# CNN Architecture

### TOPIC: Understanding Pooling and Padding in CNN

#### Q1. Describe the purpose and benefits of pooling in CNN.

Pooling, in Convolutional Neural Networks (CNNs), serves several important purposes and offers several benefits:
* **Purpose:**
    * **Dimensionality Reduction:** Pooling reduces the spatial dimensions (width and height) of feature maps, effectively downscaling the information. This helps in managing computational complexity and memory requirements, which is crucial in deep networks.
    * **Translation Invariance:** Pooling helps make the network more robust to small translations or distortions in the input. By summarizing local information into a single value, the network becomes less sensitive to precise spatial locations of features.
    * **Feature Hierarchy:** Pooling, when used in conjunction with convolutional layers, creates a hierarchy of features. Higher-level features are learned from the output of convolutional layers, while pooling layers capture more abstract and higher-level information.
    * **Computation Efficiency:** Pooling reduces the amount of computation required during both forward and backward passes, making training and inference faster and more efficient.
* **Benefits:**
    * **Reduction of Overfitting:** By downsampling the feature maps, pooling reduces the risk of overfitting, as it reduces the number of parameters in the network.
    * **Improved Computation:** Smaller feature maps after pooling require less computation, which speeds up training and inference.
    * **Enhanced Generalization:** Pooling helps the network generalize better by focusing on essential features while ignoring less relevant details.
    * **Shift Invariance:** Especially in max pooling, the network becomes partially invariant to small translations, which can be advantageous in recognizing objects regardless of their exact position within the receptive field.

---

#### Q2. Explain the difference between Min pooling and Max pooling.

Max pooling and Min pooling are two common types of pooling operations in CNNs, and they differ in how they select the representative value from a local region:

* **Max Pooling:**
    - In max pooling, for each local region (typically a 2x2 or 3x3 window), the maximum value is selected as the representative value for that region.
    - Max pooling is often used to retain the most prominent feature within each local region.
    - It emphasizes the most active feature in the region, which can be useful for detecting specific patterns.

* **Min Pooling:**
    - In min pooling, the minimum value from the local region is chosen as the representative value.
    - Min pooling is used when we want to emphasize the presence of smaller features or the least prominent information in the local region.
    - It can be used to detect the least active feature in the region, which might be useful in certain scenarios.

The choice between max pooling and min pooling depends on the specific problem and the characteristics of the data being processed. Max pooling is more common and often produces better results for many tasks, as it tends to capture the most relevant features.

---

#### Q3. Discuss the concept of padding in CNN and its significance.

Padding is a technique used in CNNs to control the spatial dimensions of the output feature maps after convolution or pooling operations. It involves adding extra rows and columns of zeros (or other values) around the input data before applying the operation. Padding is significant for several reasons:
* **Preserving Spatial Information:**
    - Padding allows the output feature map to have the same spatial dimensions as the input. This is crucial when maintaining spatial information is important.
    - It ensures that information at the edges of the input is considered during convolution or pooling.
* **Controlling Output Size:**
    - Padding enables us to control the size of the output feature map. We can choose the amount of padding to achieve the desired output size, which is essential for network design and computational resource management.
* **Edge Information:**
    - Padding ensures that the information at the edges of the input is considered during convolution or pooling. Without padding, the outer pixels might be underrepresented in the output feature map.
* There are two common types of padding:
    - **Zero-padding (Same Padding):** Here, zeros are added around the input data. It helps maintain the spatial dimensions and is often used when preserving spatial information is crucial.
    - **Valid Padding (No Padding):** In this case, no padding is added, and the spatial dimensions of the output feature map are reduced. It's used when dimensionality reduction is desired or when the spatial information at the edges is less important.

The choice between zero-padding and valid-padding depends on the network architecture, the nature of the problem, and whether spatial information preservation is a priority.

---

#### Q4. Compare and contrast zero-padding and valid-padding in terms of their effects on the output feature map size.

Zero-padding and valid-padding have distinct effects on the output feature map size in CNNs:
* **Zero-Padding (Same Padding):**
    - Increases the spatial dimensions of the output feature map.
    - Helps maintain the spatial information from the input.
    - Commonly used when we want the output feature map to have the same spatial dimensions as the input.
    - Output size (O) can be calculated as:
          **O = [(W - F + 2P) / S] + 1,** *where:*
      - *W is the input size.*
      - *F is the filter size.*
      - *P is the amount of zero-padding.*
      - *S is the stride.*
* **Valid-Padding (No Padding):**
    - Does not add any extra rows or columns to the input.
    - Reduces the spatial dimensions of the output feature map.
    - Commonly used when dimensionality reduction is desired or when the spatial information at the edges is less important.
    - Output size (O) can be calculated as:
          **O = [(W - F) / S] + 1,** *where:*
      - *W is the input size.*
      - *F is the filter size.*
      - *S is the stride.*

In summary, zero-padding increases the output size and preserves spatial information, while valid-padding reduces the output size and may discard information from the edges of the input. The choice between these padding strategies depends on the specific requirements of the CNN architecture and the task at hand.

<hr style="border: 2px solid black">

## TOPIC: Exploring LeNet

#### Q1. Provide a concise overview of LeNet-5 architecture.

LeNet-5 is a pioneering convolutional neural network (CNN) architecture developed by Yann LeCun and his colleagues in the late 1990s. It was designed for handwritten digit recognition and has played a significant role in the advancement of deep learning. Here's a concise overview of the LeNet-5 architecture:
- **Input Layer:** LeNet-5 takes grayscale images as input, typically of size 32x32 pixels.
- **Convolutional Layers:** LeNet-5 consists of two pairs of convolutional layers followed by activation functions. These convolutional layers learn hierarchical features from the input image.
- **Max Pooling Layers:** After each pair of convolutional layers, max-pooling layers are applied to downsample the spatial dimensions of the feature maps and enhance translation invariance.
- **Fully Connected Layers:** Following the convolutional and pooling layers, LeNet-5 has three fully connected layers. These layers flatten the output from the previous layers and map it to the output class probabilities.
- **Activation Functions:** Sigmoid activation functions were primarily used in LeNet-5.
- **Output Layer:** The final output layer typically has 10 units (one for each digit class) and uses the softmax activation function to produce class probabilities.

---

#### Q2. Describe the key components of LeNet-5 and their respective purposes.

1. **Convolutional Layers:** The convolutional layers are responsible for learning local patterns and features in the input image. The learned features become more abstract as we move deeper into the network.
2. **Max Pooling Layers:** Max pooling layers downsample the spatial dimensions of the feature maps while preserving essential information. They help make the network invariant to small translations or distortions in the input.
3. **Fully Connected Layers:** These layers serve as a classifier, taking the high-level features learned from the convolutional layers and making class predictions. The fully connected layers are responsible for capturing global patterns and relationships in the data.
4. **Activation Functions:** In the original LeNet-5, sigmoid activation functions were used. While modern CNNs often use rectified linear units (ReLUs), sigmoid activations were common at the time of LeNet-5's development.

---

#### Q3. Discuss the advantages and limitations of LeNet-5 in the context of image classification tasks.

**Advantages:**
- **Pioneering Work:** LeNet-5 was one of the earliest successful CNN architectures, setting the stage for modern deep learning research.
- **Effective for Handwriting Recognition:** It was highly effective for its original purpose of recognizing handwritten digits, achieving state-of-the-art results at the time.

**Limitations:**
- **Limited Depth:** LeNet-5 has a relatively shallow architecture compared to modern CNNs. Deeper networks can learn more complex and abstract features, which is crucial for tasks with large and diverse datasets.
- **Activation Functions:** The use of sigmoid activations in LeNet-5 can result in the vanishing gradient problem, limiting its ability to train very deep networks effectively.
- **Small Input Size:** LeNet-5 was designed for small 32x32 input images. It may not perform well on tasks that require high-resolution inputs, such as object detection or recognition in natural images.
- **Simplicity:** While simplicity can be an advantage, LeNet-5's architecture may not capture the complexity of modern image recognition tasks.

---

#### Q4. Implement LeNet-5 using a deep learning framework of your choice (e.g., TensorFlow, PyTorch) and train it on a publicly available dataset (e.g., MNIST). Evaluate its performance and provide insights.

##### LeNet-5 using Pytorch

In [1]:
import torch
import torchvision
import torch.nn as nn
import torch.optim as opt
import torchvision.transforms as transform
from torch.utils.data import DataLoader

# Architechture
class lenet5 (nn.Module):
    def __init__(self):
        super(lenet5, self).__init__()
        self.c1 = nn.Conv2d(1,6,5)      # 1 input channel, 6 output channels, 5x5 kernel
        self.a1 = nn.ReLU()
        self.p1 = nn.MaxPool2d(2,2)     # Max pooling with 2x2 window
        self.c2 = nn.Conv2d(6,16,5)     # 6 input channel, 16 output channels, 5x5 kernel
        self.a2 = nn.ReLU()
        self.p2 = nn.MaxPool2d(2,2)     # Max pooling with 2x2 window
        self.f1 = nn.Linear(16*4*4,120) # 16 feature maps, 4x4 spatial size
        self.a3 = nn.ReLU()
        self.f2 = nn.Linear(120,84)
        self.a4 = nn.ReLU()
        self.f3 = nn.Linear(84,10)      # 10 output classes for MNIST
    def forward(self,x):
        x = self.p1(self.a1(self.c1(x)))
        x = self.p2(self.a2(self.c2(x)))
        x = x.view(-1,16*4*4)           # Flatten the output
        x = self.a3(self.f1(x))
        x = self.a4(self.f2(x))
        x = self.f3(x)
        return x
    
# Loading MNIST Dataset
transform = transform.Compose([transform.ToTensor(), transform.Normalize((0.5,),(0.5,))])
Train = torchvision.datasets.MNIST(root='./data', train=True, download=True, transform=transform)
Test  = torchvision.datasets.MNIST(root='./data', train=False, download=True, transform=transform)
train = DataLoader(Train, batch_size=64, shuffle=True)
test  = DataLoader(Test, batch_size=64, shuffle=False)

# Model Prepartions
model = lenet5()
criterion = nn.CrossEntropyLoss()
optimizer = opt.SGD(model.parameters(), lr=0.001, momentum=0.9)

# Training
for e in range(10):
    loss = 0
    for i, d in enumerate(train,0):
        inputs , labels = d
        optimizer.zero_grad()
        outputs = model(inputs)
        l = criterion(outputs,labels)
        l.backward()
        optimizer.step()
        loss+= l.item()
    print(f"Epoch {e + 1}, Loss: {loss / len(train)}")
print('Finishing Training')

# Evaluation
correct = 0
total = 0
with torch.no_grad():
    for d in test:
        image, labels = d
        output = model(image)
        _,predict = torch.max(output.data,1)
        total += labels.size(0)
        correct += (predict == labels).sum().item()
acc = 100*correct/total
print(f"Accuracy on test dataset: {acc}%")

Epoch 1, Loss: 1.9297293821123362
Epoch 2, Loss: 0.34765644957309466
Epoch 3, Loss: 0.19032202784750443
Epoch 4, Loss: 0.1386182621192894
Epoch 5, Loss: 0.11206938845437092
Epoch 6, Loss: 0.09449441493776783
Epoch 7, Loss: 0.0827250789626559
Epoch 8, Loss: 0.07468811215272844
Epoch 9, Loss: 0.06742035615434652
Epoch 10, Loss: 0.06202100622288978
Finishing Training
Accuracy on test dataset: 98.19%


##### LeNet-5 using Tensorflow

In [2]:
import tensorflow as tf
from tensorflow.keras import datasets, layers, models

# Loading and Scaling Datasets
(x_train, y_train) , (x_test, y_test) = datasets.mnist.load_data()
x_train, x_test = x_train/255.0, x_test/255.0
x_train, x_test = x_train.reshape(-1,28,28,1).astype('float32'), x_test.reshape(-1,28,28,1).astype('float32')

# Model Creations
model = models.Sequential([layers.Conv2D(6,(5,5), activation='relu', input_shape=(28,28,1)), layers.MaxPool2D((2,2)),
                           layers.Conv2D(16,(5,5), activation='relu'), layers.MaxPool2D((2,2)), layers.Flatten(),
                           layers.Dense(120, activation='relu'), layers.Dense(84, activation='relu'), layers.Dense(10)])
model.compile(optimizer='adam', loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True), metrics=['accuracy'])
model.fit(x_train,y_train, epochs=10, verbose=2)

# Model Evaluation
loss, accuracy = model.evaluate(x_test, y_test, verbose=0)
print(f"Test accuracy: {accuracy * 100:.2f}%")

Epoch 1/10
1875/1875 - 11s - loss: 0.2006 - accuracy: 0.9386 - 11s/epoch - 6ms/step
Epoch 2/10
1875/1875 - 10s - loss: 0.0654 - accuracy: 0.9792 - 10s/epoch - 5ms/step
Epoch 3/10
1875/1875 - 9s - loss: 0.0454 - accuracy: 0.9856 - 9s/epoch - 5ms/step
Epoch 4/10
1875/1875 - 9s - loss: 0.0368 - accuracy: 0.9881 - 9s/epoch - 5ms/step
Epoch 5/10
1875/1875 - 9s - loss: 0.0284 - accuracy: 0.9909 - 9s/epoch - 5ms/step
Epoch 6/10
1875/1875 - 9s - loss: 0.0261 - accuracy: 0.9916 - 9s/epoch - 5ms/step
Epoch 7/10
1875/1875 - 9s - loss: 0.0219 - accuracy: 0.9927 - 9s/epoch - 5ms/step
Epoch 8/10
1875/1875 - 9s - loss: 0.0183 - accuracy: 0.9939 - 9s/epoch - 5ms/step
Epoch 9/10
1875/1875 - 9s - loss: 0.0161 - accuracy: 0.9948 - 9s/epoch - 5ms/step
Epoch 10/10
1875/1875 - 9s - loss: 0.0157 - accuracy: 0.9951 - 9s/epoch - 5ms/step
Test accuracy: 98.79%


<hr style="border: 2px solid black">

## TOPIC: Analyzing AlexNet

#### Q1. Present an overview of the AlexNet architecture.

AlexNet is a deep convolutional neural network (CNN) architecture that gained significant attention and marked a breakthrough in image classification when it won the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) in 2012. Here's an overview of the AlexNet architecture:
1. **Input Layer:** AlexNet takes color images as input, typically of size 224x224 pixels.
2. **Convolutional Layers:** It consists of five convolutional layers, each followed by a Rectified Linear Unit (ReLU) activation function. These layers are responsible for learning hierarchical features from the input images. The number of filters increases with depth.
3. **Max Pooling Layers:** After the first two convolutional layers, max-pooling layers are applied to downsample the spatial dimensions of the feature maps.
4. **Normalization Layers:** Local Response Normalization (LRN) layers are used to normalize the output of some convolutional layers. This helps enhance the network's ability to generalize.
5. **Dropout Layers:** Dropout layers are introduced after the fully connected layers to prevent overfitting during training.
6. **Fully Connected Layers:** AlexNet has three fully connected layers. The first two have 4,096 units each, and the last one has 1,000 units (corresponding to the 1,000 ImageNet classes). The final fully connected layer produces class predictions using a softmax activation function.

---

#### Q2. Explain the architectural innovations introduced in AlexNet that contributed to its breakthrough performance.

AlexNet introduced several architectural innovations that contributed to its breakthrough performance:
1. **Deep Architecture:** AlexNet was one of the first deep CNNs to have multiple convolutional layers stacked on top of each other. This depth allowed the network to learn hierarchical features from low-level edges and textures to high-level object parts and semantics.
2. **ReLU Activation:** AlexNet used Rectified Linear Units (ReLU) as activation functions instead of traditional sigmoid or tanh activations. ReLU activations accelerated training by mitigating the vanishing gradient problem.
3. **Local Response Normalization (LRN):** LRN layers were applied after some convolutional layers to provide local contrast normalization, which improved the network's generalization ability.
4. **Overlapping Max Pooling:** AlexNet used max-pooling layers with a stride smaller than the pooling window size, leading to overlapping pooling regions. This increased the network's ability to capture spatial hierarchies.
5. **Dropout Regularization:** Dropout layers were introduced after the fully connected layers to prevent overfitting. During training, dropout randomly deactivates neurons, which reduces co-dependency among them.
6. **Data Augmentation:** Data augmentation techniques, such as cropping and horizontal flipping of training images, were used to artificially increase the size of the training dataset. This helped improve the model's robustness to variations in input data.
7. **Large-Scale Training Data:** AlexNet was trained on a large dataset, ImageNet, which contained over a million images spanning 1,000 different classes. The availability of this extensive training dataset was crucial for the network's performance.

---

#### Q3. Discuss the role of convolutional layers, pooling layers, and fully connected layers in AlexNet.

- **Convolutional Layers:** The convolutional layers in AlexNet play a central role in feature extraction. They learn various low-level and high-level features from the input images, gradually building a hierarchical representation of visual information. Convolutional layers are responsible for capturing patterns, textures, and object parts.
- **Pooling Layers:** Max pooling layers in AlexNet downsample the spatial dimensions of the feature maps, reducing computational complexity and enhancing translation invariance. Overlapping pooling regions help capture spatial hierarchies within the feature maps.
- **Fully Connected Layers:** The fully connected layers at the end of AlexNet serve as a classifier. They take the high-level features learned by the convolutional layers and make class predictions. The final fully connected layer produces class probabilities using a softmax activation function. These layers capture global patterns and relationships in the data and provide the network's output.

---

#### Q4. Implement AlexNet using a deep learning framework of your choice and evaluate its performance on a dataset of your choice.

##### LeNet-5 using Pytorch

In [3]:
import torch
import torchvision
import torch.nn as nn
import torch.optim as opt
import torchvision.transforms as transform
from torch.utils.data import DataLoader

# Architechture
class alexnet(nn.Module):
    def __init__(self, num=10):
        super(alexnet, self).__init__()
        self.features = nn.Sequential(nn.Conv2d(3,64, kernel_size=11, stride=4, padding=2), nn.ReLU(inplace=True),
                                      nn.MaxPool2d(kernel_size=3, stride=2),
                                      nn.Conv2d(64,192, kernel_size=5, padding=2), nn.ReLU(inplace=True),
                                      nn.MaxPool2d(kernel_size=3, stride=2), 
                                      nn.Conv2d(192,384, kernel_size=3, padding=1), nn.ReLU(inplace=True),
                                      nn.Conv2d(384,256, kernel_size=3, padding=1), nn.ReLU(inplace=True),
                                      nn.Conv2d(256,256, kernel_size=3, padding=1), nn.ReLU(inplace=True), 
                                      nn.MaxPool2d(kernel_size=3,stride=2),)
        self.avgpool = nn.AdaptiveAvgPool2d((6,6))
        self.classifier = nn.Sequential(nn.Dropout(), nn.Linear(256*6*6,4096), nn.ReLU(inplace=True),
                                        nn.Dropout(), nn.Linear(4096,4096), nn.ReLU(inplace=True),
                                        nn.Linear(4096,num),)
    def forward(self,x):
        x = self.features(x)
        x = self.avgpool(x)
        x = torch.flatten(x,1)
        x = self.classifier(x)
        return x

# Loading CIFAR-10 Dataset
transform = transform.Compose([transform.Resize((224,224)),transform.ToTensor(),
                               transform.Normalize(mean=[0.485,0.456,0.406], std=[0.229, 0.224, 0.225])])
Train = torchvision.datasets.CIFAR10(root='./data', train=True, download=True, transform=transform)
Test  = torchvision.datasets.CIFAR10(root='./data', train=False, download=True, transform=transform)
train = DataLoader(Train, batch_size=64, shuffle=True)
test  = DataLoader(Test, batch_size=64, shuffle=False)

# Model Prepartions
model = alexnet(num=10)
criterion = nn.CrossEntropyLoss()
optimizer = opt.SGD(model.parameters(), lr=0.001, momentum=0.9)

# Training
for e in range(10):
    loss = 0
    for i, d in enumerate(train,0):
        inputs , labels = d
        optimizer.zero_grad()
        outputs = model(inputs)
        l = criterion(outputs,labels)
        l.backward()
        optimizer.step()
        loss+= l.item()
    print(f"Epoch {e + 1}, Loss: {loss / len(train)}")
print('Finishing Training')

# Evaluation
correct = 0
total = 0
with torch.no_grad():
    for d in test:
        image, labels = d
        output = model(image)
        _,predict = torch.max(output.data,1)
        total += labels.size(0)
        correct += (predict == labels).sum().item()
acc = 100*correct/total
print(f"Accuracy on test dataset: {acc}%")

Files already downloaded and verified
Files already downloaded and verified
Epoch 1, Loss: 2.296324372596448
Epoch 2, Loss: 2.0316340918736078
Epoch 3, Loss: 1.6874255785704269
Epoch 4, Loss: 1.49628219229486
Epoch 5, Loss: 1.3600778068270525
Epoch 6, Loss: 1.24536047856826
Epoch 7, Loss: 1.1388245041641738
Epoch 8, Loss: 1.0397641702991007
Epoch 9, Loss: 0.9535676575530215
Epoch 10, Loss: 0.8738080864138615
Finishing Training
Accuracy on test dataset: 68.7%


##### LeNet-5 using Tensorflow

In [9]:
import tensorflow as tf
from tensorflow.keras import datasets, layers, models

# Loading and Scaling Datasets
(x_train, y_train) , (x_test, y_test) = datasets.cifar10.load_data()
x_train, x_test = x_train/255.0, x_test/255.0

# Model Creations
model = models.Sequential([layers.Conv2D(96,(3,3), strides=(4,4), padding='same',activation='relu',input_shape=(32,32,3)),
                           layers.MaxPool2D((2,2), strides=(2,2)),
                           layers.Conv2D(256,(3,3), padding='same', activation='relu'),
                           layers.MaxPool2D((2,2), strides=(2,2)),
                           layers.Conv2D(384,(3,3), padding='same', activation='relu'),
                           layers.Conv2D(384,(3,3), padding='same', activation='relu'),
                           layers.Conv2D(256,(3,3), padding='same', activation='relu'),
                           layers.MaxPool2D((2,2), strides=(2,2)), 
                           layers.Flatten(),
                           layers.Dense(4096, activation='relu'),
                           layers.Dense(4096, activation='relu'),
                           layers.Dense(10)])
model.compile(optimizer='adam', loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True), metrics=['accuracy'])
model.fit(x_train,y_train, epochs=10, verbose=2)

# Model Evaluation
loss, accuracy = model.evaluate(x_test, y_test, verbose=0)
print(f"Test accuracy: {accuracy * 100:.2f}%")

Epoch 1/10
1563/1563 - 371s - loss: 2.3027 - accuracy: 0.0996 - 371s/epoch - 237ms/step
Epoch 2/10
1563/1563 - 366s - loss: 2.3028 - accuracy: 0.0984 - 366s/epoch - 234ms/step
Epoch 3/10
1563/1563 - 364s - loss: 2.3028 - accuracy: 0.0978 - 364s/epoch - 233ms/step
Epoch 4/10
1563/1563 - 365s - loss: 2.3028 - accuracy: 0.0984 - 365s/epoch - 234ms/step
Epoch 5/10
1563/1563 - 366s - loss: 2.3028 - accuracy: 0.0996 - 366s/epoch - 234ms/step
Epoch 6/10
1563/1563 - 376s - loss: 2.3028 - accuracy: 0.0973 - 376s/epoch - 241ms/step
Epoch 7/10
1563/1563 - 381s - loss: 2.3028 - accuracy: 0.0989 - 381s/epoch - 244ms/step
Epoch 8/10
1563/1563 - 371s - loss: 2.3028 - accuracy: 0.0982 - 371s/epoch - 237ms/step
Epoch 9/10
1563/1563 - 380s - loss: 2.3028 - accuracy: 0.0990 - 380s/epoch - 243ms/step
Epoch 10/10
1563/1563 - 395s - loss: 2.3028 - accuracy: 0.0959 - 395s/epoch - 253ms/step
Test accuracy: 10.00%
