# Understanding pooling and padding in CNN

## Q1.Describe the purpose and benefits of pooling in CNN?

Pooling is a fundamental operation in Convolutional Neural Networks (CNNs) used for down-sampling or reducing the spatial dimensions of feature maps. The primary purpose of pooling is to extract the most important information from an input volume (e.g., an image or feature map) while reducing the computational complexity of the network.

Benefits of pooling in Convolutional Neural Networks (CNNs) are:

1. **Spatial Down-Sampling**: Reduces spatial dimensions, making the network computationally more efficient.

2. **Translation Invariance**: Enhances robustness to small data variations.

3. **Feature Reduction**: Mitigates overfitting by reducing dimensionality and noise.

4. **Increased Receptive Field**: Enables capturing larger-scale features in deeper layers.

5. **Computation Efficiency**: Reduces computational complexity, vital for deep networks and high-resolution images.

Common types of pooling in CNNs are:

- **Max Pooling**: In this pooling, the maximum value within a specified window (typically 2x2 or 3x3) is selected from each local region. Max pooling is effective in capturing the most salient features in each region.

- **Average Pooling**: Instead of selecting the maximum value, average pooling computes the average of values within the pooling window. It can be useful when you want a smoother down-sampling and are less concerned with preserving the most prominent features.

## Q2. Explain the difference between MinPooling and MaxPooling?

MinPooling and MaxPooling are two common types of pooling operations used in Convolutional Neural Networks (CNNs) for down-sampling feature maps.

1. **MaxPooling**:

   - **Operation**: MaxPooling selects the maximum value from each local region of the input feature map.
   - **Purpose**: It is primarily used to capture the most prominent feature within each region. MaxPooling focuses on retaining the most important information.
   - **Effect**: MaxPooling helps in preserving the salient features of the input data and is effective in tasks where identifying the presence of specific features or patterns is critical.
   - **Advantage**: It is robust to noise in the input data and can be helpful when dealing with translation-invariant features.

2. **MinPooling**:

   - **Operation**: MinPooling selects the minimum value from each local region of the input feature map.
   - **Purpose**: It aims to capture the least prominent feature within each region. MinPooling can be useful when you want to identify the absence of specific features or patterns.
   - **Effect**: MinPooling focuses on the least prominent features, which can help in certain types of feature detection or anomaly detection tasks.
   - **Advantage**: It can be sensitive to variations in the input data and is suitable for specific use cases where detecting outliers is important.



In Summary , In Max pooling the maximum value within a specified window (typically 2x2 or 3x3) is selected from each local region and this is effective in capturing the most salient features in each region. Whereas, in MinPooling the minimum value within a specified window is selected from each local region and this is effective in capturing the least prominent features or anomaly tasks in each region.


## Q3. Discuss the concept of padding and its significance


Padding is a technique used in Convolutional Neural Networks (CNNs) it involves adding extra, often zero-valued, elements around the edges of an input data structure, such as an image or a feature map.

Significance of padding are :

1. **Preservation of Spatial Dimensions**:
   - Without padding, the spatial dimensions of feature maps decrease with each convolutional layer. Padding, especially "same" padding, ensures that the spatial dimensions remain the same after convolution. This is important because it preserves the spatial information of the input, which can be crucial for tasks like object localization.

2. **Mitigation of Boundary Effects**:
   - Convolutional operations typically involve sliding a filter (also known as a kernel) over the input data. Without padding, the center of the filter will not be applied to the pixels near the edges of the input. Padding ensures that every pixel in the input has an equal opportunity to be part of the convolution operation, reducing boundary effects.

3. **Control Over Receptive Field**:
   - Padding allows you to control the size of the receptive field (the region of input data used to compute a feature in the output). It can be beneficial when designing deep neural networks because you can fine-tune the receptive field size by adjusting the amount of padding.

4. **Prevention of Information Loss**:
   - In the absence of padding, the spatial dimensions of feature maps shrink as you move deeper into the network. This can lead to information loss and reduced ability to capture fine-grained details. Padding helps mitigate this issue by preserving spatial information.

5. **Better Handling of Strides**:
   - When you apply a convolutional layer with a stride larger than 1 (i.e., skipping some positions during the filter's movement), padding ensures that the output dimensions are well-defined. It can also help maintain alignment with other layers in the network.

6. **Compatibility with Deconvolution/Transposed Convolution**:
   - Padding can be essential when using deconvolution or transposed convolution layers in neural networks. It ensures that the output dimensions align with the desired input dimensions.


In summary, padding is a critical technique in CNNs that allows for the preservation of spatial information, prevention of boundary effects, control over receptive field size, and better management of spatial dimensions, ensuring that deep neural networks can effectively capture and process features in input data while avoiding issues related to border pixels and data loss.

## Q4.Compace and contrast zero-padding and valid-padding in terms of their effects on the output feature Map size.

Zero-padding and valid-padding are common padding techniques used in convolutional neural networks (CNNs), and they have different effects on the size of the output feature maps.

1. **Zero-Padding** (also known as "Same" Padding):
   - In zero-padding, additional rows and columns of zeros are added to the input feature map to ensure that the output feature map has the same spatial dimensions (width and height) as the input.
   - The number of rows and columns added as padding is determined by the size of the filter (kernel). If the ***filter size is F, the amount of padding added to each side is typically (F - 1) / 2*** to keep the dimensions the same.
   - Zero-padding results in an output feature map with the same spatial dimensions as the input, making it useful for tasks where preserving spatial information is crucial (e.g., image segmentation or object detection).
   - Zero-padding ensures that the center of the filter is applied to every pixel in the input, mitigating boundary effects.

2. **Valid-Padding** (No Padding):
   - In valid-padding, no extra rows or columns are added to the input feature map. The filter is applied to the input without extending beyond its edges.
   - Valid-padding results in an output feature map with reduced spatial dimensions compared to the input. The amount of reduction depends on the filter size, stride, and the shape of the input.
   - This padding type is suitable when you want to reduce the spatial dimensions and capture features at various scales or when you have enough data and don't need to preserve spatial information.

### Lets Say for Example ,
- Input feature map size: 6x6
- Filter size: 3x3

With zero-padding (Same Padding):
- The filter is centered at each pixel of the input, including the pixels at the edges.
- The output feature map size remains 6x6.

With valid-padding (No Padding):
- The filter is not allowed to extend beyond the input's edges.
- The output feature map size is reduced due to the absence of padding. Depending on the stride, it could be, for example, 4x4.

In summary, zero-padding preserves the spatial dimensions of the feature maps, while valid-padding reduces the feature map size. The choice of padding depends on the specific requirements of your task and the architecture of your neural network. Zero-padding is commonly used when you need to maintain spatial information, while valid-padding is used when you want to reduce dimensions and capture features at various scales.

# Exploring LeNet

## Q1. Provide a brief overview of LeNet-5 Architecture

LeNet-5, from the paper Gradient-Based Learning Applied to Document Recognition, is a very efficient convolutional neural network for handwritten character recognition.


<a href="http://yann.lecun.com/exdb/publis/pdf/lecun-98.pdf" target="_blank">Paper: <u>Gradient-Based Learning Applied to Document Recognition</u></a>

**Authors**: Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner

**Published in**: Proceedings of the IEEE (1998)

### Structure of the LeNet network

LeNet5 is a small network, it contains the basic modules of deep learning: convolutional layer, pooling layer, and full link layer. It is the basis of other deep learning models. Here we analyze LeNet5 in depth. At the same time, through example analysis, deepen the understanding of the convolutional layer and pooling layer.

![lenet](https://raw.githubusercontent.com/entbappy/Branching-tutorial/master/lenet/lenet-5.png)


LeNet-5 Total seven layer , does not comprise an input, each containing a trainable parameters; each layer has a plurality of the Map the Feature , a characteristic of each of the input FeatureMap extracted by means of a convolution filter, and then each FeatureMap There are multiple neurons.

![lenet1](https://raw.githubusercontent.com/entbappy/Branching-tutorial/master/lenet/arch.jpg)

Detailed explanation of each layer parameter:

#### **INPUT Layer**

The first is the data INPUT layer. The size of the input image is uniformly normalized to 32 * 32.

> Note: This layer does not count as the network structure of LeNet-5. Traditionally, the input layer is not considered as one of the network hierarchy.


#### **C1 layer-convolutional layer**

>**Input picture**: 32 * 32

>**Convolution kernel size**: 5 * 5

>**Convolution kernel types**: 6

>**Output featuremap size**: 28 * 28 (32-5 + 1) = 28

>**Number of neurons**: 28 * 28 * 6

>**Trainable parameters**: (5 * 5 + 1) * 6 (5 * 5 = 25 unit parameters and one bias parameter per filter, a total of 6 filters)

>**Number of connections**: (5 * 5 + 1) * 6 * 28 * 28 = 122304

**Detailed description:**

1. The first convolution operation is performed on the input image (using 6 convolution kernels of size 5 * 5) to obtain 6 C1 feature maps (6 feature maps of size 28 * 28, 32-5 + 1 = 28).

2. Let's take a look at how many parameters are needed. The size of the convolution kernel is 5 * 5, and there are 6 * (5 * 5 + 1) = 156 parameters in total, where +1 indicates that a kernel has a bias.

3. For the convolutional layer C1, each pixel in C1 is connected to 5 * 5 pixels and 1 bias in the input image, so there are 156 * 28 * 28 = 122304 connections in total. There are 122,304 connections, but we only need to learn 156 parameters, mainly through weight sharing.


#### **S2 layer-pooling layer (downsampling layer)**

>**Input**: 28 * 28

>**Sampling area**: 2 * 2

>**Sampling method**: 4 inputs are added, multiplied by a trainable parameter, plus a trainable offset. Results via sigmoid

>**Sampling type**: 6

>**Output featureMap size**: 14 * 14 (28/2)

>**Number of neurons**: 14 * 14 * 6

>**Trainable parameters**: 2 * 6 (the weight of the sum + the offset)

>**Number of connections**: (2 * 2 + 1) * 6 * 14 * 14

>The size of each feature map in S2 is 1/4 of the size of the feature map in C1.

**Detailed description:**

The pooling operation is followed immediately after the first convolution. Pooling is performed using 2 * 2 kernels, and S2, 6 feature maps of 14 * 14 (28/2 = 14) are obtained.

The pooling layer of S2 is the sum of the pixels in the 2 * 2 area in C1 multiplied by a weight coefficient plus an offset, and then the result is mapped again.

So each pooling core has two training parameters, so there are 2x6 = 12 training parameters, but there are 5x14x14x6 = 5880 connections.

#### **C3 layer-convolutional layer**

>**Input**: all 6 or several feature map combinations in S2

>**Convolution kernel size**: 5 * 5

>**Convolution kernel type**: 16

>**Output featureMap size**: 10 * 10 (14-5 + 1) = 10

>Each feature map in C3 is connected to all 6 or several feature maps in S2, indicating that the feature map of this layer is a different combination of the feature maps extracted from the previous layer.

>One way is that the first 6 feature maps of C3 take 3 adjacent feature map subsets in S2 as input. The next 6 feature maps take 4 subsets of neighboring feature maps in S2 as input. The next three take the non-adjacent 4 feature map subsets as input. The last one takes all the feature maps in S2 as input.

>**The trainable parameters are**: 6 * (3 * 5 * 5 + 1) + 6 * (4 * 5 * 5 + 1) + 3 * (4 * 5 * 5 + 1) + 1 * (6 * 5 * 5 +1) = 1516

>**Number of connections**: 10 * 10 * 1516 = 151600

**Detailed description:**

After the first pooling, the second convolution, the output of the second convolution is C3, 16 10x10 feature maps, and the size of the convolution kernel is 5 * 5. We know that S2 has 6 14 * 14 feature maps, how to get 16 feature maps from 6 feature maps? Here are the 16 feature maps calculated by the special combination of the feature maps of S2. details as follows:




The first 6 feature maps of C3 (corresponding to the 6th column of the first red box in the figure above) are connected to the 3 feature maps connected to the S2 layer (the first red box in the above figure), and the next 6 feature maps are connected to the S2 layer The 4 feature maps are connected (the second red box in the figure above), the next 3 feature maps are connected with the 4 feature maps that are not connected at the S2 layer, and the last is connected with all the feature maps at the S2 layer. The convolution kernel size is still 5 * 5, so there are 6 * (3 * 5 * 5 + 1) + 6 * (4 * 5 * 5 + 1) + 3 * (4 * 5 * 5 + 1) +1 * (6 * 5 * 5 + 1) = 1516 parameters. The image size is 10 * 10, so there are 151600 connections.

![lenet1](https://raw.githubusercontent.com/entbappy/Branching-tutorial/master/lenet/c31.png)


The convolution structure of C3 and the first 3 graphs in S2 is shown below:

![lenet1](https://raw.githubusercontent.com/entbappy/Branching-tutorial/master/lenet/c32.png)


#### **S4 layer-pooling layer (downsampling layer)**

>**Input**: 10 * 10

>**Sampling area**: 2 * 2

>**Sampling method**: 4 inputs are added, multiplied by a trainable parameter, plus a trainable offset. Results via sigmoid

>**Sampling type**: 16

>**Output featureMap size**: 5 * 5 (10/2)

>**Number of neurons**: 5 * 5 * 16 = 400

>**Trainable parameters**: 2 * 16 = 32 (the weight of the sum + the offset)

>**Number of connections**: 16 * (2 * 2 + 1) * 5 * 5 = 2000

>The size of each feature map in S4 is 1/4 of the size of the feature map in C3

**Detailed description:**

S4 is the pooling layer, the window size is still 2 * 2, a total of 16 feature maps, and the 16 10x10 maps of the C3 layer are pooled in units of 2x2 to obtain 16 5x5 feature maps. This layer has a total of 32 training parameters of 2x16, 5x5x5x16 = 2000 connections.

*The connection is similar to the S2 layer.*

#### **C5 layer-convolution layer**

>**Input**: All 16 unit feature maps of the S4 layer (all connected to s4)

>**Convolution kernel size**: 5 * 5

>**Convolution kernel type**: 120

>**Output featureMap size**: 1 * 1 (5-5 + 1)

>**Trainable parameters / connection**: 120 * (16 * 5 * 5 + 1) = 48120

**Detailed description:**


The C5 layer is a convolutional layer. Since the size of the 16 images of the S4 layer is 5x5, which is the same as the size of the convolution kernel, the size of the image formed after convolution is 1x1. This results in 120 convolution results. Each is connected to the 16 maps on the previous level. So there are (5x5x16 + 1) x120 = 48120 parameters, and there are also 48120 connections. The network structure of the C5 layer is as follows:

![lenet1](https://raw.githubusercontent.com/entbappy/Branching-tutorial/master/lenet/c5.png)


#### **F6 layer-fully connected layer**

>**Input**: c5 120-dimensional vector

>**Calculation method**: calculate the dot product between the input vector and the weight vector, plus an offset, and the result is output through the sigmoid function.

>**Trainable parameters**: 84 * (120 + 1) = 10164

**Detailed description:**

Layer 6 is a fully connected layer. The F6 layer has 84 nodes, corresponding to a 7x12 bitmap, -1 means white, 1 means black, so the black and white of the bitmap of each symbol corresponds to a code. The training parameters and number of connections for this layer are (120 + 1) x84 = 10164. The ASCII encoding diagram is as follows:

![lenet1](https://raw.githubusercontent.com/entbappy/Branching-tutorial/master/lenet/f61.png)

The connection method of the F6 layer is as follows:

![lenet1](https://raw.githubusercontent.com/entbappy/Branching-tutorial/master/lenet/f62.png)


#### **Output layer-fully connected layer**

The output layer is also a fully connected layer, with a total of 10 nodes, which respectively represent the numbers 0 to 9, and if the value of node i is 0, the result of network recognition is the number i. A radial basis function (RBF) network connection is used. Assuming x is the input of the previous layer and y is the output of the RBF, the calculation of the RBF output is:

![lenet1](https://raw.githubusercontent.com/entbappy/Branching-tutorial/master/lenet/81.png)

The value of the above formula w_ij is determined by the bitmap encoding of i, where i ranges from 0 to 9, and j ranges from 0 to 7 * 12-1. The closer the value of the RBF output is to 0, the closer it is to i, that is, the closer to the ASCII encoding figure of i, it means that the recognition result input by the current network is the character i. This layer has 84x10 = 840 parameters and connections.

![lenet1](https://raw.githubusercontent.com/entbappy/Branching-tutorial/master/lenet/82.png)


**Summary**


* LeNet-5 is a very efficient convolutional neural network for handwritten character recognition.
* Convolutional neural networks can make good use of the structural information of images.
* The convolutional layer has fewer parameters, which is also determined by the main characteristics of the convolutional layer, that is, local connection and shared weights.












## Q2. Describe the Key components of the LeNet-5 Architecture and their respective purposes


LeNet-5 is a classic convolutional neural network (CNN) architecture developed by Yann LeCun and his colleagues in the 1990s. It played a significant role in popularizing the use of CNNs in computer vision tasks and was one of the early models for handwritten digit recognition. The architecture consists of several key components, each with its own purpose:

1. **Input Layer**:
   - Purpose: The input layer receives the input image data. LeNet-5 was primarily designed for grayscale images, so it typically has one input channel (for color images, this would be three channels, such as RGB).

2. **Convolutional Layers**:
   - Purpose: LeNet-5 contains multiple convolutional layers that perform feature extraction. Each convolutional layer consists of a set of filters (kernels) that slide over the input data to detect local features. The purpose is to capture patterns and features in the input image.

3. **Activation Function (Sigmoid)**:
   - Purpose: After each convolutional layer, LeNet-5 uses the sigmoid activation function (or more commonly, tanh in modern variants) to introduce non-linearity to the model. The activation function helps the network learn complex patterns and relationships in the data.

4. **Average Pooling Layers**:
   - Purpose: Average pooling layers down-sample the feature maps, reducing their spatial dimensions. This operation helps in reducing the number of parameters, computational complexity, and making the network more robust to small spatial translations.

5. **Fully Connected Layers**:
   - Purpose: LeNet-5 ends with fully connected layers. These layers connect every neuron in one layer to every neuron in the subsequent layer. The fully connected layers perform classification based on the features extracted by the convolutional layers and down-sampled by the pooling layers.

6. **Softmax Layer**:
   - Purpose: The final layer in LeNet-5 is typically a softmax layer, which computes class probabilities. It is used for multi-class classification tasks, such as digit recognition. The softmax function ensures that the class probabilities sum to 1, making it easy to identify the most likely class.

7. **Output Layer**:
   - Purpose: The output layer produces the final classification output, indicating the predicted class label. The class with the highest probability from the softmax layer is selected as the prediction.

8. **Fully Connected and Softmax Layer (Optional)**:
   - Some variants of LeNet-5 include additional fully connected layers and softmax layers to perform hierarchical classification, allowing for more complex classification tasks.

In summary, LeNet-5's key components include convolutional layers for feature extraction, activation functions for introducing non-linearity, average pooling layers for down-sampling, fully connected layers for classification, and a softmax layer for multi-class classification. LeNet-5 played a pivotal role in shaping the architecture of modern CNNs and is a foundational model in the field of deep learning.

## Q3. Discuss the advantages and disadvantages of the LeNet-5 in the context of image classification tasks

LeNet-5, as one of the early convolutional neural network (CNN) architectures, has both advantages and disadvantages in the context of image classification tasks.

**Advantages**:

1. **Pioneering Role**: LeNet-5 was one of the pioneering CNN architectures in the field of computer vision and deep learning. It demonstrated the effectiveness of using convolutional layers for feature extraction in image classification tasks.

2. **Hierarchical Feature Extraction**: LeNet-5 introduced the concept of hierarchical feature extraction, where features are extracted at multiple levels of abstraction. This approach is fundamental to the success of CNNs.

3. **Simplicity and Interpretability**: LeNet-5 is relatively simple compared to more complex modern architectures like ResNet or Inception. Its simplicity makes it more interpretable and easier to understand.

4. **Efficiency for Small Datasets**: LeNet-5 is efficient when working with relatively small datasets. It can perform well in tasks like handwritten digit recognition and basic image classification with limited training data.

**Disadvantages**:

1. **Limited Depth**: LeNet-5 is relatively shallow by modern standards. It lacks the depth and capacity to learn complex hierarchical features from large-scale datasets. Deep networks like VGG, ResNet, and Inception have demonstrated superior performance on more challenging tasks.

2. **Limited Receptive Field**: The small receptive field of LeNet-5's early layers may limit its ability to capture global context and long-range dependencies in images. Modern architectures have larger receptive fields.

3. **Sensitivity to Variations**: LeNet-5 is less robust to various image variations, such as changes in scale, rotation, or illumination, compared to more advanced architectures that include normalization layers, residual connections, and data augmentation techniques.

4. **Lack of Advanced Activation Functions**: LeNet-5 typically uses sigmoid or tanh activation functions. Modern architectures often employ rectified linear units (ReLU) or variants, which are computationally more efficient and better at mitigating vanishing gradient problems.

5. **Less Scalability**: LeNet-5 may not scale well to very large datasets and high-resolution images. Modern architectures are designed to handle large-scale datasets and higher-resolution images efficiently.

In conclusion, while LeNet-5 played a pivotal role in the development of CNNs and is a remarkable achievement in the field of deep learning, it has limitations that make it less suitable for state-of-the-art image classification tasks. Modern architectures have addressed these limitations and offer significant advantages in terms of depth, scalability, robustness, and performance. However, LeNet-5 remains a valuable model for educational purposes and tasks where simplicity and interpretability are desired.

## Q4. Implement LeNet-5 using a deep learning network of your choice (e.g., TensorFlow, PyTorch) and train it on a publicly available dataset (e.g., MNIST). Evaluate its performance and provide insights.

In [1]:
#Importing the libraries
import keras
from keras.datasets import mnist
from keras.layers import Dense,Flatten,Conv2D,MaxPooling2D,AveragePooling2D
from keras.models import Sequential
from keras.utils import to_categorical

#Loading the data
(X_train,y_train) ,(X_test,y_test) = mnist.load_data()

#Normalizing the pixel values between 0 and 1
X_train = X_train /255.0
X_test = X_test / 255.0

#Converting labels to one-hot encoded
y_train = to_categorical(y_train,10)
y_test = to_categorical(y_test,10)

print(X_train.shape)

(60000, 28, 28)


In [2]:
#Building Model Architecture (LeNet-5)
model = Sequential()

model.add(Conv2D(6,kernel_size=(5,5),padding="valid",activation="tanh",input_shape =(28,28,1)))
model.add(AveragePooling2D(pool_size=(2,2),strides=2,padding='valid'))

model.add(Conv2D(16,kernel_size=(5,5),padding="valid",activation="tanh"))
model.add(AveragePooling2D(pool_size=(2,2),strides=2,padding='valid'))

model.add(Flatten())

model.add(Dense(120,activation="tanh"))
model.add(Dense(84,activation="tanh"))
model.add(Dense(10,activation="softmax"))

model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 conv2d (Conv2D)             (None, 24, 24, 6)         156       
                                                                 
 average_pooling2d (Average  (None, 12, 12, 6)         0         
 Pooling2D)                                                      
                                                                 
 conv2d_1 (Conv2D)           (None, 8, 8, 16)          2416      
                                                                 
 average_pooling2d_1 (Avera  (None, 4, 4, 16)          0         
 gePooling2D)                                                    
                                                                 
 flatten (Flatten)           (None, 256)               0         
                                                                 
 dense (Dense)               (None, 120)               3

In [3]:
#Compiling the model
model.compile(loss=keras.metrics.categorical_crossentropy , optimizer=keras.optimizers.Adam() ,metrics = ['accuracy'] )

#Fitting the model
history = model.fit(X_train.reshape(-1, 28, 28, 1),y_train,
                    batch_size=128,
                    epochs=3,
                    verbose=2,
                    validation_split=0.2
                    )



Epoch 1/3
375/375 - 12s - loss: 0.4070 - accuracy: 0.8844 - val_loss: 0.1962 - val_accuracy: 0.9433 - 12s/epoch - 33ms/step
Epoch 2/3
375/375 - 2s - loss: 0.1501 - accuracy: 0.9547 - val_loss: 0.1191 - val_accuracy: 0.9664 - 2s/epoch - 6ms/step
Epoch 3/3
375/375 - 1s - loss: 0.0994 - accuracy: 0.9696 - val_loss: 0.0884 - val_accuracy: 0.9737 - 1s/epoch - 4ms/step


In [4]:
#Evaluating the model
score = model.evaluate(X_test.reshape(-1, 28, 28, 1), y_test)

print('Test Loss:', score[0])
print('Test accuracy:', score[1])

Test Loss: 0.08106596022844315
Test accuracy: 0.9764000177383423


# Analyzing AlexNet

## Q1. Present an Overview of Alexnet architecture



>AlexNet was designed by Hinton, winner of the 2012 ImageNet competition, and his student Alex Krizhevsky. It was also after that year that more and deeper neural networks were proposed, such as the excellent vgg, GoogleLeNet. Its official data model has an accuracy rate of 57.1% and top 1-5 reaches 80.2%. This is already quite outstanding for traditional machine learning classification algorithms.


![title](https://raw.githubusercontent.com/entbappy/Branching-tutorial/19087e9920ff7db29e4103cc660bb41eca510b57/alexnet/alexnet.png)


![title](https://raw.githubusercontent.com/entbappy/Branching-tutorial/master/alexnet/alexnet2.png)

>The following table below explains the network structure of AlexNet:



<table>
<thead>
	<tr>
		<th>Size / Operation</th>
		<th>Filter</th>
		<th>Depth</th>
		<th>Stride</th>
		<th>Padding</th>
		<th>Number of Parameters</th>
		<th>Forward Computation</th>
	</tr>
</thead>
<tbody>
	<tr>
		<td>3* 227 * 227</td>
		<td></td>
		<td></td>
		<td></td>
		<td></td>
		<td></td>
		<td></td>
	</tr>
	<tr>
		<td>Conv1 + Relu</td>
		<td>11 * 11</td>
		<td>96</td>
		<td>4</td>
		<td></td>
		<td>(11*11*3 + 1) * 96=34944</td>
		<td>(11*11*3 + 1) * 96 * 55 * 55=105705600</td>
	</tr>
	<tr>
		<td>96 * 55 * 55</td>
		<td></td>
		<td></td>
		<td></td>
		<td></td>
		<td></td>
		<td></td>
	</tr>
	<tr>
		<td>Max Pooling</td>
		<td>3 * 3</td>
		<td></td>
		<td>2</td>
		<td></td>
		<td></td>
		<td></td>
	</tr>
	<tr>
		<td>96 * 27 * 27</td>
		<td></td>
		<td></td>
		<td></td>
		<td></td>
		<td></td>
		<td></td>
	</tr>
	<tr>
		<td>Norm</td>
		<td></td>
		<td></td>
		<td></td>
		<td></td>
		<td></td>
		<td></td>
	</tr>
	<tr>
		<td>Conv2 + Relu</td>
		<td>5 * 5</td>
		<td>256</td>
		<td>1</td>
		<td>2</td>
		<td>(5 * 5 * 96 + 1) * 256=614656</td>
		<td>(5 * 5 * 96 + 1) * 256 * 27 * 27=448084224</td>
	</tr>
	<tr>
		<td>256 * 27 * 27</td>
		<td></td>
		<td></td>
		<td></td>
		<td></td>
		<td></td>
		<td></td>
	</tr>
	<tr>
		<td>Max Pooling</td>
		<td>3 * 3</td>
		<td></td>
		<td>2</td>
		<td></td>
		<td></td>
		<td></td>
	</tr>
	<tr>
		<td>256 * 13 * 13</td>
		<td></td>
		<td></td>
		<td></td>
		<td></td>
		<td></td>
		<td></td>
	</tr>
	<tr>
		<td>Norm</td>
		<td></td>
		<td></td>
		<td></td>
		<td></td>
		<td></td>
		<td></td>
	</tr>
	<tr>
		<td>Conv3 + Relu</td>
		<td>3 * 3</td>
		<td>384</td>
		<td>1</td>
		<td>1</td>
		<td>(3 * 3 * 256 + 1) * 384=885120</td>
		<td>(3 * 3 * 256 + 1) * 384 * 13 * 13=149585280</td>
	</tr>
	<tr>
		<td>384 * 13 * 13</td>
		<td></td>
		<td></td>
		<td></td>
		<td></td>
		<td></td>
		<td></td>
	</tr>
	<tr>
		<td>Conv4 + Relu</td>
		<td>3 * 3</td>
		<td>384</td>
		<td>1</td>
		<td>1</td>
		<td>(3 * 3 * 384 + 1) * 384=1327488</td>
		<td>(3 * 3 * 384 + 1) * 384 * 13 * 13=224345472</td>
	</tr>
	<tr>
		<td>384 * 13 * 13</td>
		<td></td>
		<td></td>
		<td></td>
		<td></td>
		<td></td>
		<td></td>
	</tr>
	<tr>
		<td>Conv5 + Relu</td>
		<td>3 * 3</td>
		<td>256</td>
		<td>1</td>
		<td>1</td>
		<td>(3 * 3 * 384 + 1) * 256=884992</td>
		<td>(3 * 3 * 384 + 1) * 256 * 13 * 13=149563648</td>
	</tr>
	<tr>
		<td>256 * 13 * 13</td>
		<td></td>
		<td></td>
		<td></td>
		<td></td>
		<td></td>
		<td></td>
	</tr>
	<tr>
		<td>Max Pooling</td>
		<td>3 * 3</td>
		<td></td>
		<td>2</td>
		<td></td>
		<td></td>
		<td></td>
	</tr>
	<tr>
		<td>256 * 6 * 6</td>
		<td></td>
		<td></td>
		<td></td>
		<td></td>
		<td></td>
		<td></td>
	</tr>
	<tr>
		<td>Dropout (rate 0.5)</td>
		<td></td>
		<td></td>
		<td></td>
		<td></td>
		<td></td>
		<td></td>
	</tr>
	<tr>
		<td>FC6 + Relu</td>
		<td></td>
		<td></td>
		<td></td>
		<td></td>
		<td>256 * 6 * 6 * 4096=37748736</td>
		<td>256 * 6 * 6 * 4096=37748736</td>
	</tr>
	<tr>
		<td>4096</td>
		<td></td>
		<td></td>
		<td></td>
		<td></td>
		<td></td>
		<td></td>
	</tr>
	<tr>
		<td>Dropout (rate 0.5)</td>
		<td></td>
		<td></td>
		<td></td>
		<td></td>
		<td></td>
		<td></td>
	</tr>
	<tr>
		<td>FC7 + Relu</td>
		<td></td>
		<td></td>
		<td></td>
		<td></td>
		<td>4096 * 4096=16777216</td>
		<td>4096 * 4096=16777216</td>
	</tr>
	<tr>
		<td>4096</td>
		<td></td>
		<td></td>
		<td></td>
		<td></td>
		<td></td>
		<td></td>
	</tr>
	<tr>
		<td>FC8 + Relu</td>
		<td></td>
		<td></td>
		<td></td>
		<td></td>
		<td>4096 * 1000=4096000</td>
		<td>4096 * 1000=4096000</td>
	</tr>
	<tr>
		<td>1000 classes</td>
		<td></td>
		<td></td>
		<td></td>
		<td></td>
		<td></td>
		<td></td>
	</tr>
	<tr>
		<td>Overall</td>
		<td></td>
		<td></td>
		<td></td>
		<td></td>
		<td>62369152=62.3 million</td>
		<td>1135906176=1.1 billion</td>
	</tr>
	<tr>
		<td>Conv VS FC</td>
		<td></td>
		<td></td>
		<td></td>
		<td></td>
		<td>Conv:3.7million (6%) , FC: 58.6 million  (94% )</td>
		<td>Conv: 1.08 billion (95%) , FC: 58.6 million (5%)</td>
	</tr>
</tbody>
</table>


#### Why does AlexNet achieve better results?

1. **Relu activation function is used.**

Relu function: f (x) = max (0, x)

![alex1](https://raw.githubusercontent.com/entbappy/Branching-tutorial/master/alexnet/alex512.png)

ReLU-based deep convolutional networks are trained several times faster than tanh and sigmoid- based networks. The following figure shows the number of iterations for a four-layer convolutional network based on CIFAR-10 that reached 25% training error in tanh and ReLU:

![alex1](https://raw.githubusercontent.com/entbappy/Branching-tutorial/master/alexnet/alex612.png)

2. **Standardization ( Local Response Normalization )**

After using ReLU f (x) = max (0, x), you will find that the value after the activation function has no range like the tanh and sigmoid functions, so a normalization will usually be done after ReLU, and the LRU is a steady proposal (Not sure here, it should be proposed?) One method in neuroscience is called "Lateral inhibition", which talks about the effect of active neurons on its surrounding neurons.

![alex1](https://raw.githubusercontent.com/entbappy/Branching-tutorial/master/alexnet/alex3.jpg)


3. **Dropout**

Dropout is also a concept often said, which can effectively prevent overfitting of neural networks. Compared to the general linear model, a regular method is used to prevent the model from overfitting. In the neural network, Dropout is implemented by modifying the structure of the neural network itself. For a certain layer of neurons, randomly delete some neurons with a defined probability, while keeping the individuals of the input layer and output layer neurons unchanged, and then update the parameters according to the learning method of the neural network. In the next iteration, rerandom Remove some neurons until the end of training.


![alex1](https://raw.githubusercontent.com/entbappy/Branching-tutorial/master/alexnet/alex4.jpg)


4. **Enhanced Data ( Data Augmentation )**



**In deep learning, when the amount of data is not large enough, there are generally 4 solutions:**

>  Data augmentation- artificially increase the size of the training set-create a batch of "new" data from existing data by means of translation, flipping, noise

>  Regularization——The relatively small amount of data will cause the model to overfit, making the training error small and the test error particularly large. By adding a regular term after the Loss Function , the overfitting can be suppressed. The disadvantage is that a need is introduced Manually adjusted hyper-parameter.

>  Dropout- also a regularization method. But different from the above, it is achieved by randomly setting the output of some neurons to zero

>  Unsupervised Pre-training- use Auto-Encoder or RBM's convolution form to do unsupervised pre-training layer by layer, and finally add a classification layer to do supervised Fine-Tuning





## Q2. Explain the architectural innovations introduced in AlexNet that contributed to its breakthrough performance

AlexNet, a deep convolutional neural network architecture, achieved a breakthrough in image classification by winning the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) in 2012. Its success can be attributed to several architectural innovations that significantly improved its performance compared to previous models. The key architectural innovations introduced in AlexNet are :

1. **Deep Architecture**:
   - Prior to AlexNet, deep neural networks were not widely used due to concerns about vanishing gradients and computational complexity. AlexNet addressed these issues by introducing a relatively deep architecture, consisting of eight layers (five convolutional and three fully connected), with a total of 60 million parameters. This depth allowed the model to learn complex hierarchical features.

2. **Convolutional Layers**:
   - AlexNet employed convolutional layers for feature extraction. These layers learned a wide range of low-level and high-level features, which helped improve the network's understanding of image patterns.

3. **Rectified Linear Unit (ReLU) Activation Function**:
   - Instead of traditional activation functions like sigmoid or tanh, AlexNet used ReLU activation functions. ReLU is computationally efficient and helps mitigate the vanishing gradient problem. It allowed the model to train faster and achieve better results.

4. **Local Response Normalization**:
   - AlexNet introduced Local Response Normalization (LRN) layers. LRN computes the response of a neuron relative to its neighboring neurons, promoting competition between adjacent neurons and enhancing contrast. This helped the model in distinguishing fine details in images.

5. **Overlapping Max-Pooling**:
   - In the pooling layers, AlexNet used overlapping max-pooling, which allows the pooling windows to overlap. This captures more spatial information and improves the model's translation invariance.

6. **Data Augmentation and Dropout**:
   - Data augmentation techniques, such as random cropping and horizontal flipping, were applied during training to increase the diversity of the training data. Additionally, dropout was introduced in the fully connected layers to reduce overfitting.

7. **Large-Scale Training Data**:
   - AlexNet was trained on a massive dataset, including millions of images from the ImageNet dataset. This large-scale training data helped the model generalize better and recognize a wide range of objects and categories.

8. **Parallelization with GPUs**:
   - The training of AlexNet was parallelized using multiple GPUs. This allowed the model to process a vast amount of data and optimize computation, significantly reducing training time.

9. **Softmax Output Layer**:
   - The final layer of AlexNet used a softmax activation function to predict class probabilities. This made it suitable for multi-class classification tasks.

These architectural innovations collectively contributed to AlexNet's breakthrough performance in image classification tasks, leading to significant improvements in accuracy and setting the stage for the deep learning revolution in computer vision. AlexNet served as a blueprint for subsequent deep CNN architectures and demonstrated the potential of deep learning in real-world applications.

## Q3. Discuss the role of convolutional layers,pooling layers and fully connected layers in Alexnet

In the AlexNet architecture, convolutional layers, pooling layers, and fully connected layers play distinct but complementary roles in processing and extracting features from the input data.

1. **Convolutional Layers**:

   - **Feature Extraction**: The primary role of convolutional layers is feature extraction. These layers use learnable filters (kernels) that slide over the input image to capture various patterns and features, such as edges, textures, and more complex shapes. In the context of AlexNet, these layers serve as the feature extractors for different levels of abstraction.

   - **Hierarchical Feature Learning**: AlexNet's deep convolutional layers learn features at multiple levels of abstraction. Lower layers capture low-level features like edges and corners, while higher layers capture more complex features like object parts and textures. This hierarchical feature learning enables the model to understand and recognize complex structures in images.

   - **Parameter Sharing**: Convolutional layers share weights and biases across different spatial positions in the input. This weight sharing reduces the number of parameters, making the model more efficient and preventing overfitting.

2. **Pooling Layers**:

   - **Down-Sampling**: Pooling layers are responsible for down-sampling the feature maps generated by the convolutional layers. They reduce the spatial dimensions of the feature maps while retaining the most essential information. In AlexNet, max-pooling layers are used, where the maximum value within a local region is selected.

   - **Translation Invariance**: Pooling layers introduce translation invariance, making the network robust to small shifts or translations in the input. By selecting the most significant value within a local region, they preserve the presence of features rather than their precise location.

   - **Dimension Reduction**: Pooling layers help reduce the computational complexity of the network by reducing the number of parameters. Smaller feature maps are less likely to overfit and allow for faster training.

3. **Fully Connected Layers**:

   - **Classification**: The fully connected layers at the end of the network serve as the classifier. They take the high-level features extracted by the convolutional and pooling layers and perform classification based on these features.

   - **Multi-Class Classification**: In AlexNet, the final fully connected layer uses a softmax activation function to produce class probabilities. This makes the model suitable for multi-class classification tasks, where it can predict the probabilities of an input belonging to different categories.

   - **High-Level Abstraction**: The fully connected layers capture high-level semantic information about the input, making them well-suited for making decisions about the content of the image.

In summary, convolutional layers extract features hierarchically from the input image, pooling layers down-sample the feature maps and introduce translation invariance, while fully connected layers perform high-level classification based on the features extracted by the previous layers. These three types of layers work together to enable AlexNet to recognize and classify objects and patterns in images, and the combination of their roles contributes to the model's success in image classification tasks.

## Q4. Implement AlexNet using a deep learning framework of your choice and evalauate its performance

In [5]:
import tensorflow as tf
import keras
from keras.models import Sequential
from keras.layers import Flatten,Dense,Dropout,BatchNormalization,Conv2D,MaxPooling2D,Activation
from keras.utils import to_categorical
from tensorflow.keras.datasets import cifar10

In [6]:
(X_train,y_train),(X_test,y_test) = cifar10.load_data()

# Normalize pixel values to be between 0 and 1
X_train = X_train.astype('float32') / 255.0
X_test =  X_test.astype('float32') / 255.0

#Converting labels to one hot encoded
y_train = to_categorical(y_train,10)
y_test = to_categorical(y_test,10)

print(X_train.shape)

(50000, 32, 32, 3)


In [7]:
modell = Sequential()

# Convolutional Layer 1
modell.add(Conv2D(96, (3, 3), strides=(1, 1), padding='same', input_shape=(32, 32, 3)))
modell.add(Activation('relu')) # Activation Function
# Pooling
modell.add(MaxPooling2D(pool_size=(3, 3), strides=(2, 2))) # Pooling

# Convolutional Layer 2
modell.add(Conv2D(256, (5, 5), strides=(1, 1), padding='same'))
modell.add(Activation('relu'))
# Pooling
modell.add(MaxPooling2D(pool_size=(3, 3), strides=(2, 2)))

# Convolutional Layer 3
modell.add(Conv2D(384, (3, 3), strides=(1, 1), padding='same'))
modell.add(Activation('relu'))

# Convolutional Layer 4
modell.add(Conv2D(384, (3, 3), strides=(1, 1), padding='same'))
modell.add(Activation('relu'))

# Convolutional Layer 5
modell.add(Conv2D(256, (3, 3), strides=(1, 1), padding='same'))
modell.add(Activation('relu'))
# Pooling
modell.add(MaxPooling2D(pool_size=(3, 3), strides=(2, 2)))

# Fully Connected Layers
# Passing it to a dense layer
modell.add(Flatten())

#Dense Layer 1
modell.add(Dense(4096))
modell.add(Activation('relu'))
modell.add(Dropout(0.5))

#Dense Layer 2
modell.add(Dense(4096))
modell.add(Activation('relu'))
modell.add(Dropout(0.5))

#Output Layer
modell.add(Dense(10))
modell.add(Activation('softmax'))

modell.summary()

Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 conv2d_2 (Conv2D)           (None, 32, 32, 96)        2688      
                                                                 
 activation (Activation)     (None, 32, 32, 96)        0         
                                                                 
 max_pooling2d (MaxPooling2  (None, 15, 15, 96)        0         
 D)                                                              
                                                                 
 conv2d_3 (Conv2D)           (None, 15, 15, 256)       614656    
                                                                 
 activation_1 (Activation)   (None, 15, 15, 256)       0         
                                                                 
 max_pooling2d_1 (MaxPoolin  (None, 7, 7, 256)         0         
 g2D)                                                 

In [8]:
#Compiling the model
modell.compile(loss=keras.metrics.categorical_crossentropy , optimizer=keras.optimizers.Adam() , metrics = ['accuracy'] )

#Fitting the model
historyy = modell.fit(X_train,y_train,
                    epochs=3,
                    batch_size=64,
                    verbose=2,
                    validation_split=0.2,
                    )


Epoch 1/3
625/625 - 22s - loss: 1.7261 - accuracy: 0.3443 - val_loss: 1.3602 - val_accuracy: 0.4990 - 22s/epoch - 36ms/step
Epoch 2/3
625/625 - 16s - loss: 1.2856 - accuracy: 0.5336 - val_loss: 1.1863 - val_accuracy: 0.5756 - 16s/epoch - 25ms/step
Epoch 3/3
625/625 - 16s - loss: 1.1126 - accuracy: 0.6040 - val_loss: 1.0254 - val_accuracy: 0.6345 - 16s/epoch - 25ms/step


In [9]:
#Evaluating the model
scoree = modell.evaluate(X_test, y_test)

print('Test Loss:', scoree[0])
print('Test accuracy:', scoree[1])

Test Loss: 1.0357236862182617
Test accuracy: 0.633899986743927
