# Convolutional Neural Networks (or simply ConvNets)

ConvNets are sparsely connected aritificial neural networks that deploy convolution operation in atleast one of the layers to learn patterns and are mainly used in image processing tasks (or data with grid like structure).

## Limitations of MLP?

- Unable to preserve the image or in general grid-like structure of the input
- Exponential scaling of the number of parameters with the input size.
- Computationally intensive

<img src="images/MLP vs ConvNet.png" style="width:700px;height:300px;">
<caption><center> <u> <font color='purple'> **Figure 1** </u><font color='purple'>  : **MLP VS ConvNet**<br></center></caption>

## Inspiration
ConvNets were inspired from the breakthrough achieved by Hubel and Wiesel in the area of sensory processing.
<img src="Images/cat.png" style="width:500px;height:300px;">
<caption><center> <u> <font color='purple'> **Figure 2** </u><font color='purple'>

## The Convolutional Operation

- $\text{Input Volume and Filter} --> \text{Feature Map}$
- The operation is performed by superimposing a kernel tensor on a small region of the input tensor and then computing the elementwise multiplication of the tensors and summing the result.
- Or it can be understood as a weighted sum of a range of pixels in a 2D image represented as a 2D matrix given a set of weights.
- The purpose of the convolution operation is to approximate and learn the information present in those pixels or region or receptive field such as horizontal and vertical edge detection.

- A kernel also known as a filter is a multidimensional array of parameters learned by the model. Each filter is responsible to learn some type of visual feature such as an edge of some particular orientation or some glob of color on the firtst layer or some complex feature later in the network.

- The filter extends to the entire depth of the input and is moved across the entire input tensor by convolving/shifting it along the spatial (width and height) axes. 
- The size of the receptive field and the filter is always kept same to perform the operation.
- The output of this operation is a 2D matrix <br><br>
Mathematical formulation - <br><br>

$S(i,j) = (I*K)(i,j) = \sum_m \sum_n I(i+m,j+n)\odot K(m,n)$<br><br> 

Here S is the output matrix also known as the feature map, I is the input tensor and K is the filter


<img src="images/Convolution_schematic.gif" style="width:500px;height:300px;">
<caption><center> <u> <font color='purple'> **Figure 3** </u><font color='purple'>  : **Convolution operation**<br> with a filter of 2x2 and a stride of 1 (stride = amount you move the window each time you slide) </center></caption>
    
<b>Note</b> - The convolution operation in Machine Learning is the cross-correlation function in mathematics which is closely related to the convolution function.
    

<img src="Images/filters.png" style="width:700px;height:300px;">
<caption><center> <u> <font color='purple'> **Visualizing filters from lower Conv Layers** </u><font color='purple'>


<img src="Images/visuals.png" style="width:700px;height:300px;">
<caption><center> <u> <font color='purple'> **Visualizing features and filters** </u><font color='purple'>

## The convolution Layer
- It is the core building block of a CNN.

### Sparse Interactions

- In CNN, each neuron is connected to only a local region of the input volume
- The spatial extent of this connectivity called the receptive field has to be determined.
- Results in less number of paramters.

### Parameter Sharing

- The number of paramters in the Conv layer are drastically reduced by making a simple assumption:<br>
<b>If detecting a feature is important at some spatial location in the image, then it must be useful at some other location also due to translation invariance property exhibited by an image.</b>  <br>
- Hence in a Conv layer a filter is used across all the pixels in the input image (except sometimes leaving the border pixels) to compute the output or feature map.
- Multiple filters are used to learn different feature maps. These feature maps are then stacked together across the depth to form the input volume of the next layer.

### Equivariant representations

- Paramter sharing causes the layer to acquire a property known as equivariance to translatioin.
- According to this property, if we move the object in the input, its representations will also move in the output. Eg.- If a horizontal edge is present in one region then it is intuitive to assume that similar edges will be present across the image.
- Convolutions are not naturally equivariant to some other kinds of transformations such as scaling or rotation. We use Pooling for such transformations.


<img src="Images/activations.png" style="width:700px;height:300px;">
<caption><center> <u> <font color='purple'> **Figure 4** </u><font color='purple'>  : **Visualizing Activations and Feature Maps**<br> </center></caption>

Basically a Conv Layer,
- Accepts a volume of size $W_1×H_1×D_1$
- Requires four hyperparameters:
1. Number of filters $K$,
2. their spatial extent $F$,
3. the stride $S$,
4. the amount of zero padding $P$.
- Produces a volume of size $W_2×H_2×D_2$ where:
1. $W_2=\frac{(\text{input_width}−\text{Filter_width}+2*Padding)}{Stride}+1$
2. $H_2=\frac{(\text{input_height}−\text{Filter_height}+2*Padding)}{Stride}+1$ (i.e. width and height are computed equally by symmetry)
3. $D_2=K$
- With parameter sharing, it introduces $F⋅F⋅D_1$ weights per filter, for a total of $(F⋅F⋅D_1)⋅K$ weights and $K$ biases.


## Pooling Layer

- Replaces a subset of the output of a layer at a certain location with a summary statistic of the nearby outputs.<br>
- Reduces the spatial size of the output volume helping to -<br>
1. Reduce the number of paramters to be learned by the next layer.
2. Ensures invariance to small translations in the neighboring neurons.

- If we pool over features learned by different filters i.e pool across the depth, the resulting feature can learn which transformations to become invariant to. 

### Two well known Pooling Techniques:
#### Max Pooling 
In Max Pooling the maximum value from a selected neighborhood is used to represent the region.

#### Average Pooling
In Average Pooling, the average of all the values in the region is used to as the summary statistic.

<img src="Images/Pooling.png" style="width:700px;height:300px;">

<b>Use case</b> -One major use of CNN is for the purpose of learning reapresentations which can be fed to another network for taks such as generating image captions.

<img src="Images/lenet.jpeg" style="width:700px;height:300px;">
<caption><center> <u> <font color='purple'> **LeNet CNN Architecture** </u><font color='purple'>

## Limitations of CNNs
- They are also prone to overfitting
- Problem of Vanishing and exploding gradients with increase in depth
- May overfit
- Variable length inputs can not be given as input.
- Increasing the depth does not always result in improved accuracy.

## Popular Architectures
- LeNet
- AlexNet
- VGGNet
- GoogLeNet
- ResNet
- DenseNet
- ENet
- Network in Network
- MobileNet