# Intuitively Understanding Convolutions

### Convolutional Neural Network

Convolutional networks belong to a class of neural networks that take the image as an input, subjects it to combinations of weights and biases, extracts features and outputs the results. They tend to reduce the dimensions of the input image with the use of a kernel which makes it easier to extract features as compared to a generic Dense neural network. Convolutional networks trace their foundation to convolution operations on matrices.
Convnets were inspired by biological processes in that the connectivity pattern between neurons resembles the organization of the animal visual cortex. Individual cortical neurons respond to stimuli only in a restricted region of the visual field known as the receptive field. The receptive fields of different neurons partially overlap such that they cover the entire visual field.

![](./fig/cnn.png)

The name "***Convolutional neural network***" indicates that the network employs a mathematical operation called ***Convolution***. Convolution is a specialized kind of linear operation. Convnets are simply neural networks that use convolution in place of general matrix multiplication in at least one of their layers.

### **Kernels**  
A kernel is a small 2D matrix whose contents are based upon the operations to be performed. A kernel maps on the input image by simple matrix multiplication and addition, the output obtained is of lower dimensions and therefore easier to work with.

![](./fig/kertypes.png)

The shape of a kernel is heavily dependent on the input shape of the image and architecture of the entire network, mostly the size of kernels is (MxM) i.e a square matrix. The movement of a kernel is always from left to right and top to bottom.

### **Stride**  
Stride defines by what step does to kernel move, for example stride of 1 makes kernel slide by one row/column at a time and stride of 2 moves kernel by 2 rows/columns.

Below we see 3x3 kernels of a filter striding across 5x5 image with stride 1.

![](./fig/rgb.gif)

Output from each kernel is added to give one final ouput correspondiing to that filter.


![](./fig/rgb2.gif)

For input images with 3 or more channels such as RGB a *filter* is applied.


### **Filters**  

*Each filter actually happens to be a collection of kernels, with there being one kernel for every single input channel to the layer, and each kernel being unique.*

Each filter in a convolution layer produces one and only one output channel, and they do it like so:
Each of the kernels of the filter “slides” over their respective input channels, producing a processed version of each. Some kernels may have stronger weights than others, to give more emphasis to certain input channels than others (eg. a filter may have a red kernel channel with stronger weights than others, and hence, respond more to differences in the red channel features than the others).

![](./fig/convo.gif)

In the above image we can see we have an input image of 3 channels and a filter with 3 kernels which output a single channel.

### **Pooling**

The next step in convolutional network is downsampling this is done by **Pooling layer**. Below you can see 3x3 Maxpooling Layer.

![](./fig/pooling_layer.gif)

There are different types of pooling like Max pooling, Average pooling etc.

![](./fig/max_pool_avg.png)

### **Padding**

Padding is done when we want the output size to be that of input. Padding applies zeros to top, bottom ,left and right of an image. "same" padding will pad the input such that the output will have size, "valid" mean without padding.

![](./fig/same_padding.gif)