# Building blocks of a CNN

CNNs are the most prominent architectures that are used when working on images.
CNNs address the major limitations of deep neural networks that we saw in the
previous section. Besides image classification, they also help with object detection,
image segmentation, GANs, and many more – essentially, wherever we use images.
Furthermore, there are different ways of constructing a convolutional neural network,
and there are multiple pre-trained models that leverage CNNs to perform various
tasks. Starting with this chapter, we will be using CNNs extensively.

In the upcoming subsections, we will understand the fundamental building blocks of
a CNN, which are as follows:

- Convolutions
- Filters
- Strides and padding
- Pooling

## Convolutions

A convolution is basically multiplication between two matrices. As you saw in the
previous chapter, matrix multiplication is a key ingredient of training a neural
network. (We perform matrix multiplication when we calculate hidden layer values –
which is a matrix multiplication of the input values and weight values connecting the
input to the hidden layer. Similarly, we perform matrix multiplication to calculate
output layer values.)


To ensure we have a solid understanding of the convolution process, let's go through
the following example.

Let's assume we have two matrices we can use to perform convolution.

Here is Matrix A:

![imgs](./imgs/cnn0.png)


Here is Matrix B:


![imgs](./imgs/cnn1.png)

While performing the convolution operation, you are sliding Matrix B (the smaller
matrix) over Matrix A (the bigger matrix). Furthermore, we are performing element to
element multiplication between Matrix A and Matrix B, as follows:

![imgs](imgs/cnn2.png)

The result of performing the preceding operations is as follows:

![img](./imgs/cnn3.png)

## Filter

A filter is a matrix of weights that is initialized randomly at the start. The model
learns the optimal weight values of a filter over increasing epochs.

The concept of filters brings us to two different aspects:
- What the filters learn about
- How filters are represented

In general, the more filters there are in a CNN, the more features of an image that the
model can learn about. 

For now, we'll ensure that we have an intermediate understanding that the filters learn about different features
present in the image. For example, a certain filter might learn about the ears of a cat
and provide high activation (a matrix multiplication value) when the part of the
image it is convolving with contains the ear of a cat.

In the previous section, we learned that when we convolved one filter that has a size
of 2 x 2 with a matrix that has a size of 4 x 4, we got an output that is 3 x 3 in
dimension.

However, if 10 different filters multiply the bigger matrix (original image), the result
is 10 sets of the 3 x 3 output matrices.

Furthermore, in a scenario where we are dealing with color images where there are
three channels, the filter that is convolving with the original image would also have
three channels, resulting in a single scalar output per convolution. Also, if the filters
are convolving with an intermediate output – let's say of 64 x 112 x 112 in shape – the
filter would have 64 channels to fetch a scalar output. In addition, if there are 512
filters that are convolving with the output that was obtained in the intermediate layer,
the output post convolution with 512 filters would be 512 x 111 x 111 in shape.

To solidify our understanding of the output of filters further, let's take a look at the
following diagram:

![imgs](./imgs/cnn5.png)


In the preceding diagram, we can see that the input image is multiplied by the filters
that have the same depth as that of the input (which the filters are convolving with)
and that the number of channels in the output of a convolution is as many as there are
filters.

## Strides

Let's understand the impact of stride by leveraging the same example that we saw in
the Filter section. Furthermore, we'll stride Matrix B with a stride of 2 over Matrix A.
The output of convolution with a stride of 2 is as follows:

![img](./imgs/cnn6.png)

The result of performing the preceding operations is as follows:

![img](./imgs/cnn7.png)

Note that the preceding output has a lower dimension compared to the scenario
where the stride was 1 (where the output shape was 3 x 3) since we now have a stride
of 2.

## Padding

In the preceding case, we could not multiply the leftmost elements of the filter by the
rightmost elements of the image. If we were to perform such matrix multiplication,
we would pad the image with zeros. This would ensure that we can perform element
to element multiplication of all the elements within an image with a filter.

Let's understand padding by using the same example we used in the
Convolution section.

Once we add padding on top of Matrix A, the revised version of Matrix A will look as
follows:

![imgs](./imgs/cnn8.png)

From the preceding matrix, we can see that we have padded Matrix A with zeros and
that the convolution with Matrix B will not result in the output dimension being
smaller than the input's dimension. This aspect comes in handy when we are working
on residual network where we must add the output of the convolution to the original
image.

## Pooling

Pooling aggregates information in a small patch. Imagine a scenario where the output
of convolution activation is as follows:



![imgs](./imgs/cnn9.png)

The max pooling for this patch is 4. Here, we have considered the elements in this
pool of elements and have taken the maximum value across all the elements present.

Similarly, let's understand the max pooling for a bigger matrix:

![imgs](./imgs/cnn10.png)

![imgs](./imgs/cnn11.png)

![imgs](./imgs/cnn12.png)

## Putting them all together

So far, we have learned about convolution, filters, and pooling, and their impact in
reducing the dimension of an image. Now, we will learn about another critical
component of a CNN – the flatten layer (fully connected layer) – before putting the
three pieces we have learned about together.


To understand the flattening process, we'll take the output of the pooling layer in the
previous section and flatten the output. The output of flattening the pooling layer is
as follows:

${6, 8, 14, 16}$

By doing this, we'll see that the flatten layer can be treated equivalent to the input
layer. Once the flatten layer's (fully connected
layer) values have been obtained, we can pass it through the hidden layer and then
obtain the output for predicting the class of an image.

The overall flow of a CNN is as follows:

![imgs](./imgs/cnn13.png)

In the preceding image, we can see the overall flow of a CNN model, where we are
passing an image through convolution via multiple filters and then pooling (and in
the preceding case, repeating the convolution and pooling process twice), before
flattening the output of the final pooling layer. This forms the feature learning part of
the preceding image.

The operations of convolution and pooling constitute the feature learning section as
filters help in extracting relevant features from images and pooling helps in
aggregating information and thereby reducing the number of nodes at the flatten
layer. (If we directly flatten the input image (which is 300 x 300 pixels in size, for
example), we are dealing with 90K input values. If we have 90K input pixel values
and 100K nodes in a hidden layer, we are looking at ~9 billion parameters, which is
huge in terms of computation.)

Convolution and pooling help in fetching a flattened layer that has a much smaller
representation than the original image.

## How convolution and pooling help in image translation

When we perform pooling, we can consider the output of the operation as an
abstraction of a region (a small patch). This phenomenon comes in handy, especially
when images are being translated.

Think of a scenario where an image is translated by 1 pixel to the left. Once we
perform convolution, activation, and pooling on top of it, we'll have reduced the
dimension of the image (due to pooling), which means that a fewer number of pixels
store the majority of the information from the original image. Moreover, given that
pooling stores information of a region (patch), the information within a pixel of the
pooled image would not vary, even if the original image is translated by 1 unit. This
is because the maximum value of that region is likely to get captured in the pooled
image.

Convolution and pooling cam also help us with the receptive field. To understand
the receptive field, let's imagine a scenario where we perform a convolution pooling
operation twice on an image that is 100 x 100 in shape. The output at the end of the
two convolution pooling operations is of the shape 25 x 25 (if the convolution
operation was done with padding). Each cell in the 25 x 25 output now corresponds to
a larger 4 x 4 portion of the original image. Thus, because of the convolution and
pooling operations, each cell in the resulting image corresponds to a patch of the
original image.