# Convolutional Neural Network - MNIST

In this notebook, my goal is to implement a CNN to classify handwritten digits from the MNIST dataset using PyTorch. 

Classifying handwritten digits is not new to me at this point 😅. In the previous folder, I used PyTorch to implement an MLP to do the very same thing, and acheived 96% accuracy on the testing set, which I think is pretty good! 🤷

So, implementing this is CNN is also a chance for me to compare the performance of two different architectures (CNN and MLP) on the very same task. In `README.md` of this folder, I introduced CNNs as a family of models that are *well-suited for image-related task*. By the end of this notebook, I will see for myself! The score to beat is: 96%

Okay, let's jump right in!

# Model Overview

The architecture of the network that I am going to implement is shown on the following image:

![Deep CNN architecture](./images/img-7.jpg)

It is not new to me that training examples are usually fed to networks in batches. So, I would like to describe the journey of a batch throughout the network. For simplicity, I choose 3 as the batchsize, and assume the image is black and white (i.e. one color channel) and of size $(28 \times 28)$.

When the batch of 3 images gets fed in the network, through the input layer, **EACH** image in the batch gets convolved with 32 different filters (of size $5 \times 5$). Convolving **EACH** image in batch with 32 kernels results in **32** features maps for **EACH** image. Another way to see this is that after an image in the batch gets convolved the 32 filters, the result is a $(28 \times 28 \times 32)$ matrix... 32 channels (i.e. features maps). So, after all three images in the batch go through this layer, the result is a 4-D matrix containing THREE $(28 \times 28 \times 32)$ matrices. If you do not understand this, consider reading the section of `cnn-basics.ipynb` where I talk about convolving on 3-D matrices.

Still with me? Cool! Let's proceed.

The three sets of 32 feature maps we got out of the convolutional layer (in the form of a 4-D matrix containing THREE $(28 \times 28 \times 32)$ matrices) get fed in a subsampling layer where a max-pooling operation, $P_{2 \times 2}$, takes place. As I learned in `cnn-basics.ipynb`, pooling decreases the resolution of feature maps. So, the 32 feature maps of **EACH** image get downsized from $(28 \times 28)$ to $(14 \times 14)$. After the batch goes through the subsampling layer, the result is still a 4-D matrix containing THREE smaller $(14 \times 14 \times 32)$ feature maps matrices.

The batch of feature map matrices now gets fed into another convolutional layer. Each feature map matrix gets convolved with 64 kernels. Remember in `cnn-basics.ipynb`, I said the following when talking about performing convolutions:

>  The filters must have the SAME number of channels as the input image

Here, we are feeding 3 feature map matrices (32 channels each) in the conv. layer. So, each of the 64 filters must also have 32 channels. After convolving the 64 filters (of size $5 \times 5$) with each feature map matrix, the result is a batch of 3 new feature map matrices of 64 channels each.

Are still there?! Hang on, almost done. I need to understand it. Articulating helps!

Next is another pooling layer. The batch containing our THREE $(14 \times 14 \times 64)$ matrices get fedin a sampling layer where a max-pooling operation, $P_{2 \times 2}$, takes place. The result our batch with three smaller $(7 \times 7 \times 64)$ feature maps matrices.

Each feature map matrix in the batch is flatten, then fed in to MLP basically, and the rest I familiar already :)

Here is the list of layers I described, and the tensor size to feed them:

- `Input`: [batchsize $\times 28 \times 28 \times 1 $]

- `Conv_1`: [batchsize $\times 28 \times 28 \times 32 $]

- `Pooling_1`: [batchsize $\times 14 \times 14 \times 32 $]

- `Conv_2`: [batchsize $\times 14 \times 14 \times 64 $]

- `Pooling_2`: [batchsize $\times 28 \times 28 \times 1 $]

- `FC_1`: [batchsize $\times 1024$]

- `FC_2` and `softmax` layer: [batchsize $\times 10$]

Remember, in `cnn-basics.ipynb`, I said this:

> Colored images are often represented by 3-D matrices. Different Deep Learning frameworks will follow **different conventions**, but I know for sure that one dimension represents the width of the image, another represents the height of the image, and the remaining one represents the number of color channel in the image.

The key term is: **convention**.

Above, I followed my own convention to help my understanding. 🤷

But, when we read an image, the default dimension for the channels is the first dimension of the tensor array (if we add a batch dimension). This is called the **`NCHW` format**, where `N` stands for the number of images within the batch, `C` stands for channels, and `H` and `W` stand for height and width, respectively.

In PyTorch, `Conv2D`, the convolutional class, assumes that inputs are in the `NCHW` format by default. TensorFlow, on the other hand, uses use NHWC format. Moving forward, I will follow the `NCHW` format.

# Loading and Preprocessing the data

# Implementing the model using PyTorch

# Training the model

# Last words...