# PyTorch MNIST example dissected

In this notebook we'll explore the components of the
[PyTorch MNIST example](https://github.com/pytorch/examples/tree/master/mnist)
one-by-one.

**This is notebook 2 of the multi-part example.** Please see [Part 1 here](1_mnist_load.ipynb).

## 2. Building the model

In this notebook we'll explore the components of `torch.nn`

In [1]:
import torch
import torch.nn as nn
import torch.nn.functional as F

We'll start with copying the model from the original [PyTorch MNIST example](https://github.com/pytorch/examples/tree/master/mnist) verbatim:

In [2]:
class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(1, 10, kernel_size=5)
        self.conv2 = nn.Conv2d(10, 20, kernel_size=5)
        self.conv2_drop = nn.Dropout2d()
        self.fc1 = nn.Linear(320, 50)
        self.fc2 = nn.Linear(50, 10)

    def forward(self, x):
        x = F.relu(F.max_pool2d(self.conv1(x), 2))
        x = F.relu(F.max_pool2d(self.conv2_drop(self.conv2(x)), 2))
        x = x.view(-1, 320)
        x = F.relu(self.fc1(x))
        x = F.dropout(x, training=self.training)
        x = self.fc2(x)
        return F.log_softmax(x, dim=1)

That is, any class derived from `nn.Module` should implement the `.forward()` method; everything else is taken care of automagically by the auto-differentiation framework. Pretty neat.

The original `x` argument of the `.forward()` method is the **batch** of features generated by the `DataLoader` (see [Part 1](1_mnist_load.ipynb#1.4-Dataset-and-DataLoader) for details). That is, for the MNIST dataset and batch size of 64, the dimensions of the input `x` are

    torch.Size([64, 1, 28, 28])

### 2.1 Conv2d

First, we feed the input batch through 2D convolution:

    self.conv1(x)

 this convolution is defined with 5x5 kernel and `10` output layers:

    self.conv1 = nn.Conv2d(1, 10, kernel_size=5)

Recall that our input has 1 layer, because it is a greyscale image; hence `1` in `Conv2d()` constructor parameters.

Below we will create an instance of `Conv2d` and see what it actually does to the data. We start with a tiny 3x3 input tensor and 3x3 convolution with 4 output layers:

In [3]:
data = 1 - torch.eye(3, dtype=torch.float32).reshape((1, 1, 3, 3))
data

tensor([[[[0., 1., 1.],
          [1., 0., 1.],
          [1., 1., 0.]]]])

Note that NN layers operate on *batches*, hence input data dimensions of [batch of size 1 x 1 color layer x height 3 x width 3]: 

In [4]:
data.size()

torch.Size([1, 1, 3, 3])

In [5]:
conv = nn.Conv2d(1, 4, kernel_size=3)
conv

Conv2d(1, 4, kernel_size=(3, 3), stride=(1, 1))

In [6]:
conv_result = conv(data)
conv_result

tensor([[[[ 0.3212]],

         [[-0.3721]],

         [[ 0.3503]],

         [[-0.1429]]]], grad_fn=<ThnnConv2DBackward>)

In [7]:
conv_result.size()

torch.Size([1, 4, 1, 1])

So our convolution converts a 1x3x3 patch of (greyscale) image into a 4x1x1 column of weird numbers. Where do those numbers come from?

Let's take a look at the `Conv2d` internals.

In [8]:
(params, bias) = list(conv.parameters())

In [9]:
params

Parameter containing:
tensor([[[[ 0.1652,  0.0623,  0.2246],
          [ 0.1479, -0.2734, -0.2719],
          [-0.3310,  0.1613, -0.0247]]],


        [[[ 0.0025, -0.1486, -0.2781],
          [ 0.0184, -0.0850, -0.1435],
          [ 0.1833,  0.0569, -0.2380]]],


        [[[ 0.0323,  0.1469,  0.1247],
          [-0.1735, -0.2636,  0.0736],
          [-0.1304,  0.2736,  0.2486]]],


        [[[ 0.0915, -0.1726,  0.1300],
          [-0.0631, -0.0104,  0.2462],
          [-0.0241, -0.2866,  0.0535]]]], requires_grad=True)

In [10]:
bias

Parameter containing:
tensor([ 0.3281, -0.0605,  0.0354,  0.0273], requires_grad=True)

These are randomly initialized parameters of our convolution. The second one is a bias vector. (It is possible to have a convolution *without* a bias - that's `bias=False` parameter in the constructor).

So what is exactly the computation that `Conv2d` performs on the data during feed forward?

In [11]:
(params * data).sum(dim=(1, 2, 3)) + bias

tensor([ 0.3212, -0.3721,  0.3503, -0.1429], grad_fn=<ThAddBackward>)

Woo hoo, we get the same results!

That is, for each of 4 output layers of the convolution there is a 3x3 matrix of parameters that we multiply by a 3x3 patch of the input image, and then we just sum up all elements of that product.

Let's do it again, just for one layer.

In [12]:
params[0]

tensor([[[ 0.1652,  0.0623,  0.2246],
         [ 0.1479, -0.2734, -0.2719],
         [-0.3310,  0.1613, -0.0247]]], grad_fn=<SelectBackward>)

In [13]:
params[0] * data

tensor([[[[ 0.0000,  0.0623,  0.2246],
          [ 0.1479, -0.0000, -0.2719],
          [-0.3310,  0.1613, -0.0000]]]], grad_fn=<ThMulBackward>)

In [14]:
(params[0] * data).sum() + bias[0]

tensor(0.3212, grad_fn=<ThAddBackward>)

So each of 4 layers of the convolution parameters can be seen as a 3x3 patch of an image that we learn to recognize in the input. Right now it is random, but as we learn (i.e. backpropagate the gradient through it), it will morph into something like, say, a vertical line, or a diagonal gradient. Having 4 layers means that our convolution learns to recognize 4 different image patterns. Our original MNIST model has 10 such layers in its first convolution.

Having the bias parameter for each layer allows us to learn image patterns of different intensity.

#### 2.1.1 Sliding window

In our toy example above input image is the same size as the convolution (3x3). Let's apply our convolution to a larger image now.

In [15]:
data2 = 1 - torch.eye(5, dtype=torch.float32).reshape((1, 1, 5, 5))
data2

tensor([[[[0., 1., 1., 1., 1.],
          [1., 0., 1., 1., 1.],
          [1., 1., 0., 1., 1.],
          [1., 1., 1., 0., 1.],
          [1., 1., 1., 1., 0.]]]])

In [16]:
conv_result2 = conv(data2)
conv_result2

tensor([[[[ 0.3212, -0.1208,  0.5194],
          [ 0.3980,  0.3212, -0.1208],
          [-0.0362,  0.3980,  0.3212]],

         [[-0.3721, -0.7678, -0.8759],
          [-0.4005, -0.3721, -0.7678],
          [-0.4144, -0.4005, -0.3721]],

         [[ 0.3503,  0.2675,  0.4980],
          [ 0.1472,  0.3503,  0.2675],
          [ 0.2429,  0.1472,  0.3503]],

         [[-0.1429,  0.3413,  0.0158],
          [-0.0819, -0.1429,  0.3413],
          [-0.1383, -0.0819, -0.1429]]]], grad_fn=<ThnnConv2DBackward>)

Note that the elements at `[0,0]` in each of 4 layers are exactly the same as we had from our 3x3 image:  

In [17]:
conv_result2[:,:,0,0]

tensor([[ 0.3212, -0.3721,  0.3503, -0.1429]], grad_fn=<SelectBackward>)

In [18]:
conv(data)

tensor([[[[ 0.3212]],

         [[-0.3721]],

         [[ 0.3503]],

         [[-0.1429]]]], grad_fn=<ThnnConv2DBackward>)

So the 2d convolution applies the 3x3 sliding window with the same parameters to each 3x3 patch of the input. In other words, on the forward pass it converts a 1x3x3 patch of input into 4x1x1 column output, and builds a new 4-layer tensor out of these elements.

By default, 2d convolution has a **stride** of `1` and **padding** of `0`. That means, we'll move our sliding window by 1 pixel, starting from the [0,0] element and never going *outside* of the input - i.e. applying it first to

In [19]:
data2[:, :, 0:3, 0:3]

tensor([[[[0., 1., 1.],
          [1., 0., 1.],
          [1., 1., 0.]]]])

then to

In [20]:
data2[:, :, 0:3, 1:4]

tensor([[[[1., 1., 1.],
          [0., 1., 1.],
          [1., 0., 1.]]]])

and so on, all the way to `data2[:, :, 2:5, 2:5]`.

#### 2.1.2 Aside: Gradient computation

Note that tensors holding the convolution parameters have property `requires_grad=True`.

Likewise, the results of the convolution are tensors with `grad_fn` property set.

This is where the auto-differentiation magic of PyTorch happens. Later in this course we will dedicate the whole section to discuss it (see [Backpropagation](3_mnist_backprop.ipynb)). 

### 2.2 Max pooling

Next, we apply max pooling to the convolution results, i.e.

    F.max_pool2d(self.conv1(x), 2)

This function picks the largest values from each layer.

Let's see how it works on our toy data:

In [21]:
conv_result2

tensor([[[[ 0.3212, -0.1208,  0.5194],
          [ 0.3980,  0.3212, -0.1208],
          [-0.0362,  0.3980,  0.3212]],

         [[-0.3721, -0.7678, -0.8759],
          [-0.4005, -0.3721, -0.7678],
          [-0.4144, -0.4005, -0.3721]],

         [[ 0.3503,  0.2675,  0.4980],
          [ 0.1472,  0.3503,  0.2675],
          [ 0.2429,  0.1472,  0.3503]],

         [[-0.1429,  0.3413,  0.0158],
          [-0.0819, -0.1429,  0.3413],
          [-0.1383, -0.0819, -0.1429]]]], grad_fn=<ThnnConv2DBackward>)

In [22]:
F.max_pool2d(conv_result2, 2)

tensor([[[[ 0.3980]],

         [[-0.3721]],

         [[ 0.3503]],

         [[ 0.3413]]]], grad_fn=<MaxPool2DWithIndicesBackward>)

Recall that each of 4 layers correspond to some image pattern represented by the convolution parameters. Max pooling, then, allows us to learn the location of each pattern in the input image.

### 2.3 ReLU

Finally, we apply the Rectified Linear Unit activation function to the output:

    x = F.relu(F.max_pool2d(self.conv1(x), 2))

and that concludes the first layer of our neural network.

`relu()` function simply trims the negative elements of the tensor to zero, e.g.

In [23]:
F.relu(torch.tensor([-2, -1, 0, 1, 2]))

tensor([0, 0, 0, 1, 2])

This way, we will not propagate the gradient back to the layers that do not have positive correlation with certain patches of the image.

### 2.4 Next layer

Next layer of our framework is very similar to the first one:

    x = F.relu(F.max_pool2d(self.conv2_drop(self.conv2(x)), 2))
   
Where `conv2` is defined as

    self.conv2 = nn.Conv2d(10, 20, kernel_size=5)
    
So we have the same sequence of `Conv2d`, `max_pool2d`, and `relu`.

To match the output of the first layer, `conv2` has `10` input dimensions, and `20` output ones. That means, on this stage out neural net will learn 20 different 5x5 patterns, each being a mosaic of 5x5 image components from the layer below. We'll try to visualize these parameters later, after we train our network.

### 2.5 Dropout

One new component that we encounter in the second layer is dropout, `conv2_drop`. It is defined as

    self.conv2_drop = nn.Dropout2d()
    
Let us see how it works.