# PyTorch MNIST example dissected

In this notebook we'll explore the components of the
[PyTorch MNIST example](https://github.com/pytorch/examples/tree/master/mnist)
one-by-one.

* Part 1: [Loading the data](1_mnist_load.ipynb)
* Part 2: [Model components and forward propagation](2_mnist_model.ipynb##2-Building-the-model) <-- **you are here**
   * [2.0 Input Layer](#2.0-Input-Layer)
   * [2.1 Conv2d](#2.1-Conv2d)
      * [2.1.1 Sliding window](#2.1.1-Sliding-window)
      * [2.1.2 Aside: Gradient computation](#2.1.2-Aside:-Gradient-computation)
   * [2.2 Max pooling](#2.2-Max-pooling)
   * [2.3 ReLU](#2.3-ReLU)
   * [2.4 Layer 2](#2.4-Layer-2)
   * [2.5 2d Dropout](#2.5-2d-Dropout)
      * [2.5.1 Aside: Dropout techniques](#2.5.1-Aside:-Dropout-techniques)
   * [2.6 Layer 3](#2.6-Layer-3)
      * [2.6.1 Linear unit](#2.6.1-Linear-unit)
* Part 3: [Autodiff and backpropagation](3_mnist_backprop.ipynb)
* Part 4: [Training the model](4_mnist_train.ipynb)
* Part 5: [Visualizing the results](5_mnist_visualize.ipynb)

## 2 Building the model

In this notebook we'll explore the components of `torch.nn`

In [2]:
import torch
import torch.nn as nn
import torch.nn.functional as F

We'll start with copying the model from the original [PyTorch MNIST example](https://github.com/pytorch/examples/tree/master/mnist) verbatim:

In [3]:
class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(1, 10, kernel_size=5)
        self.conv2 = nn.Conv2d(10, 20, kernel_size=5)
        self.conv2_drop = nn.Dropout2d()
        self.fc1 = nn.Linear(320, 50)
        self.fc2 = nn.Linear(50, 10)

    def forward(self, x):
        x = F.relu(F.max_pool2d(self.conv1(x), 2))
        x = F.relu(F.max_pool2d(self.conv2_drop(self.conv2(x)), 2))
        x = x.view(-1, 320)
        x = F.relu(self.fc1(x))
        x = F.dropout(x, training=self.training)
        x = self.fc2(x)
        return F.log_softmax(x, dim=1)

That is, any class derived from `nn.Module` should implement the `.forward()` method; everything else is taken care of automagically by the auto-differentiation framework. Pretty neat.

The original `x` argument of the `.forward()` method is the **batch** of features generated by the `DataLoader` (see [Part 1](1_mnist_load.ipynb#1.4-Dataset-and-DataLoader) for details). That is, for the MNIST dataset and batch size of 64, the dimensions of the input `x` are

    torch.Size([64, 1, 28, 28])

### 2.0 Input Layer

First layer of our neural net is defined as

    x = F.relu(F.max_pool2d(self.conv1(x), 2))

The combination of 2D convolution, max pooling, and ReLU activation function if very typical for learning image models. Below we will look at each of these three components in detail.

### 2.1 Conv2d

First, we feed the input batch through 2D convolution:

    self.conv1(x)

 this convolution is defined with 5x5 kernel and `10` output layers:

    self.conv1 = nn.Conv2d(1, 10, kernel_size=5)

Recall that our input has 1 layer, because it is a greyscale image; hence `1` in `Conv2d()` constructor parameters.

Below we will create an instance of `Conv2d` and see what it actually does to the data. We start with a tiny 3x3 input tensor and 3x3 convolution with 4 output layers:

In [4]:
data = 1 - torch.eye(3, dtype=torch.float32).reshape((1, 1, 3, 3))
data

tensor([[[[0., 1., 1.],
          [1., 0., 1.],
          [1., 1., 0.]]]])

Note that NN layers operate on *batches*, hence input data dimensions of [batch of size 1 x 1 color layer x height 3 x width 3]: 

In [5]:
data.size()

torch.Size([1, 1, 3, 3])

In [6]:
conv = nn.Conv2d(1, 4, kernel_size=3)
conv

Conv2d(1, 4, kernel_size=(3, 3), stride=(1, 1))

In [7]:
conv_result = conv(data)
conv_result

tensor([[[[ 0.2802]],

         [[-0.0270]],

         [[ 0.2794]],

         [[-0.4983]]]], grad_fn=<ThnnConv2DBackward>)

In [8]:
conv_result.size()

torch.Size([1, 4, 1, 1])

So our convolution converts a 1x3x3 patch of (greyscale) image into a 4x1x1 column of weird numbers. Where do those numbers come from?

Let's take a look at the `Conv2d` internals.

In [9]:
(params, bias) = list(conv.parameters())

In [10]:
params

Parameter containing:
tensor([[[[ 0.3215,  0.0547, -0.0736],
          [-0.1007, -0.1459, -0.0529],
          [ 0.1067,  0.0479, -0.1120]]],


        [[[ 0.2073,  0.0388, -0.2968],
          [ 0.0361,  0.2942,  0.3329],
          [-0.2384,  0.2017, -0.2592]]],


        [[[ 0.1062, -0.2165, -0.1538],
          [ 0.2595, -0.0364, -0.0085],
          [ 0.0320,  0.1645, -0.0020]]],


        [[[ 0.0402,  0.0258,  0.1012],
          [-0.3169, -0.2258,  0.0269],
          [ 0.1091, -0.2625,  0.3272]]]], requires_grad=True)

In [11]:
bias

Parameter containing:
tensor([ 0.2981, -0.1014,  0.2020, -0.1818], requires_grad=True)

These are randomly initialized parameters of our convolution. The second one is a bias vector. (It is possible to have a convolution *without* a bias - that's `bias=False` parameter in the constructor).

So what is exactly the computation that `Conv2d` performs on the data during feed forward?

In [12]:
(params * data).sum(dim=(1, 2, 3)) + bias

tensor([ 0.2802, -0.0270,  0.2794, -0.4983], grad_fn=<ThAddBackward>)

Woo hoo, we get the same results!

That is, for each of 4 output layers of the convolution there is a 3x3 matrix of parameters that we multiply by a 3x3 patch of the input image, and then sum up all elements of that product.

Let's do it again, just for one layer.

In [13]:
params[0]

tensor([[[ 0.3215,  0.0547, -0.0736],
         [-0.1007, -0.1459, -0.0529],
         [ 0.1067,  0.0479, -0.1120]]], grad_fn=<SelectBackward>)

In [14]:
params[0] * data

tensor([[[[ 0.0000,  0.0547, -0.0736],
          [-0.1007, -0.0000, -0.0529],
          [ 0.1067,  0.0479, -0.0000]]]], grad_fn=<ThMulBackward>)

In [15]:
(params[0] * data).sum() + bias[0]

tensor(0.2802, grad_fn=<ThAddBackward>)

Each of 4 layers of the convolution parameters can be seen as a 3x3 patch of an image that we learn to recognize in the input. Right now it is random, but as we learn (i.e. backpropagate the gradient through it), it will morph into something like, say, a vertical line, or a diagonal gradient. Having 4 layers means that our convolution learns to recognize 4 different image patterns. Our original MNIST model has 10 such layers in its first convolution.

Having the bias parameter for each layer allows us to learn image patterns of different intensity.

#### 2.1.1 Sliding window

In our toy example above input image is the same size as the convolution (3x3). Let's apply our convolution to a larger image now.

In [16]:
data2 = 1 - torch.eye(5, dtype=torch.float32).reshape((1, 1, 5, 5))
data2

tensor([[[[0., 1., 1., 1., 1.],
          [1., 0., 1., 1., 1.],
          [1., 1., 0., 1., 1.],
          [1., 1., 1., 0., 1.],
          [1., 1., 1., 1., 0.]]]])

In [17]:
conv_result2 = conv(data2)
conv_result2

tensor([[[[ 0.2802,  0.3964,  0.2370],
          [ 0.3419,  0.2802,  0.3964],
          [ 0.4173,  0.3419,  0.2802]],

         [[-0.0270, -0.0226,  0.4537],
          [-0.1564, -0.0270, -0.0226],
          [ 0.5121, -0.1564, -0.0270]],

         [[ 0.2794, -0.0770,  0.3151],
          [ 0.5721,  0.2794, -0.0770],
          [ 0.5009,  0.5721,  0.2794]],

         [[-0.4983,  0.2228, -0.4657],
          [-0.4093, -0.4983,  0.2228],
          [-0.4579, -0.4093, -0.4983]]]], grad_fn=<ThnnConv2DBackward>)

Note that the elements at `[0,0]` in each of 4 layers are exactly the same as we had from our 3x3 image:  

In [18]:
conv_result2[:,:,0,0]

tensor([[ 0.2802, -0.0270,  0.2794, -0.4983]], grad_fn=<SelectBackward>)

In [19]:
conv(data)

tensor([[[[ 0.2802]],

         [[-0.0270]],

         [[ 0.2794]],

         [[-0.4983]]]], grad_fn=<ThnnConv2DBackward>)

So the 2d convolution applies the 3x3 sliding window with the same parameters to each 3x3 patch of the input. In other words, on the forward pass it converts a 1x3x3 patch of input into 4x1x1 column output, and builds a new 4-layer tensor out of these elements.

By default, 2d convolution has a **stride** of `1` and **padding** of `0`. That means, we'll move our sliding window by 1 pixel, starting from the [0,0] element and never going *outside* of the input - i.e. applying it first to

In [20]:
data2[:, :, 0:3, 0:3]

tensor([[[[0., 1., 1.],
          [1., 0., 1.],
          [1., 1., 0.]]]])

then to

In [21]:
data2[:, :, 0:3, 1:4]

tensor([[[[1., 1., 1.],
          [0., 1., 1.],
          [1., 0., 1.]]]])

and so on, all the way to `data2[:, :, 2:5, 2:5]`.

#### 2.1.2 Aside: Gradient computation

Note that tensors holding the convolution parameters have property `requires_grad=True`.

Likewise, the results of the convolution are tensors with `grad_fn` property set.

This is where the auto-differentiation magic of PyTorch happens. Later in this course we will dedicate the whole section to discuss it (see [Backpropagation](3_mnist_backprop.ipynb)). 

### 2.2 Max pooling

Next, we apply max pooling to the convolution results, i.e.

    F.max_pool2d(self.conv1(x), 2)

Here it runs 2x2 sliding window through the data and picks the largest values from each layer within that window. Note that unlike the convolution, by default max pooling windows do *not* overlap.

Let's see how it works on some toy data:

In [76]:
x = torch.linspace(10,60,6).reshape((1,6,1)).repeat(1,1,6) + torch.linspace(1,6,6).reshape((1,1,6))
x

tensor([[[11., 12., 13., 14., 15., 16.],
         [21., 22., 23., 24., 25., 26.],
         [31., 32., 33., 34., 35., 36.],
         [41., 42., 43., 44., 45., 46.],
         [51., 52., 53., 54., 55., 56.],
         [61., 62., 63., 64., 65., 66.]]])

In [80]:
F.max_pool2d(x, 2)

tensor([[[22., 24., 26.],
         [42., 44., 46.],
         [62., 64., 66.]]])

In [82]:
F.max_pool2d(x, 3)

tensor([[[33., 36.],
         [63., 66.]]])

That is, max pooling effectively scales down the input proportionally to the kernel size. 

If input dimension is not a multiple of max pooling kernel size, we may lose some information at the edges, e.g.

In [84]:
F.max_pool2d(x, 5)

tensor([[[55.]]])

Now let's apply it to the multilayer tensor that is the result of our 2d convolution:

In [85]:
conv_result2

tensor([[[[ 0.2802,  0.3964,  0.2370],
          [ 0.3419,  0.2802,  0.3964],
          [ 0.4173,  0.3419,  0.2802]],

         [[-0.0270, -0.0226,  0.4537],
          [-0.1564, -0.0270, -0.0226],
          [ 0.5121, -0.1564, -0.0270]],

         [[ 0.2794, -0.0770,  0.3151],
          [ 0.5721,  0.2794, -0.0770],
          [ 0.5009,  0.5721,  0.2794]],

         [[-0.4983,  0.2228, -0.4657],
          [-0.4093, -0.4983,  0.2228],
          [-0.4579, -0.4093, -0.4983]]]], grad_fn=<ThnnConv2DBackward>)

In [86]:
F.max_pool2d(conv_result2, 3)

tensor([[[[0.4173]],

         [[0.5121]],

         [[0.5721]],

         [[0.2228]]]], grad_fn=<MaxPool2DWithIndicesBackward>)

Recall that each of 4 layers correspond to some image pattern represented by the convolution parameters. Max pooling, then, allows us to learn the location of each pattern in the input image.

### 2.3 ReLU

Finally, we apply the Rectified Linear Unit activation function to the output:

    x = F.relu(F.max_pool2d(self.conv1(x), 2))

and that concludes the first layer of our neural network.

`relu()` function simply trims the negative elements of the tensor to zero, e.g.

In [24]:
F.relu(torch.tensor([-2, -1, 0, 1, 2]))

tensor([0, 0, 0, 1, 2])

This way, we will not propagate the gradient back to the layers that do not have positive correlation with certain patches of the image.

### 2.4 Layer 2

Next layer of our framework is very similar to the first one:

    x = F.relu(F.max_pool2d(self.conv2_drop(self.conv2(x)), 2))
   
Where `conv2` is defined as

    self.conv2 = nn.Conv2d(10, 20, kernel_size=5)
    
So we have the same sequence of `Conv2d`, `max_pool2d`, and `relu`.

To match the output of the first layer, `conv2` has `10` input dimensions, and `20` output ones. That means, on this stage out neural net will learn 20 different 5x5 patterns, each being a mosaic of 5x5 image components from the layer below. We'll try to visualize these parameters later, after we train our network.

### 2.5 2d Dropout

One new component that we encounter in the second layer is **2d dropout**, `conv2_drop`. It is defined as

    self.conv2_drop = nn.Dropout2d()

Let us see what it does.

In [25]:
drop = nn.Dropout2d(p=0.4)
drop

Dropout2d(p=0.4)

Parameter `p` specifies the probability of setting random *layers* of the input tensor to zero. The [Deep Learning book](http://www.deeplearningbook.org) suggests `p=0.8` for input layers and `p=0.5` for hidden layers of the net.

For each new *batch*, `Dropout2d` will pick different layers at random.

In [26]:
conv_result2

tensor([[[[ 0.2802,  0.3964,  0.2370],
          [ 0.3419,  0.2802,  0.3964],
          [ 0.4173,  0.3419,  0.2802]],

         [[-0.0270, -0.0226,  0.4537],
          [-0.1564, -0.0270, -0.0226],
          [ 0.5121, -0.1564, -0.0270]],

         [[ 0.2794, -0.0770,  0.3151],
          [ 0.5721,  0.2794, -0.0770],
          [ 0.5009,  0.5721,  0.2794]],

         [[-0.4983,  0.2228, -0.4657],
          [-0.4093, -0.4983,  0.2228],
          [-0.4579, -0.4093, -0.4983]]]], grad_fn=<ThnnConv2DBackward>)

In [27]:
drop_result = drop(conv_result2)

print(drop_result)

tensor([[[[ 0.0000,  0.0000,  0.0000],
          [ 0.0000,  0.0000,  0.0000],
          [ 0.0000,  0.0000,  0.0000]],

         [[-0.0450, -0.0376,  0.7561],
          [-0.2607, -0.0450, -0.0376],
          [ 0.8534, -0.2607, -0.0450]],

         [[ 0.4656, -0.1283,  0.5251],
          [ 0.9535,  0.4656, -0.1283],
          [ 0.8348,  0.9535,  0.4656]],

         [[-0.0000,  0.0000, -0.0000],
          [-0.0000, -0.0000,  0.0000],
          [-0.0000, -0.0000, -0.0000]]]], grad_fn=<FeatureDropoutBackward>)


Note that `Dropout2d` not only zeroes some (random) layers, but also rescales the remaining non-zero values by `1/(1-p)`:

In [28]:
drop_result / conv_result2

tensor([[[[0.0000, 0.0000, 0.0000],
          [0.0000, 0.0000, 0.0000],
          [0.0000, 0.0000, 0.0000]],

         [[1.6667, 1.6667, 1.6667],
          [1.6667, 1.6667, 1.6667],
          [1.6667, 1.6667, 1.6667]],

         [[1.6667, 1.6667, 1.6667],
          [1.6667, 1.6667, 1.6667],
          [1.6667, 1.6667, 1.6667]],

         [[0.0000, 0.0000, 0.0000],
          [0.0000, 0.0000, 0.0000],
          [0.0000, 0.0000, 0.0000]]]], grad_fn=<DivBackward1>)

#### 2.5.1 Aside: Dropout techniques

Dropout is a popular regularization technique that promotes independence between elements of the neural net. Since the dropout mask changes from one batch to the next, we effectively create an *ensemble* of models that share some of their parameters.

Unlike `Dropout2d`, regular `Dropout` class zeroes out neurons at random, e.g.

In [29]:
nn.Dropout(p=0.5)(conv_result2)

tensor([[[[ 0.0000,  0.7929,  0.4740],
          [ 0.6839,  0.5603,  0.0000],
          [ 0.8345,  0.6839,  0.5603]],

         [[-0.0000, -0.0452,  0.9073],
          [-0.3129, -0.0000, -0.0452],
          [ 1.0241, -0.3129, -0.0000]],

         [[ 0.5587, -0.1540,  0.0000],
          [ 0.0000,  0.0000, -0.0000],
          [ 0.0000,  1.1442,  0.0000]],

         [[-0.9965,  0.0000, -0.0000],
          [-0.0000, -0.0000,  0.0000],
          [-0.9158, -0.8186, -0.9965]]]], grad_fn=<DropoutBackward>)

This does not work well for convolutional nets, where each pixel of the input is strongly correlated with its immediate neigbors. For images, zeroing out random neurons will just slow down the learning process. Layer-wise dropout implemented in `Dropout2d` is much more efficient in such case.

### 2.6 Layer 3

The layer that follows two convolutional layers is defined as:

    self.fc1 = nn.Linear(320, 50)
    # ...
    x = x.view(-1, 320)
    x = F.relu(self.fc1(x))

It is a simple linear layer from 320 to 50 elements followed by a familiar ReLU activation function.

#### 2.6.1 Aside: Dimensions

Where's that dimension of `320` come from? Let's trace how the shape of the data changes as 

Recall that the input data has dimensions of `torch.Size([64, 1, 28, 28])`. That is, a batch of 64 images with one (greyscale) layer 28x28 pixels each.

After the first convolution `Conv2d(1, 10, kernel_size=5)` out comes the tensor of `torch.Size([64, 10, 24, 24])`: stride of 1, padding of 0, and 5x5 kernel produce the result of size `28 - 5 + 1 = 24` in height and width.

2d max pooling with 2x2 kernel shrinks the data to half its size in both width and height to `torch.Size([64, 10, 12, 12])`, as `int(dim / kernel_size) = 24/2 = 12`.

At the next layer, we apply 5x5 convolution with 20 output layers, `Conv2d(10, 20, kernel_size=5)`. It yields `torch.Size([64, 20, 8, 8])`, sinc `12 - 5 + 1 = 8`.

Finally, 2d max pooling with 2x2 kernel halves the the height and width of the input tensor again, making it `torch.Size([64, 20, 4, 4])`. That's a batch of 64 tensors, each having size of 4x4 and with 20 layers, or 320 elements in total (`4 * 4 * 20 = 320`).

Now `x.view(-1, 320)` just flattens this tensor into a vector, preserving the first dimension (i.e. the batch size), and we end up with `torch.Size([64, 320])` that can be fed into our linear layer.

#### 2.6.2 Linear unit

`Linear` is the most fundamental building block of every neural net. ...