# PyTorch MNIST example dissected

In this notebook we'll explore the components of the
[PyTorch MNIST example](https://github.com/pytorch/examples/tree/master/mnist)
one-by-one.

**This is notebook 2 of the multi-part example.** Please see [Part 1 here](1_mnist_load.ipynb).

## 2. Building the model

In this notebook we'll explore the components of `torch.nn`

In [1]:
import torch
import torch.nn as nn
import torch.nn.functional as F

We'll start with copying the model from the original [PyTorch MNIST example](https://github.com/pytorch/examples/tree/master/mnist) verbatim:

In [2]:
class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(1, 10, kernel_size=5)
        self.conv2 = nn.Conv2d(10, 20, kernel_size=5)
        self.conv2_drop = nn.Dropout2d()
        self.fc1 = nn.Linear(320, 50)
        self.fc2 = nn.Linear(50, 10)

    def forward(self, x):
        x = F.relu(F.max_pool2d(self.conv1(x), 2))
        x = F.relu(F.max_pool2d(self.conv2_drop(self.conv2(x)), 2))
        x = x.view(-1, 320)
        x = F.relu(self.fc1(x))
        x = F.dropout(x, training=self.training)
        x = self.fc2(x)
        return F.log_softmax(x, dim=1)

That is, any class derived from `nn.Module` should implement the `.forward()` method; everything else is taken care of automagically by the auto-differentiation framework. Pretty neat.

The original `x` that comes into the `.forward()` method is the **batch** of features generated by the `DataLoader` (see [Part 1](1_mnist_load.ipynb#1.4-Dataset-and-DataLoader) for details). That is, for the MNIST dataset and batch size of 64, the dimensions of the input `x` will be

    torch.Size([64, 1, 28, 28])

### 2.1 Conv2d

First, we feed the input batch through the 2D convolution with 5x5 kernel and `10` output layers. Remember that our input has 1 layer, because it is a greyscale image; hence `1` in `Conv2d()` constructor parameters.

Below we will create an instance of `Conv2d` and see what it actually does to the data. We start with a tiny 3x3 tensor and 3x3 convolution with 4 output layers:

In [20]:
data = torch.ones((1, 1, 3, 3), dtype=torch.float32)
data

tensor([[[[1., 1., 1.],
          [1., 1., 1.],
          [1., 1., 1.]]]])

Note that NN layers operate on *batches*, hence input data dimensions of [batch of size 1 x 1 color layer x height 3 x width 3]: 

In [21]:
data.size()

torch.Size([1, 1, 3, 3])

In [22]:
conv = nn.Conv2d(1, 4, kernel_size=3)
conv

Conv2d(1, 4, kernel_size=(3, 3), stride=(1, 1))

In [23]:
result = conv(data)
result

tensor([[[[-0.8066]],

         [[-0.0852]],

         [[-0.0678]],

         [[-0.4776]]]], grad_fn=<ThnnConv2DBackward>)

In [19]:
result.size()

torch.Size([1, 4, 1, 1])

So our convolution converts a 1x3x3 patch of (greyscale) image into a 4x1x1 column of weird numbers. Where do those numbers come from?

Let's take a look at the `Conv2d` internals.

In [41]:
(params, bias) = list(conv.parameters())

In [43]:
params

Parameter containing:
tensor([[[[-0.0505, -0.2281,  0.1797],
          [-0.2954, -0.1113, -0.1966],
          [-0.1436,  0.0263, -0.1849]]],


        [[[ 0.2418, -0.0305,  0.0494],
          [ 0.1427, -0.2922, -0.1214],
          [-0.3194,  0.2209, -0.2240]]],


        [[[-0.1038, -0.0043, -0.1872],
          [ 0.2374, -0.2073,  0.2146],
          [-0.1888,  0.2902,  0.0201]]],


        [[[ 0.1617, -0.1213, -0.1789],
          [-0.1230, -0.0075, -0.0797],
          [-0.3140,  0.1756,  0.0823]]]], requires_grad=True)

In [44]:
bias

Parameter containing:
tensor([ 0.1980,  0.2475, -0.1387, -0.0727], requires_grad=True)

These are randomly initialized parameters of our convolution. The second one is a bias vector. (It is possible to have a convolution *without* a bias - that's `bias=False` parameter in the constructor).

So what is exactly the computation that `Conv2d` performs on the data during feed forward?

In [34]:
(params * data).sum(dim=(1, 2, 3)) + bias

tensor([-0.8066, -0.0852, -0.0678, -0.4776], grad_fn=<ThAddBackward>)

Woo hoo, we get the same results!

That is, for each of 4 output layers of the convolution there is a 3x3 matrix of parameters that we multiply by a 3x3 patch of the input image, and then we just sum up all elements of that product.

Let's do it again, just for one layer.

In [36]:
params[0]

tensor([[[-0.0505, -0.2281,  0.1797],
         [-0.2954, -0.1113, -0.1966],
         [-0.1436,  0.0263, -0.1849]]], grad_fn=<SelectBackward>)

In [37]:
params[0] * data

tensor([[[[-0.0505, -0.2281,  0.1797],
          [-0.2954, -0.1113, -0.1966],
          [-0.1436,  0.0263, -0.1849]]]], grad_fn=<ThMulBackward>)

In [39]:
(params[0] * data).sum() + bias[0]

tensor(-0.8066, grad_fn=<ThAddBackward>)

So each of 4 layers of the convolution parameters can be seen as a 3x3 patch of an image that we learn to recognize in the input. Right now it is random, but as we learn (i.e. backpropagate the gradient through it), it will morph into something like, say, a vertical line, or a diagonal gradient. Having 4 layers means that our convolution learns to recognize 4 different image patterns. Our original MNIST model has 10 such layers in its first convolution.

Having the bias parameter for each layer allows us to learn image patterns of different intensity.

#### 2.1.1 Gradient computation

Note that tensors holding the convolution parameters have property `requires_grad=True`.

Likewise, the results of the convolution are tensors with `grad_fn` property set.

This is where the auto-differentiation magic of PyTorch happens. Later in this course we will dedicate the whole section to discuss it (see [Backpropagation](3_mnist_backprop.ipynb)). 