# 💪 Part 3: Building and Training Your Model

In [None]:
# Import relevant libraries
import torch 
import torchvision
from torch import nn 
from torchvision import transforms
from torch.utils import data
import random
import matplotlib.pyplot as plt
import time
from IPython import display
import numpy as np
from helper import helper

random.seed(2021) # We set a seed to ensure our samples will be the same every time we run the code.

## ⚗️ The Data Science Pipleine 
*This section will be repeated in both Part 2 and Part 3*

> What on earth is data science?! -- George Washington (probably not)

Seriously though, nowadays, in such a data-rich world, data science has become the new buzzword, the new cool kid in the block. But what exactly is it? Unfortunately, no one can really pin down a [rigourous definition](https://hdsr.mitpress.mit.edu/pub/jhy4g6eg/release/7) of data science. At the high level:

> Data science is the systematic extraction of novel information from data.

Good enough! With this definition, most practitioners can somewhat agree on a pipeline or flow. Here are the steps:
1. Identify your problem (What are you trying to do?)
2. Obtain your data (What resource do we have to work with?)
3. Explore your data (What does our data actually look like?)
4. Prepare your data (How do we clean/wrangle our data to make it ingestible?)
5. Model your data (How do we automate the process of drawing out insights?)
6. Evaluate your model (How good are our predictions?)
7. Deploy your model (How can the wider-user base access these insights?)

The 7th step is out-of-scope for this workshop, but we well be exploring the other steps to varying degrees:
* Steps 1-4 will be explored in Part 2.
* Steps 5-6 will be explored in Part 3 and 4.


## 🧢 Recap
Let's review what we have done so far:

|Pipeline | Our Problem |
|---| --- |
|1. Identify Your Problem | Classify images of items of clothing |
|2. Obtain Your Data | 70,000 labelled images (10 different types) of clothes |
|3. Explore Your Data | Class distribution perfectly equal across classes |
|4. Prepare Your Data | Split 70,000 into 60,000 train and 10,000 test set |

Note that we didn't have to do too much cleaning because the data we have is close to *perfect* in many regards. For further details about intricacies of this process, this excellent [textbook](https://www.oreilly.com/library/view/hands-on-machine-learning/9781492032632/) provides all the nitty gritty detail. 

We need to re-run the data reading part of the tutorial from the last notebook. Please set your variables accordingly.

In [None]:
# First define the function without running it
def load_data_fashion_mnist(batch_size, n_workers):
    """Download the Fashion-MNIST dataset and then load it into memory."""
    trans = [transforms.ToTensor()]
    trans = transforms.Compose(trans)
    mnist_train = torchvision.datasets.FashionMNIST(root="../data",
                                                    train=True,
                                                    transform=trans,
                                                    download=True)
    mnist_test = torchvision.datasets.FashionMNIST(root="../data",
                                                   train=False,
                                                   transform=trans,
                                                   download=True)
    return (data.DataLoader(mnist_train, batch_size, shuffle=True,
                            num_workers=n_workers),
            data.DataLoader(mnist_test, batch_size, shuffle=False,
                            num_workers=n_workers))

# Then execute the function here
batch_size = 512  # Set to 256 on your own device
n_workers = 0      # Set to 4 on your own device
train_iter, test_iter = load_data_fashion_mnist(batch_size=batch_size, n_workers = n_workers)

## Step 5A: Setting Up Your Model
Our data is fully prepared and we are ready to go! 🚀🚀🚀 A recap on neural networks:
* Building Blocks = Perceptrons
* Many Perceptrons = Multi-Layer Perceptrons (MLP)
* Extension to Images = Convolutional Neural Networks

### 🪟 Convolutional Neural Network
#### ⚠️⚠️⚠️ This part is the most conceptually confusing -- ask lots of questions here! ⚠️⚠️⚠️

* Typical MLPs are weak for image recognition ❄️
* We need some image pre-processing to bolster performance 💪
* We use [convolution](https://en.wikipedia.org/wiki/Convolution#Visual_explanation) (and pooling) to do that, which is effectively a sliding window 🪟 

![](../images/convolution.gif)  
[source](https://commons.wikimedia.org/wiki/File:Convolution_arithmetic_-_Full_padding_no_strides.gif)

### LeNet
The particular convolutional neural network architecture we will use is called the LeNet. It was one of the first successful neural network architectures to be concieved by Yann LeCun as he worked at Bell Laboratory. Here is the [original paper](https://www.researchgate.net/publication/2985446_Gradient-Based_Learning_Applied_to_Document_Recognition) if you are interested. In this instance, we will be building a slightly adapted version that the d2l.ai textbook outlines in [this chapter](http://d2l.ai/chapter_convolutional-neural-networks/lenet.html). (The key difference is that we will be dropping the Gaussian activation function in the final layer).

We will be using the two diagrams below to construct our neural network:

![lenet](../images/lenet.svg)

**Figure 1:** The architecture of LeNet. ([source](http://d2l.ai/chapter_convolutional-neural-networks/lenet.html))

![lenetsimple](../images/lenet-vert.svg)

**Figure 2:** Compact version of the architecture of LeNet. ([source](http://d2l.ai/chapter_convolutional-neural-networks/lenet.html))

LeNet layers summary:

| Layer | Dimensions | Purpose |
| --- | --- | --- |
| Convolution | 2D | Draw out 'patterns' in the image |
| Pooling | 2D | Compress image |
| Flattening | 2D -> 1D | Convert to 1D |
| Dense/Linear | 1D | Typical MLP |

#### 🧅 Convolution Layers
The convolution layer is effectly the 'sliding window' part of the CNN. It takes in 4 hyperparameters. From Figure 2, we can determine the values of the first convolutional layer.

| Hyperparameter | Description | First Layer Values |
| --- | --- | --- |
| Kernel Size | Size of window | 5 (5x5) | 
| Output Layers | Number of filters/channels | 6 |
| Padding | Size of the '0' ring | 2 |
| Stride | How far to slide the window | 1 |

***If not stated, default Pad is 0 and Stride is 1.***




---
## <font color='#F89536'> **Discussion:** </font>
If the input image is 28x28, what are the output dimensions of the layer? (Hint: With pad-2, the image will be a 32x32 image. How many strides (of 1) can the window move horizontally before reaching the right-hand side?)

---

Since our images are black and white, there is only a single input channel. In code, this looks like:  
`nn.Conv2d(in_channels = 1, out_channels = 6, kernel_size=5, padding=2)`.

**Note:** If you want to get a more conceptual understanding of what is happening, [this Stanford University course](https://cs231n.github.io/convolutional-networks/) has an animated figure which you can toggle on and off. The 'window' and the 'filter' matrices are combined via a [cross-correlation](http://d2l.ai/chapter_convolutional-neural-networks/conv-layer.html?highlight=cross%20correlation) operation.

#### 🧅 Pooling Layers
All pooling in LeNet is average pooling. It takes in two hyperparameters. From Figure 2, we can determine the values of the first pooling layer.

| Hyperparameter | Description | First Layer Values |
| --- | --- | --- |
| Kernel Size | Size of window | 2 (2x2) | 
| Stride | How far to slide the window | 2 |

Putting this together, the code becomes: `nn.AvgPool2d(kernel_size=2, stride=2)`.


#### 🧅 Linear/Dense Layer
The first dense (fully-connected) layer is FC(120) in Figure 2. The input is 16 layers of 5x5 images. How many pixels is that in total?  
$$16 \times 5 \times 5 = 400$$

Thus for this layer, the input dimension is 400, and the output is 120. 

Putting this together, the code becomes: `nn.Linear(in_features = 16 * 5 * 5, out_features = 120)`.


#### 🪢 Tying Loose Ends
* Between each layer, we add a sigmoid function `nn.Sigmoid()` as our activation function
* Pooling layers do not need activation functions after them (only convolutional and linear)
* The final layer does not use an activation function
* To flatten the 2D image into 1D line so it can put into an MLP, we use the `nn.Flatten()` function

---
## <font color='#F89536'> **Your Turn!** </font> 
What are the hyperparameters for the *second* convolutional, pooling, and linear layers? Fill in the `?` below.

In [None]:
# Initialise LeNet Architecture
net = nn.Sequential(nn.Conv2d(in_channels = 1, out_channels = 6, kernel_size=5, padding=2), nn.Sigmoid(),
                    nn.AvgPool2d(kernel_size=2, stride=2),
                    nn.Conv2d(in_channels = ?, out_channels = ?, kernel_size=?), ?,
                    ?, nn.Flatten(),
                    nn.Linear(in_features = 16 * 5 * 5, out_features = 120), ?,
                    ?, ?, 
                    nn.Linear(in_features = 84, out_features = 10))

---

Let's have a look at what each layer looks like

In [None]:
# Show layers
X = torch.rand(size=(1, 1, 28, 28), dtype=torch.float32)
for layer in net:
    X = layer(X)
    print(layer.__class__.__name__, 'output shape:    \t', X.shape)

---
## <font color='#F89536'> **Discussion** </font> 
Does this match with the LeNet architecture we outlined at the beginning?

---

## Step 5B: Training Your Model

### Initialising weights
* When you setup your neural network `net` the values are complete garbage
* You want to initialise your weights randomly, but in a systematic way (valuable garbage)
* We use the [Xavier Uniform](https://pytorch.org/docs/stable/nn.init.html#) distribution as outlined in [this paper](https://proceedings.mlr.press/v9/glorot10a/glorot10a.pdf) to do this.

In [None]:
def init_weights(m):
    if type(m) == nn.Linear or type(m) == nn.Conv2d: # We will only set the weights from linear and Conv2d layers, since pooling layers do not require this
        nn.init.xavier_uniform_(m.weight)

net.apply(init_weights) # the function apply() takes in another function as input! 

### Setting Hyperparameters
There are certain hyperparameters associated with the training phase we need to set:

| Hyperparameter | Description | Selected Value |
| --- | --- | --- |
| Learning Rate | How quickly the algorithm converges. Too quick and we might *miss* the optimal weights. Too slow and it will take a long time to run | $0.9$ |
| Optimiser | What algorithm do we use to find the optimal weights? | [Stochastic Gradient Descent (SGD)](https://pytorch.org/docs/stable/generated/torch.optim.SGD.html#torch.optim.SGD) |
| Loss Function | How do we measure the *correctness* of our predictions? | [Cross Entropy Loss](https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html) |

We define these below:

In [None]:
lr = 0.9 
optimizer = torch.optim.SGD(net.parameters(), lr=lr) 
loss = nn.CrossEntropyLoss() 

### Training the Model

To train the model we must:
1. Set the 'mode' of the network to `train` 🚅
2. Select a single mini-batch to train on ✅
3. Conduct a forward pass to make predictions ➡️
4. Calculate the loss (lack of 'correctness') of these predictions 🧮
5. Calculate the gradients required for back propagation (according to the loss function) 🧮
6. Update weights according to gradient descent ⬅️

Each line below corresponds to each of these steps above.

In [None]:
net.train() # This doesn't actually train, but sets the network on training mode
X, y = next(iter(train_iter)) # Pick a single minibatch at random to do the training
optimizer.zero_grad() # before running the forward/backward pass we need to reset the gradient (otherwise it accumulates)
y_hat = net(X) # Forward pass on the data to make prediction
l = loss(y_hat, y) # calculate the loss 
l.backward() # calculate gradients for back prop
optimizer.step() # step forward in optimisation and apply backprop to update weights

### 🎉🎉🎉 Congratulations! You have trained your first neural network in PyTorch 🎉🎉🎉

...well not quite. This model is going to be quite terrible, since we only trained on a small sample of our dataset. In the next part we will look into scaling this procedure up. But first, let's see how we went.


## Step 6: Evaluate Your Model
It's all well and good if you can train a model, but it's pretty useless if you can't see how well it does. Recall that our performance metric is that we want:
* Predictions to be correct (Accuracy)
* Model to generalise to unseen data (No Overfitting)

Thus we should extract both the train accuracy (how well the model runs on the dataset it trained on), and the test accuracy (how well the model runs on unseen/independent data).

### Training Accuracy
First we calculate the training accuracy.

In [None]:
loss = l * X.shape[0]
n_correct = helper.accuracy(y_hat, y)
n_total = X.shape[0] 

print("1. The mini-batch loss is: \t\t\t\t", loss)
print("2. The number of correct training predictions is: \t", n_correct)
print("3. The number of total training predictions is: \t", n_total)

print("This means we get a training accuracy of ", n_correct/n_total)
print("The average loss for each example is ", float(loss/n_total))

### Testing Accuracy
Then we calculate the testing accuracy.

In [None]:
test_accuracy = helper.evaluate_accuracy(net, test_iter)
print("The testing accuracy is: ", test_accuracy)

| Dataset | Accuracy |
| --- | --- |
| Train | ~10% |
| Test | ~10% |

This is no better than randomly assigning (since we would have a 1 in 10 chance of being correct)! But we are not finished yet, let's now scale our model.