# Your first neural network

In this lab session we aim to accomplish the following:

- Show you the basics of PyTorch
- Introduce the implementation of a multilayer perceptron/fully connected deep network.
- Train the network on an image dataset

### Contents
1. [Iris Flower Dataset - A 3-way classification problem](#Iris-Flower-Dataset)
2. [Colaboratory notebooks](#Colaboratory-notebooks)
3. [Intro to PyTorch](#PyTorch-Intro)
4. [Your first DNN](#Building-Your-First-Fully-Connected-Network)
5. [Porting the network to a script and running on BC4](#Porting-Your-Network-To-BC4)
6. [Logging metrics with TensorBoard](#Logging-Performance-Metrics)
7. [Optional Extension: Implementing library functions](#Optional-Extension:-Implementing-library-functions)


## CIFAR-10 Dataset

Our network is designed to operate on images from CIFAR-10, a dataset containing 60,000 RGB images, each 32 $\times$ 32 in resolution, split into 50,000 images for training and 10,000 images for testing. 

There are 10 classes with 6,000 examples per class. Some examples of each class can be seen in the diagram below:

<img alt="CIFAR-10 examples" src="./media/cifar10.png" style="max-height: 500px;" />

--- 
## PyTorch Intro

In this section we will introduce you in a hands on way to PyTorch. You can check that you get the answers listed in comments at the bottom of each cell in the notebook.

PyTorch is a lot like numpy, it has a notion of an N-dimensional array like in numpy, but it is instead called a tensor. Tensors generalise scalar, vectors, and matrices. A scalar is a 0D tensor, a vector is a 1D tensor, and a matrix is a 2D tensor.

In [None]:
import torch
import numpy as np

from pathlib import Path
import torchvision.datasets
from torchvision import transforms


array_np = np.array([[1, 2, 3],
                     [4, 5, 6]])
array_pytorch = torch.tensor([[1, 2, 3],
                              [4, 5, 6]])
print(array_np)
print(array_pytorch)

# OUTPUT
# [[1 2 3]
#  [4 5 6]]
# tensor([[1, 2, 3],
#         [4, 5, 6]])

Now we'll demonstrate some of the operations defined on tensors. Check out the docs for [`torch.Tensor`](https://pytorch.org/docs/stable/tensors.html) and [`torch`](https://pytorch.org/docs/stable/torch.html) for details. (You can click on each to find out more information)

In [None]:
x = torch.tensor([1, 2, 3], dtype=torch.float32)
y = torch.tensor([4, 5, 6], dtype=torch.float32)

In [None]:
x.shape  # what are the dimensions of the tensor?

# OUTPUT
# torch.Size([3])

In [None]:
x.dim()  # how many dimensions does the tensor have?

# OUTPUT
# 1

In [None]:
x + y # + performs elementwise addition

# OUTPUT
# tensor([5., 7., 9.])


In [None]:
x * y  # * performs elementwise multiplication 

# OUTPUT
# tensor([ 4., 10., 18.])

In [None]:
x / y  # / performs elementwise division

# OUTPUT
# tensor([0.2500, 0.4000, 0.5000])

In [None]:
torch.dot(x, y)  # compute the scalar product of two vectors

# OUTPUT
# tensor(32.)

In [None]:
torch.dot(x, y).dim() # this is a 0D tensor or a scalar

# OUTPUT
# 0

In [None]:
# note how `torch.dot(x, y)` produces a scalar, yet this is still wrapped in a tensor
# to get the raw scalar value, call .item() on the tensor. This is so pytorch can 
# track the operations applied to even scalar values.
torch.dot(x, y).item()

# OUTPUT
# 32.0

## Time For Educated Guesses!
In this next section, see if you can work out what each are doing, and what you expect the output would be beforehand, before running the cell

In [None]:
x.mean()

In [None]:
x.std()

In [None]:
x.max()

In [None]:
x.argmax()  # the index of the maximum element of the tensor

In [None]:
y = x = torch.arange(0, 9).reshape((3, 3))
x

In [None]:
x.shape

In [None]:
x.dim()

In [None]:
x @ y  # Matrix multiplication

In [None]:
x = torch.randn((2, 3, 1, 10, 10))
x.shape

## More Advanced PyTorch Operations

Below are some more advanced operations that you will see in PyTorch but can represent important parts of networks and code.

Once again, look at the comment and think about what the code will do before it runs

In [None]:
x.squeeze().shape  # squeeze remove dimensions of size one.

In [None]:
x.unsqueeze(dim=4).shape  # unsqueeze adds a new dimension of size one at dimensional index `dim`.

In [None]:
x = torch.arange(0, 100).reshape((2, 5, 10))
x

In [None]:
x.reshape((10, 10))  # We can squash any number of dimensions into one using reshape

In [None]:
# Reshaping doesn't create new data, it produces a view on existing data,
# if you update that data, you'll update the old tensors it came from.

y = x.reshape((10, 10))
y[0, 0] = 100
print(y)
print(x)  # notice that the data in x has changed too!

We've really on scratched the surface of the functionality provided PyTorch. Before implementing something yourself, it's always worth scanning through the documentation to check if PyTorch already provides an implementation.

--- 
## Building Your First Fully Connected Network

Now you've got to know the basics of pytorch, we can implement a 2-layer fully connected network (a.k.a MultiLayer Percepton) and train it using gradient descent.

First we need to load in our data. You can use the following code that downloads and handles loading the dataset.

In [None]:
default_dataset_dir = Path.home() / ".cache" / "torch" / "datasets"
batch_size = 256
transform = transforms.ToTensor()

default_dataset_dir.mkdir(parents=True, exist_ok=True)

* The `default_dataset_dir` is where we will save the CIFAR10 dataset, feel free to save this wherever you want.
* `batch_size` is one of the more important hyperparameters (parameters that we set). This refers to how many instances we see per training loop.
* `transform` is not something to worry about at this point, it's a function that gets applied to each example, in this case turning each image into a tensor.

Finally, we make the dataset directory with the `mkdir` method.

In [None]:
train_dataset = torchvision.datasets.CIFAR10(
    default_dataset_dir, train=True, download=True, transform=transform
)
test_dataset = torchvision.datasets.CIFAR10(
    default_dataset_dir, train=False, download=False, transform=transform
)
train_loader = torch.utils.data.DataLoader(
    train_dataset,
    shuffle=True,
    batch_size=batch_size,
    pin_memory=True,
    num_workers=1,
)
test_loader = torch.utils.data.DataLoader(
    test_dataset,
    shuffle=False,
    batch_size=batch_size,
    num_workers=1,
    pin_memory=True,
)


OK, that's a lot of code, what is it all doing?

`train_dataset` and `test_dataset` represent objects which load our training set and our testing set. These have been created for CIFAR10 for researchers to easily download and use it.

The loaders, i.e. `train_loader` and `test_loader` This allows for us to easily access them in our training/testing loops (you will see this later) but is essentially just a wrapper around the dataset. They include some interesting parameters:
* shuffle: whether to shuffle the items in the dataset, this is important for training as otherwise our model could overfit!
* batch_size: Same as above, how many items we see per training loop
Don't worry too much about the other parameters at this stage.

## A Fully Connected Layer

Now we need to create a fully connected layer that takes an input $x$, and trainable weights $W$ and biases $b$ and computes

$$Wx + b$$

PyTorch has a library of common layer types including a fully connected layer, its class name is `Linear` as the layer produces a linear transformation of the input data.

We have a single fully connected layer, but we want to stack these to produce a neural network composed of two layers (a.k.a Multi-layer Perceptron or MLP):

* Input size: 3,072 features
* Hidden layer size: 100 units
* Output size: 3 classes

We need to put a non-linear function in between these two layers as otherwise the transformation is just as powerful in representational capacity as a linear classifier. We want to produce non-linear decision boundaries as these will better fit our data.

Now we can define a MLP class that brings together 2 fully connected layers with a ReLU on the output of the first layer. We have done the first layer for you:

In [None]:
from torch import nn, optim
from torch.nn import functional as F
from typing import Callable


class MLP(nn.Module):
    def __init__(self,
                 input_size: int,
                 hidden_layer_size: int,
                 output_size: int,
                 activation_fn: Callable[[torch.Tensor], torch.Tensor] = F.relu):
        super().__init__()
        self.l1 = nn.Linear(input_size, hidden_layer_size)
        self.l2 = None #Fill this in yourself
        self.activation_fn = activation_fn
        
    def forward(self, inputs: torch.Tensor) -> torch.Tensor:
        x = self.l1(inputs)
        x = self.activation_fn(x)
        x = None #Fill this in yourself
        return x

Let's deconstruct the signature of the `forward` method

```python
def forward(self, inputs: torch.Tensor) -> torch.Tensor:
    ...
```

`inputs: torch.Tensor` says that the parameter `inputs` is of type `torch.Tensor`. The return type of the method is denoted by `-> torch.Tensor` stating that the method returns a tensor.

We use a generic type `Callable` which defines the type of a function, it has the format `Callable[[args], return_type]`. `activation_fn: Callable[[torch.Tensor], torch.Tensor]` means that `activation_fn` should be a function that takes in a single argument of type `torch.Tensor` and returns a `torch.Tensor`. We've also defined the default value of this parameter to be `F.relu` which is the functional implementation of a rectified linear unit in PyTorch.

Now we can instantiate the MLP class for our problem:

In [None]:
feature_count = 3072
hidden_layer_size = 2000
class_count = 10
model = MLP(feature_count, hidden_layer_size, class_count)

Let's have a think about these sizes, the feature count is the input, our images are 32x32, with 3 channels (red, green, and blue), therefore we have 3,072 inputs.

The hidden layer size can be whatever we wish, we will start with 100, but it's something you can change later.

Finally, the class count, which corresponds to our output, needs to be the same size as the number of classes in our dataset, which is 10 in this case.

*What would happen if we increase the number of classes in our dataset?*

## Training Loop

We'll now look into the training loop. This is a two-stage process:

1. First, we pass through examples from our training set to get a prediction and compare this with what it *should* be, i.e. the ground truth
2. Then, we update the network based on the difference between the prediction and the ground truth.

We'll define a **loss function**, this is what we can use to compare our predicted values against the ground truth.

In [None]:
loss_function = torch.nn.CrossEntropyLoss()

In [None]:
for batch, labels in train_loader: #Get the inputs in a batch and the corresponding labels
    
    batch = batch.flatten(1) #This converts our images into greyscale and flattens them to be of size 1024 instead of 32x32
    
    logits = model(batch) #Get the output from the model
    
    loss = loss_function(logits, labels) #Calculate our loss, a value of how good (or bad) our network is doing
    break

In [None]:
print(loss)

Look at the loss, this is a tensor, but it also has a grad_fn applied to it. This tells the network how to update the weights based on this value.
We can use an optimiser to force the network to update, we first defined it below:

In [None]:
optimiser = optim.SGD(model.parameters(), lr=0.05)

We now get introduced to a new hyperparameter, the learning rate (lr), which is one of the most important hyperparameters. If you want to change one hyperparameter, it will be this one!

With the optimiser defined, we can now to a backward pass - or update the weights of the model!

In [None]:
loss.backward()
optimiser.step()

We can now compare the loss before the update and the loss now, hopefully, this value should be lower!

In [None]:
print(loss)
print(loss_function(model(batch), labels))
#The first value was our original loss, the second is the new one now that the model has been trained once!

## The full training Loop

We want to do more than a single training loop, so now we can put everything together.
Using the code cells above, finish the training loop below and see if you can train the model

In [None]:
model = MLP(feature_count, hidden_layer_size, class_count)
optimiser = optim.SGD(model.parameters(), lr=0.01)

for i in range(10):
    print(f'Batch {i+1} out of 10')
    for batch, labels in train_loader:
        batch_size = batch.shape[0]
        batch = batch.reshape(batch_size, -1)
        
        #Get the output logits from the model
 
        #Calculate the loss
 
        #Calculate the backward values of the loss
 
        #Take a step of the optimiser
 