This notebook was inspired by neural network & machine learning labs led by [GMUM](https://gmum.net/).

# Training neural networks

There are three necessary ingredients to train a neural network:

* the model,
* the loss,
* the optimizer.

We've already implemented a popular loss function in the first lab. Today we will briefly remind ourselves what the most popular optimizer (SGD) does and implement a custom model in PyTorch.

## Stochastic gradient descent

As a brief recap, *stochastic gradient descent* is an iterative method for optimizing an *objective function* (or *criterion*; when minimizing also *cost function*, *loss function*, or *error function*) which we can calculate the gradients of.

One way of minimizing the cost function $L(X; \theta)$ for a set of data $X \in \mathbb{R}^{NxD}$ is calculating the average cost of all the elements $\mathbf{x} \in X$:

$$L(X; \theta) = \frac{1}{N} \sum_i L(\mathbf{x}_i; \theta).$$

Next one could calculate the gradient of this and use that to minimize the function:

$$\theta_{new}=\theta_{old} -\alpha \nabla_\theta L(X; \theta),$$

with $\alpha$ being the step size. We would then apply this iteratively until convergence.

![gradient descent](figures/fig4.png)
<center>Source: <a href="https://www.deeplearningbook.org/contents/numerical.html">Chapter 4</a> of the Deep Learning book.</center>

In practice, our dataset could turn out to be enormous. It would be impractical to calculate the loss (and the gradient) for the whole dataset. We usually replace that with the cost function and gradient over a subset of $X$, a so-called *batch* $B \subsetneq X$:

$$L(B; \theta) = \frac{1}{|B|} \sum_{\mathbf{x} \in B} L(\mathbf{x}; \theta).$$

Doing SGD instead of GD also has good consequences for generalization, which we might talk about in the future.

The gradient of the cost calculated on the batch is probably going to be different than the gradient calculated on the whole dataset, but we can use it as an approximation, trading off iteration time against convergence rate:


$$\nabla_\theta L(B; \theta) \approx \nabla_\theta L(X; \theta).$$

Next, we will create an actual network for classification on the FashionMNIST dataset in PyTorch. First, however, we need to prepare the data.

## Task 1 (0.5p)
Prepare the data which we'll be using for the next task.
You need to, using [transforms](https://pytorch.org/vision/0.8/transforms.html) (following first week's lab):
- convert the PIL images to tensors,
- calculate the mean and standard deviation of pixels of the training set and use that to normalize the training data (this is new),
- change the shape of each image from 28x28 to 784.

In [None]:
import numpy as np
import torch
from torch.utils.data import DataLoader

from torchvision.datasets import FashionMNIST
from torchvision.transforms import ???

from typing import Tuple

In [None]:
def calculate_mean_and_std() -> Tuple[float, float]:
    loader = torch.utils.data.DataLoader(
        FashionMNIST(
            root='.',
            download=True,
            train=True,
            transform=ToTensor()
        )
    )
    ???
    return mean, std

mean, std = calculate_mean_and_std()

train_data = FashionMNIST(root='.', 
                          download=True, 
                          train=True, 
                          ???)

test_data = FashionMNIST(root='.', 
                         download=True, 
                         train=False, 
                         ???)

The following cell checks whether the mean and std calculation is correct.

In [None]:
assert np.isclose(mean, 0.286, atol=1e-4)
assert np.isclose(std, 0.353, atol=1e-4)

The following cell checks whether the dataloader returns objects of appropriate shape.

In [None]:
train_loader = DataLoader(train_data, batch_size=10)

x, y = next(iter(train_loader))

assert len(x.shape) == 2
assert x.shape == (10, 784)

We can now proceed to building our model.

## Task 2 (1p)
Implement a simple neural network in Pytorch. 

The network is supposed to accept data of dimension `input_dim` and have one hidden layer of size `hidden_dim` with weights initialized from the standard normal distribution. The biases are supposed to be initialized with zeros. For the activation function for the first layer use `torch.tanh`. For the second layer use a linear activation function. Don't forget to use `requires_grad=True` when defining the parameters of the network.

Next, implement a training loop in PyTorch utilizing the cost function `nn.CrossEntropyLoss` and the SGD optimizer.

If everything was implemented correctly, the network should usually achieve accuracy higher than $80\%$ on the test set (you might need a few runs for this, depending on the initialization).

In [None]:
from typing import List

class CustomNetwork(object):
    """
    Simple 1-hidden-layer linear neural network
    """
    def __init__(self, input_dim, hidden_dim, output_dim):
        """
        Initialize the network's weights 
        """
        
        self.weight_1: torch.Tensor = ???
        self.bias_1: torch.Tensor = ???
        
        self.weight_2: torch.Tensor = ???
        self.bias_2: torch.Tensor = ???
        
    def __call__(self, x: torch.Tensor) -> torch.Tensor:
        """
        Forward pass through the network
        """
        ???
        
    def parameters(self) -> List[torch.Tensor]:
        """
        Returns a list of all trainable parameters 
        """
        return [self.weight_1, self.bias_1, self.weight_2, self.bias_2]

The following cell checks whether the network's weights have appropriate shape.

In [None]:
network = CustomNetwork(100, 32, 54)

assert network.weight_1.shape == (100, 32)
assert network.weight_2.shape == (32, 54)

In [None]:
from torch import nn
from torch.optim import SGD
from torch.nn.functional import cross_entropy

# some hyperparams
batch_size: int = 64
n_epochs: int = 10

# prepare data loaders based on the already loaded datasets
train_loader = DataLoader(train_data, batch_size=batch_size)
test_loader = DataLoader(test_data, batch_size=batch_size)

# initialize the model
model: CustomNetwork = ???

# initialize the optimizer using the hyperparams below
lr: float = 0.01
momentum: float = 0.9
optimizer: torch.optim.Optimizer = SGD(???, lr=lr, momentum=momentum)
    
criterion = nn.CrossEntropyLoss()

# training loop
for e in range(n_epochs):
    for i, (x, y) in enumerate(train_loader):
        # reset the gradients from previous iteration
        optimizer.zero_grad()
        # pass through the network
        output: torch.Tensor = ???
        # calculate loss
        loss: torch.Tensor = criterion(???)
        # backward pass thorught the network
        loss.backward()
        # apply the gradients
        optimizer.step()
        
        # log the loss value
        if (i + 1) % 100 == 0:
            print(f"\rEpoch {e+1} iter {i+1}/{len(train_data) // batch_size} loss: {loss.item()}", end="")
            
    # at the end of an epoch run evaluation on the test set
    with torch.no_grad():
        # initialize the number of correct predictions
        correct: int = 0 
        for i, (x, y) in enumerate(test_loader):
            # pass through the network
            output: torch.Tensor = ???
            correct += ???

        print(f"\nTest accuracy: {correct / len(test_data)}")

        
# this is your test
assert correct / len(test_data) > 0.8, "Subject to random seed you should be able to get >80% accuracy"