In [None]:
import torch
import matplotlib.pyplot as plt
import numpy as np

In this notebook we will create our model to classify the FashionMNIST dataset. We will implement a Multilayer Perceptron (MLP) with one hidden layer.

# Model

## Small introduction

In this lab we will implement the MLP from scratch. Recall how an input $X \in \mathbb{R}^{n×d}$ with $n$ samples and $d$ dimensions moves forward in the network. The output from the first layer is

- $H = \sigma(XW_1 + b_1)$

where $W_1 \in \mathbb{R}^{d×h}$ are the weights of the first layers and $b_1 \in \mathbb{R}^{1×h}$ are the bias. We consider $h$ hidden units. Finally, $\sigma$ is the activation function.

Since we will consider just one hidden layer, the output layer is given as

- $O = H W_2 + b_2$

where $H \in \mathbb{R}^{n×h}$, $W_2 \in \mathbb{R}^{h×q}$ and $b_2 \in \mathbb{R}^{1×q}$.

## Building the model with PyTorch

To build our network we will always inherit from PyTorch `nn.Module` as this takes care of many problems for us. In the case of our MLP, the hyperparameters are the number of inputs, the number of outputs and the number of hidden units. For now just look at this class and make sure you understand it, do not worry about the `forward` method.

In [None]:
from torch import nn

class MLPInit(nn.Module):

    def __init__(self, num_inputs, num_outputs, num_hiddens, lr = 0.01, sigma=0.01):

        super().__init__()

        # Saving the hyperparameters
        self.num_inputs = num_inputs
        self.num_outputs = num_outputs
        self.num_hiddens = num_hiddens
        self.lr = lr

        
    # for now you should ignore this
    def forward(self, X):
        return X

## Initializing parameters

First, we will need to initialize the parameters for our MLP. For that we will use the `nn.Parameter` class

In [None]:
sigma = 0.1
W = nn.Parameter(torch.randn(5, 3) * sigma)

You will notice that `W` works like a usual Tensor. However, one of the advantages is that when used in the `nn.Module` class it will be recognized as a parameter of our model. Besides, it is always created with `requires_grad=True` so we do not need to worry about that.

In [None]:
print(W)

Now let's initialize all of the parameters in our `MLPInit` class. We have to initialize $W_1, W_2, b_1, b_2$. The weights should be initialized with random values from a Gaussian distribution of zero mean and sigma variance. The bias should be initialized with zeros.

In [None]:
from torch import nn

class MLPInit(nn.Module):

    def __init__(self, num_inputs, num_outputs, num_hiddens, lr = 0.01, sigma=0.01):

        super().__init__()

        # Saving the hyperparameters
        self.num_inputs = num_inputs
        self.num_outputs = num_outputs
        self.num_hiddens = num_hiddens
        self.lr = lr


      # Exercise - initialize the parameters with nn.Parameter class and the correct dimensions


    # for now you should ignore this
    def forward(self, X):
        return X

Now we will create a model with 4 input features, 3 outputs and 5 hidden units. Make sure your code passes all asserts

In [None]:
model = MLPInit(num_inputs=4, num_outputs=3, num_hiddens=5)
assert model.W1.shape == (4,5)
assert model.b1.shape == (5,)
assert model.W2.shape == (5,3)
assert model.b2.shape == (3,)

Loop over `model.parameters()` printing the elements

In [None]:
# Exercise


## Forward method

Now that we have the parameters we are ready to implement our forward method, but first we need to define an activation function.

### Activation function

For the activation function $\sigma()$ we will use the ReLU function. Recall that

$ReLU(x) = \textrm{max}(x,0)$

Implement the ReLU function and plot it in the interval [-5, 5]

In [None]:
# Exercise
def relu(X):


We will not rely on our implementation but we will use `torch.nn.functional` module. Run the cell below, you should get the same plot as before

In [None]:
import torch.nn.functional as F

x = torch.arange(-5.0, 5.0, 0.1)
y = F.relu(x)
plt.plot(x,y)

### Forward

Now that we have defined our activation function and know how to implement it, it is time to finally complete our `forward` method.  Remember that we want to take an input $X$ and pass it through all the layers of the network (in the correct order). You should implement

- $H = ReLU(XW_1 + b_1)$
- $O = H W_2 + b_2$

In [None]:
import torch.nn.functional as F

class MLPScratch(nn.Module):
    def __init__(self, num_inputs, num_outputs, num_hiddens, lr=0.01, sigma=0.01):
        super().__init__()
        self.num_inputs = num_inputs
        self.num_outputs = num_outputs
        self.num_hiddens = num_hiddens
        self.lr = lr

        # Exercise - Just copy here your parameter initialization from before


    # Exercis - implement the forward method
    def forward(self, X):



When you inherit from `nn.Module` you must define a `forward` method, which is why we always defined it above. Now let's test your implementation with a small batch $X$ of 8 samples and 4 features. When we run `model(X)` the forward method is called.

In [None]:
model = MLPScratch(num_inputs=4, num_outputs=3, num_hiddens=5)
X = torch.randn(8, 4)
output = model(X)
print(output.shape)
print(output)

What is the meaning of this output? As expected the shape of the output is the number of samples (8) by the number of outputs (3). In classification, the number of outputs corresponds to the number of possible classes that we are trying to classify from. And consequently, the values in `output` tell us how likely each sample is to belong to each class.

However, these are not probabilities as you can easily check. For this, we  will use the soft-max function that will squish each input between 0 and 1 and then normalize all the values, so that they sum to 1.

$\textrm{softmax}(x_i) = \frac{\textrm{exp}(x_i)}{\sum_j \textrm{exp}(x_j)}$

We can use `F.softmax` to implement this. Compute the softmax for the output above and check that the values are now probabilities

In [None]:
# Exercise


# Loss Function

We have our model but it is still not trained. I.e, we have yet to find values for W and b (remember that they were initialized with random values or zeros). To find the weights and bias, we will need to minimize a loss function. In a classification problem where we are using the softmax function as an output, we will want to use the cross-entropy loss

$l(y,o) = -\sum_j^q y_j \log \frac{\exp(o_j)}{\sum_k \exp(o_k)}$

Naturally, PyTorch also provides us with loss functions.

In [None]:
criterion = nn.CrossEntropyLoss()

An important thing to take into account is that  `CrossEntropyLoss()` already computes the softmax, so you should not give as input `probs` but rather `output`. The true labels for our batch are $[0,1,2,2,0,2,1,2]$. Compute the current cross entropy loss between the `output` and true labels.

In [None]:
# Exercise


# Optimization

Our goal is now to minimize this loss function. For that we will use gradient descent. In a neural network this is done through backpropagation, where we propagate the gradient of the loss over the network (backwards) using the chain rule. Finally, recall that our update to the weights will be given as

$W' = W - \alpha \nabla_W L$

where $W$ is the current parameter value and W' is the new value, $\alpha$ is the learning rate and $\nabla_W L$ the gradient of our loss with respect to W.

## Optimizer from scratch

Let us start with a simple example by taking a loss function $f = 2x^Tu$, where both $x$ and $u$ are parameters. Create a tensor $x = [0,1]$ and $u = [2,3]$ and create f. Look at the values of x, u and their gradients (remember to set `requires_grad`!)

In [None]:
# Exercise
def myloss(x,u):
  # implement the loss


# create x and u


l = myloss(x,u)
print(x,u)
print(x.grad, u.grad)

Now run the follwing cell multiple times. What happens?

In [None]:
l = myloss(x,u)
l.backward()
print(x,u)
print(x.grad, u.grad)

Why is the gradient changing if we are not changing x or u? When we do several backward passes with the same paramters, gradients will be accumulated, not replaced. At each pass of our training loop we will want to have the current gradient, not the accumulation with previous passes. So we will need to take care of this. Fortunately, we can easily set gradients to zero

In [None]:
x.grad.zero_()
print(x.grad)

Now, what will be the values of x and u after one optimization step with learning rate of 0.5? Try to do it by hand as well

In [None]:
x = torch.arange(2.0, requires_grad=True)
u = torch.arange(2.0, 4.0, requires_grad=True)
l = myloss(x,u)

# Exercise compute the value of x and u after an update


We are now ready to build our own Stochastic Gradient Descent Optimizer. You should complete the two methods, `step` and `zero_grad`. We need the `zero_grad` method to set all gradients to zero and prevent them from accumulating as we have seen above

In [None]:
class MySGD():
    def __init__(self, params, lr):
        self.params = params # list of parameters to be updated
        self.lr = lr # learning rate

    def step(self):
        with torch.no_grad(): #  disable gradient calculation while updating parameters
          # Exercise update all parameters in self.params (just like you did above)


    def zero_grad(self):
        # Exercise set each of the grad of the parameters to zero


Let us test our optimizer with the previous example

In [None]:
# Just as before
x = torch.arange(2.0, requires_grad=True)
u = torch.arange(2.0, 4.0, requires_grad=True)
l = myloss(x,u)

# Test our optimizer
optimizer = MySGD(params = [x,u], lr = 0.5)
l.backward()
optimizer.step()
print(x,u)
optimizer.zero_grad()
print(x.grad,u.grad)

## PyTorch SGD

Fortunately PyTorch already has an implemented optimizer class that we can use. It works in the same way. Replicate the results with `torch.optim.SGD`

In [None]:
# Exercise


By default `zero_grad()` sets the gradients to None and not zero.

We now have all the elements to train our model and we will do so in the next notebook.