[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/lorenzobasile/DeepLearning2022/blob/main/2_gradient_descent.ipynb)

# Lab 2

# Training a Neural Network

Today we will introduce the basic tools we need to apply backpropagation and train a neural network with gradient-based methods.

To start, let's import the usual.

In [None]:
from google.colab import drive
import torch

drive.mount('/content/drive')
%cd drive/MyDrive/DeepLearning2022

## Automatic differentiation in torch



PyTorch is built with support for differentiation in mind.
In the end, Deep Learning (for now) is all about differentiation and building cascades of differentiable function into complicated multilayer deep neural networks.

Essentially, all PyTorch built-ins support differentiability (unless the function is not differentiable, of course).
Today we will see how to compute derivatives in PyTorch.


Under the hood, each torch `Tensor` has a boolean attribute `requires_grad`, which tells `autograd` whether it should keep trace of operations on the tensor or not. `autograd` is the automatic differentiation engine of PyTorch.

In [None]:
x = torch.rand(3,3)

x

In [None]:
x.requires_grad

We can manually set this to `True` or create directly a Tensor supporting grad.

In [None]:
x.requires_grad = True
# or equivalently x=torch.rand(3, 3, requires_grad=True)
print(x)

Now suppose we are handling a function $f:\mathbb{R}\rightarrow\mathbb{R}$.

For instance, $f(x) = x^2$.

We could apply $f$ to a singleton tensor and compute the derivative.

In [None]:
x = torch.rand(1, requires_grad=True)

print("x:", x)

y = x**2

print("y:", y)

To compute the gradient, we call `backward()` on the Tensor `y`:

In [None]:
y.backward()

We expect the derivative to be $2x$. We can check this is correct by inspecting the gradient of `x`:

In [None]:
x.grad==2*x

Notice that when there's no gradient, `grad` is automatically set to `None` to save memory

In [None]:
torch.rand(3,3).grad is None

The same operation can be repeated in a slightly more complicated setting, when dealing with a scalar function of more than one variable $f:\mathbb{R}^d\rightarrow\mathbb{R}$

This step is particularly important since the core operation of Deep Learning is computing the gradients of a scalar function (our loss function) with respect to the parameters of the network.

Now `x` will be a vector (or a matrix, it doesn't really matter for our case) and we will apply to it a function which returns a single scalar.

One example may be $f(\mathbf{x})=\sum_{i=1}^d x_i$.

In [None]:
x = torch.rand([5], requires_grad=True)

print(x)

y = x.sum()

y.backward()

In [None]:
x.grad

## Composition of functions

We can use also `backward` to compute the gradient of a composition of functions. For our objective, it will be very useful to think in terms of computational graph.

We can view $y=g(f(x))$ as

![](images/compgra1.jpg)

We might extend this and add a hidden node $z$
between $f$ and $g$

![](images/compgra2.jpg)

Supposing $z=f(x)=\log(x)$
and $y=g(z)=z^2$, we can reproduce this example in PyTorch. 

To sum up, $y=\log^2(x)$, and so we expect $\frac{dy}{dx}=2\frac{\log(x)}{|x|}$

In [None]:
x = torch.rand(1, requires_grad=True)

print("x:", x)

z = x.log()

y = z**2

print("y:", y)

In [None]:
#z.retain_grad()

y.backward()

x.grad==2*x.log()/x

Now suppose that we want to access $\frac{dy}{dz}=2z$:

In [None]:
z.grad==2*z

To store gradients of intermediate computations, we can call `.retain_grad()` on the intermediate node.

## Gradient accumulation

Let us see another feature of torch differentiation functionalities.

We can call `backward()` multiple times; let's see what happens.

In [None]:
x_1 = torch.tensor([3.0], requires_grad=True)

x_2 = torch.tensor([2.0], requires_grad=True)

In [None]:
c = x_1.cos() * x_2.log()
c.backward()
print(x_1.grad, x_2.grad)

What is happening? Why the gradient is not the same?

PyTorch continues to accumulate (i.e., sum) the gradients. If we want to reset the gradient, we must set it to None
```
x_1.grad = None
x_2.grad = None
```

## Optimizers

Now that we have the tools to compute gradients of any function (specifically, we will be interested in loss functions), it is time to figure out what to do with these gradients.

In torch, it is straightforward to use most optimization techniques based on Gradient Descent and its variations, such as SGD, Adam, RMSProp etc.

To do so, you should exploit the tools of the `torch.optim` package, that contains the most famous training algorithms in the `Optimizer` class. To cite some:

*   `torch.optim.SGD`
*   `torch.optim.Adam`
*   `torch.optim.RMSprop`
*   `torch.optim.Adagrad`

To construct an Optimizer you have to give it an iterable containing the parameters to optimize. Then, you can specify optimizer-specific options such as the learning rate, weight decay, etc. For example:

In [None]:
# note that the lr parameter is mandatory, there is no default
optimizer=torch.optim.SGD(model.parameters(), lr=learning_rate, momentum=0, weight_decay=0)

Using an optimizer is very simple, there are three main steps:

*   delete the current gradient information on the parameters: `optimizer.zero_grad()`
*   compute the derivatives of the loss: `loss.backward()`
*   perform a gradient descent step and update parameters (according to the algorithm in use): `optimizer.step()`

Note that the first step should always be performed since, as we saw, torch accumulates gradients.


## Adaptive learning rate

At times, it is useful to vary the learning rate while training, for example you may want to use a large learning rate in the initial phase of training to quickly descend the loss and then you may want to decrease it to be more precise around a minimum. To do so, the easiest way is to use schedulers that you can find in `torch.optim.lr_scheduler`.

The simplest LR scheduler is `ExponentialLR`, which takes a parameter $\gamma$ and at each step does:

$$
\text{lr}=\gamma \text{lr}
$$

More info on this [here](https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate), but in a nutshell all you have to do is something like:

In [None]:
optimizer = torch.optim.SGD(model.parameters(), lr=0.1)
scheduler = torch.optim.lr_scheduler.ExponentialLR(optimizer, gamma=0.9)

#and then, when you want to actually decrement the learning rate
scheduler.step()

## Back to our linear regression

Coming back to the linear regression from the previous lab, we now have all the tools to fit its weights and bias with Stochastic Gradient Descent (SGD) instead of using Least Squares.

First of all, let's recover some code from the last lab:

In [None]:
X=torch.load("./data/X_reg.pt")
Y=torch.load("./data/y_reg.pt")
N,P=X.shape

In [None]:
class LinearRegressor(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.regressor = torch.nn.Linear(in_features=P, out_features=1, bias=True)

    def forward(self, X):
        return self.regressor(X)

lin_reg=LinearRegressor()

We define a `batch_size` of 8. This means that at each training step the network is fed with 8 datapoints from our dataset, on which a gradient descent step is performed.

In [None]:
dataset=torch.utils.data.TensorDataset(X,Y)
dataloader=torch.utils.data.DataLoader(dataset, batch_size=8, shuffle=True)

We are doing regression, so our objective is the minimization of the Mean Squared Error between the targets and the output of our model.

In [None]:
loss=torch.nn.MSELoss()

Now we can go on and define our optimizer:

In [None]:
optimizer=torch.optim.SGD(lin_reg.parameters(), lr=1e-3)

The following cell contains the nested loop that we will run to train the network. Its pseudocode is as follows:


```
Loop over epochs:
    Loop over data:
        Perform a forward pass
        Compute the loss
        Erase the past gradients
        Compute gradients performing a backward pass
        Update the parameters
```



SGD training is an iterative process, that we repeat for a (usually) fixed number of *epochs*. In each epoch we traverse the whole dataset by exploiting our `DataLoader` object, that provides us with randomly drawn mini-batches of 8 elements.

We can keep track of the loss so that we can compare it with the loss of the least squares estimate we obtained during the previous lab.

In [None]:
epochs=100
losses=[]
for epoch in range(epochs):
    epoch_loss=0 # we want to accumulate the loss on all the mini-batches to average and obtain the overall MSE
    for x, y in iter(dataloader):
        out=lin_reg(x)
        l=loss(out, y)
        epoch_loss+=l.item()
        optimizer.zero_grad()
        l.backward()
        optimizer.step()
    losses.append(epoch_loss/len(dataloader)) # len(dataloader) contains the number of mini-batches

The loss rapidly decreases during training, approaching the optimal one obtained with least squares the last time.

In [None]:
import matplotlib.pyplot as plt

plt.title("Linear regession loss")
plt.axhline(0.9005, color='r', label='Least Squares')
plt.semilogy(losses, label='SGD')
plt.legend()
plt.xlabel("Optimization epoch")
plt.ylabel("MSE Loss")


# A simple MLP Classifier

We will now try to solve our first *real* problem with Deep Learning. The task is handwritten digit recognition, and we will use the MNIST dataset. It is an outstandingly popular benchmark in the Machine Learning community, and it is seen as the first and simplest real-world task one can solve with neural network (aka the *Hello World of Deep Learning*).

First of all, let's download the data and create our DataLoaders.

The transforms we are employing are intended to convert the images into torch Tensors (`ToTensor()`) and to normalize the images to have 0 mean and standard deviation 1 (`Normalize`).

In [None]:
import torchvision

transforms = torchvision.transforms.Compose([
        torchvision.transforms.ToTensor(),
        torchvision.transforms.Normalize((0.1307,), (0.3081,))
    ])

trainset = torchvision.datasets.MNIST('./data/', transform=transforms,  train=True, download=True)
trainloader = torch.utils.data.DataLoader(trainset, batch_size=256, shuffle=True)

testset = torchvision.datasets.MNIST('./data/', transform=transforms, train=False, download=True)
testloader = torch.utils.data.DataLoader(testset, batch_size=512, shuffle=False)


Now, we can visualize our data using `matplotlib`.

In [None]:
x,y=next(iter(trainloader))
print(x.shape)
first_img=x[0]
first_label=y[0]

For grayscale images, `imshow` expects input of shape $H\times W$, so we have to reshape the image:

In [None]:
plt.imshow(first_img.reshape(28,28), cmap='gray')
print("Label: ", first_label)

Now we go on and define our model: a MLP with two hidden layers of 32 and 16 neurons each, with ReLU activations. The input layer size is 784, since our images are 28x28, and the output layer has 10 neurons since we have 10 classes.

ReLU is the Rectified Linear Unit, defined as:
$$
\text{ReLU}(x)=\cases{0\hspace{0.5cm}\text{if } x\lt 0\\ x\hspace{0.5cm}\text{if } x\ge 0}
$$

Note that the `Linear` layer of torch expects the input to be a vector, so in the `forward` method we have to remember to flatten the image into a 1-D vector.

In [None]:
class Classifier(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.layer1 = torch.nn.Linear(in_features=28*28, out_features=32, bias=True)
        self.layer2 = torch.nn.Linear(in_features=32, out_features=16, bias=True)
        self.layer3 = torch.nn.Linear(in_features=16, out_features=10, bias=True)
        self.activation = torch.nn.ReLU()
        # alternatively, self.activation = torch.nn.functional.relu

    def forward(self, x):
        x=x.reshape(-1,28*28)
        x=self.activation(self.layer1(x))
        x=self.activation(self.layer2(x))
        x=self.layer3(x) # we don't need any nonlinearity (i.e. softmax) on the output layer. why?
        return x

model=Classifier()

Now we define our optimizer and the loss function that we want to minimize. Since we are doing 10-class classification we now switch to Cross-Entropy.

You can find the details of this loss in torch [here](https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html), but what is really important to point out is that it is assumed that the input to this loss are "raw, unnormalized scores for each class". This means that we shouldn't include any softmax in the network architecture, since it is already included by this implementation of the CE loss.

In [None]:
optimizer=torch.optim.SGD(model.parameters(), lr=1e-2, momentum=0.9)
loss=torch.nn.CrossEntropyLoss()

We define a utility function to compute how accurate is our classifier.

Note the `model.eval()` command and the context manager `with torch.no_grad()`.The former is telling the network that we are evaluating it, not training it: in this simple example there is no real use for this clarification, but we will see cases when this is crucial; the latter is switching off gradient operations: this allows you to make computations lighter and to avoid performing gradient descent steps by mistake.

In [None]:
def get_accuracy(model, dataloader):
    model.eval()
    with torch.no_grad():
        correct=0
        for x, y in iter(dataloader):
            out=model(x)
            correct+=(torch.argmax(out, axis=1)==y).sum()
        return correct/len(dataloader.dataset)

When training, we have to pay attention to specifying `model.train()`, the converse of `model.eval()`.

In [None]:
epochs=5
losses=[]
for epoch in range(epochs):
    print("Test accuracy: ", get_accuracy(model, testloader))
    model.train()
    print("Epoch: ", epoch)
    for x, y in iter(trainloader):
        out=model(x)
        l=loss(out, y)
        optimizer.zero_grad()
        l.backward()
        optimizer.step()
        losses.append(l.item())
print("Final accuracy: ", get_accuracy(model, testloader))

plt.figure()
plt.title("MNIST batch loss")
plt.plot(losses)
plt.xlabel("Optimization step")
plt.ylabel("CE Loss")

# First assignment (deadline 13 November)



- 1. Read carefully the paper [Learning representations by back-propagating errors](https://www.iro.umontreal.ca/~vincentp/ift3395/lectures/backprop_old.pdf)
- 2. Reproduce in PyTorch (or any other DL library you like) experiment 1. Try to be as close as possible to the original protocol regarding network architecture, activation function, training algorithm and parameter initialization.
    - Inspect the weights you obtained and check if they provide a solution to the problem
    - Compare the solution to the solution reported in the paper
- 3. Write a small report (1 page) about your experiment and what you
learned about that. The report should be a jupyter notebook with text
cells that describe the non-trivial parts of your work.

- Tips and comments:
    - Be careful: don't expect to be able to fully reproduce the results of the paper.
    - Optionally, you are warmly encouraged to explore (and discuss in your report) possible workarounds to improve your results. Some of these may include:
        - changing activation function;
        - changing optimizer;
        - adding a learning rate scheduler

You can send the jupyter notebook in any format you prefer (`ipynb`, `pdf` or `html`) to lore.basile@outlook.com by 23.59, 13/11/2022.

Also, use the same email address for any question regarding the assignment and coding.

If you have more "general" doubts regarding the assignment itself, the paper or any other part of the course until now, please also include prof. Ansuini in the conversation (alessio.ansuini@areasciencepark.it).