[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/lorenzobasile/DeepLearningMHPC/blob/main/1_introduction.ipynb)



# Lab 1: Introduction


Lecturer: Lorenzo Basile (lorenzo.basile@areasciencepark.it), Research Fellow at AREA Science Park


Every lab will finish with a small homework that builds on the day's material. We can briefly discuss any doubt or curiosity the following day. The homeworks have to be completed within the end of the week.

Please, do reach out for any doubt and clarification about the course and/or the assignments!

## Computational resources

We will not run particularly heavy experiments during the labs, so for most parts you should be able to reproduce the experiments on the CPU of your personal laptop. However, to avoid issues with library versions and to avoid installing any package (and to take advantage of some hardware acceleration from time to time), we will be running the labs on Google Colab, a service that provides free access GPUs.

For your assignments, it is advisable to switch from Colab to a proper HPC facility.


# Introduction to Colab

Colab is a free service provided by Google for ML research. It is based on Jupyter notebooks that run on a remote server, and it provides free (but limited time) GPU acceleration.

To enable GPU or TPU acceleration just go to `Runtime>Change runtime type` and choose from the menu. Please note that GPU usage is limited in time, so avoid requesting one if you do not really need it.

Inside a code cell you can use `!` to run shell commands:

In [None]:
!nvidia-smi    # if you enable GPU acceleration, this command returns information on the GPU
!pip install torch==1.11.0    # just an example, torch is already installed in Colab
!sudo apt-get install gcc    # you can also run sudo commands
!wget https://roboti.us/download/mujoco200_linux.zip    # and download data to a temporary memory
!git clone git@github.com:lorenzobasile/DeepLearningMHPC.git    # just another example

## Colab file system

By default, Colab accesses a volatile memory that is erased as soon as your process terminates, but you can interface it with your personal Google Drive to read and write data:

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# note that if you want this command to be permanent you need to use the magic % instead of !
%cd drive/MyDrive/DeepLearningMHPC

In [None]:
!ls

# Getting started with PyTorch

In [None]:
import torch

## What is PyTorch?

PyTorch (or informally torch) is a Python library specifically built for Deep Learning, that comes with a series of very useful functionalities that make it one of the most popular tools for DL research and application.

Namely, it has many built-in features and modules useful for DL, tensor arithmetic and automatic differentiation features, and it allows for easy GPU acceleration through CUDA.

Another famous library for DL you may have heard of is TensorFlow, which also has a more user-friendly interface called Keras.

## Basic operation with Tensors

The main building block of PyTorch is the `Tensor` class. A torch `Tensor` is the equivalent of NumPy `ndarray` and most of the functionalities are the same as in NumPy.

In [None]:
import numpy as np

x=torch.tensor([[1,2,3],[4,5,6]])
y=np.array([[1,2,3],[4,5,6]])

print("X:", x)
print("Y:", y)



Basic NumPy array features exist for torch tensors:

In [None]:
x.shape, y.shape, x.size()

In [None]:
x.dtype, y.dtype

Note that you can build a tensor through the constructor `torch.Tensor`. In this case, since `torch.Tensor` is an alias for `torch.FloatTensor`, the tensor you create will have type `torch.float32`.

You can convert the dtype of a tensor by using the functions `float()`, `int()` etc.

More info on data types [here](https://pytorch.org/docs/stable/tensors.html).

In [None]:
x=torch.Tensor([[1,2,3], [4,5,6]])
print("Dtype of X:", x.dtype)
x=x.int()
print("Dtype of X:", x.dtype)

Special tensors, such as those made by ones and zeros can be created using their corresponding functions:

In [None]:
x=torch.ones(2,3,2)
print("Ones:", x)
x=torch.zeros(2,3,2)
print("Zeros:", x)

And you can create random tensors just like you create random arrays:

In [None]:
x=torch.rand(2,3,2)    # you can also use a list or a tuple for the dimensions
y=np.random.rand(2,3,2)
print("X:", x)
print("Y:", y)

You can easily compute statistics of tensors (such as the sum, mean, max, min, std... of the elements) by either using the methods of the `Tensor` class or using the basic torch functions and using your tensor as input:

In [None]:
x.sum(), torch.sum(x)

In [None]:
x.mean(), torch.mean(x)

In [None]:
x.argmin(), torch.argmin(x)

It is sometimes very useful to specify one or more dimensions to reduce (along which you want to perform your operations):

In [None]:
print(x)
x.mean(dim=0)

In [None]:
x.argmax(dim=1)

In [None]:
x.sum(dim=(0,1))

Tensor slicing works exactly like in NumPy, by means of square brackets:

In [None]:
x[0,1,1]

In [None]:
x[0,1:,1]

In [None]:
x[:,::2,:]

## Aggregating tensors

In [None]:
x = torch.randn(4,5,6)
y = torch.randn_like(x)
z = torch.ones_like(y)

The two key functions to aggregate different tensors are `torch.cat` and `torch.stack`. 

In a nutshell, `cat` concatenates the tensors along a given dimension. All tensors must have the same shape, except at most for the concatenation dimension:

In [None]:
c = torch.cat([x,y,z], dim=0)
print("Concatenated shape:", c.shape)

new_c = torch.cat([c,y,z], dim=0)
print("New concatenated shape:", new_c.shape)

new_c = torch.cat([c,y,z], dim=1) # this does not work
print("New concatenated shape:", new_c.shape)


`stack` instead creates a new dimension, along which the input tensors get aggregated. In this case, the shapes need to match exactly.

In [None]:
c = torch.stack([x,y,z], dim=0)
print("Stack shape:", c.shape)

new_c = torch.cat([c,y,z], dim=0) # this does not work
print("New stack shape:", new_c.shape)


## Linear algebra and tensor reshaping



An operation we will frequently perform in Deep Learning (though often under the hood) is matrix multiplication. In torch, it can be done in many equivalent ways:

In [None]:
x=torch.rand(4,5)
y=x.T    # matrix transposition

print(x@y)
print(x.matmul(y))
print(torch.matmul(x,y))

Please note that the operator for matrix multiplication is `@`, not `*`, which indicates the Hadamard (element-wise) product instead.

In [None]:
x*x

Multiplying a matrix by itself is obviously equivalent to computing its power, and it can be done also by running one of the following commands:

In [None]:
torch.pow(x,2), x**2

As in NumPy, there exists a `dot` function to compute the scalar product between vectors. Note that differently from NumPy, in torch this is **not** equivalent to matrix multiplication, as it is intended to work only with 1D vectors.

In [None]:
v1=x[:,1]
v2=x[:,2]
print(v1.shape, v2.shape)

print(v1.dot(v2))    # in the case of 1D vectors, there is no difference between row and column vectors
print(v1.matmul(v2))
print(v1@v2)

If you want to do something fancier with two vectors, like multiplying a column by a row to obtain a matrix, you need to switch to 2D vectors by reshaping them or use `torch.outer`.

When you reshape a tensor, you can leave one dimension unspecified (using -1), as it can be inferred automatically by torch.

In [None]:
v1_reshaped=v1.reshape(-1,1)    # column vector
v2_reshaped=v2.reshape(1,-1)    # row vector

print(v1_reshaped.shape, v2_reshaped.shape)
result1 = v1_reshaped@v2_reshaped
result2 = torch.outer(v1,v2)
print(torch.allclose(result1, result2, atol=1e-6))

In [None]:
print(v1_reshaped.dot(v2_reshaped))    # this doesn't work! dot works only on 1D tensors

Changing the shape of a tensor is a crucial operation in DL. To have an idea of its application, just think of RGB images, commonly used in Computer Vision.

These are $3\times H\times W$ tensors, where H and W stand for height and width of the image (in number of pixels). It is often needed to regard an image as a linearized (flattened array of pixels):

In [None]:
img=torch.stack([torch.zeros(8,8), torch.ones(8,8), torch.zeros(8,8)], dim=0)
img.reshape(3,64)    # note that reshaping is not in place, so this call does not change the actual shape of img
print(img.shape)

Very often (for instance when you have to pass an image to `matplotlib` for visualization), you need to change the shape of an image to $H\times W \times 3$. You may be tempted to do something like this:

In [None]:
import matplotlib.pyplot as plt

new_img=img.reshape(8,8,3)

plt.imshow(new_img)

This piece of code runs seamlessly, since the dimensions are consistent with the original ones. However, it will not produce the expected behaviour (a green image).

In fact, `reshape` only modifies the shape of a tensor, without touching the way data are stored in memory, meaning that you would end up mixing data from different dimensions.

The right way to change the order of dimensions is to use `permute`, which accepts as argument the ordering of dimensions that you desire:

In [None]:
new_img=img.permute(1,2,0)
print(new_img.shape)
plt.imshow(new_img)

# Building Machine Learning models with PyTorch

## Linear regression

By using all the pieces we've seen till now, we can build our first ML model using PyTorch: a linear regressor, whose model is:

$$
y = XW + b
$$

which can also be simplified as:

$$
y = XW
$$

if we incorporate the bias $b$ inside $W$ and add to the $X$ a column of ones to the right.


We start by creating our data. We randomly sample $X$ as a $N\times P$ tensor, meaning that we have 1000 datapoints and 100 features and compute $y$ as:
$$
y=XM+\mathcal{N}(0,I)
$$
where $M$ is a randomly drawn projection vector (shape $P\times 1$, same as our weights).
We are adding some iid gaussian noise on the $y$ to avoid the interpolation regime, in which we could be fitting our data perfectly using a linear model.

In [None]:
N=1000
P=100
X=torch.rand(N,P)
M=torch.rand(P,1)
Y=X@M+torch.normal(torch.zeros(N,1),torch.ones(N,1))


We can add a column of ones to $X$ to include the bias:

In [None]:
X=torch.cat([X, torch.ones(N,1)], dim=1)

The regression can be fit with classical statistical methods such as Ordinary Least Squares, and the optimal $W$ has the form:

$$
W^*=(X^TX)^{-1}X^Ty
$$


In [None]:
W_star = ((X.T @ X).inverse()) @ X.T @ Y

In [None]:
W_star.shape

To assess the quality of this fit we can evaluate the Mean Squared Error (MSE) between the original $y$ and the prediction:

In [None]:
torch.nn.functional.mse_loss(X@W_star, Y)

## The same linear model, but in PyTorch style

A linear model like the one we saw before is nothing more than an artificial neuron with no activation function.

We will now be exploring the second chunk of PT functionalities, namely the built-in structures and routines supporting the creation of ML models.

We can create the same model we have seen before using torch built-in structures, so we start to see them right away.

Usually, a torch model is a `class` inheriting from `torch.nn.Module`. Inside this class, we'll define two methods:
* the constructor (`__init__`) in which we define the building blocks of our model as class variables;
* the `forward` method, which specifies how the data fed into the model needs to be processed in order to produce the output.

Note for those who already know something about NNs: we don't need to define `backward` methods since we're constructing our model with built-in PT building blocks. PyTorch automatically creates a `backward` routine based upon the `forward` method.

Our model only has one building block (layer) which is a `Linear` layer.
We need to specify the size of the input (i.e. the coefficients $W$ of our linear regressor) and the size of the output (i.e. how many scalars it produces) of the layer. We additionaly request our layer to have a bias term $b$.

The `Linear` layer processes its input as $XW + b$, which is exactly the (first) equation of the linear regressor we saw before.



In [None]:
class LinearRegressor(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.regressor = torch.nn.Linear(in_features=P, out_features=1, bias=True)

    def forward(self, X):
        return self.regressor(X)

We can create an instance of our model and inspect the current parameters by using the `state_dict` method, which prints the building blocks of our model and their current parameters. Note that `state_dict` is essentially a dictonary indexed by the names of the building blocks which we defined inside the constructor (plus some additional identifiers if a layer has more than one set of parameters).

In [None]:
model = LinearRegressor()

for param_name, param in model.state_dict().items():
    print(param_name, param)

We can update the parameters via `state_dict` and re-using the same OLS estimates we obtained before.

Note that torch is thought of for Deep Learning: it does not have the routines to solve different ML problems (just use `sklearn` for this).

Next time, we'll see how we can unleash gradient-based iterative training routines in torch and compare the results w.r.t. the OLS estimators.

In [None]:
state_dict = model.state_dict()
state_dict["regressor.weight"] = W_star[:P].T
state_dict["regressor.bias"] = W_star[P]
model.load_state_dict(state_dict)

In [None]:
X_lin_reg = X[:,:P]
predictions_lin_reg = model(X_lin_reg) # equivalent to lin_reg.foward(X_lin_reg)
ols_loss = torch.nn.functional.mse_loss(predictions_lin_reg, Y)
print(ols_loss)

# Training a Neural Network

## Automatic differentiation in torch



PyTorch is built with support for differentiation in mind.
In the end, Deep Learning (for now) is all about differentiation and building cascades of differentiable function into complicated multilayer deep neural networks.

Essentially, all PyTorch built-ins support differentiability (unless the function is not differentiable, of course).
Today we will see how to compute derivatives in PyTorch.


Under the hood, each torch `Tensor` has a boolean attribute `requires_grad`, which tells `autograd` whether it should keep trace of operations on the tensor or not. `autograd` is the automatic differentiation engine of PyTorch.

In [None]:
x = torch.rand(3,3)

x

In [None]:
x.requires_grad

We can manually set this to `True` or create directly a Tensor supporting grad.

In [None]:
x.requires_grad = True
# or equivalently x=torch.rand(3, 3, requires_grad=True)
print(x)

Now suppose we are handling a function $f:\mathbb{R}\rightarrow\mathbb{R}$.

For instance, $f(x) = x^2$.

We could apply $f$ to a singleton tensor and compute the derivative.

In [None]:
x = torch.rand(1, requires_grad=True)

print("x:", x)

y = x**2

print("y:", y)

To compute the gradient, we call `backward()` on the Tensor `y`:

In [None]:
y.backward()

We expect the derivative to be $2x$. We can check this is correct by inspecting the gradient of `x`:

In [None]:
x.grad==2*x

Notice that when there's no gradient, `grad` is automatically set to `None` to save memory

In [None]:
torch.rand(3,3).grad is None

The same operation can be repeated in a slightly more complicated setting, when dealing with a scalar function of more than one variable $f:\mathbb{R}^d\rightarrow\mathbb{R}$

This step is particularly important since the core operation of Deep Learning is computing the gradients of a scalar function (our loss function) with respect to the parameters of the network.

Now `x` will be a vector (or a matrix, it doesn't really matter for our case) and we will apply to it a function which returns a single scalar.

One example may be $f(\mathbf{x})=\sum_{i=1}^d x_i$.

In [None]:
x = torch.rand([5], requires_grad=True)

print(x)

y = x.sum()

y.backward()

In [None]:
x.grad

## Composition of functions

We can use also `backward` to compute the gradient of a composition of functions. For our objective, it will be very useful to think in terms of computational graph.

We can view $y=g(f(x))$ as

![](images/compgra1.jpg)

We might extend this and add a hidden node $z$
between $f$ and $g$

![](images/compgra2.jpg)

Supposing $z=f(x)=\log(x)$
and $y=g(z)=z^2$, we can reproduce this example in PyTorch. 

To sum up, $y=\log^2(x)$, and so we expect $\frac{dy}{dx}=2\frac{\log(x)}{|x|}$

In [None]:
x = torch.rand(1, requires_grad=True)

print("x:", x)

z = x.log()

y = z**2

print("y:", y)

In [None]:
#z.retain_grad()

y.backward()

x.grad==2*x.log()/x

Now suppose that we want to access $\frac{dy}{dz}=2z$:

In [None]:
z.grad==2*z

To store gradients of intermediate computations, we can call `.retain_grad()` on the intermediate node.

## Gradient accumulation

Let us see another feature of torch differentiation functionalities.

We can call `backward()` multiple times; let's see what happens.

In [None]:
x_1 = torch.tensor([3.0], requires_grad=True)

x_2 = torch.tensor([2.0], requires_grad=True)

In [None]:
c = x_1.cos() * x_2.log()
c.backward()
print(x_1.grad, x_2.grad)

What is happening? Why the gradient is not the same?

PyTorch continues to accumulate (i.e., sum) the gradients. If we want to reset the gradient, we must set it to None
```
x_1.grad = None
x_2.grad = None
```

## Optimizers

Now that we have the tools to compute gradients of any function (specifically, we will be interested in loss functions), it is time to figure out what to do with these gradients.

In torch, it is straightforward to use most optimization techniques based on Gradient Descent and its variations, such as SGD, Adam, RMSProp etc.

To do so, you should exploit the tools of the `torch.optim` package, that contains the most famous training algorithms in the `Optimizer` class. To cite some:

*   `torch.optim.SGD`
*   `torch.optim.Adam`
*   `torch.optim.RMSprop`
*   `torch.optim.Adagrad`

To construct an Optimizer you have to give it an iterable containing the parameters to optimize. Then, you can specify optimizer-specific options such as the learning rate, weight decay, etc. For example:

In [None]:
# note that the lr parameter is mandatory, there is no default
optimizer=torch.optim.SGD(model.parameters(), lr=1e-3, momentum=0)

Using an optimizer is very simple, there are three main steps:

*   delete the current gradient information on the parameters: `optimizer.zero_grad()`
*   compute the derivatives of the loss: `loss.backward()`
*   perform a gradient descent step and update parameters (according to the algorithm in use): `optimizer.step()`

Note that the first step should always be performed since, as we saw, torch accumulates gradients.


## Adaptive learning rate

At times, it is useful to vary the learning rate while training, for example you may want to use a large learning rate in the initial phase of training to quickly descend the loss and then you may want to decrease it to be more precise around a minimum. To do so, the easiest way is to use schedulers that you can find in `torch.optim.lr_scheduler`.

The simplest LR scheduler is `ExponentialLR`, which takes a parameter $\gamma$ and at each step does:

$$
\text{lr}=\gamma \text{lr}
$$

More info on this [here](https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate), but in a nutshell all you have to do is something like:

In [None]:
optimizer = torch.optim.SGD(model.parameters(), lr=0.1)
scheduler = torch.optim.lr_scheduler.ExponentialLR(optimizer, gamma=0.9)

#and then, when you want to actually decrement the learning rate
scheduler.step()

## Back to our linear regression

Coming back to the linear regression from the previous lab, we now have all the tools to fit its weights and bias with Stochastic Gradient Descent (SGD) instead of using Least Squares.

First of all, let's reset our regressor:

In [None]:
lin_reg=LinearRegressor()

### Definition of a loss function

One key element that we need to train any neural network is a loss function, i.e. a function that quantifies how *good* is our fit to the data and that is differentiable w.r.t. the weights and biases of the network.

We saw some examples of common loss functions in the lecture, and all the main losses used in Deep Learning are already implemented and available in PyTorch, to cite some:

*   `torch.nn.MSELoss`
*   `torch.nn.CrossEntropyLoss`
*   `torch.nn.BCELoss`
*   `nn.KLDivLoss`

You can also define your own custom loss function, and as long as you use built-in torch functions to compute it (and you keep it differentiable), you should be fine.

For example, you could build your own MSE loss like this:


In [None]:
def mseloss(output, target):
    loss = torch.mean((output - target)**2)
    return loss

### Definition of a DataLoader object

To train any PyTorch model, it is useful to handle data through a `DataLoader` object. A `DataLoader` is an iterable wrapped around a `Dataset` object that allows to easily run through your data in batches.

For any specific need, you can build your own `Dataset` class. To make it work properly, you always have to implement three functions: `__init__`, `__len__` and `__getitem__`. More info on this [here](https://pytorch.org/docs/stable/data.html).

Starting from a set of `Tensor`s representing features and labels, it is easy to define the `Dataset` and its corresponding `DataLoader`:

In [None]:
dataset=torch.utils.data.TensorDataset(X[:,:-1],Y) # the last column of X is the bias term

We define a `batch_size` of 8. This means that at each training step the network is fed with 8 datapoints from our dataset, on which a gradient descent step is performed.

In [None]:
dataloader=torch.utils.data.DataLoader(dataset, batch_size=8, shuffle=True)

In [None]:
X_0, y_0=next(iter(dataloader))

In [None]:
print(X_0, y_0)

We are doing regression, so our objective is the minimization of the Mean Squared Error between the targets and the output of our model.

In [None]:
loss=torch.nn.MSELoss()

Now we can go on and define our optimizer:

In [None]:
optimizer=torch.optim.SGD(lin_reg.parameters(), lr=1e-3)

The following cell contains the nested loop that we will run to train the network. Its pseudocode is as follows:


```
Loop over epochs:
    Loop over data:
        Perform a forward pass
        Compute the loss
        Erase the past gradients
        Compute gradients performing a backward pass
        Update the parameters
```



SGD training is an iterative process, that we repeat for a (usually) fixed number of *epochs*. In each epoch we traverse the whole dataset by exploiting our `DataLoader` object, that provides us with randomly drawn mini-batches of 8 elements.

We can keep track of the loss so that we can compare it with the loss of the least squares estimate we obtained before.

In [None]:
epochs=100
losses=[]
for epoch in range(epochs):
    epoch_loss=0 # we want to accumulate the loss on all the mini-batches to average and obtain the overall MSE
    for x, y in iter(dataloader):
        out=lin_reg(x)
        l=loss(out, y)
        epoch_loss+=l.item()
        optimizer.zero_grad()
        l.backward()
        optimizer.step()
    losses.append(epoch_loss/len(dataloader)) # len(dataloader) contains the number of mini-batches

The loss rapidly decreases during training, reaching the one obtained with least squares.

In [None]:
plt.title("Linear regession loss")
plt.axhline(ols_loss.detach(), color='r', label='Least Squares')
plt.semilogy(losses, label='SGD')
plt.legend()
plt.xlabel("Optimization epoch")
plt.ylabel("MSE Loss")


# A simple MLP Classifier

We will now try to solve our first *real* problem with Deep Learning. The task is handwritten digit recognition, and we will use the MNIST dataset. It is an outstandingly popular benchmark in the Machine Learning community, and it is seen as the first and simplest real-world task one can solve with neural network (aka the *Hello World of Deep Learning*).

First of all, let's download the data and create our DataLoaders.

The transforms we are employing are intended to convert the images into torch Tensors (`ToTensor()`) and to normalize the images to have mean 0 and standard deviation 1 (`Normalize`).

In [None]:
import torchvision

transforms = torchvision.transforms.Compose([
        torchvision.transforms.ToTensor(),
        torchvision.transforms.Normalize((0.1307,), (0.3081,))
    ])

trainset = torchvision.datasets.MNIST('./data/', transform=transforms,  train=True, download=True)
trainloader = torch.utils.data.DataLoader(trainset, batch_size=256, shuffle=True)

testset = torchvision.datasets.MNIST('./data/', transform=transforms, train=False, download=True)
testloader = torch.utils.data.DataLoader(testset, batch_size=512, shuffle=False)


Now, we can visualize our data using `matplotlib`.

In [None]:
x,y=next(iter(trainloader))
print(x.shape)
first_img=x[0]
first_label=y[0]

For grayscale images, `imshow` expects input of shape $H\times W$, so we have to reshape the image:

In [None]:
plt.imshow(first_img.reshape(28,28), cmap='gray')
print("Label: ", first_label)

Now we go on and define our model: a MLP with two hidden layers of 32 and 16 neurons each, with ReLU activations. The input layer size is 784, since our images are 28x28, and the output layer has 10 neurons since we have 10 classes.

ReLU is the Rectified Linear Unit, defined as:
$$
\text{ReLU}(x)=\cases{0\hspace{0.5cm}\text{if } x\lt 0\\ x\hspace{0.5cm}\text{if } x\ge 0}
$$

Note that the `Linear` layer of torch expects the input to be a vector, so in the `forward` method we have to remember to flatten the image into a 1-D vector.

In [None]:
class Classifier(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.layer1 = torch.nn.Linear(in_features=28*28, out_features=32, bias=True)
        self.layer2 = torch.nn.Linear(in_features=32, out_features=16, bias=True)
        self.layer3 = torch.nn.Linear(in_features=16, out_features=10, bias=True)
        self.activation = torch.nn.ReLU()
        # alternatively, self.activation = torch.nn.functional.relu

    def forward(self, x):
        x=x.reshape(-1,28*28)
        x=self.activation(self.layer1(x))
        x=self.activation(self.layer2(x))
        x=self.layer3(x) # we don't need any nonlinearity (i.e. softmax) on the output layer. why?
        return x

model=Classifier()

Now we define our optimizer and the loss function that we want to minimize. Since we are doing 10-class classification we now switch to Cross-Entropy.

You can find the details of this loss in torch [here](https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html), but what is really important to point out is that it is assumed that the input to this loss are "raw, unnormalized scores for each class". This means that we shouldn't include any softmax in the network architecture, since it is already included by this implementation of the CE loss.

In [None]:
optimizer=torch.optim.SGD(model.parameters(), lr=1e-2, momentum=0.9)
loss=torch.nn.CrossEntropyLoss()

We define a utility function to compute how accurate is our classifier.

Note the `model.eval()` command and the context manager `with torch.no_grad()`.The former is telling the network that we are evaluating it, not training it: in this simple example there is no real use for this clarification, but we will see cases when this is crucial; the latter is switching off gradient operations: this allows you to make computations lighter and to avoid performing gradient descent steps by mistake.

In [None]:
def get_accuracy(model, dataloader):
    model.eval()
    with torch.no_grad():
        correct=0
        dataset_loss=[]
        for x, y in iter(dataloader):
            out=model(x)
            dataset_loss.append(loss(out, y).item())
            correct+=(torch.argmax(out, dim=1)==y).sum()
            #correct+=(out.argmax(-1)==y).sum()
        val_loss=sum(dataset_loss)/len(dataloader)
        return correct/len(dataloader.dataset), val_loss

When training, we have to pay attention to specifying `model.train()`, the converse of `model.eval()`.

In [None]:
epochs=5
losses=[]
val_losses=[]
for epoch in range(epochs):
    acc, val_loss = get_accuracy(model, testloader)
    val_losses.append(val_loss)
    print("Test accuracy: ", acc)
    model.train()
    print("Epoch: ", epoch)
    batch_losses=[]
    for x, y in iter(trainloader):
        out=model(x)
        l=loss(out, y)
        optimizer.zero_grad()
        l.backward()
        optimizer.step()
        batch_losses.append(l.item())
    losses.append(sum(batch_losses)/len(batch_losses))
print("Final accuracy: ", get_accuracy(model, testloader))

plt.figure()
plt.title("MNIST batch loss")
plt.plot(losses, label='training')
plt.plot(val_losses, label='test')
plt.xlabel("Optimization step")
plt.ylabel("CE Loss")
plt.legend()

# Homework



- 1. Read carefully the paper [Learning representations by back-propagating errors](https://www.iro.umontreal.ca/~vincentp/ift3395/lectures/backprop_old.pdf)
- 2. Reproduce in PyTorch (or any other DL library you like) experiment 1. Try to be as close as possible to the original protocol regarding network architecture, activation function, training algorithm and parameter initialization.
    - Inspect the weights you obtained and check if they provide a solution to the problem
    - Compare the solution to the solution reported in the paper
    - Be careful: don't expect to be able to fully reproduce the results of the paper.