<a href="https://colab.research.google.com/github/maurapintor/ai4dev/blob/main/AI4Dev_04_dnns.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction to PyTorch

In this notebook, we will look at PyTorch Tensors and introduce all the components used to implement deep learning one by one. 
Remember that the primary source of information, that also describes all the APIs and classes discussed in this notebook (and way more), is the [PyTorch documentation](https://pytorch.org/docs/stable/index.html).

We can technically use numpy arrays for learning, but there are several aspects that they don't cover.

In `PyTorch`, the `Tensor` class is more powerful than standard numeric libraries.

* GPU support
* Parallel operations on multiple devices or machines
* Tensors keep track of graph of computations that created them (needed for gradients)

All these features, especially the last one, are of utmost importance when dealing with deep learning!

Mostly, the APIs for math operations resemble the ones of Numpy - but only from the exterior...

In [None]:
import torch
x = torch.tensor([1., 2.])
w = torch.tensor([2., 2.])

# indexing operations as usual
# the .item() extracts the element if it's only one number
print("first element of w: ", w[0].item())

... in the backstage, as specified, Tensors also keep track of the operations in the computational graph. This means that if we perform operations with tensors, we are able to retrieve all the computations from the end node to the beginning (leaves). This is needed to compute gradients automatically!

In [None]:
# set tracking gradients to true, this will
# rememeber all operations performed to w
...

# operations as usual
...

# compute gradients
...

# gradient of f w.r.t. w is x
print("gradient of f w.r.t. w")
print(...)

## Additional information on Tensor storage

**Remember**: the underlying memory is allocated only once, which makes the view operation very lightweigth even for large storages. Even assigning another variable to the tensor only copies the reference!

To retrieve the location of a tensor in memory, we can use the `data_ptr()` method:
- https://pytorch.org/docs/stable/generated/torch.Tensor.data_ptr.html#torch-tensor-data-ptr


In [None]:
a = torch.tensor([1, 2, 3, 4])
b = a[0]  # different Tensor, same storage (points to the same location)
c = a.reshape([2, 2])  # same storage, different stride

print("storage of a == storage of c")
print(a.data_ptr() == c.data_ptr())  # same storage

If we modify the tensor c, we also indirectly modify the tensor a!

To create a copy and duplicate the tensor, we can use the `Tensor.clone()` API.

In [None]:
c[0] = 1
print("let's modify the first element of c")

print("as you can see, the tensor is the same, and c is only a reference")
print("first element of c", c[0])
print("first element of a", a[0])

print("storage of a == storage of c")
print(a.data_ptr() == c.data_ptr())

In [None]:
a = torch.tensor([1, 2, 3, 4])
d = a.clone()

print("storage of a == storage of d")
print(a.data_ptr() == d.data_ptr())

d[0] = 5
print("first element of d", d[0])
print("first element of a", a[0])

To reshape a Tensor, we have two APIs (mostly interchangeable for what we do here, but they have subtle differences related to the storage).

To find out more:
- https://pytorch.org/docs/stable/generated/torch.reshape.html#torch.reshape
- https://discuss.pytorch.org/t/whats-the-difference-between-torch-reshape-vs-torch-view/159172

In [None]:
a = torch.tensor([1, 2, 3, 4, 5, 6])
b = a.reshape((2, 3))
c = a.view((3, 2))

print("stride of a")
print(a.stride())  # how many storage items to skip for incrementing each dimension

print("stride of b")
print(b.stride())  # how many storage items to skip for incrementing each dimension

print("stride of c")
print(c.stride())  # how many storage items to skip for incrementing each dimension

## Learning with tensors

We can now use PyTorch to learn from tensors, given it's ability to track the operations. 
We will start with a simple **regression** problem, where we try to estimate a value given an underlying distribution of points.

We do this by estimating the parameters of a line that passes through the points, thus once we have the value $x$, we can immediately compute $y=f(x) = w^T x$.

First, let's create a dataset of points with scikit-learn. We will use one feature, that will be our `x`. We can wrap numpy arrays into tensors with the `tensor.from_numpy()` method.


In [None]:
import torch
from sklearn import datasets

samples, labels = datasets.make_regression(n_samples=1000, n_features=1, noise=2, random_state=42)

samples, labels = torch.from_numpy(samples.ravel()), torch.from_numpy(labels)

# normalization
samples -= samples.min()
samples /= samples.max()

print(samples[:5], labels[:5])


Let's plot the points with `matplotlib.pyplot.scatter`.

In [None]:
import matplotlib.pyplot as plt

...

Let's define a model that computes the $f(x)$. For now, let's stick to using python functions. Note that if the inputs of the function are tensors, all operations will also be tracked there.

In [None]:
def model(x, w, b):
    return ...

We can plot the model with the usual trick of defining a range of values for x and predicting with our model function the values for y. Then, we use `matplotlib.pyplot.plot` to draw the line.

In [None]:
def plot_line(w, b, alpha=1.0):
    x_axis = torch.linspace(0, 1, 100)
    y_axis = model(x_axis, w, b)
    plt.plot(x_axis.detach().numpy(), y_axis.detach().numpy(), color='r', alpha=alpha)

def plot_points(samples, labels):
    plt.scatter(samples, labels)

w, b = torch.tensor([1.0]), torch.tensor([0.0])

plot_line(w, b)
plot_points(samples, labels)


Then, we should write a loss function to perform gradient descent on the parameters (w and b) of our model.
First, let's compute the gradients "by hand".

$$
\displaystyle L(w, b, x) = \frac{1}{n} \sum_{i=1}^{n} \left(y_i - (w^T x_i + b)\right)^2
$$

Then, the derivative of the loss w.r.t. $w$ is:

$$
\frac{\partial{L}}{\partial{w}} = \frac{1}{n} \sum_{i=1}^{n} 2 \left(y_i - (w^T x_i + b)\right) * x
$$

While the gradient of L w.r.t. b is:

$$
\frac{\partial{L}}{\partial{b}} = \frac{1}{n} \sum_{i=1}^{n} 2 \left(y_i - (w^T x_i + b)\right) * 1
$$

In [None]:
def loss_fn(y_pred, y_true):
    ...

def grad_l_w(x, w, b, y):
    ...

def grad_l_b(x, w, b, y):
    ...


In [None]:
w, b = torch.tensor([1.0]), torch.tensor([0.0])

print("initial parameters")
print(w, b)

n_steps = 2000
step_size = 0.1

# plot the points
plot_points(samples, labels)

for i in range(n_steps):
    y_pred = model(samples, w, b)
    loss = loss_fn(y_pred, labels)
    if i % 100 == 0:
        # print loss value and plot line with increasing alpha
        print(f"Iteration {i}, loss: {loss.item()}")
        plot_line(w, b, alpha=min(1.0, 0.2 + i/n_steps))
    w = ...
    b = ...

# plot final line
plot_line(w, b)

print("final parameters")
print(w, b)



If we don't want to compute the gradients by hand, we can also leverage the automatic gradient from PyTorch. The way we use it is:

1. we compute the `forward` pass, that constructs the graph from the original tensors to the end of the graph (the loss node).
2. we run the `backward`, that goes through the graph in reverse and accumulates the gradients

The accumulation is a key concept. Since the graphs usually perform the computation for multiple samples at once, the default behavior of PyTorch is to accumulate all gradients in each node unless specified otherwise (by zero-ing operations).

Let's compute the loss and print its value. We will notice an additional item in the output.

In [None]:
# now let's pack the parameters in one single tensor for convenience
params = ...

loss = loss_fn(model(samples, *params), labels)
print(f"loss: {loss}")

loss.backward()

Notice that the printed tensor also contain a reference to a gradient function. This is the previous node of the graph. This is needed for the chain rule, in fact, remember how we compute the derivative of a composite function:

$$
\frac{\partial{L}}{\partial{z}} = \frac{\partial{L}}{\partial{u}} \frac{\partial{u}}{\partial{z}}
$$

Thus, to compute the full gradient of the loss w.r.t. $w$, we have to compute the corresponding $\frac{\partial{u}}{\partial{z}}$ of every operation and multiply by the first gradient encountered in the backward. Then, to get the next, we have to compute again the partial derivative of the next operation w.r.t. the previous node.

This is all handled by PyTorch, thus we can directly access the gradient in the variable where we need it:


In [None]:
print("gradient of params")
print(params.grad)

Remember also that the gradient stays there until we reset it. This means that if we run again a forward and a backward, we will fin 2*gradient in this variable.


In [None]:
loss = loss_fn(model(samples, *params), labels)
print(f"loss: {loss}")

loss.backward()

print("gradient accumulated after two backwards")
print(params.grad)

If we want to perform gradient descent, in each iteration we need to:

1. compute the gradient
2. update the parameters with the gradient
3. clear the gradient from the node

At the moment, we can use the inplace operation `tensor.zero_()` to set to zero the gradients after use.

In [None]:
# remove the gradients for the next computation
...

print("gradients of w after zero-ing")
print(params.grad)

In [None]:
def training_loop(n_epochs, learning_rate, params, x, y):
    plot_points(x, y)
    for epoch in range(n_epochs):
        y_pred = ...
        loss = ...
        # compute backward
        ...
        if epoch % 100 == 0:
            print("Epoch: %d, Loss %f" % (epoch, float(loss)))
            plot_line(*params, alpha=min(1.0, 0.2 + epoch/n_epochs))
        with torch.no_grad():  # this is required to avoid creating nested graphs
            params = ...
        # reset gradients
        ...
    plot_line(*params)
    return params

params = torch.tensor([1.0, 0.0], requires_grad=True)
print("initial parameters: ")
print(params)
trained_params = training_loop(2000, 1e-1, params, samples, labels)
print("final parameters: ")
print(trained_params)

Then, we can also use the optimizers from PyTorch to perform the update of the parameters.
We have to pass to the constructor the set of parameters (even multiple ones) that we need to update. PyTorch will automatically track the gradients on these and use the gradients to update them, according to the "rules" of the optimizer defined. Let's for example use the Adam optimizer: https://pytorch.org/docs/stable/generated/torch.optim.Adam.html


In [None]:
from torch.optim import Adam


def training_loop(n_epochs, learning_rate, params, x, y):
    # define the optimizer
    optimizer = ...
    for epoch in range(n_epochs):
        y_pred = model(x, *params)
        loss = loss_fn(y_pred, y)
        loss.backward()
        if epoch % 100 == 0:
            print('Epoch %d, Loss %f' % (epoch, float(loss)))
            plot_line(*params, alpha=min(1.0, 0.2 + epoch/n_epochs))
        # update the parameters
        ...
        # clears the gradients on the parameters
        ...
    return params

params = torch.tensor([1.0, 0.0], requires_grad=True)

print("initial parameters")
print(params)
training_loop(2000, 1e-1, params, samples, labels)

plot_line(*params)
plot_points(samples, labels)
print("final parameters")
print(params)


Now let's introduce also a scheduler that will reduce the step size during training. We can update the step size by calling `scheduler.step()`. 
We will use now a `MultiStepLR`, which takes a set of milestones in which it will decay the step size by a factor `gamma`.

In [None]:
from torch.optim import SGD
from torch.optim.lr_scheduler import MultiStepLR


def training_loop(n_epochs, learning_rate, params, x, y):
    optimizer = SGD([params], lr=learning_rate)
    scheduler = ...

    for epoch in range(n_epochs):
        y_pred = model(x, *params)
        loss = loss_fn(y_pred, y)
        loss.backward()
        if epoch % 100 == 0:
            print('Epoch %d, Loss %f' % (epoch, float(loss)))
            plot_line(*params, alpha=min(1.0, 0.2 + epoch/n_epochs))
        # update the parameters
        optimizer.step()
        # clears the gradients on the parameters
        optimizer.zero_grad()
        # updates the learning rate according to the scheduler rule
        ...
    return params

params = torch.tensor([1.0, 0.0], requires_grad=True)

print("initial parameters")
print(params)
# let's start from a bigger learning rate
training_loop(2000, 1e-1, params, samples, labels)

plot_line(*params)
plot_points(samples, labels)
print("final parameters")
print(params)

Let's now also replace the model function with a model from PyTorch. The simple linear model is implemented in the `torch.nn.Linear` class.

(We have to modify the function to plot the model and the samples and labels slightly)

In [None]:
def plot_module(model, alpha=1.0):
    x_axis = torch.linspace(0, 1, 100).unsqueeze(1)
    y_axis = model(x_axis)
    plt.plot(x_axis.detach().numpy(), y_axis.detach().numpy(), color='r', alpha=alpha)

def training_loop_module(n_epochs, learning_rate, model, x, y):  # changed params to model
    x, y = x.float().unsqueeze(1), y.float().unsqueeze(1)
    optimizer = Adam(...)  # changed [params] to model.parameters()
    scheduler = MultiStepLR(optimizer, milestones=[100, 500, 1000], gamma=0.5)
    criterion = ...

    plot_points(x, y)

    for epoch in range(n_epochs):
        y_pred = model(x)  # changed line
        loss = criterion(y_pred, y)
        loss.backward()
        if epoch % 100 == 0:
            print('Epoch %d, Loss %f' % (epoch, float(loss)))
            plot_module(model, alpha=min(1.0, 0.2 + epoch/n_epochs))
        # update the parameters
        optimizer.step()
        # clears the gradients on the parameters
        optimizer.zero_grad()
        # updates the learning rate according to the scheduler rule
        scheduler.step()
    
    plot_module(model)

    return params

linear_model = torch.nn.Linear(1, 1)

print("initial parameters")
print(linear_model.weight.item(), linear_model.bias.item())

training_loop_module(2000, 1.0, linear_model, samples, labels)

plot_module(linear_model)
print("final parameters")
print(linear_model.weight.item(), linear_model.bias.item())

And then, we can replace the linear layer with a composite (non-linear) PyTorch model. For now, let's use the `torch.nn.Sequential`, that allows us to list the sequence of layers and activations, that will be applied in cascade.

In [None]:
dnn = torch.nn.Sequential(
    ...
)

training_loop_module(2000, 1e-3, dnn, samples, labels)

The advantage of having a non-linear model is that now we can also perform regression on non-linear distributions of samples.
Let's modify the y so that the function mapping between x and y is not linear anymore, and let's visualize the new distribution.

In [None]:
# create modified ys
y_new = ...
...

plot_points(samples, y_new)

In [None]:
y_new = labels.clone()
y_new[samples>0.5] *= -2

dnn = torch.nn.Sequential(
    torch.nn.Linear(1, 100),
    torch.nn.ReLU(),
    torch.nn.Linear(100, 100),
    torch.nn.ReLU(),
    torch.nn.Linear(100, 1),
)

training_loop_module(1000, 1e-2, dnn, samples, y_new)

plot_module(dnn)

Finally, let's see how neural networks are implemented in PyTorch.
We write a class that inherits the `torch.nn.Module`, that is the base of all differentiable modules in the library (including the losses). 
In the `__init__` we define the layers, and all parameters will be initialized automatically (also with better distributions than setting them manually). Additionally, all parameters initialized here will be accessible by using `model.parameters()` and inspected recursively. 

The other thing to implement here is the method `forward`, that defines all the operations performed on the input. When we want the output from our model, we call `model(x)` and this will call the forward method (along with other operations before and after, that are missed if you call `model.forward(x)`. As a rule, always use the `__call__` method).

In [None]:
class NeuralNetwork(torch.nn.Module):
    def __init__(self):
        """Defines the layers"""
        super().__init__()
        ...

    def forward(self, x):
        """Defines the operations applied to x"""
        ...

dnn_2 = NeuralNetwork()
training_loop_module(1000, 1e-2, dnn_2, samples, y_new)

plot_module(dnn_2)

## Using GPUs

We can move the computation to run on a Graphic Processing Unit (GPU) with a few simple lines. Note that if we run any operation on a GPU, all elements should be in the device. Thus, if we want to use our model in the GPU, we should take care of moving also the samples and labels. The rest of the code remains the same.

In [None]:
dnn_3 = NeuralNetwork()

device = "cuda" if torch.cuda.is_available() else "cpu"
dnn_3.to(device)
samples, labels = samples.to(device), labels.to(device)

training_loop_module(1000, 1e-2, dnn_3, samples, y_new)
plot_module(dnn_3)


## Saving and loading tensors

Until now, we created tensors only in RAM. At some point, we will want to store a tensor in the persistent memory. PyTorch uses `pickle` to **serialize** the tensors. Here is how to store a tensor in memory.

```
torch.save(a, 'tensor.pth')  # note that the extension is arbitrary
```

And to load back the tensor, a similar API is available.

```
b = torch.load('tensor.pth')
```

The `torch.save` API uses `pickle` to save the Python object in memory. In theory, we could issue `torch.save(net)` and we can store the object somewhere in our memory. However, this has some issues. 

If we save the model directly, we risk problems when reloading the model (as we cannot save `torch.nn` inside a pickle). You can find the issue well described in the [PyTorch documentation](https://pytorch.org/tutorials/beginner/saving_loading_models.html#save-load-entire-model). The correct way of saving the model is to save the model code in a `.py` file, and then use `torch.save` to store the parameters in a file. Here are the correct steps for saving and loading a model.

In [None]:
model_path = 'cifar_model.pt'
torch.save(dnn_3.state_dict(), model_path)

new_model = NeuralNetwork()
new_model.load_state_dict(torch.load(model_path))

plot_points(samples, y_new)
plot_module(dnn_3)

Notebook with complete code:

<a href="https://colab.research.google.com/github/battistabiggio/ai4dev/blob/main/notebooks/AI4Dev_04_dnns.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>