# Learning rate

**Note :** to use this notebook in Google Colab, create a new cell with
the following line and run it.

``` shell
!pip install git+https://gitlab.in2p3.fr/jbarnier/ateliers_deep_learning.git
```

In [None]:
import plotnine as pn
import torch

from adl.sklearn import skl_regression

pn.theme_set(pn.theme_minimal() + pn.theme(plot_background=pn.element_rect(fill="white")))

In this notebook we will take a look at the effect of the step size, or
learning rate, on the training process. For this we will use a very
simple linear regression example with only one parameter (the slope of
the regression line).

## Data

We start with some random input and target data.

In [None]:
# Input data
x_values = [2.1, 3.4, 1.8, 5.9, 8.3, 9.1, 2.4, 5.6, 7.8]
# Target data
y_values = [4.0, 6.2, 3.0, 13.2, 17.8, 17.9, 5.5, 11.7, 14.3]

We want to predict $y$ values from $x$ with a model of the form
$y = x \times w$, *ie* a linear regression without intercept. Our model
has only one parameter, the **weight** $w$, which represents the slope
of the rgression line. Our goal is to estimate the optimal value of $w$
parameter from the data: this optimal value will be the one which
minimizes the mean squared error, *ie* the mean of squared distances
between predictions and targets.

## Regression with scikit-learn

As a reference, we can compute the optimal $w$ value by doing a simple
linear regression without intercept, for example with `scikit-learn`. We
will use a predefined custom function to do this, and display both the
best $w$ value and the associated minimum loss value (here the loss is
the mean squared error).

In [None]:
reg = skl_regression(x_values, y_values, fit_intercept=False)
print(f"slope: {reg['slope']:.2f}, mse: {reg['mse']:.3f}")

We can plot both our data points and the estimated regression line.

In [None]:
(
    pn.ggplot(mapping=pn.aes(x=x_values, y=y_values))
    + pn.geom_abline(slope=reg["slope"], intercept=0, color="orchid")
    + pn.geom_point(fill="yellowgreen", color="white", size=4)
    + pn.coord_cartesian(xlim=(0, 10), ylim=(0, 20))
)

## Regression with pytorch

We have seen in the previous notebook that we can do the same
computation (finding the value of $w$ that minimizes the mean squared
error) with pytorch. In this case, instead of computing $w$ directly we
will approximate its value with a gradient descent.

**Exercise 1**

Using pytorch, write and run the code implementing the training process
to find the value of $w$ which minimizes the mean square error loss
between true and predicted values.

Run a training process with a step size (or learning rate) of 0.001 for
10 epochs and print the `w` value at each epoch.

## Effect of step size (learning rate)

For convenience, for now on we will use a predefined function that will
run the training process while keeping track of the different loss,
gradient and weight values at each training step in order to easily
compare the results for different step size values.

Here are the results with a step size of 0.001. The `new_w` column
should have the same values as the output of the code you wrote for the
exercise.

In [None]:
from adl import model_1p

# Convert x and y data to tensors
x = torch.tensor(x_values)
y = torch.tensor(y_values)

model_1p.train(x, y, step_size=0.001, epochs=10)

We can see that the weight $w$ evolves towards the optimum value while
the loss goes down, but the training is quite slow and the optimum is
not reached after 10 iterations.

With a larger step size of 0.01, the optimal $w$ value and the
associated minimal loss are reached after only a few training steps.

In [None]:
model_1p.train(x, y, step_size=0.01, epochs=8)

With an even larger step size of 0.1, the result is completely
different. The loss, instead of going down, is increasing at each step.
Accordingly, the weight value goes farther and farther from the optimal
one.

Our model is *diverging*, and adding more training steps would only make
$w$ go farther from its optimal value.

In [None]:
model_1p.train(x, y, step_size=0.1, epochs=8)

## Graphical representations

The following plot shows the value of the loss function for $w$ values
ranging from -1 to 4. We can see that the loss is minimal when $w$ is
around 2.

In [None]:
model_1p.plot_loss(x, y, wmin=-1, wmax=4, gradient=False)

The next plot is an attempt at visualizing the *gradient* value of the
loss at different $w$ values. The direction of the red arrow at a given
point depends on the sign of the gradient at this point, and it
indicates the “direction” we should go if we want the loss to go up. So,
at a given point of the curve, if we want to minimize the loss value we
have to modifiy $w$ in the direction *opposite* to the one of the arrow.

The length of the arrow is proportional to the gradient absolute value.
It represents the intensity of the modification of the loss value in the
gradient direction: if the arrow is long, then moving $w$ a bit in this
direction will lead to a higher gradient increase. If it is short, it
will lead to a smaller increase.

In [None]:
model_1p.plot_loss(x, y, wmin=-1, wmax=4)

We can also try to visualize the *training process*.

The following plot shows the values of $w$ at each step of a training
process starting from $w=0$ and running for 30 epochs with a step size
of 0.001. We see that at each step $w$ follows the loss function curve
to go towards its minimum, even if it is not reached after 30 epochs. We
can also see that the “move” of $w$ value is smaller and smaller at each
epoch.

In [None]:
model_1p.plot_train(x, y, step_size=0.001, epochs=30, wmin=-2, wmax=4.5)

If we increase the step size to 0.01, we see that the training process
is much faster, and $w$ goes towards its optimum by moving more rapidly,
and the minimum is reached after a few epochs.

In [None]:
model_1p.plot_train(x, y, step_size=0.01, epochs=10, wmin=-2, wmax=4.5)

With a learning rate of 0.025, the training process is working but a bit
differently: $w$ moves even “faster”, but by doing so it “overshoots”
and goes beyond the minimum. but the process is nevertheless converging,
because by going from one side of the optimum to the other, it manages
to get closer eat each step.

In [None]:
model_1p.plot_train(x, y, step_size=0.025, epochs=8, wmin=-2, wmax=4.5)

If we increase the learning rate a bit more, we can get to a situation
where the training process seems almost stalled: the $w$ value goes from
one side of the minimal value to the other, but barely progressing
towards it even after 20 epochs.

In [None]:
model_1p.plot_train(x, y, step_size=0.02963, epochs=20, wmin=-2, wmax=4.5)

Finally, with a learning rate even higher, we reach a point when $w$
“moves too much”, and the process becomes diverging: at each step the
loss becomes higher and $w$ goes farther instead of closer from its
optimum.

In [None]:
model_1p.plot_train(x, y, step_size=0.031, epochs=8, wmin=-2, wmax=4.5)