# Data normalization

**Note :** to use this notebook in Google Colab, create a new cell with
the following line and run it.

``` shell
!pip install git+https://gitlab.in2p3.fr/jbarnier/ateliers_deep_learning.git
```

In [None]:
import plotnine as pn
import polars as pl
import torch

from adl import model_2p
from adl.sklearn import skl_regression

pl.Config(tbl_rows=30, float_precision=2)
pn.theme_set(pn.theme_minimal() + pn.theme(plot_background=pn.element_rect(fill="white")))


In this notebook we will take a look at a slightly more complicated
model with two parameters, and at the effect of data normalization on
the training process.

## Data

We first create a small nonsensical dataset with two numerical
variables: the temperature and the quantity of ice cream sold at a shop.

In [None]:
temperature = [-1.5, 0.2, 3.4, 4.1, 7.8, 13.4, 18.0, 21.5, 32.0, 33.5]
icecream = [100.5, 110.2, 133.5, 141.2, 172.8, 225.1, 251.0, 278.9, 366.7, 369.9]

In [None]:
(
    pn.ggplot(mapping=pn.aes(x=temperature, y=icecream))
    + pn.geom_hline(yintercept=0, linetype="dotted")
    + pn.geom_vline(xintercept=0, linetype="dotted")
    + pn.geom_point(color="white", fill="yellowgreen", size=4)
    + pn.labs(x="temperature", y="icecream")
)

This time we will try to predict the `icecream` values from the
`temperature` values with a simple linear model with both a slope and an
intercept: $y = w \times x + b$. Our model now has two parameters, a
**weight** $w$, and a **bias** $b$.

## Regression with scikit-learn

As a reference, we first compute the “real” optimal slope and intercept
values with `scikit-learn`.

In [None]:
reg = skl_regression(x=temperature, y=icecream, fit_intercept=True)
print(f"slope: {reg['slope']:.2f}, intercept: {reg['intercept']:.2f}, mse: {reg['mse']:.4f}")

In [None]:
(
    pn.ggplot(mapping=pn.aes(x=temperature, y=icecream))
    + pn.geom_hline(yintercept=0, linetype="dotted")
    + pn.geom_vline(xintercept=0, linetype="dotted")
    + pn.geom_abline(slope=reg["slope"], intercept=reg["intercept"], color="orchid")
    + pn.geom_point(color="white", fill="yellowgreen", size=4)
    + pn.labs(x="temperature", y="icecream")
)

## Regression with pytorch

As we did in the previous notebook, we can do the same computation using
pytorch to search for the $w$ and $b$ values that would minimize the
mean squared error of our model.

**Exercise**

Using pytorch, write and run the code implementing the training process
to find the values of $w$ and $b$ which minimize the mean square error
loss between true and predicted values.

The code will be quite similar as the one in the previous notebook,
except that we now have two parameters to adjust at each step.

1.  create the input, target and parameters tensors
2.  create a `forward()` method which applies our model to input data
    passed as argument
3.  create a loss function using one of pytorch predefined methods
4.  implement a training process using a `for` loop

Run this training process for 20 epochs with a step size of 0.001. Print
the epoch, loss, $w$ and $b$ values at each step.

For now on we will use a predefined function for our training process to
keep track of the different loss, gradient and parameter values at each
training step.

With a step size of 0.001, we see that the weight of our model (the
slope of the regression line) goes up in the first epochs, then starts
to go down very slowly. The bias goes up, but also very slowly.

In [None]:
# Convert x and y data to tensors
x = torch.tensor(temperature)
y = torch.tensor(icecream)

train_params = {"x": x, "y": y, "w_init": 0.0, "b_init": 0.0}
model_2p.train(step_size=0.001, epochs=20, **train_params)

If we increase the step size a bit to 0.002, the loss goes down a bit
faster, but the weight oscillates around the optimal value during the
first epochs. The bias still goes up very slowly.

In [None]:
model_2p.train(step_size=0.002, epochs=10, **train_params)

If we increase the step size to 0.003, the loss goes down a bit more
slowly and regularly, but the weight value oscillates greatly around its
optimum value.

In [None]:
model_2p.train(step_size=0.003, epochs=10, **train_params)

If we increase again the step size to 0.004, the loss doesn’t go down
anymore and the training process becomes divergent.

In [None]:
model_2p.train(step_size=0.004, epochs=10, **train_params)

### Graphical representations

To try to understand why the training process doesn’t seem to be able to
reach the optimum weight and bias values, we can try to represent the
loss graphically.

In the plot below, the space of possible values for weight $w$ and bias
$b$ is divided into a grid. At each grid point, the loss value is
represented as a circle with a varying radius. The gradient of the loss
function at each point is represented as a red arrow: its orientation
gives the “direction” the parameters must be modified in order for the
loss value to increase as much as possible, and its length is
proportional to the magnitude of this increase. Thus, if we want for our
loss value to decrease, we must follow the opposite direction given by
these arrows.

The blue dot in the center is the optimal parameters values, *ie* the
values of $w$ and $b$ for which the loss is minimal.

In [None]:
graphic_params = {
    "x": x,
    "y": y,
    "true_weight": reg["slope"],
    "true_bias": reg["intercept"],
    "grad_scale": 6000,
}
model_2p.plot_loss(**graphic_params)

We can see that the gradients are almost all “horizontal”. This is due
to the fact that our two parameters do not have the same scale: a
variation of 1 on $w$ (the slope) will have an higher effect on the loss
value than a variation of 1 on $b$ (the intercept).

We can try to visualise what this means for the training process.

In the next plot, we represent a training process of 10 epochs with a
step size of 0.001 starting at $w = 2$ and $b = 50$.

In [None]:
graphic_params.update({"w_init": 2.0, "b_init": 50.0})
model_2p.plot_train(
    step_size=0.001,
    epochs=10,
    **graphic_params,
)

We see that the gradient descent seems to go only horizontally, slowing
down rapidly after the first epochs.

If we increase the number of epochs, we see that after a while going
horizontally, the gradient descent starts to “turn” into the direction
of the optimum value (but still very slowly).

In [None]:
model_2p.plot_train(
    step_size=0.001,
    epochs=200,
    **graphic_params,
)

We have to increase the number of epochs a lot to see the training
process getting very close to the optimum value.

In [None]:
model_2p.plot_train(
    step_size=0.001,
    epochs=3000,
    **graphic_params,
)

If we increase the step size to 0.003, we can see that the horizontal
gradient descent is more “chaotic”. However the training process gets
close to the optimum a bit faster.

In [None]:
model_2p.plot_train(
    step_size=0.003,
    epochs=1000,
    **graphic_params,
)

Finally if we increase the step size further to 0.004, we see that the
training process immediately starts to diverge for the optimum.

In [None]:
model_2p.plot_train(
    step_size=0.004,
    epochs=10,
    **graphic_params,
)

## Regression with pytorch on transformed data

One way to improve our training process is to transform our original
data so that weight and bias will be on a more similar “scale”.

### Normalized data

First we will try to standardize the temperature values to be between 0
and 1 by applying scikit-learn’s `preprocessing.minmax_scale`.

In [None]:
from sklearn import preprocessing

temp_n = preprocessing.minmax_scale(temperature)  # type: ignore


In [None]:
(
    pn.ggplot(mapping=pn.aes(x=temp_n, y=icecream))
    + pn.geom_hline(yintercept=0, linetype="dotted")
    + pn.geom_vline(xintercept=0, linetype="dotted")
    + pn.geom_point(color="white", fill="yellowgreen", size=4)
    + pn.labs(x="temp_n", y="icecream")
)

We can compute the new optimum weight and bias values with
`scikit-learn`.

In [None]:
reg_n = skl_regression(temp_n, icecream)
print(f"slope: {reg_n['slope']:.2f}, intercept: {reg_n['intercept']:.2f}")

If we run our pytorch implementation on this transformed data, we can
see that with a large step size, the training process seems to start to
converge towards the true values.

In [None]:
x_n = torch.tensor(temp_n, dtype=torch.float)
model_2p.train(x_n, y, step_size=0.4, epochs=20, w_init=0.0, b_init=0.0)

If we plot the loss at different points, we can see that the values and
the gradient orientations are quite different.

In [None]:
graphic_params_n = {
    "x": x_n,
    "y": y,
    "true_weight": reg_n["slope"],
    "true_bias": reg_n["intercept"],
    "grad_scale": 5,
    "b_factor": 4,
}
model_2p.plot_loss(**graphic_params_n)

If we add the visualization of a training process with a step size of
0.4, we can see that the process converges much faster towards the
optimal value, which is reached in about 50 epochs.

In [None]:
graphic_params_n.update({"w_init": 0.0, "b_init": 0.0, "w_factor": 1.0})
model_2p.plot_train(**graphic_params_n, step_size=0.4, epochs=50)

A smaller step size of 0.1 is slower but still reaches the optimum in
about 200 epochs.

In [None]:
model_2p.plot_train(**graphic_params_n, step_size=0.1, epochs=200)

With a step size of 0.75, the training process converges even faster. We
can see that the gradient descent is less smooth as it “oscillates”
between two gradient directions.

In [None]:
model_2p.plot_train(**graphic_params_n, step_size=0.75, epochs=30)

Finally, when the step size is too high, the training process starts
diverging.

In [None]:
model_2p.plot_train(**graphic_params_n, step_size=0.9, epochs=10)

### Scaled data

Another possible transformation of the input data is to scale it by
substracting its mean and dividing by its standard deviation. This can
be done easily using `scikit-learn`’s `scale` preprocessing.

In [None]:
from sklearn import preprocessing

temp_s = preprocessing.scale(temperature, with_mean=True)


In [None]:
(
    pn.ggplot(mapping=pn.aes(x=temp_s, y=icecream))
    + pn.geom_hline(yintercept=0, linetype="dotted")
    + pn.geom_vline(xintercept=0, linetype="dotted")
    + pn.geom_point(color="white", fill="yellowgreen", size=4)
    + pn.labs(x="temp_s", y="icecream")
)

We can again compute the new optimal weight and bias values with
`scikit-learn`.

In [None]:
reg_s = skl_regression(temp_s, icecream)
print(f"slope: {reg_s['slope']:.2f}, intercept: {reg_s['intercept']:.2f}")

If we run our pytorch implementation on this scaled data, we can see
that with a large step size, the training is able to converge towards
the true values quite rapidly.

In [None]:
x_s = torch.tensor(temp_s, dtype=torch.float)
model_2p.train(x=x_s, y=y, step_size=0.3, epochs=10, w_init=0.0, b_init=0.0)

We can once again try to visualize the loss gradients and values along a
grid of $w$ and $b$ values. We see that the contour of our loss seems
more circular, and the gradient seem to point directly to the opposite
direction of the optimum.

In [None]:
graphic_params_s = {
    "x": x_s,
    "y": y,
    "true_weight": reg_s["slope"],
    "true_bias": reg_s["intercept"],
    "grad_scale": 15,
}
model_2p.plot_loss(**graphic_params_s)

We can plot the training process with a step size of 0.3. The gradient
descent seems to be straightforward and goes directly to the optimum
value, which is reached within less than 10 epochs.

In [None]:
graphic_params_s.update({"w_init": 0.0, "b_init": 0.0, "w_factor": 1.0, "b_factor": 1.0})
model_2p.plot_train(**graphic_params_s, step_size=0.3, epochs=10)

With a larger step size of 0.6, the gradient descent first “overshoots”
the optimum values, but it then rapidly converges towards it in a few
epochs.

In [None]:
model_2p.plot_train(**graphic_params_s, step_size=0.6, epochs=5)

And, as before, if the step size is too high the training process starts
to diverge, oscillating farther and farther from the optimum instead of
converging towards it.

In [None]:
model_2p.plot_train(**graphic_params_s, step_size=1.0, epochs=10)