# Overfitting

**Note :** to use this notebook in Google Colab, create a new cell with
the following line and run it.

``` shell
!pip install git+https://gitlab.in2p3.fr/jbarnier/ateliers_deep_learning.git
```

In [None]:
import copy

import numpy as np
import plotnine as pn
import polars as pl
import torch
from torch import nn
from torchinfo import summary

from adl import cooking

pn.theme_set(pn.theme_minimal())

In this notebook we will try to understand the concept of overfitting
and why it can be problematic.

To illustrate this concept we an example dataset, imagine that we are
having some people taste and evaluate a cake recipe, by varying the cake
cooking time from 5 to 120 minutes.

Here is a sample generated dataset of 30 ratings based on the cooking
time.

In [None]:
# Generate random data
np.random.seed(42)
time, score = cooking.generate_data(size=30)
# Add some outliers
score[[2, 15, 26]] = torch.tensor([[9], [4], [8.5]])

# Plot data
cooking.scatter_plot(time, score)

In this plot, each dot represents a score associated to a cooking time.
We can see that globally the data seems to follow a parabolic
distribution, but that there are 3 outliers (introduced on purpose).

We will train a dense neural network on this small dataset to try to
predict the score from the cooking time value.

In [None]:
class RegressionNetwork(nn.Module):
    def __init__(self):
        super().__init__()
        self.model = nn.Sequential(
            nn.Linear(in_features=1, out_features=100),
            nn.Sigmoid(),
            nn.Linear(in_features=100, out_features=100),
            nn.Sigmoid(),
            nn.Linear(in_features=100, out_features=1),
        )

    def forward(self, x):
        # Center data
        x = x - 115 / 2
        return self.model(x)


model = RegressionNetwork()
print(summary(model))

The network is a dense neural network with two linear layers of 100
units. This represents more than 10 000 parameters, which is
(deliberately) quite big for modelling a dataset of 30 points.

We will train this network for 3000 epochs, which, once again, seems
quite a lot.

In [None]:
# Model training step
def train_step(x, y, model, loss_fn, optimizer):
    # Set the model to training mode
    model.train()
    # Reset gradients
    optimizer.zero_grad()
    # Forward pass: compute predicted values
    y_pred = model(x)
    # Compute loss
    loss = loss_fn(y_pred, y)
    # Backpropagations
    loss.backward()
    # Parameters adjustment
    optimizer.step()
    return loss


# Loss function
loss_fn = nn.MSELoss()
# Optimizer
optimizer = torch.optim.AdamW(model.parameters(), lr=0.005)

# Run training
epochs = 2000
for epoch in range(epochs):
    loss = train_step(time, score, model, loss_fn, optimizer)
    if (epoch + 1) % 500 == 0:
        print(f"{epoch + 1:5}. loss: {loss:5.3f}")

The training seems to be going fine, the loss is going down steadily and
seems rather small after the last epoch.

We can compare our model predictions *vs* the real ones.

In [None]:
score_pred = model(time)
cooking.scatter_plot_pred(time, score, score_pred)

This seems quite good! Predicted values are not too far from the real
ones, and even the outliers seem to be predicted correctly.

But let’s try now to visualize what the predicted values would be for
values in the range of \[5, 120\] cooking time.

In [None]:
cooking.line_plot(model)

This doesn’t look like the parabolic distribution we had intuitively
seen by looking at our dataset. By adjusting to the outliers, the model
is diverging from the “true” data distribution.

Suppose we get a new sample dataset, this time without outliers. By
applying our trained model to these new values we can predict their
scores and compare with the real ones.

In [None]:
# Generate new data
time_new, score_new = cooking.generate_data(size=20)

# Compute and plot predicted scores vs true scores
score_new_pred = model(time_new)
cooking.scatter_plot_pred(time_new, score_new, score_new_pred)

The results are rather good except for the points which are around the
original outliers time values: in these cases the predicted values are
quite far from the real ones.

This is an example of **overfitting**: by being too close to our
training data, our model doesn’t *generalize* well to new data.

Overfitting is a frequent problem in deep learning, and there are
several methods to try to limit it. Some of these methods are linked to
the network architecture: we can reduce the number of parameters to
avoid the model to learn “too much”, or we can introduce some specific
layers such as dropout layers to help the model generalize better.

In the following we will talk about another way to limit overfitting by
not training our model for too long.

## Validation data

One way to limit overfitting is by splitting our training dataset in two
parts: a training set and a validation set. The model will be trained
only on the training set (it will never see the validation data during
training), but after each epoch we will compute the loss both on the
train set and on the validation set.

Let’s try in our example by creating a small random validation data set
without outliers.

In [None]:
time_valid, score_valid = cooking.generate_data(size=20)

We modify our training step to compute the loss on the validation
dataset at the end of each step, after the parameters have been
adjusted.

In [None]:
def train_step_with_validation(x, y, x_valid, y_valid, model, loss_fn, optimizer):
    model.train()
    optimizer.zero_grad()
    y_pred = model(x)
    loss = loss_fn(y_pred, y)
    loss.backward()
    optimizer.step()

    # Compute validation loss
    model.eval()
    valid_pred = model(x_valid)
    valid_loss = loss_fn(valid_pred, y_valid)

    return loss, valid_loss

We can now run this new training process. The model architecture is the
same, the train set is the same as the one from our previous example,
but we add a validation set that we preprocess and scale in the same way
as the train set.

We then train the model for 1000 epochs (and with a smaller learning
rate).

In [None]:
model = RegressionNetwork()

loss_fn = nn.MSELoss()
optimizer = torch.optim.AdamW(model.parameters(), lr=0.005)

results = []

epochs = 2000
for epoch in range(epochs):
    loss, valid_loss = train_step_with_validation(
        time, score, time_valid, score_valid, model, loss_fn, optimizer
    )
    if (epoch + 1) % 200 == 0:
        print(f"{epoch + 1:5}. loss: {loss:5.3f}, valid_loss: {valid_loss:5.3f}")
    if (epoch + 1) % 10 == 0:
        results.append({"epoch": epoch + 1, "loss": loss, "valid_loss": valid_loss})

We can see that the train loss is still going down until 2000 epochs,
but if we look at the validation loss, it goes down quite fast but then
starts to go up after about 1000 epochs.

We can represent both losses in a line plot to better visualize the
process.

In [None]:
d = pl.DataFrame(results).unpivot(index="epoch", on=["loss", "valid_loss"])

(
    pn.ggplot(d, pn.aes(x="epoch", y="value", color="variable"))
    + pn.geom_line()
    + pn.scale_y_continuous(limits=[0, None])  # type: ignore
    + pn.labs(color="")
)

The graph allows to see that until about epoch 1000, both losses go down
fast. But after that, their evolution is inverted: the train loss goes
down slowly, but the validation loss starts to go up at about the same
speed.

Intuitively, we can imagine that the first 1000 epochs are used to learn
the parabolic distribution, which corresponds to both the train and
validation datasets. But after that the only way to lower the train loss
is to adapt to the individual data points in the train set, and in
particular the three outliers. This allows to improve the train loss,
but by doing that the model goes away from the “real” distribution, and
so the validation loss goes up.

By using both a training and a validation dataset and monitoring their
loss at each epoch, you can determine if and when the validation loss
begins to increase. If this occurs, it indicates overfitting, suggesting
that training should be halted around the corresponding epoch.

A good way to do this is to keep track of the validation loss at each
epoch, and to save the model corresponding to the lowest validation loss
reached during the training process.

The following code shows a way to do this.

In [None]:
model = RegressionNetwork()

loss_fn = nn.MSELoss()
optimizer = torch.optim.AdamW(model.parameters(), lr=0.005)

best_model = copy.deepcopy(model)
best_loss = np.inf
best_epoch = None

epochs = 2000
for epoch in range(epochs):
    loss, valid_loss = train_step_with_validation(
        time, score, time_valid, score_valid, model, loss_fn, optimizer
    )
    # We keep track of the model with the best results on the validation dataset
    if valid_loss < best_loss:
        # Store best model
        best_model = copy.deepcopy(model)
        # Keep track of best validation loss
        best_loss = valid_loss
        # Keep track of epoch for best validation loss
        best_epoch = epoch
    if (epoch + 1) % 200 == 0:
        print(f"{epoch + 1:5}. loss: {loss:5.3f}, valid_loss: {valid_loss:5.3f}")

print(f"\nBest validation loss: {best_loss:.3f}, reached at epoch {best_epoch}")

If we plot the predicted values of our “best” model, we can see that it
didn’t adjust to the outliers in our train dataset.

In [None]:
best_score_pred = best_model(time)
cooking.scatter_plot_pred(time, score, best_score_pred)

And if we plot the predicted values for the whole range of cooking
times, we can see that we are closer to the previously guessed parabolic
distribution.

In [None]:
cooking.line_plot(best_model)

## Test data

Using a validation dataset is good, but it is still possible that our
model will overfit on this validation data. How? Imagine we run the
training process a great number of times, to modify the network
architecture, to evaluate different optimizers, to modify the learning
rate, etc.

During each training process we will use our validation dataset to
assess the quality of our results. By doing so, we tend to adjust our
hyperparameters and network architecture to this validation dataset,
which can lead to a form of overfitting and degrade the model’s
generalization ability.

To prevent this, we can add a third dataset, which is called the *test*
dataset. These are the data that will be used to compute the final
quality of our results.

## Exercise

Generate a small test dataset of size 20, and compute the loss on this
test dataset on our best model computed previously.

So, as a recap, our training (labelled) data can be split into three
datasets:

-   the *train* dataset is used to train the model and adjust its
    parameters
-   the *validation* dataset is used to evaluate the loss at the end of
    each epoch and prevent overfitting by selecting the best model based
    on this validation loss
-   the *test* dataset is used at the end of all the model tuning and
    training processes to assess the final quality of its results based
    on data never seen beforehand

Ideally, the test dataset should be used once and only once. It is
sometimes difficult in practice, but in any case it should be used (and
seen by the model) as infrequently as possible.