# Batches, datasets and data loaders

**Note :** to use this notebook in Google Colab, create a new cell with
the following line and run it.

``` shell
!pip install git+https://gitlab.in2p3.fr/jbarnier/ateliers_deep_learning.git
```

In [None]:
import json
import random

import numpy as np
import plotnine as pn
import requests
import torch
from torch import nn

from adl.sklearn import skl_regression
from adl import cooking, model_1p

pn.theme_set(pn.theme_minimal() + pn.theme(plot_background=pn.element_rect(fill="white")))

Up to this point we have only worked on very small toy datasets, but in
real-world deep learning applications the datasets can be huge, either
because we have a very large number of data points, and/or because each
data point is itself quite big (like sequences, images or videos). In
this case it is impossible to apply a training step (forward pass and
backpropagation) to the whole dataset, due to memory limitations.

To overcome this, the training steps will rather be applied to
*mini-batches* of data:

-   the dataset is divided into smaller subsets of data points of a
    given size. Each subset is called a *mini-batch*, or a *batch*.
-   the train step will be applied sequentially to each batch: the
    forward pass, backpropagation and parameters adjustment will be
    performed for each batch, one after the other.
-   an *epoch* is reached when all the batches have been processed and
    the entire training dataset has been seen by the network.

To demonstrate this we will reuse cake recipe rating example from the
*overfitting* notebook where ratings were predicted based on cooking
time. But this time we will generate a much larger dataset of 500 000
data points.

In [None]:
np.random.seed(1337)
time, score = cooking.generate_data(size=500_000, noise_scale=0.9)

cooking.scatter_plot(time, score, size=1, alpha=0.01)

We define a regression network class and model object.

In [None]:
class RegressionNetwork(nn.Module):
    def __init__(self):
        super().__init__()
        self.model = nn.Sequential(
            nn.Linear(in_features=1, out_features=10),
            nn.Sigmoid(),
            nn.Linear(in_features=10, out_features=1),
        )

    def forward(self, x):
        # Center data
        x = x - 115 / 2
        return self.model(x)


model = RegressionNetwork()


The train step will be the same as previously.

In [None]:
# Model training step
def train_step(x, y, model, loss_fn, optimizer):
    # Set the model to training mode
    model.train()
    # Reset gradients
    optimizer.zero_grad()
    # Forward pass: compute predicted values
    y_pred = model(x)
    # Compute loss
    loss = loss_fn(y_pred, y)
    # Backpropagations
    loss.backward()
    # Parameters adjustment
    optimizer.step()
    return loss


# Loss function
loss_fn = nn.MSELoss()
# Optimizer
optimizer = torch.optim.AdamW(model.parameters(), lr=0.005)  # type: ignore

The difference with the previous notebook is the way we will run our
training steps.

Previously, we applied it to the whole dataset at once for each epoch:

``` py
epochs = 10
for epoch in range(epochs):
    loss = train_step(time, score, model, loss_fn, optimizer)
    if (epoch + 1) % 500 == 0:
        print(f"Epoch {epoch + 1:2} - loss: {loss:5.3f}")
```

This time we will introduce a second loop which will iterate through
batches of data points. The train step (loss computation and parameters
adjustment) is applied to each batch after another. At the end of each
epoch we compute the average batch loss as the global epoch loss.

In [None]:
# Run training

epochs = 10
# Batch size
batch_size = 10_000
# Number of batches
n_batches = len(time) // batch_size
for epoch in range(epochs):
    loss = 0
    for batch in range(n_batches):
        # For each batch, extract the corresponding x and y data
        x_batch = time[batch * batch_size : (batch + 1) * batch_size - 1]
        y_batch = score[batch * batch_size : (batch + 1) * batch_size - 1]
        # Compute loss on this batch
        batch_loss = train_step(x_batch, y_batch, model, loss_fn, optimizer)
        # Accumulate loss between batches
        loss += batch_loss.item()
    # Compute average loss for this epoch
    loss /= n_batches
    print(f"Epoch {epoch + 1:2} - loss: {loss:5.3f}")

## Datasets and DataLoaders

In the previous example we performed the batches extraction manually,
but in practice it can quickly become complex and cumbersome. Pytorch
provides two tools to make batch processing a bit easier: `Dataset` and
`DataLoader`.

-   A `Dataset` object describes a data source, its size and the way to
    get an item from it. It allows to access data from a numpy array, a
    torch tensor, a file, or any other resource accessible via Python
    code.
-   A `Dataloader` object allows to load data from a `Dataset` while
    handling features like batch loading and shuffling.

### Dataset

To demonstrate the use of `Dataset`, we first generate two sample
datasets, one for training and one for validation:

In [None]:
time_train, score_train = cooking.generate_data(size=500_000, noise_scale=0.9)
time_valid, score_valid = cooking.generate_data(size=18_000, noise_scale=0.9)


The next step is to define a `Dataset` class for our data. This is a
Python class which inherits from
[torch.utils.data.Dataset](https://pytorch.org/docs/stable/data.html#torch.utils.data.Dataset),
and must implement three methods:

-   `__init__()`, the class constructor
-   `__len__()`, which must return the length of our dataset (the number
    of data points)
-   `__getitem__()`, which, given an integer index as argument, must
    return a couple of `(data, label)` corresponding to this index.

Here we will create a class called `RegressionDataset`:

-   the constructor takes two `time` and `score` tensors as arguments
    and store them as attributes. - the `__len__()` method returns the
    length of these tensors.
-   the `__getitem__()` method returns a tuple of the time and score
    values at the given index.

In [None]:
from torch.utils.data import Dataset


class RegressionDataset(Dataset):
    def __init__(self, time, score):
        # Check if both tensors have the same length
        if len(time) != len(score):
            msg = "time and score don't have the same length"
            raise ValueError(msg)
        # Store time and score as attributes
        self.time = time
        self.score = score

    def __len__(self):
        # Returns the number of data points
        return len(self.score)

    def __getitem__(self, index):
        # Returns the time and score values for the given index
        time_index = self.time[index]
        score_index = self.score[index]
        return time_index, score_index

Our `Dataset` is quite simple here as it just stores and retrieve values
from tensors, but it could be more complex. For example the constructor
could get a list of filenames containing images and their corresponding
labels, and the `__getitem__` method would then open and read the files
and preprocess the image data.

Now that our `RegressionDataset` class is defined, we can create two
training and validation dataset objects.

In [None]:
train_dataset = RegressionDataset(time_train, score_train)
valid_dataset = RegressionDataset(time_valid, score_valid)

We can apply `len` to one of these objects to get the number of its data
points, or index it to get a (data, label) tuple.

In [None]:
len(train_dataset)

In [None]:
valid_dataset[1000]

### Dataloaders

Once our `Dataset` objects are defined, we can create associated
`Dataloader` objects, which will handle the batches extraction and
traversal.

For this, we will create
[torch.utils.data.DataLoader](https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader)
instances by passing them (among other possible arguments):

-   a `Dataset` object
-   the batch size
-   a `shuffle` argument: if `True`, the data points will be reshuffled
    randomly before each epoch. This means that batches will be
    different from one epoch to another.

We create two training and validation loaders with a batch size of 10
000. The training loader is shuffled, to have different batches at each
epoch. This is not useful for the validation loader.

In [None]:
from torch.utils.data import DataLoader

batch_size = 10_000

train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
valid_loader = DataLoader(valid_dataset, batch_size=batch_size, shuffle=False)

Once created, we can iterate over a `Dataloader`. Each iteration will
return a batch of `(time, score)` data.

For example, we can iterate over our validation loader. This yields two
batches, the first with the wanted size of 10 000, and the second one
with a size of 8 000 (as there are only 18 000 points in our validation
dataset).

In [None]:
for batch in valid_loader:
    time, score = batch
    print(time.shape, score.shape)

If we iterate again, we will start a new epoch and get the same batches
again.

In [None]:
for batch in valid_loader:
    time, score = batch
    print(time.shape, score.shape)

We can rewrite our training code using our data loaders. Inside each
epoch, we first iterate through our `train_loader` object and run a
training step on the yielded batch. Then, once all training batches have
been processed, we iterate through our `valid_loader`, this time to
compute the validation loss for this epoch.

In [None]:
model = RegressionNetwork()

loss_fn = nn.MSELoss()
optimizer = torch.optim.AdamW(model.parameters(), lr=0.005)  # type: ignore

epochs = 5
for epoch in range(epochs):
    loss = 0

    # Iterate through training batches
    for x_batch, y_batch in train_loader:
        # Set model in train mode
        model.train()
        # Run a training step and accumulate loss value
        batch_loss = train_step(x_batch, y_batch, model, loss_fn, optimizer)
        loss += batch_loss.item()

    # Compute average training loss for this epoch
    loss /= len(train_loader)

    # Iterate through validation batches
    valid_loss = 0
    for x_valid_batch, y_valid_batch in valid_loader:
        # Set model in evaluation (inference) mode
        model.eval()
        # Compute and accumulate the batch loss
        y_valid_pred = model(x_valid_batch)
        valid_batch_loss = loss_fn(y_valid_pred, y_valid_batch)
        valid_loss += valid_batch_loss

    # Compute the average validation loss for this epoch
    valid_loss /= len(valid_loader)

    print(f"{epoch + 1:5}. loss: {loss:5.3f}, valid_loss: {valid_loss:5.3f}")

The training is working well, even if it is slower than previously.
However, the slowdown would be smaller if the cost of the
`__getitem__()` operation was higher (if, for example, we were reading a
file). And we have some nice bonus features, such as managing
automatically the size of the last batch, or the shuffling of training
data between each epoch.

**Exercise**

The following python function uses the [Open meteo
API](https://open-meteo.com/en/docs) to get different daily weather
informations for the past 30 days at the LBBE.

In [None]:
def get_weather_data():
    url = "https://api.open-meteo.com/v1/forecast?latitude=45.78&longitude=4.87&daily=precipitation_sum,temperature_2m_mean,wind_speed_10m_mean&past_days=30"
    data = requests.get(url).json()["daily"]  # noqa: S113
    return data


The returned data is a Python dictionary with the following fields:

-   `wind_speed_1Om_mean`: mean daily wind speed
-   `temperature_2m_mean`: mean daily temperature
-   `precipitation_sum`: total daily precipitations

1.  Use the `get_weather_data` function to create a new `data` object
2.  Create a `WeatherDataset` class that converts wind speed,
    temperature and precipitations as tensors and store them as
    attributes, and returns temperature and wind speed as input data,
    and precipitation as target
3.  Create a `weather_dataset` object from your `WeatherDataset` class
4.  Create a `DataLoader` object for `weather_dataset` with a
    `batch_size` of 10 and no shuffle
5.  Iterate over the data loader and print the batch values

## Effect of batch size on training process

Besides allowing to train a model from bigger datasets, the use of
mini-batches also has an effect on the training process itself. This is
due to the fact that when using batches, the loss function will be
slightly different at each train step (and thus the gradients and the
parameters adjustments will also be slightly different).

To illustrate this point, we reuse a previous example of linear
regression with only one parameter $w$ (the slope of the line). We first
generate a random dataset of 500 points of `x` and `y` values where `y`
is equal to `x * 2` plus some noise.

In [None]:
n_points = 500
np.random.seed(1337)
# Generate x and y values
x = np.random.uniform(low=0, high=10, size=n_points)
y = x * 2 + np.random.normal(loc=0, scale=3, size=n_points)

# Convert to tensors
xt = torch.tensor(x).view(-1, 1)
yt = torch.tensor(y).view(-1, 1)

# Plot the dataset
(
    pn.ggplot(mapping=pn.aes(x=x, y=y))
    + pn.geom_abline(slope=2, intercept=0, color="red")
    + pn.geom_point(color="royalblue", size=2, alpha=0.5)
    + pn.coord_cartesian(xlim=(0, 10))
)

We can plot the loss function for our whole dataset, *ie*, the mean
squared error of our model $y = w \times x$ for different values of $w$.

In [None]:
model_1p.plot_loss(xt, yt, wmin=-4, wmax=8, gradient=False, ylim=(0, 2000))

We can see that if $w = 0$ (*ie* with an horizontal regression line),
the loss value on our dataset is around 200. As expected, the minimum
value of loss is reached for $w$ approximately equal to 2.

But what happens if we compute this loss function not over the whole
dataset, but only over a subset of it, for example of 32 data points?

In [None]:
indices = random.choices(range(len(x)), k=32)
x_subset, y_subset = x[indices], y[indices]
reg = skl_regression(x_subset, y_subset, fit_intercept=False)

# Plot the dataset
(
    pn.ggplot()
    + pn.geom_abline(slope=2, intercept=0, color="red", alpha=0.2)
    + pn.geom_point(mapping=pn.aes(x=x, y=y), color="royalblue", size=2, alpha=0.05)
    + pn.geom_abline(slope=reg["slope"], intercept=0, color="red")
    + pn.geom_point(mapping=pn.aes(x=x_subset, y=y_subset), color="royalblue", size=2, alpha=0.9)
    + pn.coord_cartesian(xlim=(0, 10))
)

If we take a subset of our data, we can see that the slope of the
regression line is slightly different. So we can suppose that the
overall loss function values will be different too.

We can visualize this by randomly sampling 32 data points, plotting the
associated loss function (in grey on the plot below) and compare it with
the loss function of our whole data (the dashed red line).

In [None]:
model_1p.plot_batch_loss(xt, yt, batch_size=32, n_batches=1)

We can see that the loss of our batch is not identical to the “full”
loss. It has about the same shape but its values are not the same.

We can guess that each batch of 32 points will generate a different loss
function. We can visualize this variability by generating many batches
and plotting their losses on the same plot.

In [None]:
model_1p.plot_batch_loss(xt, yt, batch_size=32, n_batches=50)

What happens if we decrease the batch size?

If we create batches of 16 data points, we can see that the variability
around the “full” loss is higher.

In [None]:
model_1p.plot_batch_loss(xt, yt, batch_size=16, n_batches=50)

As an extreme example, with a batch size of 1, the loss function is
calculated for only one data point. The variability of the loss is then
maximal:

In [None]:
model_1p.plot_batch_loss(xt, yt, batch_size=1, n_batches=30)

On the contrary, with a larger batch size of 256, the loss functions
will more closely approximate the “full” loss:

In [None]:
model_1p.plot_batch_loss(xt, yt, batch_size=256, n_batches=50)

Here is another example with a more complex loss function, still with a
unique parameter $w$. We can see the same effect of the batch size on
the variability of the batch losses.

In [None]:
model_1p.plot_sin_loss()

In [None]:
model_1p.plot_sin_batch_loss(batch_size=32, n_batches=20)

In [None]:
model_1p.plot_sin_batch_loss(batch_size=128, n_batches=20)

In [None]:
model_1p.plot_sin_batch_loss(batch_size=8, n_batches=20)

If the batch size has an effect on the batch loss, it also affects the
training process. Here is an example of a training process on the same
complex loss without using mini-batches: the process here is fully
deterministic and leads to a local minimum.

In [None]:
train_args = {
    "step_size": 0.002,
    "epochs": 10,
    "w_init": 1.0,
}

model_1p.plot_sin_train(**train_args, batch_size=None)

If we use mini-batches during training, here with a batch size of 128,
we can see that the training process is more erratic, and as the batches
are shuffled between epochs, less deterministic.

In the following plot, each point represents a training step, *ie* the
loss computation and $w$ adjustment after each batch.

In [None]:
model_1p.plot_sin_train(**train_args, batch_size=128)

With a smaller batch size, the variability between batch losses
increases, and so the training process is even more erratic.

In [None]:
model_1p.plot_sin_train(**train_args, batch_size=32)

The fact that the process is more erratic and less deterministic can be
seen as an issue, but it can also be an advantage. For example, with an
even smaller batch size of 8, most of the training processes manage to
“escape” the local minimum and find another, better one.

In [None]:
model_1p.plot_sin_train(**train_args, batch_size=8)

So, as a summary:

-   A **large batch size** demands more memory, as the whole batch must
    be loaded into the computer or GPU memory. However, since the batch
    losses are closer to the “full” loss, the training process will be
    smoother, more deterministic, and faster due to increased
    computational efficiency.

-   A **small batch size** requires less memory but is less
    computationally efficient. The training process will be more erratic
    and less deterministic. However, this can also have positive
    consequences as it allows for a better exploration of the data
    distribution and a greater ability to escape local minima. It will
    be slower but in some cases can yield better results and reduce the
    risk of overfitting.