# Optimizers and schedulers

In [None]:
import matplotlib.pyplot as plt
import plotnine as pn
import torch

from adl import optimizers

pn.theme_set(
    pn.theme_minimal()
    + pn.theme(
        plot_background=pn.element_rect(fill="white"),
        plot_title=pn.element_text(size=11),
    )
)

Optimizers and schedulers are techniques and methods to improve the
training process and find better optimum values. The goal of this
notebook is to introduce some ideas about the way these methods work.

## Optimizers

In the previous notebooks we modified the parameters of our models from
our gradient values manually, with code like the following:

``` python
w = w - step_size * w.grad
b = b - step_size * b.grad
```

That is, at each training step our parameters (here $w$ and $b$) were
adjusted by adding to their value the value of the loss gradient
multiplied by a *step size* (also called *learning rate*). The learning
rate measures how much we want to move in the gradient direction at each
step, which could be roughly seen as how fast we want to learn from the
gradient value at each step.

This simple method allows to find the optimum values for the simplest
cases, but it can be improved strongly with different techniques. A
method to update the model parameters from their gradient values is
called an *optimizer*.

To illustrate this, we will use a more complex loss functions of two
parameters with several different optimums. It is plotted below as a
contour plot: the $x$ and $y$ axis show the values of our two
parameters, and the colored contours represent the loss value at
different points. When the contour is red the loss is high, when it is
dark blue it is low. So our objective here is to start from a random
point and reach the visible minimum which is around (-3, -2).

In [None]:
plot_sin_loss_args = {
    "loss_fn": optimizers.sin_loss_fn,
    "w1min": -14,
    "w1max": 0,
    "w2min": -6,
    "w2max": 12,
    "nsteps": 100,
}
optimizers.plot_loss(**plot_sin_loss_args)

In the following plot, we represent the gradient descent starting from
(-6.5, 4.5) and using a step size, or learning rate, of 0.01. So at each
step we compute the loss gradient at the current point and we “move”
along this direction according to the learning rate. We can see that in
this case, we reach the minimal value in about 30 steps.

In [None]:
plot_sin_train_args1 = plot_sin_loss_args | {
    "w1_init": -6.5,
    "w2_init": 4.5,
}
optimizers.plot_train(**plot_sin_train_args1, epochs=30, step_size=0.01)

### Stochastic gradient descent

This method is called *stochastic gradient descent* and instead of
computing it ourselves we can call the pytorch optimizer method
`torch.optim.SGD` which by default does exactly the same thing.

In [None]:
plot_sin_train_args1 = plot_sin_loss_args | {
    "w1_init": -6.5,
    "w2_init": 4.5,
}
optimizers.plot_train(
    **plot_sin_train_args1,
    epochs=30,
    optimizer=torch.optim.SGD,  # type: ignore
    optimizer_params={"lr": 0.01},
)

What would happen if we change the starting point? In the plot below we
start from (-12, 10), and we can see that the gradient descent stops at
another place, which is a local minimum with a higher value than the
best visible one.

This illustrates the fact that the training process is non
deterministic: if we start at a random point, *ie* with our model
parameters initialized with random values, the descent will be different
and may not lead to the same optimum value.

In [None]:
plot_sin_train_args2 = plot_sin_loss_args | {
    "w1_init": -12.0,
    "w2_init": 10.0,
}
optimizers.plot_train(
    **plot_sin_train_args2,
    epochs=30,
    optimizer=torch.optim.SGD,  # type: ignore
    optimizer_params={"lr": 0.01},
)

This also illustrates a drawback of the gradient descent method: if we
stop at the first place we encounter where the gradient values are zero,
we know we are at a minimum point, but it could be a local minimum
instead of a global one. In fact, there is no way to know if the minimum
is local or global.

Several techniques have been developed to improve this behavior. One of
them is to add *momentum* to our learning process: this can be seen as a
way to add “inertia” to our descent, which can allow to escape from a
local minimum in some cases. In the following example, we have the same
starting point and learning rate as previously, but adding momentum
allows to go beyond the first local minimum and towards the lower value.

In [None]:
optimizers.plot_train(
    **plot_sin_train_args2,
    epochs=25,
    optimizer=torch.optim.SGD,  # type: ignore
    optimizer_params={"lr": 0.01, "momentum": 0.6},
)

Of course, the amount of momentum added is important: too few will not
allow to escape from a local minimum, and too much can make the descent
too fast and prevent from reaching the desired optimum, as in the
following example.

In [None]:
optimizers.plot_train(
    **plot_sin_train_args2,
    epochs=15,
    optimizer=torch.optim.SGD,  # type: ignore
    optimizer_params={"lr": 0.01, "momentum": 0.8},
)

### Other optimizers

Another limitation of stochastic gradient descent is that it uses a
fixed learning rate, which is the same for each parameter at each
training step. Other optimizers use adaptive learning rates, *ie* they
adjust the values based on the previous gradient values: it steps down
the learning rate for parameters with a history of high gradient values,
and steps it up for parameters with low gradient values.

We will illustrate this with another simpler loss function, still for
two parameters:

In [None]:
plot_reg_loss_args = {
    "loss_fn": optimizers.reg_loss_fn,
    "w1min": -5,
    "w1max": 5,
    "w2min": -5,
    "w2max": 5,
    "nsteps": 100,
}
optimizers.plot_loss(**plot_reg_loss_args)

In the following plot we compare the `torch.optim.SGD` optimizer, with a
fixed learning rate, and another optimizer called `torch.optim.RMSprop`,
with adaptive learning rates. We can see that, for the same global
learning rate, RMSprop will minimize oscillations by reducing the
learning rates when the gradients have high values for several training
steps.

In [None]:
plot_reg_train_args = plot_reg_loss_args | {
    "w1_init": -4.5,
    "w2_init": -2.0,
    "optimum": (0.0, 0.0),
}
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 5))
optimizers.plot_train(
    **plot_reg_train_args,
    epochs=50,
    optimizer=torch.optim.SGD,  # type: ignore
    optimizer_params={"lr": 0.5},
    ax=ax1,
)
optimizers.plot_train(
    **plot_reg_train_args,
    epochs=50,
    optimizer=torch.optim.RMSprop,  # type: ignore
    optimizer_params={"lr": 0.5},
    ax=ax2,
)
plt.show()


Another optimizer with adaptative learning rate is `torch.optim.AdamW`,
which implements both adaptive learning rate and momentum. It also uses
by default the notion of *weight decay*, which is a regularization
technique that encourages smaller values for the parameters.

When used in the same example with the same starting learning rate, we
can see that this leads to a smoother descent, even if it is a bit
slower at the start.

In [None]:
optimizers.plot_train(
    **plot_reg_train_args,
    epochs=80,
    optimizer=torch.optim.AdamW,  # type: ignore
    optimizer_params={"lr": 0.5},
)

If we increase the learning rate, we can see that the SGD descent starts
to diverge and doesn’t reach the optimum value, whereas RMSprop still
converges, and AdamW has almost the same descent, although a bit faster.
This makes RMSprop and AdamW a bit less dependent on the initial
learning rate value than SGD.

In [None]:
fig, (ax1, ax2, ax3) = plt.subplots(1, 3, figsize=(18, 5))
optimizers.plot_train(
    **plot_reg_train_args,
    epochs=10,
    optimizer=torch.optim.SGD,  # type: ignore
    optimizer_params={"lr": 0.6},
    ax=ax1,
)
optimizers.plot_train(
    **plot_reg_train_args,
    epochs=50,
    optimizer=torch.optim.RMSprop,  # type: ignore
    optimizer_params={"lr": 0.6},
    ax=ax2,
)
optimizers.plot_train(
    **plot_reg_train_args,
    epochs=70,
    optimizer=torch.optim.Adam,  # type: ignore
    optimizer_params={"lr": 0.6},
    ax=ax3,
)
plt.show()


Finally, we can compare the training process of SGD, RMSprop and AdamW
on our complex loss example.

First, for SGD, the result is highly dependant ont the initial learning
rate. Furthermore, the fixed learning rate can lead to strong
oscillations around the optimum value.

In [None]:
fig, (ax1, ax2, ax3) = plt.subplots(1, 3, figsize=(18, 5))
optimizers.plot_train(
    **plot_sin_train_args2,
    epochs=60,
    optimizer=torch.optim.SGD,  # type: ignore
    optimizer_params={"lr": 0.03},
    ax=ax1,
)
optimizers.plot_train(
    **plot_sin_train_args2,
    epochs=100,
    optimizer=torch.optim.SGD,  # type: ignore
    optimizer_params={"lr": 0.05},
    ax=ax2,
)
optimizers.plot_train(
    **plot_sin_train_args2,
    epochs=60,
    optimizer=torch.optim.SGD,  # type: ignore
    optimizer_params={"lr": 0.07},
    ax=ax3,
)
plt.show()


RMSprop is also quite dependent on the initial learning rate. If the
value is too low we can be stuck in a local minimum, and if it is too
high we can “escape” the visible global minimum and go beyond it.

In [None]:
fig, (ax1, ax2, ax3) = plt.subplots(1, 3, figsize=(18, 5))
optimizers.plot_train(
    **plot_sin_train_args2,
    epochs=60,
    optimizer=torch.optim.RMSprop,  # type: ignore
    optimizer_params={"lr": 0.3},
    ax=ax1,
)
optimizers.plot_train(
    **plot_sin_train_args2,
    epochs=60,
    optimizer=torch.optim.RMSprop,  # type: ignore
    optimizer_params={"lr": 0.5},
    ax=ax2,
)
optimizers.plot_train(
    **plot_sin_train_args2,
    epochs=60,
    optimizer=torch.optim.RMSprop,  # type: ignore
    optimizer_params={"lr": 0.6},
    ax=ax3,
)
plt.show()


Finally, `AdamW` is less sensitive to the initial learning rate. Even
with more extreme low and high values that our RMSprop example, it
manages to reach the visible optimum with a smoother descent.

In [None]:
fig, (ax1, ax2, ax3) = plt.subplots(1, 3, figsize=(18, 5))
optimizers.plot_train(
    **plot_sin_train_args2,
    epochs=150,
    optimizer=torch.optim.AdamW,  # type: ignore
    optimizer_params={"lr": 0.2},
    ax=ax1,
)
optimizers.plot_train(
    **plot_sin_train_args2,
    epochs=50,
    optimizer=torch.optim.AdamW,  # type: ignore
    optimizer_params={"lr": 0.5},
    ax=ax2,
)
optimizers.plot_train(
    **plot_sin_train_args2,
    epochs=35,
    optimizer=torch.optim.AdamW,  # type: ignore
    optimizer_params={"lr": 1.0},
    ax=ax3,
)
plt.show()


## Schedulers

Schedulers are another set of methods designed to improve the training
process. The goal of a scheduler is to change the learning rate during
the process.

We will start with the following gradient descent example, with an `SGD`
optimizer. We can see that with a learning rate of 0.04, the descent
reaches the area of the visible minimum, but then it starts to oscillate
around the minimum indefinitely instead of really reaching it.

In [None]:
optimizers.plot_train(
    **plot_sin_train_args2,
    epochs=250,
    optimizer=torch.optim.SGD,  # type: ignore
    optimizer_params={"lr": 0.04},
)


This oscillation is due to the fixed value of the learning rate in
`SGD`. The high learning rate is useful at the start of the training
because it avoids a local minimum, but it becomes detrimental at the end
because it prevents to stabilize and reach the optimum value.

One way to work around this issue is to use a *scheduler* which will
regularly decrease the learning rate along the training process. In the
following example we use an `ExponentialLR` scheduler with a `gamma`
argument of 0.95, which means that at every training step the learning
rate will be multiplied by 0.95.

This allows to avoid the oscillation problem at the end of the descent.

In [None]:
optimizers.plot_train(
    **plot_sin_train_args2,
    epochs=30,
    optimizer=torch.optim.SGD,  # type: ignore
    optimizer_params={"lr": 0.04},
    scheduler=torch.optim.lr_scheduler.ExponentialLR,
    scheduler_params={"gamma": 0.95},
)


There are many different schedulers available, such as
`ReduceLROnPlateau` which will reduce the learning rate only if the loss
value hasn’t gone down for a certain of training steps.

Here we use it with a `factor` of 0.8 and a `patience` of 0, which means
that the learning rate will be multiplied by 0.8 as soon as the loss
value isn’t lower than the one of the previous step.

We can see that this method also allows to avoid the oscillation problem
at the end of training.

In [None]:
optimizers.plot_train(
    **plot_sin_train_args2,
    epochs=20,
    optimizer=torch.optim.SGD,  # type: ignore
    optimizer_params={"lr": 0.04},
    scheduler=torch.optim.lr_scheduler.ReduceLROnPlateau,
    scheduler_params={"patience": 0, "factor": 0.8},
)


Schedulers can be useful even for optimizers that use adaptive learning
rate. If we use `RMSprop` in the previous example, we can see that we
can still have the oscillation problem shown by `SGD`: at the end of the
process, the model oscillates between two values around the optimum.

In [None]:
optimizers.plot_train(
    **plot_sin_train_args2,
    epochs=150,
    optimizer=torch.optim.RMSprop,  # type: ignore
    optimizer_params={"lr": 0.4},
)


If we use a `ReduceLROnPlateau` scheduler, the oscillation problem goes
away.

In [None]:
optimizers.plot_train(
    **plot_sin_train_args2,
    epochs=30,
    optimizer=torch.optim.RMSprop,  # type: ignore
    optimizer_params={"lr": 0.4},
    scheduler=torch.optim.lr_scheduler.ReduceLROnPlateau,
    scheduler_params={"patience": 0, "factor": 0.9},
)


Finally, we can see that more modern optimizers like `AdamW` don’t seem
to have the same issue as they will better adapt the learning rate by
themselves.

In [None]:
optimizers.plot_train(
    **plot_sin_train_args2,
    epochs=50,
    optimizer=torch.optim.AdamW,  # type: ignore
    optimizer_params={"lr": 0.5},
)


However, other types of schedulers can be used with these optimizers
especially with bigger models, for example to introduce a warmup period
at the start of training. In this case, the learning rate starts from a
small value and increases gradually to the desired starting learning
rate along a few epochs.

## Optimizers and schedulers in pytorch

### Optimizers

In the previous notebooks we didn’t use pytorch optimizers but adjusted
our parameters manually.

To use a pytorch defined optimizer we must first create an optimizer
instance by invoking an optimizer method form `torch.optim`, like
`torch.optim.SGD` or `torch.optim.AdamW`. We pass our model parameters
as first argument, followed by other method arguments such as the
learning rate.

For example, to create an `SGD` optimizer on two model parameters `w`
and `b` with a learning rate of 0.001:

``` python
optimizer = torch.optim.SGD([w, b], lr=0.001)
```

After that, we get two methods we can use in our training loop:

-   `optimizer.step()` will adjust the values of the parameters based on
    their gradients
-   `optimizer.zero_grad()` will reset the gradient values.

**Exercise**

Change the following training code seen in the previous notebook, by
using a `torch.optimizer.SGD` optimizer with a learning rate of 0.001.
Check that both code give the same results.

In [None]:
# x and y data
x = torch.tensor([-1.5, 0.2, 3.4, 4.1, 7.8, 13.4, 18.0, 21.5, 32.0, 33.5])
y = torch.tensor([100.5, 110.2, 133.5, 141.2, 172.8, 225.1, 251.0, 278.9, 366.7, 369.9])

# Parameters
w = torch.tensor(0.0, requires_grad=True)
b = torch.tensor(0.0, requires_grad=True)


def forward(x):
    return w * x + b


loss_fn = torch.nn.MSELoss()

step_size = 0.001
epochs = 5
for epoch in range(epochs):
    y_pred = forward(x)
    loss = loss_fn(y_pred, y)
    loss.backward()

    # weight and bias adjustment
    w.data = w.data - step_size * w.grad  # type: ignore
    b.data = b.data - step_size * b.grad  # type: ignore

    # reset gradients
    w.grad.zero_()  # type: ignore
    b.grad.zero_()  # type: ignore

    print(f"Epoch: {epoch}, loss: {loss:.2f}, weight: {w:.3f}, bias: {b:.3f}")

### Schedulers

To use a pytorch scheduler in our training process, we have to create a
scheduler instance by passing it our optimizer object and additional
arguments. For example, to use a `ReduceLROnPlateau` scheduler:

``` python
scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer, patience=3, factor=0.9)
```

After that, we can use the method `scheduler.step()` at the end of each
epoch to modify our optimizer learning rate.

**Exercise**

Modify the answer of the previous exercise to use a
`torch.optim.lr_scheduler.ExponentialLR` scheduler with a `gamma`
argument of 0.95.