# Learning Rate Scheduler

adjusting the rate of the weight update throught the training

In [1]:
import torch
import torch.nn as nn
import torch.optim as optim

The magnitude of the learning rate is important  
A too small learning rate can slow down convergence often leading to suboptimal result  
On the other hand, a too large learning rate can makes the optimization diverge 

At the beginning of the training, the error is large so we can afford a higher learning rate  
Toward the end of the training, the learning rate needs to be reduced to avoid boucing around the minimum

We can access and modify the learning rate of the optimizer

In [2]:
model = nn.Sequential(nn.Linear(10, 1))

In [3]:
lr = 0.01
optimizer = optim.Adam(model.parameters(), lr=lr)
optimizer.param_groups[0]["lr"] = 0.1

However, it is not practical to modify it by hand  
Fortunately, Pytorch defines by default a certain number of scheduler: https://pytorch.org/docs/stable/optim.html

A rudimentary approach is to halve by a certain factor the learning rate at specific epochs

In [4]:
epochs = 100
scheduler = torch.optim.lr_scheduler.MultiStepLR(optimizer, milestones=[30, 0], gamma=0.1)
for epoch in range(epochs):
    #train(...)
    #validate(...)
    scheduler.step()



The problem is that we rather have a smoother reduction of the learning rate rather than suddent drop  
A very common and empirically effective scheduler is the cosine scheduler

<center>
    <img src='images/cosine.png' width='60%'/>
</center>

In [5]:
min_lr = 0.0001
scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, epochs, eta_min=min_lr)
for epoch in range(epochs):
    #train(...)
    #validate(...)
    scheduler.step()

In certain special case, there is a need to "warm-up" the learning rate  
For example, if you have strong features than can dominate other, having a too high learning rate at the beginning can cause the neural network to only focus on these features and overfit very fast

<center>
    <img src='images/warmup.png' width='80%'/>
    <p>Source: <a href='https://huggingface.co/docs/transformers/main_classes/optimizer_schedules'>Hugging Face </a></p>
</center>

In [6]:
iter_before_restart = 20
mult_fact_restart = 10
scheduler = torch.optim.lr_scheduler.CosineAnnealingWarmRestarts(optimizer, iter_before_restart, mult_fact_restart)

Some recent works have shown that a warm-up period is generally beneficial for very deep neural network  
And it's crutial for certain architectures (i.e., transformer)

One scheduler working very well emperically is the `ReduceLROnPlateau` one  
It reduces the learning rate by a given factor (by default $0.1$) after $X$ steps without improvement

In [7]:
scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer, 'min')

As the optimizer itself contains some parameters (i.e., momentum, RMSProp) should also be saved and load to continue training with

In [8]:
torch.save(optimizer.state_dict(), 'optimizer.pt')
optimizer.load_state_dict(torch.load('optimizer.pt'))

The scheduler state can also be saved to be loaded later

In [9]:
torch.save(scheduler.state_dict(), 'scheduler.pt')
scheduler.load_state_dict(torch.load('scheduler.pt'))

To go further: https://iconof.com/1cycle-learning-rate-policy/