# On the variance of adaptive learning rate and beyond (RAdam)

## Motivation
- Adaptive learning rate optimizers have benefits over normal SGD, like faster convergence, less sensitive to hyper params.
- Lately, the learning rate warmup heuristic has seen a lot of use with adaptive learning rate optimizers because of improved training stabilization, faster convergence, improved generalization.
- In many recent NLP models, the warmup heuristic has been essential for it to train successfully.
- The reasons for this have not been investigated.
- The warmup heuristic comes with extra hyper parameters that must be tuned.
- This paper theoretically and empirically looks at the reasons why adaptive learning rate optimizers sometimes fail and suggests a new optimizer algorithm to alleviate these problems.

## Background

### Adaptive learning rate optimizers
- Can be described in a general way like this.

<img src="figs/radam/generic-adaptive-learning-rate-algo.png" width="60%"/>

- In the Adam case we have

<img src="figs/radam/adam-specific-case.png" width="60%"/>

### Warmup heuristic
- The learning rate $\alpha_t$ is set to a very small value initially. 
- Then (often linearly) increased to some target learning rate. 
- The warmup period is usually over one epoch or similar.
- After warmup other learning rate scheduled might be applied like some form of decay.

## Analysis
- The figure shows that the distribution of (absolute) gradient values are quickly distorted.
- This is because the moving averages haven't seen enough samples so their estimates are bad with too high variance which in turn will affect the adaptive learning rate.
- The warmup heuristic solves this by dampening the step sizes early on when too few samples have been seen.

<img src="figs/radam/radam-gradient-histogram.png" width="40%"/>

- They verify this by an experiment in which they use normal Adam but initially only updates the adaptive learning rate of it for 2000 iterations. The model parameters and momentum variables are kept frozen. This variant avoids the problems.
- They verify this by another experiment in which they explicitly decrease the variance by increasing the $\epsilon$ to a non neglible value. This also avoids the problem but has worse performance. $\epsilon$ is usually set higher when training imagenet for example which I suppose is maybe related to this?

<img src="figs/radam/adam-verification-experiments.png" width="60%"/>

- They then verify this analytically as well.

## Rectified adaptive learning rate
- Add the rectifier term (TODO: Intution?)
- Turn on or off the adaptiveness depending on the divergence of the variance (TODO: Intution?)

<img src="figs/radam/radam-algo.png" width="50%"/>

## Comparison with learning rate warmup
TODO

## Experiments
TODO
<img src="figs/radam/radam-lr-robustness.png" width="80%"/>

## Takeaways
- Removes the need for learning warmup phase, less hyper parameters.
- Less sensitive to choice of learning rate.