## Find the awesome original post here: [Overview of different Optimizers for neural networks](https://medium.com/datadriveninvestor/overview-of-different-optimizers-for-neural-networks-e0ed119440c3)

## Gradient descent

Gradient descent is an iterative machine learning optimization algorithm to reduce the cost function. This will help models to make accurate predictions.

Gradient indicates the direction of increase. As we want to find the minimum point in the valley we need to go in the opposite direction of the gradient. We update parameters in the negative gradient direction to minimize the loss.

$$
    \theta = \theta - \eta \nabla J(\theta ; x, y)
$$

<center>
    $\theta$ is the weight parameter, $\eta$ is the learning rate and $\nabla \mathrm{J}(\theta ; \mathrm{x}, \mathrm{y})$ is the gradient of weight parameter $\theta$
</center>

## Types of gradient descent

- Batch Gradient Descent or Vanilla Gradient Descent

- Stochastic Gradient Descent (SGD)

- Mini batch Gradient Descent

## Role of an optimizer

Optimizers update the weight parameters to minimize the loss function. Loss function acts as guides to the terrain telling optimizer if it is moving in the right direction to reach the bottom of the valley, the global minimum.

## Types of optimizers

### Momentum

Momentum is like a ball rolling downhill. The ball will gain momentum as it rolls down the hill.

![momentum](./momentum.png)

Momentum helps accelerate Gradient Descent(GD) when we have surfaces that curve more steeply in one direction than in another direction. It also dampens the oscillation as shown above.

For updating the weights it takes the gradient of the current step as well as the gradient of the previous time steps. This helps us move faster towards convergence.

Convergence happens faster when we apply momentum optimizer to surfaces with curves.

$$
    \begin{aligned} v_{t} &=\gamma v_{t-1}+\eta \nabla J(\theta ; x, y) \\ \theta &=\theta-v_{t} \end{aligned}
$$

<center>
    Momentum Gradient descent takes gradient of previous time steps into consideration
</center>

### Nesterov accelerated gradient(NAG)

Nesterov acceleration optimization is like a ball rolling down the hill but knows exactly when to slow down before the gradient of the hill increases again.

We calculate the gradient not with respect to the current step but with respect to the future step. We evaluate the gradient of the looked ahead and based on the importance then update the weights.

![NAG](./nag.png)

NAG is like you are going down the hill where we can look ahead in the future. This way we can optimize our descent faster. Works slightly better than standard Momentum.

$$
\begin{align*}
    &\theta =\theta - v_{t} \\
    &v_{t} = \gamma v_{t-1} + \eta \nabla J\left(\theta - \gamma v_{t-1}\right) \\
    &\theta - \gamma v_{t-1} \text{ is the gradient of looked ahead}
\end{align*}
$$

### Adagrad — Adaptive Gradient Algorithm

We need to tune the learning rate in Momentum and NAG which is an expensive process.

Adagrad is an adaptive learning rate method. In Adagrad we adopt the learning rate to the parameters. We perform larger updates for infrequent parameters and smaller updates for frequent parameters.

It is well suited when we have sparse data as in large scale neural networks. GloVe word embedding uses adagrad where infrequent words required a greater update and frequent words require smaller updates.

For SGD, Momentum, and NAG we update for all parameters θ at once. We also use the same learning rate η. In Adagrad we use different learning rate for every parameter θ for every time step t

$$
\begin{align*}
    &\theta_{t+1} = \theta_{t} - \frac{\eta}{\sqrt{G_{t} + \varepsilon}} \cdot g_{t} \\
    &G_t \text{ is sum of the squares of the past gradients w.r.t. to all parameters } \theta
\end{align*}
$$

Adagrad eliminates the need to manually tune the learning rate.

In the denominator, we accumulate the sum of the square of the past gradients. Each term is a positive term so it keeps on growing to make the learning rate η infinitesimally small to the point that algorithm is no longer able learning. Adadelta, RMSProp, and adam tries to resolve Adagrad’s radically diminishing learning rates.

### Adadelta

- Adadelta is an extension of Adagrad and it also tries to reduce Adagrad’s aggressive, monotonically reducing the learning rate.

- It does this by restricting the window of the past accumulated gradient to some fixed size of w. Running average at time t then depends on the previous average and the current gradient.

- In Adadelta we do not need to set the default learning rate as we take the ratio of the running average of the previous time steps to the current gradient.

$$
\begin{align*}
    &\theta_{t+1} = \theta_{t} + \Delta \theta_{t} \\
    &\Delta \theta = -\frac{R M S[\Delta \theta]_{t-1}}{R M S\left[g_{t}\right]} \cdot g_{t}
\end{align*}
$$


### RMSProp

- RMSProp is Root Mean Square Propagation. It was devised by Geoffrey Hinton.

- RMSProp tries to resolve Adagrad’s radically diminishing learning rates by using a moving average of the squared gradient. It utilizes the magnitude of the recent gradient descents to normalize the gradient.

- In RMSProp learning rate gets adjusted automatically and it chooses a different learning rate for each parameter.
RMSProp divides the learning rate by the average of the exponential decay of squared gradients.

$$
    \theta_{t+1} = \theta_{t} - \frac{\eta}{\sqrt{(1 - \gamma) g^{2}_{t-1}} + \gamma g_{t} + \varepsilon} \cdot g_{t}
$$

<center>
    γ is the decay term that takes value from 0 to 1. gt is moving average of squared gradients
</center>

### Adam — Adaptive Moment Estimation

- Another method that calculates the individual adaptive learning rate for each parameter from estimates of first and second moments of the gradients.

- It also reduces the radically diminishing learning rates of Adagrad.

- Adam can be viewed as a combination of Adagrad, which works well on sparse gradients and RMSprop which works well in online and nonstationary settings.

- Adam implements the exponential moving average of the gradients to scale the learning rate instead of a simple average as in Adagrad. It keeps an exponentially decaying average of past gradients.

- Adam is computationally efficient and has very little memory requirement.

- Adam optimizer is one of the most popular gradient descent optimization algorithms.

Adam algorithm first updates the exponential moving averages of the gradient(mt) and the squared gradient(vt) which is the estimates of the first and second moment.

Hyper-parameters β1, β2 ∈ [0, 1) control the exponential decay rates of these moving averages as shown below

$$
\begin{aligned}
    &m_{t} = \beta_{1} m_{t-1} + \left(1 - \beta_{1}\right) g_{t} \\
    &v_{t} = \beta_{2} v_{t-1} + \left(1 - \beta_{2}\right) g_{t}^{2} \\
    &m_{t} \text{ and } v_{t} \text{ are estimates of first and second moment respectively}
\end{aligned}
$$

Moving averages are initialized as 0 leading to moment estimates that are biased around 0 especially during the initial timesteps. This initialization bias can be easily counteracted resulting in bias-corrected estimates

$$
\begin{aligned}
    &\hat{m}_{t}=\frac{m_{t}}{1-\beta_{1}^{t}} \\
    &\hat{v}_{t}=\frac{V_{t}}{1-\beta_{2}^{t}} \\
    &\hat{m}_{t} \text{ and } \hat{v}_{t} \text{ are bias corrected estimates of first and second moment respectively}
\end{aligned}
$$

Finally, we update the parameter as shown below

$$
\theta_{t+1} = \theta_{t} - \frac{\eta \widehat{m}_{t}}{\sqrt{\hat{v}_{t} + \varepsilon}}
$$

### Nadam- Nesterov-accelerated Adaptive Moment Estimation

- Nadam combines NAG and Adam.

- Nadam is employed for noisy gradients or for gradients with high curvatures.

- The learning process is accelerated by summing up the exponential decay of the moving averages for the previous and current gradient.

In the diagram below we see can see how different optimizer will converge to the minimum. Adagrad, Adadelta, and RMSprop headed off immediately in the right direction and converge. Momentum and NAG were led off-track, evoking the image of a ball rolling down the hill. NAG corrected itself quickly

![performace comparsion](./performance_comparison.gif)

## References

[Overview of different Optimizers for neural networks](https://medium.com/datadriveninvestor/overview-of-different-optimizers-for-neural-networks-e0ed119440c3)

[Adam: A Method for Stochastic Optimization by Diederik P. Kingma, Jimmy Ba](https://arxiv.org/pdf/1412.6980.pdf)

http://cs231n.github.io/neural-networks-3/

https://arxiv.org/pdf/1609.04747.pdf

http://yann.lecun.com/exdb/publis/pdf/lecun-98b.pdf

[优化算法Optimizer比较和总结](https://zhuanlan.zhihu.com/p/55150256)