# ADAM Optimization
This paper proposes a method for stochastic gradient based optimization which quickly became very popular. It stands for Adaptive Moment estimation and has a number of nice features for an optimization algorithm.
* Invariant to scaling of gradient.
* Does not require a stationary objective function.
* Performs step size reduction automatically.

Objective functions are often stochastic, either from the fact that data is subsampled (minibatches) or from things like dropout. This is why the focus of this paper is on optimization of stochastic objectives with high dimensional parameter spaces.

In short ADAM uses the first and second moments (mean and variance) of the gradients and combines the advantages of RMSProp and AdaGrad optimization algorithms.

## Algorithm
* Have a noisy objective function $f(\theta)$ which is differentiable w.r.t. $\theta$
* Want to minimize $\mathbb{E} \left[ f(\theta) \right]$
* Learning rate $\alpha$
* Keep exponential moving average over timesteps $t$ of the gradient $m_t$ and squared gradient $\upsilon_t$.
    * $\beta_1, \beta_2 \in [0, 1)$ controls the exponential decays respectively.
    * These moving averages estimates of first and second order moments (mean, variance) of the gradient.
    * The moving averages are initialized to 0's so are biased towards zero, especially in early steps and if decay rates are small ($\beta_i \approx 1)$.
    * Corrected mean and variance estimates
        * $\hat{m}_t = m_t\ /\ (1 - \beta_1^t)$
        * $\hat{\upsilon}_t = \upsilon_t\ /\ (1 - \beta_2^t)$
* The update step is computed using the corrected gradient mean $\hat{m}_t$ and the corrected gradient variance $\hat{\upsilon}_t$, (minimization) $f\theta)$: $\theta_t = \theta_{t-1} - \alpha \cdot \hat{m}_t\ /\ (\sqrt{\hat{\upsilon}_t} + \epsilon)$
    * $\epsilon$ for numerical stability.
    * The update step magnitude is approximately bounded by learning rate $\alpha$.
    * They call $\hat{m}_t\ /\ \sqrt{\hat{\upsilon}_t}$ signal-to-noise ratio (SNR) and with a small SNR update steps will be closer to zero. Good because low SNR means more uncertainty of direction to go. This means that typically update steps will be small close to zero when $\theta$ close to optimum.
    * Invariant to scaling of gradient because $(c \cdot \hat{m}_t)\ /\ (\sqrt{c^2 \cdot \hat{\upsilon}_t}) = \hat{m}_t\ /\ \sqrt{\hat{\upsilon}_t}$

## TODO
sparsity in gradient per steps
why we get zero biased estimates


## Pseudocode
<img src="figs/adam/adam-pseudocode.png" width="60%" height="60%">