Starting off by watching [this](https://www.youtube.com/watch?v=NE88eqLngkg) video before I read the paper:

One thing that I found quite illustrative was his analogy of momentum as imaginging the weights w as some sort of particle moving through the loss space then we think of $\rho$ the momentum coefficient as being the mass of the particle, and the gradient term being some sort of impulse on a particle with the v term being the current "velocity" of the particle.

## First Pass of Paper

Quickly a definition of SGD: Instead of using all training data we select one or a small batch of training data and get the gradients from that and then do updates. This is faster, and adds a little bit of noise to get out of local minima.

I've experienced myself how it's a bit hard sometimes to pick a good learning rate. It can depend on so many things, network size, data, what activations functions we use, network architecture etc. I've kind of been trial and error-ing to get a good value. SGD also has one global rate. I haven't experienced it myself but sometimes loss landscapes can have long narrow valleys where it oscilates back and forth across the walls.

Before this there are 2 ideas that help to combat this. The first was all the way back in the backprop paper from Rumelhart, which was to add the momentum, a moverage average of past gradients. The second is adaptive learning rates, each parameter gets its own learning rate.

Adam combines these two ideas and adds one important correction to make it work properly.

#### Adam Algorithm
As inputs we have:

$\alpha$ the learning rate (recommended 0.001)

$\beta_{1} \beta_{2}$ the exponential decay rates (recommended 0.9 and 0.999 respectively)

$f(\theta)$ the loss function

$\theta_{0}$ the weights and biases

We then initialize a few vectors

m_0 with the same shape as theta as a vector of 0s, this is the moment vector for momentum

v_0  with the same shape as theta as a vector of 0s, this is for the adaptive learning rates

t = 0 as the timestep counter.

This is the algorithm:

t += 1

g_t = gradient of loss with respect to parameters

m_t = beta_1 * m_t-1 + (1 - beta_1) * g_t

// this allows us to update our weighted moving average (momentum) of the gradients with our fresh ones.

// 0.9 * old gradients + 0.1 * fresh gradients

v_t = beta_2 * v_t-1 + (1 - beta_2) * g_t^2

// this updates our estimate of the uncentered variance of the gradients

// similar structure to m_t update

// we square them to keep track of the magnitude of recent gradients for each parameter -- squaring them makes it sensitive

// to large values, later on when we divide by sqrt vt that means if we're getting a lot of large gradients we'll kind of

// "slow down" and the learning rate will shrink

m_hat_t = m_t / (1 - beta_1 ^ t)

// this is a bias correction step. We initialized our momentum to 0 so it will, at the start, be more biased to 0

// this helps to correct for that, when we have a small t (early in the training) then we'll be dividing by a small number

// making it larger, as t gets larger and larger it will just be the term over 1

v_hat_t = v_t / (1 - beta_2 ^ t)

// same correction as above but for the uncentered variance term

theta_t = theta_t-1 - alpha * m_hat_t / (sqrt(v_hat_t) + epsilon)

// parameter update

// we set the parameters to be the learning rate multiplied by our special term

// our special term is the m_hat_t, (the momentum bias corrected gradients) divided by the square root of the

// v_hat_t our corrected uncentered variance plus a little epsilon so we don't divide by 0 :)


Notes:

So this is kind of a smart way to combine things and to remove the bias. It makes a lot of sense to me.

We have the momentum that kind of smooths the movement and gives it more stable direction so we don't oscilate wildly if there's a weird gradient for the batch. We also have the second order term that sets our "speed" correctly. If the gradients we're getting are consistently small, then we'll kind of speed up to cover more ground over this more flat area. If the gradients wer're getting are consistently large, we're in some steepish area so we should slow down so we don't jump all over the place.

##### Description / details of algorithm

An important property is the choice of stepsize. Let's assume epsilon = 0 for simplicity. Aside from just the alpha the core insight is what they call the SNR or signal to noise ratio. which is that special term we multiply by.

The signal is our estimation of the true direction of the gradient, the noise that we divide by is the magnitude of the gradient's fluctuation.

One part that confused me that I just understood: I was looking a lot at the "noise term" or the uncentered variances in isolation. And it seemed strange to me that for this term that is supposed to capture the fluctuations that a bunch of steep gradients in the same direction is interpreted the same as a bunch of steep gradients in different sign directions. One seems to fluctuate while the other seems to just be a steep gradient. That is to say [-30, 30, -30, 30] is interpreted the same as [30,30,30,30] because of the squaring. But what I realized is that you need to also consider how that would affect the m_t term. For the two situations, if they all agree then the momentum will quickly build up and the magnitude will stabalize and we'll get a large update the size of alpha. But in the case of the fluctuating ones the moving average will be around 0 and the magnitude will still be large. Therefore we'll get a very small update because that ratio will be small.

So that leads to the natural question of why does the v_t term matter so much. Like from the above scenario you may think that the momentum term on top is more important. That goes to my previous point.

Imagine now a gentle gradient something like [0.2, 0.2, 0.2] then the momentums will agree in direction, and the v term will be very small, making it a bigger step

If though we have a gradient that's something like [+5, -4, +5, -4] then we'll have a directional concensus, but because the gradients are so large it's a bit of a red flag and it will take smaller steps then. The net signal is weak and the ground is steep.

Then in the last one