# Biweekly Report 2: Adam Optimization vs The Rest #
For this biweekly report, I take a delve into one of the most cited, most used loss-minimization algorithms, such that it has become the default in many libraries today. I compare Adam against other minimzation technqiues, as well as attempt to breakdown and understand it at it's barebones. I would like to preface that for my testing/figures, a CNN is used for classifcation. We have not yet covered such structures, and while I do have experience with them, I mainly wanted to focus on having a complete understanding of these optimizers, which I believe I confidently can say I had.

# Where to start ? Answer: SGD #
Well in order to explain Adam, there are a few prerequisites. For example, what exactly is an optimization technique, and what are we even optimizing ? To answer that, let us introduce Stochastic Gradient Descent (SGD) . With any model, our goal is to minimize our loss in order to improve the accuracy of our models predictions (attempting to avoid overfitting and underfitting of course). The key to improving our models performance is through the opimzation of our loss function. The loss function is indirectly optimized by the modiification of the weights of our model. This optimization is typically done through Gradient Descent. In general, Gradient descent is defined by:

$W_{t+1} = W_t - \alpha \cdot \nabla \text{L}(W_t)$

Where:

- $W_{t+1}$ is the updated weight

- $W$ is the current weight

- $\alpha$ is the learning rate 

- $\text{L}(W_t)$ is the gradient vector of the Loss Function, typically denoted J(0).

- $t$ is our iteration

And of course, this is performed on all weights. The fundamental idea is to take the gradient in order to find out the direction of our "step", where that step size is determined my alpha, our learning rate. Stochastic Gradient Descent is this process, on all weights, with a constant alpha. It will be the baseline we will compare all our optimizers to, including Adam.

$\alpha$ in this case, is a hyperparamter, and isn't something optimized by SGD , only chosen by the Machine Learning Scientiest themselves. Learning rate, at its core, is how big of a step we take in the direction of the gradient, to hopefully find a global acceptable loss.
# Drawbacks #
While 100% a *step* (get it ) in the right direction, a new issue is introduced of attempting to now optimize $\alpha$. Our steps can now be to0 slow, taking a long time to converge to a minimum, or may be too big, and causing us to never reach the minimum as we shoot out of bounds. Let's tackle each issue one at a time, dealing with the time aspect first. It could be the case that our local minimum is a straight shot from wherever or model is currently at, however, due to an $\alpha$ being too small, we might take a significant amount of more time than needed. That is where we introduce the concept of **Momentum**.
# SGD With Momentum #
The best way to introduce the concept of Momentum when it comes to our optimization, is to thing about it exactly as it sounds. We introduce a *Velocity* vector, a vector containing past gradients of our paramters. We now will define SGD with Momentum as:
- $V_{t+1} = \beta*V_{t} + (1-\beta) \nabla W_{t}$
- $W_{t+1} = W_{t} - \alpha * V_{t+1}$

Where:

- $V_{t+1}$ is the updated velocity
- $W_{t+1}$ is the updated weight
- $V_{t}$ is the old velocity
- $\nabla W_{t}$ is the old Gradient of the Loss Function
- $W_{t}$ is our old paramter
- $\beta$ is a new hyperparameter, usually kept at around 0.9


With this new function, we now add a portion of past gradients to our current gradient before an update occurs. This is how we implement the concept of "Momentum", or , "a big step is made in this direction so let me take a bigger step". What I found extremley interesing about this defintion of SGD + Momentum, was how buy our definitions, older gradients will naturally be considered less and less as we continue. How So ? Let us consider specifcally the Velocity update expression. If we expand $V_{t}$, we obtain:

$V_{t+1} = \beta*(\beta*V_{t-1}+(1-\beta)\nabla W_{t-1}) + (1-\beta) \nabla W_{t}$

$V_{t+1} = (\beta)^2 \cdot V_{t-1}+\beta(1-\beta)\nabla W_{t-1} + (1-\beta) \nabla W_{t}$

Since $\beta$ is a value less than 1, as time goes on, past gradients are multiplied by a smaller consant until they leave little impact. Very Cool !

# Drawbacks, again #
While a another great *step* forward, we might have to take a *step* back...(I'll stop now). SGD + Momentum boasts another imporvement against SGD, but it isn't perfect. One of the main drawbacks can occur when the gradient of one paramater is significantly larger than the gradient of another paramter. This causes SGD and SGD+Momentum to optimize one parameter first, and then another, taking smaller and smaller steps as it converges. In order to solve this issue, we can propose another optimizer, built upon both of the past, called RMSProp, or Root Mean Squared Propagation.

# Root Mean What ? #
The goal of RMSProp, is to make sure we equally consider the magnitues of our gradients as we perform our descent. This allows the optimizer to converge faster, and again improves upon the past. We define RMS Prop as:
- $V_{t+1} = \beta*V_{t} + (1-\beta) \nabla (W_{t})^2$
- $W_{t+1} = W_{t} - \alpha \cdot \frac{\nabla W_{t}}{\sqrt{V_{t+1} + \epsilon}}$

The main difference here is where our gradient of the Loss Function is placed. $\epsilon$ is just a number to prevent division b zero if such a case occured. To clarify again, $V_{t}$ is a vector containing past gradients. The reason why we divide by $\sqrt{V_{t+1} + \epsilon}$ is clever:  weights with a history of larger gradients will have a larger denominator when updating the weight, hence a smaller update will be made. This ensures that we balance what weights are optimized while descending. ( $\nabla W_{t}$ is now squared to prevent negative's)

# So Who's Adam ? #
Adam, which came from **adaptive moment estimation**, is an optimization algorithm that combines intution derived from both RMSProp and Momentum. Adam has many benifits that include:
- Straightforward Implementation
- Computational Effeciency
- Low memory Requirement

The original paper introducing Adam has almost 200,000 citations, and is the default optimizer for most python-based ML libariries. Putting together what we discussed before, we will now wittle down Adam to it's core. We can define Adam as:

- $M_{t} = \beta_{1} \cdot M_{t-1} + (1-\beta_{1}) \nabla W_{t}$
- $V_{t} = \beta_{2} \cdot V_{t-1} + (1-\beta_{2}) \nabla (W_{t})^2$
- $\hat{M_{t}} = \frac{M_{t}}{1-(\beta_{1})^t}$
- $\hat{V_{t}} = \frac{V_{t}}{1-(\beta_{2})^t}$
- $W_{t+1} = W_{t} - \alpha \cdot \frac{\hat{M_{t}}}{\sqrt{\hat{V_{t}}} + \epsilon}$

$\beta_{1}$ and $\beta_{2}$ are typically 0.9 and 0.99 respectivley.Again, not to be consusing, but $\nabla W_{t}$ is the gradient of the loss function of the parameters. Here, we see a few things change in our final weight update, but most of what we see comes from Momentum and RMSProp. We have a new term, $M_{t}$, as well as two new functions. In order to describe this, I'd like to first report on the original paper's abstract. 

# Two Moments is better than one #
"We introduce Adam, an algorithm for first-order gradient-based optimization of
stochastic objective functions, based on adaptive estimates of lower-order mo-
ments"

This comes directly from the original white paper, in fact it's the very first sentence of the whole thing. The key to adam is the it's moments, the first being defined by $M$ and the second being defined by $V$. 
- $M$ (first-moment)represents the mean of the gradients. If the magnitude of the mean is high, that means the gradients along paramaters are pointing approximatley in the same direction.
- $V$ (second-moment) represents variance, and comes from RMSProp. It assists in controlling the learning rate among paramters as discussed before.

The two new equations are in place to account for the initial bias when training begins. As we continue to train, both $\beta$'s in the denominator's will eventually reach 1 (as $\beta$ is < 1 , so it'll approach 0 and leave 1/1), allowing the model to properly train without the intial Bias of what I believe is 0 initiation, 

By considering these two moment's, building upon RMSProp and Momentum, we obtain an extremley effecient,effective, and cheap optimizer for models. Now that we know exactly where Adam came from, let's compare Adam to the rest.

# Race to Convergence #
To show the difference in gradient descent, we will show gradient descent occuruing on a function defined as:

def f(x, y):

  $  parabola = 0.075 * x**2 + 0.075 * y**2 / 9$
   
   $ hole1 = -0.5 * (np.exp(-2*(x)**2) * np.exp(-2*(y)**2))$
   
   $ hole2 = -0.75 * (np.exp(-4*(x-1)**2) * np.exp(-4*(y-1)**2))$

    return parabola + hole1 + hole2

This an easy to intepret graph, with a local and global minimum. I found this function online, and computed the gradient myself to perform the gradient descent. We will be using a $\alpha$ =0.05, and will define SGD and Adam. Our result is an interactive graph where we can see the extreme gains Adam boasts over SGD, just by adding a our vectors and including our moments;

In [30]:
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
import ipywidgets as widgets
from ipywidgets import interact


# Function f(x, y)
def f(x, y):
    parabola = 0.075 * x**2 + 0.075 * y**2 / 9
    hole1 = -0.5 * (np.exp(-2*(x)**2) * np.exp(-2*(y)**2))
    hole2 = -0.75 * (np.exp(-4*(x-1)**2) * np.exp(-4*(y-1)**2))
    return parabola + hole1 + hole2

# Gradient of f(x, y)
def gradient_f(x, y):
    d_parabola_x = 0.150 * x
    d_parabola_y = 0.150 * y / 9
    
    d_hole1_x = 2 * (x) * np.exp(-2 * (x)**2) * np.exp(-2 * (y)**2)
    d_hole1_y = 2 *(y) * np.exp(-2 * (x)**2) * np.exp(-2 * (y)**2)
    
    d_hole2_x = 6 * (x - 1) * np.exp(-4 * (x - 1)**2) * np.exp(-4 * (y - 1)**2)
    d_hole2_y = 6 * (y - 1) * np.exp(-4 * (x - 1)**2) * np.exp(-4 * (y - 1)**2)
    
    grad_x = d_parabola_x + d_hole1_x + d_hole2_x
    grad_y = d_parabola_y + d_hole1_y + d_hole2_y
    
    return np.array([grad_x, grad_y])

# Gradient descent functions

# SGD
def sgd(x_init, y_init, lr=0.05, epochs=2000, tol=1e-6):
    x, y = x_init, y_init
    trajectory = [(x, y)]
    for i in range(epochs):
        grad = gradient_f(x, y)
        x -= lr * grad[0]
        y -= lr * grad[1]
        trajectory.append((x, y))
        if np.linalg.norm(grad) < tol:  # Check for convergence
            break
    return np.array(trajectory)

# Adam
def adam(x_init, y_init, lr=0.05, epochs=1000, beta1=0.9, beta2=0.99, epsilon=1e-8, tol=1e-6):
    x, y = x_init, y_init
    m_x, m_y = 0, 0
    v_x, v_y = 0, 0
    trajectory = [(x, y)]
    for t in range(1, epochs + 1):
        grad = gradient_f(x, y)
        m_x = beta1 * m_x + (1 - beta1) * grad[0]
        m_y = beta1 * m_y + (1 - beta1) * grad[1]
        v_x = beta2 * v_x + (1 - beta2) * grad[0]**2
        v_y = beta2 * v_y + (1 - beta2) * grad[1]**2

        m_x_hat = m_x / (1 - beta1**t)
        m_y_hat = m_y / (1 - beta1**t)
        v_x_hat = v_x / (1 - beta2**t)
        v_y_hat = v_y / (1 - beta2**t)

        x -= lr * m_x_hat / (np.sqrt(v_x_hat) + epsilon)
        y -= lr * m_y_hat / (np.sqrt(v_y_hat) + epsilon)
        trajectory.append((x, y))
        if np.linalg.norm(grad) < tol:  # Check for convergence
            break
    return np.array(trajectory)

# Initial point
x_init, y_init = -2, -2

# Run optimizers
sgd_path = sgd(x_init, y_init)
adam_path = adam(x_init, y_init)

# Create meshgrid for surface
x = np.linspace(-2, 3, 100)
y = np.linspace(-2, 3, 100)
X, Y = np.meshgrid(x, y)
Z = f(X, Y)

# Function to plot the optimizer paths up to the current step
def plot_with_steps(step):
    # Limit the trajectory to the current step
    sgd_partial_path = sgd_path[:step]
    adam_partial_path = adam_path[:step]

    # Plot surface and paths
    fig = plt.figure(figsize=(12, 8))
    ax = fig.add_subplot(111, projection='3d')
    ax.plot_surface(X, Y, Z, cmap='viridis', alpha=0.7, edgecolor='none')

    # Plot optimizer paths up to the current step
    ax.plot(sgd_partial_path[:, 0], sgd_partial_path[:, 1], f(sgd_partial_path[:, 0], sgd_partial_path[:, 1]), color='red', label='SGD')
    ax.plot(adam_partial_path[:, 0], adam_partial_path[:, 1], f(adam_partial_path[:, 0], adam_partial_path[:, 1]), color='orange', label='Adam')

    # Add labels and legend
    ax.set_title(f'Gradient Descent Paths at Step {step}')
    ax.set_xlabel('X')
    ax.set_ylabel('Y')
    ax.set_zlabel('Z (Loss)')
    ax.legend()

    plt.show()

 

# Create interactive slider for steps
interact(plot_with_steps, step=widgets.IntSlider(value=1, min=1, max=max(len(sgd_path), len(adam_path)), step=1, description='Steps'))
# Print how many steps it took to converge at the last step
print(f"SGD converged in {len(sgd_path)} steps.")
print(f"Adam converged in {len(adam_path)} steps.")

interactive(children=(IntSlider(value=1, description='Steps', max=529, min=1), Output()), _dom_classes=('widge…

SGD converged in 529 steps.
Adam converged in 285 steps.
