# 08 - Optimizer Comparisons: SGD, Momentum, RMSProp, Adam

Modern LLMs and transformers rely on advanced optimizers to train efficiently and stably. In this notebook, you'll scaffold the core logic of several popular optimizers, and compare their behavior and relevance to LLM training.

## 🧮 Stochastic Gradient Descent (SGD)

SGD is the foundation of most optimization algorithms. It updates parameters using the gradient of the loss.

**LLM/Transformer Context:**
- All advanced optimizers build on the basic idea of SGD.

### Task:
- Scaffold a function for a single SGD update step.
- Add a docstring explaining its role in LLM training.

In [None]:
def sgd_step(params, grads, lr):
    """
    Perform a single SGD update step.
    In LLMs, this is the basic building block for parameter updates.
    Args:
        params (np.ndarray): Current parameter values.
        grads (np.ndarray): Gradients of the loss w.r.t. parameters.
        lr (float): Learning rate.
    Returns:
        np.ndarray: Updated parameters.
    """
    # TODO: Implement SGD update
    pass

## 🏃 SGD with Momentum

Momentum helps accelerate SGD by accumulating a velocity vector in the direction of persistent reduction in the loss.

**LLM/Transformer Context:**
- Momentum is often used to help models converge faster and escape local minima.

### Task:
- Scaffold a function for a single SGD with momentum update step.
- Add a docstring explaining the velocity term and its effect.

In [None]:
def momentum_step(params, grads, velocity, lr, beta):
    """
    Perform a single SGD with momentum update step.
    Momentum helps accelerate convergence in deep networks like LLMs.
    Args:
        params (np.ndarray): Current parameter values.
        grads (np.ndarray): Gradients of the loss w.r.t. parameters.
        velocity (np.ndarray): Current velocity vector.
        lr (float): Learning rate.
        beta (float): Momentum coefficient (0 < beta < 1).
    Returns:
        tuple: (updated_params, updated_velocity)
    """
    # TODO: Implement momentum update
    pass

## 📉 RMSProp

RMSProp adapts the learning rate for each parameter by dividing by a running average of recent gradient magnitudes.

**LLM/Transformer Context:**
- RMSProp helps stabilize training by normalizing updates, especially in deep or recurrent networks.

### Task:
- Scaffold a function for a single RMSProp update step.
- Add a docstring explaining the running average and its effect.

In [None]:
def rmsprop_step(params, grads, cache, lr, beta, epsilon=1e-8):
    """
    Perform a single RMSProp update step.
    RMSProp normalizes updates using a running average of squared gradients.
    Args:
        params (np.ndarray): Current parameter values.
        grads (np.ndarray): Gradients of the loss w.r.t. parameters.
        cache (np.ndarray): Running average of squared gradients.
        lr (float): Learning rate.
        beta (float): Decay rate for the running average.
        epsilon (float): Small value to avoid division by zero.
    Returns:
        tuple: (updated_params, updated_cache)
    """
    # TODO: Implement RMSProp update
    pass

## 🤖 Adam Optimizer

Adam combines momentum and RMSProp, maintaining both a running average of gradients and squared gradients.

**LLM/Transformer Context:**
- Adam is the most widely used optimizer for training LLMs and transformers due to its stability and efficiency.

### Task:
- Scaffold a function for a single Adam update step.
- Add a docstring explaining the moving averages and bias correction.

In [None]:
def adam_step(params, grads, m, v, lr, beta1, beta2, t, epsilon=1e-8):
    """
    Perform a single Adam optimizer update step.
    Adam is the default optimizer for most LLMs and transformers.
    Args:
        params (np.ndarray): Current parameter values.
        grads (np.ndarray): Gradients of the loss w.r.t. parameters.
        m (np.ndarray): First moment vector (mean of gradients).
        v (np.ndarray): Second moment vector (mean of squared gradients).
        lr (float): Learning rate.
        beta1 (float): Decay rate for first moment.
        beta2 (float): Decay rate for second moment.
        t (int): Time step (for bias correction).
        epsilon (float): Small value to avoid division by zero.
    Returns:
        tuple: (updated_params, updated_m, updated_v)
    """
    # TODO: Implement Adam update
    pass

## 📊 Comparing Optimizer Performance

Different optimizers can converge at different speeds and stabilities.

**LLM/Transformer Context:**
- Choosing the right optimizer and hyperparameters is crucial for training large models efficiently.

### Task:
- Scaffold a function to compare the convergence of different optimizers on the same problem.
- Add a docstring explaining how this relates to LLM training.

In [None]:
def compare_optimizers(opt_steps, initial_params, compute_loss_and_grads, epochs):
    """
    Compare the convergence of different optimizers on the same loss function.
    In LLMs, optimizer choice can affect training speed and final performance.
    Args:
        opt_steps (dict): Mapping from optimizer name to optimizer step function.
        initial_params (np.ndarray): Initial parameter values.
        compute_loss_and_grads (callable): Function returning (loss, grads) for current params.
        epochs (int): Number of training epochs.
    Returns:
        dict: Mapping from optimizer name to list of loss values per epoch.
    """
    # TODO: Run each optimizer and record loss over epochs
    pass

## 🧠 Final Summary: Optimizers in LLMs

- Advanced optimizers like Adam are essential for training deep, large-scale models like LLMs and transformers.
- Understanding their differences helps you tune and debug training for best results.
- In the next notebook, you'll move on to recurrent neural networks (RNNs) and see how optimization applies to sequence modeling!