# 07 - Gradient Descent Optimization

Gradient descent is the fundamental algorithm for training neural networks, including LLMs and transformers. It updates model parameters to minimize the loss function by following the direction of steepest descent (the negative gradient).

In this notebook, you'll scaffold the steps of gradient descent, from the basic algorithm to its use in deep learning.

## 🧮 What is Gradient Descent?

Gradient descent iteratively updates parameters to minimize a loss function:

$$ \theta \leftarrow \theta - \eta \nabla L(\theta) $$

where $\theta$ are the parameters, $\eta$ is the learning rate, and $\nabla L(\theta)$ is the gradient of the loss.

**LLM/Transformer Context:**
- Every parameter in an LLM is updated using gradient descent or its variants during training.

### Task:
- Scaffold a function to perform a single gradient descent update step for a parameter vector.
- Add a docstring explaining its role in LLM training.

In [None]:
def gradient_descent_step(params, grads, lr):
    """
    Perform a single gradient descent update step.
    In LLMs, this is used to update weights and biases after computing gradients.
    Args:
        params (np.ndarray): Current parameter values.
        grads (np.ndarray): Gradients of the loss w.r.t. parameters.
        lr (float): Learning rate.
    Returns:
        np.ndarray: Updated parameters.
    """
    # TODO: Update params using gradient descent
    pass

## 🔁 Gradient Descent for Neural Network Training

In practice, gradient descent is applied repeatedly over many epochs to train a neural network.

**LLM/Transformer Context:**
- LLMs are trained for millions of steps using this process, often with large batches of data.

### Task:
- Scaffold a function to perform gradient descent over multiple epochs for a simple model.
- Add a docstring explaining each step and its relevance to LLM training.

In [None]:
def train_with_gradient_descent(params, compute_loss_and_grads, lr, epochs):
    """
    Train a model using gradient descent for a given number of epochs.
    In LLMs, this process is repeated for millions of steps to optimize all parameters.
    Args:
        params (np.ndarray): Initial parameter values.
        compute_loss_and_grads (callable): Function that returns (loss, grads) for current params.
        lr (float): Learning rate.
        epochs (int): Number of training epochs.
    Returns:
        np.ndarray: Trained parameters.
    """
    # TODO: Implement the training loop (compute loss, compute grads, update params)
    pass

## ⚠️ Learning Rate and Convergence

The learning rate ($\eta$) controls the step size in gradient descent. Too large, and the model may diverge; too small, and training is slow.

**LLM/Transformer Context:**
- Careful tuning of the learning rate is critical for stable and efficient LLM training.

### Task:
- Scaffold a function to experiment with different learning rates and observe their effect on convergence.
- Add a docstring explaining why this matters for LLMs.

In [None]:
def experiment_learning_rates(params, compute_loss_and_grads, lrs, epochs):
    """
    Experiment with different learning rates and observe convergence behavior.
    In LLMs, learning rate schedules and tuning are essential for good performance.
    Args:
        params (np.ndarray): Initial parameter values.
        compute_loss_and_grads (callable): Function that returns (loss, grads) for current params.
        lrs (list): List of learning rates to try.
        epochs (int): Number of epochs for each run.
    Returns:
        dict: Mapping from learning rate to final loss.
    """
    # TODO: Run training for each learning rate and record final loss
    pass

## 🧠 Final Summary: Gradient Descent in LLMs

- Gradient descent is the core optimization algorithm for training all neural networks, including LLMs and transformers.
- Understanding how parameters are updated and how learning rate affects convergence is key to successful model training.
- In the next notebook, you'll explore advanced optimizers that improve on basic gradient descent for faster and more stable training!