# 03 - Softmax and Cross-Entropy Loss

Softmax and cross-entropy loss are at the heart of training LLMs and Transformers. Softmax turns logits into probabilities, and cross-entropy measures how well the model predicts the correct class/token.

In this notebook, you'll:
- Implement softmax from scratch
- Explore numerical stability in softmax
- Implement cross-entropy loss
- See how these are used in LLM training and language modeling

## 🔢 What is Softmax?

Softmax converts a vector of raw scores (logits) into probabilities. In LLMs, the output logits for each token position are passed through softmax to get a probability distribution over the vocabulary.

**Formula:**
$$\text{softmax}(z_i) = \frac{e^{z_i}}{\sum_j e^{z_j}}$$

### Task:
- Write a function to compute softmax for a 1D numpy array of logits.
- Add a docstring explaining why softmax is used in LLMs.
- Do NOT use any library softmax (no np.exp directly in the return statement—write out the steps).

**LLM/Transformer Context:**
- Softmax is used in the output layer of every transformer to turn logits into token probabilities.

In [None]:
def softmax(logits):
    """
    Compute the softmax of a 1D numpy array of logits.
    In LLMs, softmax is used to convert model outputs (logits) into probabilities over the vocabulary.
    Args:
        logits (np.ndarray): 1D array of raw scores.
    Returns:
        np.ndarray: Probabilities summing to 1.
    """
    # TODO: Implement softmax step by step (no np.exp in return statement)
    pass

## ⚠️ Numerical Stability in Softmax

Large logits can cause overflow in $e^{z_i}$. To prevent this, we subtract the max logit before exponentiating.

### Task:
- Modify your softmax function to be numerically stable.
- Add a comment explaining why this is important for LLMs (think: large vocabularies, large logits).

**LLM/Transformer Context:**
- Transformers often output very large logits, so numerical stability is critical for correct probability computation.

In [None]:
def softmax_stable(logits):
    """
    Compute the numerically stable softmax of a 1D numpy array of logits.
    Subtracts the max logit for stability (important for LLMs with large vocabularies).
    Args:
        logits (np.ndarray): 1D array of raw scores.
    Returns:
        np.ndarray: Probabilities summing to 1.
    """
    # TODO: Implement numerically stable softmax
    pass

## 🧮 Cross-Entropy Loss

Cross-entropy loss measures how well the predicted probability distribution matches the true distribution. In LLMs, this is used to train the model to predict the next token.

**Formula (for one-hot targets):**
$$L = -\sum_{i} y_i \log(p_i)$$

### Task:
- Write a function to compute cross-entropy loss given predicted probabilities and a true class index.
- Add a docstring explaining why cross-entropy is used in LLMs.

**LLM/Transformer Context:**
- Cross-entropy is the standard loss for language modeling and next-token prediction.

In [None]:
def cross_entropy_loss(probs, target_index):
    """
    Compute the cross-entropy loss for a single prediction.
    In LLMs, this is used to measure how well the model predicts the correct next token.
    Args:
        probs (np.ndarray): 1D array of predicted probabilities (output of softmax).
        target_index (int): Index of the true class/token.
    Returns:
        float: Cross-entropy loss value.
    """
    # TODO: Implement cross-entropy loss
    pass

## 🔁 Gradient of Softmax + Cross-Entropy (Backprop in LLMs)

When training LLMs, the combination of softmax and cross-entropy loss has a beautiful property: the gradient with respect to the logits simplifies to:

$$ \nabla L = \hat{y} - y $$

- $\hat{y}$: predicted probability distribution (output of softmax)
- $y$: one-hot encoded true label

**Why is this important?**
- This formula is used in every transformer and LLM during backpropagation, making the output layer gradient computation efficient and numerically stable.

### Task:
- Scaffold a function to compute the gradient of the softmax + cross-entropy loss with respect to the logits, given the logits and the true class index.
- Add a docstring explaining its role in LLM training.

In [None]:
def softmax_cross_entropy_grad(logits, target_index):
    """
    Compute the gradient of the softmax + cross-entropy loss with respect to the logits.
    In LLMs, this is used during backpropagation to efficiently compute gradients for the output layer.
    Args:
        logits (np.ndarray): 1D array of raw model outputs (logits).
        target_index (int): Index of the true class/token.
    Returns:
        np.ndarray: Gradient vector (same shape as logits).
    """
    # TODO: Implement the gradient: grad = softmax(logits) - one_hot(target_index)
    pass

## 🔗 Softmax + Cross-Entropy in Language Modeling

In LLMs, the model outputs a vector of logits for each token position. These are passed through softmax to get probabilities, and cross-entropy loss is computed against the true next token.

### Task:
- Given a batch of logits (2D array: batch_size x vocab_size) and true target indices, outline how you would compute the average cross-entropy loss for the batch.
- Add comments explaining each step and its relevance to LLM training.

**LLM/Transformer Context:**
- This is the core of the training loop for every transformer-based language model.

In [None]:
def batch_cross_entropy_loss(logits_batch, target_indices):
    """
    Compute the average cross-entropy loss for a batch of logits and target indices.
    In LLMs, this is used to train on multiple sequences/tokens at once.
    Args:
        logits_batch (np.ndarray): 2D array (batch_size x vocab_size) of logits.
        target_indices (np.ndarray): 1D array of true token indices (batch_size,).
    Returns:
        float: Average cross-entropy loss over the batch.
    """
    # TODO: For each example in the batch:
    #   1. Compute softmax (numerically stable) on logits
    #   2. Compute cross-entropy loss for the true target
    #   3. Average the losses
    pass

## 🧠 Final Summary: Why Softmax and Cross-Entropy Matter for LLMs

- **Softmax** turns model outputs into probabilities over the vocabulary, enabling sampling and evaluation.
- **Cross-entropy loss** is the objective that drives learning in language models—minimizing it means better next-token prediction.
- Every LLM and transformer uses these operations in its output and training loop.

In the next notebook, you'll use these concepts to build and train a simple feedforward neural network for classification!