# 05 - Manual Backpropagation in Feedforward Networks

Backpropagation is the algorithm that enables LLMs and transformers to learn by updating their weights using gradients. In this notebook, you'll manually compute gradients for a simple feedforward network, mirroring the process that happens in every transformer block during training.

## 🔢 Forward Pass Recap

Before computing gradients, let's recall the forward pass for a single-layer feedforward network:

$$ y = W x + b $$

**LLM Context:**
- This is the basic computation in every transformer feedforward block.

### Task:
- Scaffold a function for the forward pass (linear transformation + activation).
- Add a docstring explaining its role in LLMs.

In [None]:
def forward_pass(x, W, b, activation_fn):
    """
    Compute the forward pass: y = activation_fn(Wx + b).
    In LLMs, this is used in the feedforward block of each transformer layer.
    Args:
        x (np.ndarray): Input vector.
        W (np.ndarray): Weight matrix.
        b (np.ndarray): Bias vector.
        activation_fn (callable): Activation function (e.g., GELU, ReLU).
    Returns:
        np.ndarray: Output vector.
    """
    # TODO: Implement the forward pass
    pass

## 🔗 Manual Gradient Calculation (Backpropagation)

To train a neural network, we need to compute the gradients of the loss with respect to each parameter (weights and biases). This is done using the chain rule.

**LLM Context:**
- Every parameter in a transformer is updated using gradients computed via backpropagation.

### Task:
- Scaffold functions to compute the gradients of the loss with respect to W, b, and x for a single-layer network.
- Add docstrings explaining the role of each gradient in LLM training.

In [None]:
def backward_pass(dy, x, W, b, activation_fn, activation_grad_fn):
    """
    Compute gradients for a single-layer feedforward network.
    Args:
        dy (np.ndarray): Gradient of the loss with respect to the output (dL/dy).
        x (np.ndarray): Input vector.
        W (np.ndarray): Weight matrix.
        b (np.ndarray): Bias vector.
        activation_fn (callable): Activation function used in the forward pass.
        activation_grad_fn (callable): Derivative of the activation function.
    Returns:
        dW (np.ndarray): Gradient w.r.t. W.
        db (np.ndarray): Gradient w.r.t. b.
        dx (np.ndarray): Gradient w.r.t. x (for chaining to previous layers).
    """
    # TODO: Implement manual backpropagation for a single-layer network
    pass

## 🧮 Backpropagation Through a Two-Layer Feedforward Block

Transformers use two linear layers with a non-linearity in between. Let's scaffold the backward pass for this structure:

$$ h = \text{activation}(W_1 x + b_1) $$
$$ y = W_2 h + b_2 $$

**LLM Context:**
- This is the exact structure of the feedforward block in every transformer layer.

### Task:
- Scaffold a function to compute gradients for all parameters in a two-layer feedforward block.
- Add a docstring explaining how this mirrors the transformer architecture.

In [None]:
def backprop_two_layer(x, W1, b1, W2, b2, activation_fn, activation_grad_fn, dy):
    """
    Compute gradients for a two-layer feedforward block (as in transformers).
    Args:
        x (np.ndarray): Input vector.
        W1, b1: First layer weights and bias.
        W2, b2: Second layer weights and bias.
        activation_fn (callable): Activation function.
        activation_grad_fn (callable): Derivative of activation function.
        dy (np.ndarray): Gradient of loss w.r.t. output.
    Returns:
        dW1, db1, dW2, db2, dx: Gradients for all parameters and input.
    """
    # TODO: Implement manual backpropagation for two-layer feedforward block
    pass

## 🔁 Gradient Flow and Residual Connections

In transformers, the output of the feedforward block is added to the input (residual connection) before layer normalization. It's important to understand how gradients flow through this addition.

**LLM Context:**
- Residual connections help gradients flow backward, enabling deep LLMs to train effectively.

### Task:
- Scaffold a function to compute the gradient of the loss with respect to the input when a residual connection is used.
- Add a docstring explaining why this is important for LLMs.

In [None]:
def residual_backward(dout):
    """
    Compute the gradient flow through a residual (skip) connection.
    In LLMs, this ensures both the input and the feedforward block receive gradients.
    Args:
        dout (np.ndarray): Gradient of loss w.r.t. output (after addition).
    Returns:
        dx_input (np.ndarray): Gradient w.r.t. the original input.
        dx_ffn (np.ndarray): Gradient w.r.t. the feedforward output.
    """
    # TODO: Implement gradient flow through residual connection
    pass

## 🧠 Final Summary: Manual Backprop in LLMs

- Manual backpropagation builds intuition for how gradients flow through each layer and parameter in a transformer.
- Every weight in an LLM is updated using these principles, just at a much larger scale.
- Understanding this process is key to debugging and designing new architectures.

In the next notebook, you'll use these gradients to train a network on a real task!