# 04 - Feedforward Neural Networks (FFN) in LLMs

Feedforward neural networks (FFNs) are the core building blocks of transformers and LLMs. Every transformer block contains a position-wise FFN that processes each token embedding independently after self-attention.

In this notebook, you'll scaffold the components of a feedforward neural network, step by step, as they appear in LLMs.

## 🔢 Linear Transformation (Affine Layer)

The first step in a feedforward layer is a linear transformation: $y = Wx + b$.

**LLM Context:**
- In transformers, this is used to project token embeddings to a higher-dimensional space before applying non-linearity.

### Task:
- Scaffold a function to compute the linear transformation for a batch of inputs.
- Add a docstring explaining its role in LLMs.

In [None]:
def linear_forward(X, W, b):
    """
    Compute the linear transformation y = Wx + b for a batch of inputs.
    In LLMs, this projects token embeddings or hidden states to a new space.
    Args:
        X (np.ndarray): Input batch (batch_size x input_dim)
        W (np.ndarray): Weight matrix (output_dim x input_dim)
        b (np.ndarray): Bias vector (output_dim,)
    Returns:
        np.ndarray: Output batch (batch_size x output_dim)
    """
    # TODO: Implement the linear transformation
    pass

## 🧮 Non-Linearity (Activation Function)

After the linear transformation, a non-linear activation is applied. Transformers often use GELU or ReLU.

**LLM Context:**
- Non-linearity allows the model to learn complex patterns beyond what a single linear layer can represent.

### Task:
- Scaffold a function for the GELU activation (used in most LLMs).
- Add a docstring explaining why GELU is preferred in transformers.

In [None]:
def gelu(x):
    """
    Compute the GELU activation function.
    GELU is the default non-linearity in most transformer-based LLMs (e.g., BERT, GPT).
    Args:
        x (np.ndarray): Input array.
    Returns:
        np.ndarray: Output after applying GELU.
    """
    # TODO: Implement the GELU activation
    pass

## 🔁 Stacking Layers: The Feedforward Block

A feedforward block in a transformer consists of two linear layers with a non-linearity in between:

$$\text{FFN}(x) = W_2 \cdot \text{GELU}(W_1 x + b_1) + b_2$$

**LLM Context:**
- This block is applied independently to each token position after self-attention in every transformer layer.

### Task:
- Scaffold a function for the full feedforward block (two linear layers + GELU).
- Add a docstring explaining its role in the transformer architecture.

In [None]:
def feedforward_block(x, W1, b1, W2, b2):
    """
    Compute the transformer feedforward block: two linear layers with GELU in between.
    This is applied position-wise to each token embedding in LLMs.
    Args:
        x (np.ndarray): Input (batch_size x input_dim)
        W1, b1: First layer weights and bias
        W2, b2: Second layer weights and bias
    Returns:
        np.ndarray: Output (batch_size x output_dim)
    """
    # TODO: Implement the feedforward block
    pass

## 🧠 Residual Connections (Skip Connections)

Transformers add the input to the output of the feedforward block (residual connection) before layer normalization.

**LLM Context:**
- Residual connections help gradients flow and enable training of very deep models like GPT-3.

### Task:
- Scaffold a function that adds a residual connection: output = input + feedforward_output.
- Add a docstring explaining why this is critical for LLMs.

In [None]:
def add_residual(input_tensor, output_tensor):
    """
    Add a residual (skip) connection: output = input + output_tensor.
    Residual connections are critical for training deep LLMs and transformers.
    Args:
        input_tensor (np.ndarray): Original input.
        output_tensor (np.ndarray): Output from feedforward block.
    Returns:
        np.ndarray: Result after adding residual connection.
    """
    # TODO: Add input_tensor and output_tensor
    pass

## 🧠 Final Summary: Feedforward NNs in LLMs

- Feedforward blocks are the main non-attention computation in every transformer layer.
- They consist of two linear layers, a non-linearity (usually GELU), and a residual connection.
- Mastering these steps is essential for understanding and building LLMs.

In the next notebook, you'll learn how to compute gradients and backpropagate through these layers!