# 22 - Transformer Block from Scratch

The transformer block is the fundamental building unit of all modern LLMs. It combines multi-head self-attention, residual connections, layer normalization, and a feedforward network.

In this notebook, you'll scaffold the steps to build a transformer block, as used in GPT, BERT, and other LLMs.

## 🔢 Multi-Head Self-Attention Layer

The first step in a transformer block is multi-head self-attention, which allows each token to attend to all others in the sequence.

**LLM/Transformer Context:**
- This is the core mechanism for capturing relationships between tokens in LLMs.

### Task:
- Scaffold a function to compute multi-head self-attention for input embeddings.
- Add a docstring explaining its role.

In [None]:
def multihead_self_attention(X, params):
    """
    Compute multi-head self-attention for input embeddings.
    Args:
        X (np.ndarray): Input embeddings (seq_len x d_model)
        params (dict): All attention parameters (projection matrices, etc.)
    Returns:
        np.ndarray: Output of multi-head attention (seq_len x d_model)
    """
    # TODO: Implement multi-head self-attention
    pass

## 🔗 Residual Connection and Layer Normalization (Post-Attention)

After self-attention, a residual connection and layer normalization are applied.

**LLM/Transformer Context:**
- Residuals and normalization stabilize training and enable deep stacking of transformer blocks.

### Task:
- Scaffold a function to add a residual connection and apply layer normalization after attention.
- Add a docstring explaining its importance.

In [None]:
def add_residual_and_layernorm(X, sublayer_output, layernorm_params):
    """
    Add a residual connection and apply layer normalization.
    Args:
        X (np.ndarray): Input to the sublayer (seq_len x d_model)
        sublayer_output (np.ndarray): Output from the sublayer (seq_len x d_model)
        layernorm_params (dict): Parameters for layer normalization (gamma, beta)
    Returns:
        np.ndarray: Output after residual and layer normalization (seq_len x d_model)
    """
    # TODO: Add residual and apply layer normalization
    pass

## 🧮 Feedforward Network (FFN)

The feedforward network consists of two linear layers with a non-linearity (usually GELU or ReLU) in between.

**LLM/Transformer Context:**
- The FFN increases the model's capacity and is applied independently to each token position.

### Task:
- Scaffold a function for the position-wise feedforward network.
- Add a docstring explaining its role.

In [None]:
def transformer_feedforward(X, params):
    """
    Position-wise feedforward network in the transformer block.
    Args:
        X (np.ndarray): Input (seq_len x d_model)
        params (dict): FFN parameters (weights, biases)
    Returns:
        np.ndarray: Output of the feedforward network (seq_len x d_model)
    """
    # TODO: Implement the feedforward network
    pass

## 🔗 Residual Connection and Layer Normalization (Post-FFN)

A second residual connection and layer normalization are applied after the feedforward network.

**LLM/Transformer Context:**
- This further stabilizes training and enables deep stacking.

### Task:
- Scaffold a function to add a residual connection and apply layer normalization after the FFN.
- Add a docstring explaining its importance.

In [None]:
def add_residual_and_layernorm_ffn(X, ffn_output, layernorm_params):
    """
    Add a residual connection and apply layer normalization after the FFN.
    Args:
        X (np.ndarray): Input to the FFN (seq_len x d_model)
        ffn_output (np.ndarray): Output from the FFN (seq_len x d_model)
        layernorm_params (dict): Parameters for layer normalization (gamma, beta)
    Returns:
        np.ndarray: Output after residual and layer normalization (seq_len x d_model)
    """
    # TODO: Add residual and apply layer normalization
    pass

## 🧱 Full Transformer Block

Combine all steps: multi-head attention, residual + layer norm, feedforward, residual + layer norm.

**LLM/Transformer Context:**
- This is the core building block of all modern LLMs and is stacked many times in deep models.

### Task:
- Scaffold a function for the full transformer block.
- Add a docstring explaining the workflow.

In [None]:
def transformer_block(X, attn_params, ffn_params, ln1_params, ln2_params):
    """
    Full transformer block: multi-head attention, residual + layer norm, feedforward, residual + layer norm.
    Args:
        X (np.ndarray): Input embeddings (seq_len x d_model)
        attn_params (dict): Attention parameters
        ffn_params (dict): Feedforward network parameters
        ln1_params (dict): Layer norm parameters after attention
        ln2_params (dict): Layer norm parameters after FFN
    Returns:
        np.ndarray: Output of the transformer block (seq_len x d_model)
    """
    # TODO: Implement the full transformer block
    pass

## 🧠 Final Summary: The Transformer Block in LLMs

- The transformer block is the fundamental unit of all modern LLMs, combining attention, feedforward, normalization, and residuals.
- Stacking these blocks enables models to learn deep, contextual representations of language.
- Mastering the transformer block is essential for building and understanding LLMs.

In the next notebook, you'll use this block to build a mini transformer language model!