# 20 - Multi-Head Attention from Scratch

Multi-head attention is a core component of transformers and LLMs. It allows the model to attend to information from different representation subspaces at different positions, greatly increasing the model's expressivity and power.

In this notebook, you'll scaffold the steps to implement multi-head self-attention, building up to the full transformer block.

## 🔢 What is Multi-Head Attention?

Multi-head attention runs several self-attention operations ("heads") in parallel, each with its own set of learned projections. The outputs are concatenated and projected to the final output space.

**LLM/Transformer Context:**
- Multi-head attention enables transformers to capture diverse relationships and patterns in the input sequence, which is critical for LLM performance.

### Task:
- Scaffold a function to split input embeddings into multiple heads and project to Q, K, V for each head.
- Add a docstring explaining its role.

In [None]:
def split_heads(X, num_heads):
    """
    Split input embeddings into multiple heads for multi-head attention.
    Args:
        X (np.ndarray): Input embeddings (seq_len x d_model)
        num_heads (int): Number of attention heads
    Returns:
        np.ndarray: Split embeddings (num_heads x seq_len x d_head)
    """
    # TODO: Split X into num_heads along the last dimension
    pass

## 🧮 Per-Head Self-Attention

Each head performs self-attention independently on its projected Q, K, V.

**LLM/Transformer Context:**
- Each head can focus on different types of relationships in the sequence.

### Task:
- Scaffold a function to compute self-attention for each head.
- Add a docstring explaining the per-head computation.

In [None]:
def multihead_attention_per_head(Q_heads, K_heads, V_heads):
    """
    Compute self-attention for each head independently.
    Args:
        Q_heads, K_heads, V_heads (np.ndarray): (num_heads x seq_len x d_head)
    Returns:
        np.ndarray: Attention outputs for each head (num_heads x seq_len x d_head)
    """
    # TODO: Compute self-attention for each head
    pass

## 🔗 Concatenation and Output Projection

The outputs of all heads are concatenated and projected to the final output dimension.

**LLM/Transformer Context:**
- This step combines information from all heads, allowing the model to integrate multiple perspectives.

### Task:
- Scaffold a function to concatenate head outputs and apply the final linear projection.
- Add a docstring explaining its role.

In [None]:
def combine_heads_and_project(head_outputs, W_o):
    """
    Concatenate outputs from all heads and project to the output dimension.
    Args:
        head_outputs (np.ndarray): (num_heads x seq_len x d_head)
        W_o (np.ndarray): Output projection matrix (d_model x d_model)
    Returns:
        np.ndarray: Final multi-head attention output (seq_len x d_model)
    """
    # TODO: Concatenate and project head outputs
    pass

## 🧮 Full Multi-Head Attention Layer

Combine all steps: split heads, project Q/K/V, compute per-head attention, concatenate, and project.

**LLM/Transformer Context:**
- This is the core of every transformer block in LLMs.

### Task:
- Scaffold a function for the full multi-head attention layer.
- Add a docstring explaining the workflow.

In [None]:
def multihead_attention_layer(X, W_q, W_k, W_v, W_o, num_heads):
    """
    Full multi-head attention layer: split heads, project Q/K/V, compute attention, combine, and project.
    Args:
        X (np.ndarray): Input embeddings (seq_len x d_model)
        W_q, W_k, W_v (np.ndarray): Projection matrices (d_model x d_model)
        W_o (np.ndarray): Output projection matrix (d_model x d_model)
        num_heads (int): Number of attention heads
    Returns:
        np.ndarray: Multi-head attention output (seq_len x d_model)
    """
    # TODO: Implement the full multi-head attention layer
    pass

## 🧠 Final Summary: Multi-Head Attention in LLMs

- Multi-head attention is the key to the power and flexibility of transformers and LLMs.
- It allows the model to capture diverse relationships and integrate information from multiple perspectives.
- Mastering multi-head attention is essential for understanding and building transformer-based LLMs.

In the next notebook, you'll see how positional encoding and multi-head attention are combined in the full transformer block!