# 19 - Self-Attention from Scratch

Self-attention is the core mechanism behind transformers and LLMs. It allows each token to attend to every other token in the sequence, enabling the model to capture long-range dependencies and context.

In this notebook, you'll scaffold the steps to implement self-attention from scratch, building up to the transformer block.

## 🔢 What is Self-Attention?

Self-attention computes a weighted sum of all token representations in a sequence, where the weights are determined by the similarity between tokens.

**LLM/Transformer Context:**
- Self-attention enables transformers to model relationships between all tokens, regardless of their distance in the sequence.

### Task:
- Scaffold a function to compute self-attention scores (dot-product attention) for a batch of token embeddings.
- Add a docstring explaining its role.

In [None]:
def compute_self_attention_scores(Q, K):
    """
    Compute self-attention scores (dot products) between all query and key vectors.
    Args:
        Q (np.ndarray): Query matrix (seq_len x d_k)
        K (np.ndarray): Key matrix (seq_len x d_k)
    Returns:
        np.ndarray: Attention scores (seq_len x seq_len)
    """
    # TODO: Compute dot-product attention scores
    pass

## 🧮 Softmax Normalization

The raw attention scores are normalized with softmax to produce attention weights (probabilities).

**LLM/Transformer Context:**
- Softmax ensures the attention weights sum to 1 for each token, allowing the model to focus on the most relevant tokens.

### Task:
- Scaffold a function to apply softmax to the attention scores along the correct axis.
- Add a docstring explaining its use.

In [None]:
def softmax_attention_weights(attn_scores):
    """
    Apply softmax to attention scores to get attention weights.
    Args:
        attn_scores (np.ndarray): Raw attention scores (seq_len x seq_len)
    Returns:
        np.ndarray: Attention weights (seq_len x seq_len)
    """
    # TODO: Apply softmax along the correct axis
    pass

## 🔗 Weighted Sum: Computing the Output

The output of self-attention is a weighted sum of the value vectors, using the attention weights.

**LLM/Transformer Context:**
- This step produces the new representation for each token, incorporating information from the entire sequence.

### Task:
- Scaffold a function to compute the weighted sum of value vectors using the attention weights.
- Add a docstring explaining its role.

In [None]:
def compute_attention_output(attn_weights, V):
    """
    Compute the output of self-attention as a weighted sum of value vectors.
    Args:
        attn_weights (np.ndarray): Attention weights (seq_len x seq_len)
        V (np.ndarray): Value matrix (seq_len x d_v)
    Returns:
        np.ndarray: Self-attention output (seq_len x d_v)
    """
    # TODO: Compute weighted sum of value vectors
    pass

## 🧮 Full Self-Attention Layer

Combine the steps: project input embeddings to Q, K, V, compute attention scores, normalize, and compute the output.

**LLM/Transformer Context:**
- This is the core computation in every transformer block in LLMs.

### Task:
- Scaffold a function for the full self-attention layer (single head).
- Add a docstring explaining the workflow.

In [None]:
def self_attention_layer(X, W_q, W_k, W_v):
    """
    Full self-attention layer (single head): project to Q, K, V, compute attention, and output.
    Args:
        X (np.ndarray): Input embeddings (seq_len x d_model)
        W_q, W_k, W_v (np.ndarray): Projection matrices (d_model x d_k)
    Returns:
        np.ndarray: Self-attention output (seq_len x d_k)
    """
    # TODO: Implement the full self-attention layer
    pass

## 🧠 Final Summary: Self-Attention and LLMs

- Self-attention is the key innovation that enables transformers and LLMs to model complex, long-range dependencies in language.
- Every token can attend to every other token, allowing for rich contextual representations.
- Mastering self-attention is essential for understanding and building transformer-based LLMs.

In the next notebook, you'll extend this to multi-head attention, a core feature of modern transformers!