# Notebook 03: Attention Mechanism (CPU Implementation)

## Understanding the Heart of Transformers

Welcome to the most important concept in modern NLP! In this notebook, you'll learn:

1. **What is Attention?** - The intuition behind the mechanism
2. **Query, Key, Value** - The three fundamental components
3. **Scaled Dot-Product Attention** - The mathematical formula
4. **Attention Visualization** - See what the model learns

We'll start with CPU implementations to understand the concepts clearly before optimizing with GPU in the next notebook.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from typing import Tuple, Optional

# Set random seed for reproducibility
np.random.seed(42)

# Plotting setup
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)

## Part 1: The Intuition Behind Attention

### The Problem: Context Matters

Consider the sentence: **"The animal didn't cross the street because it was too tired."**

What does "it" refer to?
- **The animal** (correct!)
- The street (wrong)

**Attention** lets the model focus on relevant words when processing each word.

### The Solution: Weighted Representation

Instead of treating all words equally, attention computes:
- How **relevant** is each word to the current word?
- Create a **weighted sum** of all words based on relevance

### Analogy: Information Retrieval

Think of attention like a search engine:
1. **Query:** "What are you looking for?" (current word)
2. **Keys:** "What does each document contain?" (all words)
3. **Values:** "The actual content" (word representations)
4. **Attention:** Match query to keys, retrieve weighted values

## Part 2: Scaled Dot-Product Attention

### Mathematical Definition

Given:
- $Q$ (Query): What we're looking for - shape $(n_{queries}, d_k)$
- $K$ (Keys): What we're comparing against - shape $(n_{keys}, d_k)$
- $V$ (Values): What we retrieve - shape $(n_{keys}, d_v)$

$$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$$

### Step by Step:

1. **Compute Similarity:** $S = QK^T$ (how similar is each query to each key?)
2. **Scale:** $S_{scaled} = S / \sqrt{d_k}$ (prevent large values)
3. **Normalize:** $A = \text{softmax}(S_{scaled})$ (convert to probabilities)
4. **Aggregate:** $\text{Output} = AV$ (weighted sum of values)

In [None]:
def scaled_dot_product_attention(Q: np.ndarray, 
                                  K: np.ndarray, 
                                  V: np.ndarray,
                                  mask: Optional[np.ndarray] = None) -> Tuple[np.ndarray, np.ndarray]:
    """
    Scaled Dot-Product Attention.
    
    Args:
        Q: Queries of shape (n_queries, d_k)
        K: Keys of shape (n_keys, d_k)
        V: Values of shape (n_keys, d_v)
        mask: Optional mask of shape (n_queries, n_keys)
    
    Returns:
        output: Attention output of shape (n_queries, d_v)
        attention_weights: Attention weights of shape (n_queries, n_keys)
    """
    d_k = K.shape[-1]
    
    # Step 1: Compute attention scores
    scores = np.matmul(Q, K.T) / np.sqrt(d_k)
    
    # Step 2: Apply mask if provided (set masked positions to -inf)
    if mask is not None:
        scores = np.where(mask, scores, -1e9)
    
    # Step 3: Apply softmax to get attention weights
    attention_weights = softmax(scores, axis=-1)
    
    # Step 4: Weighted sum of values
    output = np.matmul(attention_weights, V)
    
    return output, attention_weights

def softmax(x: np.ndarray, axis: int = -1) -> np.ndarray:
    """Numerically stable softmax."""
    x_max = np.max(x, axis=axis, keepdims=True)
    exp_x = np.exp(x - x_max)
    return exp_x / np.sum(exp_x, axis=axis, keepdims=True)

print("âœ… Attention function defined!")

### Simple Example: Self-Attention on 3 Words

In [None]:
# Create simple example with 3 words, embedding dimension = 4
seq_len = 3
d_model = 4

# Input embeddings (random for demonstration)
X = np.random.randn(seq_len, d_model).astype(np.float32)

print("Input embeddings (3 words, 4 dimensions each):")
print(X)

# For self-attention: Q = K = V = X
output, attention_weights = scaled_dot_product_attention(X, X, X)

print("\nAttention Weights (how much each word attends to others):")
print(attention_weights)
print(f"\nEach row sums to 1: {attention_weights.sum(axis=-1)}")

print("\nOutput (context-aware representations):")
print(output)

# Visualize attention
plt.figure(figsize=(8, 6))
sns.heatmap(attention_weights, annot=True, fmt='.3f', cmap='YlOrRd',
            xticklabels=[f'Word {i+1}' for i in range(seq_len)],
            yticklabels=[f'Word {i+1}' for i in range(seq_len)],
            cbar_kws={'label': 'Attention Weight'})
plt.title('Self-Attention Weights Matrix', fontsize=14, fontweight='bold')
plt.xlabel('Keys (attending to)', fontsize=12)
plt.ylabel('Queries (attending from)', fontsize=12)
plt.show()

## Part 3: Understanding Each Component

### What do Q, K, V actually do?

In practice, Q, K, V are **learned projections** of the input:

In [None]:
def create_qkv_projections(d_model: int, d_k: int, d_v: int) -> Tuple[np.ndarray, np.ndarray, np.ndarray]:
    """Create random projection matrices (in real transformers, these are learned)."""
    W_Q = np.random.randn(d_model, d_k).astype(np.float32) * 0.1
    W_K = np.random.randn(d_model, d_k).astype(np.float32) * 0.1
    W_V = np.random.randn(d_model, d_v).astype(np.float32) * 0.1
    return W_Q, W_K, W_V

# Create projection matrices
d_model = 8  # Input embedding dimension
d_k = 4      # Query/Key dimension
d_v = 4      # Value dimension
seq_len = 5  # Sequence length

W_Q, W_K, W_V = create_qkv_projections(d_model, d_k, d_v)

# Create input sequence
X = np.random.randn(seq_len, d_model).astype(np.float32)

# Project to Q, K, V
Q = X @ W_Q
K = X @ W_K
V = X @ W_V

print(f"Input shape: {X.shape}")
print(f"Query shape: {Q.shape}")
print(f"Key shape: {K.shape}")
print(f"Value shape: {V.shape}")

# Apply attention
output, attn_weights = scaled_dot_product_attention(Q, K, V)

print(f"\nOutput shape: {output.shape}")
print(f"Attention weights shape: {attn_weights.shape}")

## Part 4: Why Scaling by âˆšd_k?

### The Problem: Dot Products Can Be Large

When $d_k$ is large, dot products have large variance, pushing softmax into saturation.

In [None]:
def compare_scaling(d_k_values: list) -> None:
    """Demonstrate the effect of scaling on attention."""
    fig, axes = plt.subplots(1, len(d_k_values), figsize=(15, 4))
    
    for idx, d_k in enumerate(d_k_values):
        # Create random Q and K
        Q = np.random.randn(5, d_k).astype(np.float32)
        K = np.random.randn(5, d_k).astype(np.float32)
        V = np.random.randn(5, d_k).astype(np.float32)
        
        # Without scaling
        scores_no_scale = Q @ K.T
        attn_no_scale = softmax(scores_no_scale, axis=-1)
        
        # With scaling
        scores_scaled = (Q @ K.T) / np.sqrt(d_k)
        attn_scaled = softmax(scores_scaled, axis=-1)
        
        # Plot
        ax = axes[idx]
        width = 0.35
        x = np.arange(2)
        
        # Measure how "peaked" the distribution is (entropy)
        entropy_no_scale = -np.sum(attn_no_scale * np.log(attn_no_scale + 1e-9))
        entropy_scaled = -np.sum(attn_scaled * np.log(attn_scaled + 1e-9))
        
        ax.bar(x, [entropy_no_scale, entropy_scaled], width)
        ax.set_ylabel('Entropy (higher = less peaked)', fontsize=10)
        ax.set_title(f'd_k = {d_k}', fontsize=12, fontweight='bold')
        ax.set_xticks(x)
        ax.set_xticklabels(['No Scale', 'With Scale'])
        ax.grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.suptitle('Effect of Scaling on Attention Entropy', fontsize=14, fontweight='bold', y=1.02)
    plt.show()
    
    print("ðŸ’¡ Key Insight: Scaling prevents attention from becoming too peaked")
    print("   Higher entropy = more distributed attention = better gradient flow")

compare_scaling([16, 64, 256])

## Part 5: Masked Attention (Causal Attention)

### The Need for Masking

In language modeling, we predict the **next** word. The model shouldn't see future words!

**Causal Mask:** Prevent position $i$ from attending to positions $j > i$

In [None]:
def create_causal_mask(seq_len: int) -> np.ndarray:
    """Create lower-triangular causal mask."""
    mask = np.tril(np.ones((seq_len, seq_len), dtype=bool))
    return mask

# Example with sequence length 6
seq_len = 6
mask = create_causal_mask(seq_len)

print("Causal Mask (1 = allowed, 0 = blocked):")
print(mask.astype(int))

# Visualize
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

# Show mask
sns.heatmap(mask, annot=True, fmt='d', cmap='RdYlGn', 
            xticklabels=[f't{i}' for i in range(seq_len)],
            yticklabels=[f't{i}' for i in range(seq_len)],
            cbar=False, ax=ax1)
ax1.set_title('Causal Mask', fontsize=12, fontweight='bold')
ax1.set_xlabel('Keys (past â†’ future)', fontsize=11)
ax1.set_ylabel('Queries', fontsize=11)

# Apply attention with mask
X = np.random.randn(seq_len, 8).astype(np.float32)
output, attn_weights = scaled_dot_product_attention(X, X, X, mask=mask)

sns.heatmap(attn_weights, annot=True, fmt='.2f', cmap='YlOrRd',
            xticklabels=[f't{i}' for i in range(seq_len)],
            yticklabels=[f't{i}' for i in range(seq_len)],
            ax=ax2)
ax2.set_title('Masked Attention Weights', fontsize=12, fontweight='bold')
ax2.set_xlabel('Keys (past)', fontsize=11)
ax2.set_ylabel('Queries', fontsize=11)

plt.tight_layout()
plt.show()

print("\nðŸ’¡ Note: Each position only attends to itself and previous positions")

## Part 6: Real Example - Sentence Attention

Let's see attention on an actual sentence!

In [None]:
# Simulate word embeddings (in reality, these come from an embedding layer)
sentence = ["The", "cat", "sat", "on", "the", "mat"]
vocab_size = 100
d_model = 16

# Random embeddings (pretend these are learned)
word_to_idx = {word: i for i, word in enumerate(sentence)}
embeddings = np.random.randn(len(sentence), d_model).astype(np.float32)

# Create Q, K, V projections
W_Q, W_K, W_V = create_qkv_projections(d_model, d_k=8, d_v=8)

Q = embeddings @ W_Q
K = embeddings @ W_K
V = embeddings @ W_V

# Compute attention
output, attn_weights = scaled_dot_product_attention(Q, K, V)

# Visualize attention pattern
plt.figure(figsize=(10, 8))
sns.heatmap(attn_weights, annot=True, fmt='.3f', cmap='viridis',
            xticklabels=sentence, yticklabels=sentence,
            cbar_kws={'label': 'Attention Weight'})
plt.title('Self-Attention: "The cat sat on the mat"', fontsize=14, fontweight='bold')
plt.xlabel('Keys (attending to)', fontsize=12)
plt.ylabel('Queries (attending from)', fontsize=12)
plt.show()

print("\nInterpretation:")
for i, word in enumerate(sentence):
    top_attn = np.argsort(attn_weights[i])[-3:][::-1]
    print(f"'{word}' pays most attention to: {[sentence[j] for j in top_attn]}")

## Part 7: Batched Attention

In practice, we process multiple sequences simultaneously:

In [None]:
def batched_attention(Q: np.ndarray, K: np.ndarray, V: np.ndarray) -> Tuple[np.ndarray, np.ndarray]:
    """
    Batched scaled dot-product attention.
    
    Args:
        Q: (batch, n_queries, d_k)
        K: (batch, n_keys, d_k)
        V: (batch, n_keys, d_v)
    
    Returns:
        output: (batch, n_queries, d_v)
        attention_weights: (batch, n_queries, n_keys)
    """
    batch_size = Q.shape[0]
    d_k = Q.shape[-1]
    
    # Batched matrix multiplication: (batch, n_q, d_k) @ (batch, d_k, n_k)
    scores = np.matmul(Q, K.transpose(0, 2, 1)) / np.sqrt(d_k)
    
    # Softmax over last dimension
    attention_weights = softmax(scores, axis=-1)
    
    # Weighted sum: (batch, n_q, n_k) @ (batch, n_k, d_v)
    output = np.matmul(attention_weights, V)
    
    return output, attention_weights

# Test batched attention
batch_size = 8
seq_len = 10
d_model = 16

Q = np.random.randn(batch_size, seq_len, d_model).astype(np.float32)
K = np.random.randn(batch_size, seq_len, d_model).astype(np.float32)
V = np.random.randn(batch_size, seq_len, d_model).astype(np.float32)

output, attn = batched_attention(Q, K, V)

print(f"Batch size: {batch_size}")
print(f"Sequence length: {seq_len}")
print(f"\nInput Q shape: {Q.shape}")
print(f"Output shape: {output.shape}")
print(f"Attention shape: {attn.shape}")
print(f"\nAttention weights sum per query: {attn[0, 0].sum():.6f} (should be 1.0)")

## Exercise Section

### Exercise 1: Cross-Attention
Implement cross-attention where queries come from one sequence and keys/values from another:
- Useful in encoder-decoder architectures
- Decoder attends to encoder outputs

In [None]:
# TODO: Implement cross-attention
def cross_attention(Q_decoder, K_encoder, V_encoder):
    """
    Implement cross-attention between decoder and encoder.
    Q comes from decoder, K and V from encoder.
    """
    pass

### Exercise 2: Attention Pattern Analysis
Create different attention patterns and analyze their properties:
1. Uniform attention (all weights equal)
2. Peaked attention (one dominant weight)
3. Local attention (nearby positions)

Compute entropy for each and visualize.

In [None]:
# TODO: Analyze different attention patterns

### Exercise 3: Attention Dropout
Implement attention with dropout:
- Randomly zero out some attention weights
- Helps prevent overfitting
- Compare with and without dropout

In [None]:
# TODO: Implement attention with dropout

## Summary

### Key Takeaways

âœ… **Attention Mechanism:**
- Computes weighted combinations based on relevance
- Query-Key similarity determines weights
- Values are aggregated using these weights

âœ… **Components:**
- **Q (Query):** What are we looking for?
- **K (Keys):** What do we have?
- **V (Values):** What information to retrieve?

âœ… **Important Details:**
- Scaling by âˆšd_k prevents saturation
- Masking enables causal (autoregressive) attention
- Softmax ensures weights sum to 1

âœ… **Applications:**
- Self-attention: relate positions within sequence
- Cross-attention: relate two different sequences
- Causal attention: for language modeling

### Next Steps

In **Notebook 04**, we'll:
- Implement attention on GPU with PyTorch
- Benchmark CPU vs GPU performance
- Optimize for large-scale processing
- Handle batching efficiently

## Further reading (Archive.org)

For conceptual background on attention mechanisms and sequence models, try Archive.org searches such as:

- "neural networks attention"
- "sequence to sequence models"
- "deep learning attention tutorial"

Combine these with modern Transformer introductions (for example, general surveys of Transformer models) to reinforce your understanding of queries, keys, values, and the scaled dot-product attention formula implemented in this notebook.