# üìò Day 1: Attention Mechanisms

**üéØ Goal:** Master attention mechanisms - the foundation of modern AI (ChatGPT, Claude, GPT-4)

**‚è±Ô∏è Time:** 90-120 minutes

**üåü Why This Matters for AI:**
- Attention powers ALL modern language models (GPT-4, Claude, Gemini, ChatGPT)
- Foundation of Transformers - the architecture behind LLMs
- Enables models to "focus" on relevant information (like humans do)
- Critical for RAG systems, multi-modal AI, and Agentic AI
- Powers Google Translate, Copilot, Midjourney, and every AI you use daily
- Understanding attention = Understanding how ChatGPT "thinks"

---

## ü§î What is Attention?

**Attention = Focusing on what's important**

**Human Analogy:**
When you read this sentence, you don't give equal attention to every word. You focus on KEY words that carry meaning.

**Example:**
- Sentence: "The **cat** sat on the **mat**"
- Your brain focuses on: "cat" and "mat" (nouns carrying the main meaning)
- Less attention to: "the", "sat", "on" (structural words)

**In AI:**
- **Problem:** RNNs treat all words equally ‚Üí loses important context
- **Solution:** Attention learns to focus on relevant words
- **Result:** Better understanding, better predictions!

### üéØ Real-World Applications (2024-2025)

**Where attention is used:**
1. **ChatGPT/Claude:** Attention determines which previous words matter for next prediction
2. **Machine Translation:** Aligns source and target words ("cat" ‚Üí "gato")
3. **RAG Systems:** Attention helps retrieve and focus on relevant document chunks
4. **Question Answering:** Focuses on the part of context that answers the question
5. **Multimodal AI:** Attention between image regions and text descriptions

**The Revolution:**
- **2017:** "Attention Is All You Need" paper introduced Transformers
- **2018:** BERT revolutionized NLP
- **2020:** GPT-3 showed massive scaling power
- **2022:** ChatGPT launched (175B parameters, all using attention!)
- **2024-2025:** GPT-4, Claude, Gemini - all built on attention mechanisms

Let's build attention from scratch! üëá

In [None]:
# Import essential libraries
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import torch
import torch.nn as nn
import torch.nn.functional as F
from IPython.display import Image, display

# Set random seeds for reproducibility
np.random.seed(42)
torch.manual_seed(42)

# Make plots beautiful
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")

print("‚úÖ Libraries imported successfully!")
print(f"PyTorch version: {torch.__version__}")
print("Let's build attention mechanisms from scratch! üöÄ")

## ‚ùå The Problem Without Attention

**Scenario:** Machine Translation

**Old Approach (RNN/LSTM without attention):**
```
English: "The cat sat on the mat"
         ‚Üì ‚Üì ‚Üì ‚Üì ‚Üì ‚Üì (RNN processes sequentially)
         üß† (Single fixed-size vector - bottleneck!)
         ‚Üì
Spanish: "El gato se sent√≥ en la alfombra"
```

**Problems:**
1. **Information Bottleneck:** Entire sentence compressed into one vector
2. **Long-Range Dependencies:** Forgets early words in long sentences
3. **Equal Weight:** All words treated equally (even "the", "on")

**With Attention:**
```
English: "The cat sat on the mat"
         ‚Üì   ‚Üì   ‚Üì   ‚Üì   ‚Üì   ‚Üì
         [All word representations preserved]
         ‚Üì
When translating "gato": Focus 90% attention on "cat", 10% on others
When translating "alfombra": Focus 90% attention on "mat", 10% on others
```

**Benefits:**
- ‚úÖ No information bottleneck
- ‚úÖ Can attend to any word (no distance limit)
- ‚úÖ Learns what to focus on automatically

Let's see this in action!

## üßÆ Attention Mathematics (Simplified)

**Core Idea:** Calculate "similarity" between words, then focus on similar ones

### Step-by-Step Process:

**1. Query, Key, Value (QKV)**
- **Query (Q):** "What am I looking for?"
- **Key (K):** "What do I have to offer?"
- **Value (V):** "What information do I carry?"

**Analogy:** YouTube Search
- **Query:** Your search term ("how to cook pasta")
- **Keys:** Video titles/tags
- **Values:** Actual video content
- **Attention:** Match query to keys ‚Üí retrieve values

**2. Attention Formula:**

```
Attention(Q, K, V) = softmax(Q ¬∑ K^T / ‚àöd_k) ¬∑ V
```

**Breaking it down:**
1. **Q ¬∑ K^T:** Calculate similarity between query and all keys (dot product)
2. **/ ‚àöd_k:** Scale down (prevents large values that make softmax too sharp)
3. **softmax(...):** Convert to probabilities (attention weights sum to 1)
4. **¬∑ V:** Weighted sum of values (focus on relevant info)

**Example:**
```
Sentence: "The cat sat"
Question: What did the cat do?

Query = "cat"
Keys = ["The", "cat", "sat"]
Similarities = [0.1, 0.8, 0.7]  (cat is similar to itself and "sat")
Attention weights after softmax = [0.05, 0.45, 0.50]
Output = 0.05*V[The] + 0.45*V[cat] + 0.50*V[sat]
       = Focus mostly on "cat" and "sat"
```

Let's implement this!

In [None]:
# Simple Attention Mechanism from Scratch

def simple_attention(Q, K, V):
    """
    Compute attention scores and output
    
    Args:
        Q: Query matrix (1, d_k)
        K: Key matrix (n, d_k) where n = number of words
        V: Value matrix (n, d_v)
    
    Returns:
        output: Attention-weighted values (1, d_v)
        attention_weights: Attention scores (1, n)
    """
    # Step 1: Calculate similarity scores (Q ¬∑ K^T)
    d_k = K.shape[-1]
    scores = np.dot(Q, K.T) / np.sqrt(d_k)  # Scaled dot product
    
    # Step 2: Apply softmax to get attention weights
    attention_weights = np.exp(scores) / np.sum(np.exp(scores), axis=-1, keepdims=True)
    
    # Step 3: Weighted sum of values
    output = np.dot(attention_weights, V)
    
    return output, attention_weights

# Example: Simple sentence
# Sentence: "The cat sat on the mat"
# Let's use 4-dimensional embeddings for simplicity

# Word embeddings (simplified - in reality, these are learned)
words = ["The", "cat", "sat", "on", "the", "mat"]
embeddings = np.array([
    [0.1, 0.2, 0.1, 0.0],  # The
    [0.8, 0.3, 0.9, 0.7],  # cat (noun - rich features)
    [0.3, 0.9, 0.4, 0.6],  # sat (verb - different features)
    [0.1, 0.1, 0.2, 0.1],  # on (preposition)
    [0.1, 0.2, 0.1, 0.0],  # the
    [0.7, 0.4, 0.8, 0.6],  # mat (noun - similar to cat)
])

# For simplicity: Q = K = V = embeddings
# In practice, these are learned linear transformations

# Question: What should we focus on when understanding "cat"?
query = embeddings[1:2]  # "cat" as query (shape: 1, 4)
keys = embeddings        # All words as keys (shape: 6, 4)
values = embeddings      # All words as values (shape: 6, 4)

# Compute attention
output, attention_weights = simple_attention(query, keys, values)

print("üéØ Attention Analysis: Focus on 'cat'\n" + "="*50)
print("\nAttention Weights (where does 'cat' look?):")
for word, weight in zip(words, attention_weights[0]):
    bar = "‚ñà" * int(weight * 50)
    print(f"  {word:6s}: {weight:.3f} {bar}")

print("\nüí° Interpretation:")
max_attention_idx = np.argmax(attention_weights[0])
print(f"  'cat' pays most attention to: '{words[max_attention_idx]}'")
print(f"  This makes sense - a word attends strongly to itself!")
print(f"  Secondary attention to 'sat' (verb describing cat) and 'mat' (similar noun)")

In [None]:
# Visualize Attention Heatmap

def visualize_attention(words, attention_matrix, title="Attention Heatmap"):
    """
    Create attention heatmap visualization
    """
    plt.figure(figsize=(10, 8))
    sns.heatmap(attention_matrix, 
                xticklabels=words, 
                yticklabels=words,
                annot=True, 
                fmt='.2f',
                cmap='YlOrRd',
                cbar_kws={'label': 'Attention Weight'},
                linewidths=0.5)
    plt.title(title, fontsize=14, fontweight='bold', pad=20)
    plt.xlabel('Keys (Attending TO)', fontsize=12)
    plt.ylabel('Queries (Attending FROM)', fontsize=12)
    plt.tight_layout()
    plt.show()

# Compute attention for all words
attention_matrix = np.zeros((len(words), len(words)))

for i in range(len(words)):
    query = embeddings[i:i+1]
    _, weights = simple_attention(query, keys, values)
    attention_matrix[i] = weights[0]

visualize_attention(words, attention_matrix, 
                   "üéØ Self-Attention: 'The cat sat on the mat'")

print("\nüìä How to Read This Heatmap:")
print("  - Each row = one word's query")
print("  - Each column = what that query attends to")
print("  - Diagonal = words attend to themselves (usually high)")
print("  - Bright cells = strong attention (important relationships)")
print("  - Dark cells = weak attention (less relevant)")
print("\nüí° Notice: 'cat' and 'mat' attend to each other (similar nouns!)")

## üîÑ Self-Attention Explained

**What is Self-Attention?**
- Attention mechanism where queries, keys, and values all come from the SAME sequence
- Each word attends to all other words (including itself)
- Learns relationships between words in a sentence

**Why "Self"?**
- Regular attention: Source ‚Üí Target (e.g., English ‚Üí Spanish)
- Self-attention: Sequence ‚Üí Itself (e.g., sentence attends to sentence)

### üéØ Real-World Example: Understanding Context

**Sentence:** "The animal didn't cross the street because it was too tired."

**Question:** What does "it" refer to?

**Without attention:** 
- RNN might guess "street" (closer to "it")

**With self-attention:**
- "it" attends strongly to "animal" (semantically correct!)
- Why? Because "tired" is an animal property
- Self-attention learns these relationships!

### The Process:

1. **Input:** Word embeddings for entire sentence
2. **Transform:** Create Q, K, V using learned weight matrices
   - Q = X ¬∑ W_Q
   - K = X ¬∑ W_K
   - V = X ¬∑ W_V
3. **Compute Attention:** Each word attends to all words
4. **Output:** Context-aware representations

**Key Insight:**
- Before attention: "bank" has same embedding in "river bank" and "money bank"
- After attention: "bank" has different representations based on context!
- Self-attention creates **contextualized embeddings**

In [None]:
# Self-Attention Layer Implementation (PyTorch)

class SelfAttention(nn.Module):
    def __init__(self, embed_dim):
        """
        Self-Attention Layer
        
        Args:
            embed_dim: Embedding dimension (e.g., 512)
        """
        super(SelfAttention, self).__init__()
        
        self.embed_dim = embed_dim
        
        # Learned transformation matrices
        self.W_q = nn.Linear(embed_dim, embed_dim, bias=False)  # Query
        self.W_k = nn.Linear(embed_dim, embed_dim, bias=False)  # Key
        self.W_v = nn.Linear(embed_dim, embed_dim, bias=False)  # Value
        
    def forward(self, x):
        """
        Args:
            x: Input embeddings (batch_size, seq_len, embed_dim)
        
        Returns:
            output: Attention output (batch_size, seq_len, embed_dim)
            attention_weights: (batch_size, seq_len, seq_len)
        """
        # Step 1: Create Q, K, V
        Q = self.W_q(x)  # (batch, seq_len, embed_dim)
        K = self.W_k(x)
        V = self.W_v(x)
        
        # Step 2: Calculate attention scores
        scores = torch.matmul(Q, K.transpose(-2, -1))  # (batch, seq_len, seq_len)
        scores = scores / np.sqrt(self.embed_dim)  # Scale
        
        # Step 3: Apply softmax
        attention_weights = F.softmax(scores, dim=-1)
        
        # Step 4: Weighted sum of values
        output = torch.matmul(attention_weights, V)
        
        return output, attention_weights

# Test the self-attention layer
embed_dim = 64
seq_len = 6  # "The cat sat on the mat"
batch_size = 1

# Create self-attention layer
self_attn = SelfAttention(embed_dim)

# Random input embeddings (in practice, these come from embedding layer)
x = torch.randn(batch_size, seq_len, embed_dim)

# Forward pass
output, attn_weights = self_attn(x)

print("‚úÖ Self-Attention Layer Test")
print(f"\nInput shape: {x.shape}")
print(f"Output shape: {output.shape}")
print(f"Attention weights shape: {attn_weights.shape}")
print(f"\nüí° Notice: Output has same shape as input!")
print(f"   But now each word embedding is context-aware!")

# Visualize attention weights
plt.figure(figsize=(8, 6))
sns.heatmap(attn_weights[0].detach().numpy(), 
            annot=True, fmt='.2f', cmap='Blues',
            xticklabels=words,
            yticklabels=words)
plt.title('üéØ Self-Attention Weights (Learned)', fontsize=14, fontweight='bold')
plt.xlabel('Key (Attending TO)')
plt.ylabel('Query (Attending FROM)')
plt.tight_layout()
plt.show()

## üé≠ Multi-Head Attention

**Problem with Single Attention:**
- One attention mechanism can only learn ONE type of relationship
- Example: Subject-verb OR adjective-noun, but not both simultaneously

**Solution: Multi-Head Attention**
- Run multiple attention mechanisms in parallel ("heads")
- Each head learns DIFFERENT relationships!
- Combine outputs for richer representation

### üéØ Analogy: Multiple Perspectives

**Sentence:** "The quick brown fox jumps over the lazy dog"

**Different attention heads might learn:**
- **Head 1:** Syntax - "fox" ‚Üí "jumps" (subject-verb)
- **Head 2:** Attributes - "fox" ‚Üí "quick, brown" (adjective-noun)
- **Head 3:** Objects - "jumps" ‚Üí "dog" (verb-object)
- **Head 4:** Spatial - "jumps" ‚Üí "over" (action-preposition)

**Each head provides different insight!**

### Architecture:

```
Input Embeddings
    ‚Üì
    ‚îú‚îÄ‚Üí Head 1 (Q‚ÇÅ, K‚ÇÅ, V‚ÇÅ) ‚Üí Attention Output 1
    ‚îú‚îÄ‚Üí Head 2 (Q‚ÇÇ, K‚ÇÇ, V‚ÇÇ) ‚Üí Attention Output 2
    ‚îú‚îÄ‚Üí Head 3 (Q‚ÇÉ, K‚ÇÉ, V‚ÇÉ) ‚Üí Attention Output 3
    ‚îî‚îÄ‚Üí Head 4 (Q‚ÇÑ, K‚ÇÑ, V‚ÇÑ) ‚Üí Attention Output 4
    ‚Üì
Concatenate all heads
    ‚Üì
Linear transformation
    ‚Üì
Final Output
```

### üåü In Modern LLMs:

**GPT-3:**
- 96 attention heads per layer!
- 96 layers total
- = 9,216 total attention heads

**GPT-4 (estimated):**
- 128+ heads per layer
- 120+ layers
- = 15,000+ total attention heads

**Why so many?**
- Each head specializes in different linguistic patterns
- More heads = richer understanding
- Enables complex reasoning and context understanding

In [None]:
# Multi-Head Attention Implementation

class MultiHeadAttention(nn.Module):
    def __init__(self, embed_dim, num_heads):
        """
        Multi-Head Attention Layer
        
        Args:
            embed_dim: Total embedding dimension (must be divisible by num_heads)
            num_heads: Number of attention heads
        """
        super(MultiHeadAttention, self).__init__()
        
        assert embed_dim % num_heads == 0, "embed_dim must be divisible by num_heads"
        
        self.embed_dim = embed_dim
        self.num_heads = num_heads
        self.head_dim = embed_dim // num_heads  # Dimension per head
        
        # Q, K, V projections for all heads (combined)
        self.W_q = nn.Linear(embed_dim, embed_dim)
        self.W_k = nn.Linear(embed_dim, embed_dim)
        self.W_v = nn.Linear(embed_dim, embed_dim)
        
        # Output projection
        self.W_o = nn.Linear(embed_dim, embed_dim)
        
    def split_heads(self, x):
        """Split embedding into multiple heads"""
        batch_size, seq_len, embed_dim = x.shape
        # Reshape: (batch, seq_len, num_heads, head_dim)
        x = x.view(batch_size, seq_len, self.num_heads, self.head_dim)
        # Transpose: (batch, num_heads, seq_len, head_dim)
        return x.transpose(1, 2)
    
    def combine_heads(self, x):
        """Combine multiple heads back"""
        batch_size, num_heads, seq_len, head_dim = x.shape
        # Transpose: (batch, seq_len, num_heads, head_dim)
        x = x.transpose(1, 2)
        # Reshape: (batch, seq_len, embed_dim)
        return x.contiguous().view(batch_size, seq_len, self.embed_dim)
    
    def forward(self, x):
        """
        Args:
            x: Input (batch_size, seq_len, embed_dim)
        
        Returns:
            output: (batch_size, seq_len, embed_dim)
            attention_weights: (batch_size, num_heads, seq_len, seq_len)
        """
        batch_size, seq_len, embed_dim = x.shape
        
        # Step 1: Linear projections
        Q = self.W_q(x)  # (batch, seq_len, embed_dim)
        K = self.W_k(x)
        V = self.W_v(x)
        
        # Step 2: Split into multiple heads
        Q = self.split_heads(Q)  # (batch, num_heads, seq_len, head_dim)
        K = self.split_heads(K)
        V = self.split_heads(V)
        
        # Step 3: Scaled dot-product attention (for each head)
        scores = torch.matmul(Q, K.transpose(-2, -1)) / np.sqrt(self.head_dim)
        attention_weights = F.softmax(scores, dim=-1)
        attention_output = torch.matmul(attention_weights, V)
        
        # Step 4: Combine heads
        attention_output = self.combine_heads(attention_output)
        
        # Step 5: Final linear projection
        output = self.W_o(attention_output)
        
        return output, attention_weights

# Test Multi-Head Attention
embed_dim = 64
num_heads = 8  # Like GPT-2 small
seq_len = 6
batch_size = 1

# Create multi-head attention
mha = MultiHeadAttention(embed_dim, num_heads)

# Random input
x = torch.randn(batch_size, seq_len, embed_dim)

# Forward pass
output, attn_weights = mha(x)

print("‚úÖ Multi-Head Attention Test")
print(f"\nConfiguration:")
print(f"  Embedding dimension: {embed_dim}")
print(f"  Number of heads: {num_heads}")
print(f"  Dimension per head: {embed_dim // num_heads}")
print(f"\nShapes:")
print(f"  Input: {x.shape}")
print(f"  Output: {output.shape}")
print(f"  Attention weights: {attn_weights.shape}")
print(f"\nüí° Each of the {num_heads} heads learns different patterns!")

In [None]:
# Visualize Different Attention Heads

fig, axes = plt.subplots(2, 4, figsize=(16, 8))
fig.suptitle('üé≠ Multi-Head Attention: Different Heads Learn Different Patterns', 
             fontsize=16, fontweight='bold')

for head_idx in range(num_heads):
    ax = axes[head_idx // 4, head_idx % 4]
    
    # Get attention weights for this head
    head_weights = attn_weights[0, head_idx].detach().numpy()
    
    # Plot heatmap
    sns.heatmap(head_weights, 
                annot=True, fmt='.2f', 
                cmap='viridis',
                xticklabels=words,
                yticklabels=words,
                ax=ax,
                cbar=False,
                square=True)
    
    ax.set_title(f'Head {head_idx + 1}', fontweight='bold')
    if head_idx // 4 == 1:
        ax.set_xlabel('Key', fontsize=10)
    if head_idx % 4 == 0:
        ax.set_ylabel('Query', fontsize=10)

plt.tight_layout()
plt.show()

print("\nüìä Observations:")
print("  - Each head shows DIFFERENT attention patterns")
print("  - Some heads focus on nearby words (local patterns)")
print("  - Other heads attend to distant words (long-range dependencies)")
print("  - In trained models (like GPT-4), heads specialize in:")
print("    ‚Ä¢ Syntax (grammar rules)")
print("    ‚Ä¢ Semantics (meaning relationships)")
print("    ‚Ä¢ Positional patterns (word order)")
print("    ‚Ä¢ Coreference (pronoun resolution)")
print("\nüåü This diversity is why multi-head attention is so powerful!")

## üìç Positional Encoding

**Problem: Attention Has No Sense of Order!**

**These sentences have SAME attention scores:**
- "The cat chased the dog"
- "The dog chased the cat"

**Why?** Attention only looks at word relationships, not positions!

**But word order matters!**
- "Dog bites man" ‚â† "Man bites dog"
- "I will not go" ‚â† "Will I not go?"

### Solution: Positional Encoding

**Add position information to embeddings:**
```
Final_Embedding = Word_Embedding + Position_Encoding
```

**Two Approaches:**

**1. Learned Positional Embeddings (GPT, BERT)**
- Train position embeddings like word embeddings
- Position 0, 1, 2, ... each has learnable vector
- Pro: Flexible, learns optimal positions
- Con: Fixed maximum length

**2. Sinusoidal Positional Encoding (Original Transformer)**
- Use sine/cosine functions of different frequencies
- Mathematical formula (don't worry about details):
  ```
  PE(pos, 2i) = sin(pos / 10000^(2i/d_model))
  PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))
  ```
- Pro: Works for any sequence length
- Con: Not learned (fixed pattern)

### üéØ Why It Works:

**Positional encoding creates unique "fingerprint" for each position:**
- Position 0: [0.00, 1.00, 0.00, 1.00, ...]
- Position 1: [0.84, 0.54, 0.01, 1.00, ...]
- Position 2: [0.91, -0.42, 0.02, 1.00, ...]

**Now the model knows:**
- Which words come first/last
- Distance between words
- Relative positions

Let's implement it!

In [None]:
# Positional Encoding Implementation

def get_positional_encoding(seq_len, d_model):
    """
    Generate sinusoidal positional encodings
    
    Args:
        seq_len: Sequence length
        d_model: Embedding dimension
    
    Returns:
        pos_encoding: (seq_len, d_model)
    """
    position = np.arange(seq_len)[:, np.newaxis]  # (seq_len, 1)
    div_term = np.exp(np.arange(0, d_model, 2) * -(np.log(10000.0) / d_model))
    
    pos_encoding = np.zeros((seq_len, d_model))
    pos_encoding[:, 0::2] = np.sin(position * div_term)  # Even indices
    pos_encoding[:, 1::2] = np.cos(position * div_term)  # Odd indices
    
    return pos_encoding

# Generate positional encodings
seq_len = 50  # Up to 50 words
d_model = 128

pos_encoding = get_positional_encoding(seq_len, d_model)

print("‚úÖ Positional Encoding Generated")
print(f"\nShape: {pos_encoding.shape}")
print(f"Sequence length: {seq_len}")
print(f"Embedding dimension: {d_model}")

# Visualize positional encodings
plt.figure(figsize=(14, 6))

# Plot 1: Heatmap of positional encodings
plt.subplot(1, 2, 1)
plt.imshow(pos_encoding, cmap='RdBu', aspect='auto')
plt.colorbar(label='Encoding Value')
plt.xlabel('Embedding Dimension', fontsize=12)
plt.ylabel('Position in Sequence', fontsize=12)
plt.title('üìç Sinusoidal Positional Encoding Heatmap', fontsize=13, fontweight='bold')

# Plot 2: Encoding values for specific positions
plt.subplot(1, 2, 2)
for pos in [0, 10, 20, 30, 40]:
    plt.plot(pos_encoding[pos, :50], label=f'Position {pos}', alpha=0.7)
plt.xlabel('Dimension Index', fontsize=12)
plt.ylabel('Encoding Value', fontsize=12)
plt.title('üìä Positional Encoding Patterns', fontsize=13, fontweight='bold')
plt.legend()
plt.grid(alpha=0.3)

plt.tight_layout()
plt.show()

print("\nüí° Key Observations:")
print("  - Each position has a UNIQUE pattern (fingerprint)")
print("  - Different frequencies capture both local and global positions")
print("  - Low-frequency components: track position in long sequences")
print("  - High-frequency components: distinguish nearby positions")
print("\nüåü This allows transformers to understand word order!")

In [None]:
# Demonstrate Position Similarity

def cosine_similarity(a, b):
    """Compute cosine similarity between two vectors"""
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

# Compute similarity between all position pairs
similarity_matrix = np.zeros((seq_len, seq_len))

for i in range(seq_len):
    for j in range(seq_len):
        similarity_matrix[i, j] = cosine_similarity(pos_encoding[i], pos_encoding[j])

# Visualize
plt.figure(figsize=(10, 8))
sns.heatmap(similarity_matrix, cmap='coolwarm', center=0,
            xticklabels=10, yticklabels=10)
plt.xlabel('Position', fontsize=12)
plt.ylabel('Position', fontsize=12)
plt.title('üìç Positional Encoding Similarity Matrix\n(How similar are different positions?)',
          fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

print("\nüìä Analysis:")
print("  - Diagonal (same position) = 1.0 (perfect similarity)")
print("  - Nearby positions = high similarity (but not identical!)")
print("  - Distant positions = lower similarity")
print("  - Pattern helps model learn: 'Position 5 is close to 6, far from 40'")
print("\nüéØ This relative position information is crucial for understanding!")

# Example: How different are position 5 and position 6?
pos_5 = pos_encoding[5]
pos_6 = pos_encoding[6]
pos_30 = pos_encoding[30]

sim_5_6 = cosine_similarity(pos_5, pos_6)
sim_5_30 = cosine_similarity(pos_5, pos_30)

print(f"\nConcrete Example:")
print(f"  Similarity(pos 5, pos 6): {sim_5_6:.4f} (neighbors ‚Üí high)")
print(f"  Similarity(pos 5, pos 30): {sim_5_30:.4f} (distant ‚Üí lower)")

## üåü Real AI Example: Building Attention from Scratch for Sentiment Analysis

**Task:** Classify movie reviews as positive or negative

**Why Attention Helps:**
- Some words are MORE important: "amazing", "terrible", "love", "hate"
- Attention learns to focus on sentiment-bearing words
- Ignores filler words: "the", "a", "is"

**Pipeline:**
1. Convert words to embeddings
2. Add positional encoding
3. Apply self-attention (focus on important words)
4. Aggregate with attention weights
5. Classify sentiment

Let's build it!

In [None]:
# Simple Attention-Based Sentiment Classifier

class AttentionSentimentClassifier(nn.Module):
    def __init__(self, vocab_size, embed_dim, num_heads, num_classes=2):
        """
        Sentiment classifier using attention
        
        Args:
            vocab_size: Size of vocabulary
            embed_dim: Embedding dimension
            num_heads: Number of attention heads
            num_classes: Number of output classes (2 for binary)
        """
        super(AttentionSentimentClassifier, self).__init__()
        
        # Word embeddings
        self.embedding = nn.Embedding(vocab_size, embed_dim)
        
        # Multi-head attention
        self.attention = MultiHeadAttention(embed_dim, num_heads)
        
        # Classification head
        self.fc = nn.Linear(embed_dim, num_classes)
        
        self.embed_dim = embed_dim
        
    def add_positional_encoding(self, x):
        """Add positional encoding to embeddings"""
        batch_size, seq_len, embed_dim = x.shape
        
        # Generate positional encoding
        pos_enc = torch.FloatTensor(
            get_positional_encoding(seq_len, embed_dim)
        )
        
        # Add to embeddings
        return x + pos_enc.unsqueeze(0)
    
    def forward(self, x):
        """
        Args:
            x: Input token IDs (batch_size, seq_len)
        
        Returns:
            logits: (batch_size, num_classes)
            attention_weights: For visualization
        """
        # 1. Word embeddings
        embeddings = self.embedding(x)  # (batch, seq_len, embed_dim)
        
        # 2. Add positional encoding
        embeddings = self.add_positional_encoding(embeddings)
        
        # 3. Self-attention
        attn_output, attn_weights = self.attention(embeddings)
        
        # 4. Aggregate (mean pooling)
        pooled = attn_output.mean(dim=1)  # (batch, embed_dim)
        
        # 5. Classification
        logits = self.fc(pooled)  # (batch, num_classes)
        
        return logits, attn_weights

print("‚úÖ Attention-based Sentiment Classifier built!")
print("\nüéØ This is a simplified version of how BERT classifies text!")
print("\nKey Components:")
print("  1. Word Embeddings (like Word2Vec/GloVe)")
print("  2. Positional Encoding (adds order information)")
print("  3. Multi-Head Self-Attention (focuses on important words)")
print("  4. Classification Head (final sentiment prediction)")

# Create model
vocab_size = 1000  # Small vocabulary for demo
embed_dim = 64
num_heads = 4

model = AttentionSentimentClassifier(vocab_size, embed_dim, num_heads)

# Example input (batch of 2 sentences, each 10 tokens)
example_input = torch.randint(0, vocab_size, (2, 10))

# Forward pass
logits, attn_weights = model(example_input)

print(f"\nModel Test:")
print(f"  Input shape: {example_input.shape}")
print(f"  Output logits shape: {logits.shape}")
print(f"  Attention weights shape: {attn_weights.shape}")
print(f"\n‚úÖ Model works! Ready to train on real data.")

## üéØ Interactive Exercises

Test your understanding of attention mechanisms!

### Exercise 1: Implement Scaled Dot-Product Attention

**Task:** Complete the function to compute attention scores

**Steps:**
1. Compute Q ¬∑ K^T
2. Scale by ‚àöd_k
3. Apply softmax
4. Multiply by V

In [None]:
def scaled_dot_product_attention(Q, K, V):
    """
    Implement scaled dot-product attention
    
    Args:
        Q: Query (batch, seq_len, d_k)
        K: Key (batch, seq_len, d_k)
        V: Value (batch, seq_len, d_v)
    
    Returns:
        output: Attention output
        attention_weights: Attention scores
    """
    # YOUR CODE HERE
    # Step 1: Compute scores = Q ¬∑ K^T
    # Step 2: Scale by sqrt(d_k)
    # Step 3: Apply softmax
    # Step 4: Multiply by V
    
    pass

# Test your implementation
Q = torch.randn(1, 5, 8)
K = torch.randn(1, 5, 8)
V = torch.randn(1, 5, 8)

# output, weights = scaled_dot_product_attention(Q, K, V)
# print(f"Output shape: {output.shape}")
# print(f"Attention weights shape: {weights.shape}")

<details>
<summary>üìñ Click here for solution</summary>

```python
def scaled_dot_product_attention(Q, K, V):
    d_k = Q.shape[-1]
    
    # Step 1: Compute scores
    scores = torch.matmul(Q, K.transpose(-2, -1))
    
    # Step 2: Scale
    scores = scores / np.sqrt(d_k)
    
    # Step 3: Softmax
    attention_weights = F.softmax(scores, dim=-1)
    
    # Step 4: Weighted sum
    output = torch.matmul(attention_weights, V)
    
    return output, attention_weights
```
</details>

### Exercise 2: Calculate Attention Weights Manually

**Given:**
- Query: "cat" = [0.5, 0.8]
- Keys: {"cat": [0.5, 0.8], "dog": [0.4, 0.7], "mat": [0.3, 0.2]}

**Task:** Calculate attention weights (by hand or code)

In [None]:
# YOUR SOLUTION HERE
query = np.array([0.5, 0.8])
keys = np.array([
    [0.5, 0.8],  # cat
    [0.4, 0.7],  # dog
    [0.3, 0.2]   # mat
])

# Calculate attention weights
# Step 1: Dot products
# Step 2: Softmax
# Which word gets highest attention?

<details>
<summary>üìñ Click here for solution</summary>

```python
# Step 1: Compute dot products (similarities)
scores = np.dot(query, keys.T)
print(f"Scores: {scores}")  # [0.89, 0.76, 0.31]

# Step 2: Apply softmax
attention_weights = np.exp(scores) / np.sum(np.exp(scores))
print(f"Attention weights: {attention_weights}")

# Result: [0.44, 0.38, 0.18]
# Highest attention on "cat" itself (0.44)
```
</details>

### Exercise 3: Why Multi-Head Attention?

**Question:** Explain in your own words why we use multiple attention heads instead of just one.

**Think about:**
- What different relationships exist in language?
- Can one attention head capture all patterns?
- How does this relate to ChatGPT/GPT-4?

<details>
<summary>üìñ Click here for answer</summary>

**Why Multi-Head Attention:**

1. **Different Linguistic Patterns:**
   - Head 1 might learn syntax (subject-verb agreement)
   - Head 2 might learn semantics (word meanings)
   - Head 3 might learn coreference (pronouns)
   - Head 4 might learn long-range dependencies

2. **Richer Representations:**
   - Single head = one perspective
   - Multiple heads = diverse perspectives ‚Üí better understanding

3. **Real Examples:**
   - Sentence: "The cat, which was sleeping, woke up"
   - Head A: "cat" ‚Üí "was sleeping" (descriptive clause)
   - Head B: "cat" ‚Üí "woke up" (main action)
   - Head C: "which" ‚Üí "cat" (pronoun reference)

4. **In GPT-4:**
   - 128+ heads per layer √ó 120 layers = 15,000+ specialized attention heads!
   - Each learns unique patterns
   - Combined = deep understanding of language
</details>

## üéì Key Takeaways

**You just learned:**

### 1. **What is Attention?**
   - ‚úÖ Mechanism for focusing on important information
   - ‚úÖ Solves information bottleneck in RNNs
   - ‚úÖ Enables long-range dependencies
   - **Use when:** Processing sequences (text, time-series, DNA)

### 2. **Self-Attention**
   - ‚úÖ Each word attends to all words (including itself)
   - ‚úÖ Creates context-aware embeddings
   - ‚úÖ Formula: Attention(Q,K,V) = softmax(QK^T/‚àöd_k)V
   - **Powers:** All modern LLMs (GPT, BERT, Claude)

### 3. **Multi-Head Attention**
   - ‚úÖ Multiple attention mechanisms in parallel
   - ‚úÖ Each head learns different patterns
   - ‚úÖ Richer representations
   - **Used in:** GPT-4 (128 heads), Claude, Gemini

### 4. **Positional Encoding**
   - ‚úÖ Adds word order information
   - ‚úÖ Sinusoidal or learned embeddings
   - ‚úÖ Critical for understanding sequences
   - **Without it:** "dog bites man" = "man bites dog"

### üåü Real-World Impact (2024-2025):

**What You Can Build:**
- ü§ñ **Chatbots** using attention-based models
- üåê **Translation Systems** (Google Translate uses this!)
- üìù **Text Summarization** for RAG systems
- üéØ **Question Answering** (like ChatGPT)
- üìä **Sentiment Analysis** with attention weights

**Modern Applications:**
- **ChatGPT/GPT-4:** 100% attention-based
- **Claude:** Anthropic's AI using attention
- **Gemini:** Google's multimodal AI
- **GitHub Copilot:** Code generation with attention
- **Midjourney/DALL-E:** Cross-attention between text and images

### üìä Attention vs RNNs:

| Feature | RNN/LSTM | Attention/Transformer |
|---------|----------|----------------------|
| Speed | ‚ùå Sequential (slow) | ‚úÖ Parallel (fast) |
| Long sequences | ‚ùå Forgets early words | ‚úÖ Access all words |
| Interpretability | ‚ùå Black box | ‚úÖ See attention weights |
| Scalability | ‚ùå Limited | ‚úÖ Scales to billions of parameters |
| Used in 2024 | ‚ùå Mostly replaced | ‚úÖ State-of-the-art |

---

**üéâ Congratulations!** You now understand the core mechanism behind:
- ChatGPT
- Claude
- GPT-4
- BERT
- Every modern LLM!

**Next:** We'll build the full Transformer architecture! üöÄ

## üöÄ Next Steps

**Practice Exercises:**
1. Modify the number of attention heads (2, 4, 8, 16) - what changes?
2. Experiment with different positional encoding strategies
3. Visualize attention weights on real sentences
4. Implement masking for padding tokens

**Coming Next:**
- **Day 2:** Complete Transformer Architecture (Encoder, Decoder, the full "Attention Is All You Need" paper)
- **Day 3:** Modern LLMs (BERT, GPT, T5) and Fine-tuning with HuggingFace

---

**üí° Deep Dive Resources:**
- "Attention Is All You Need" paper (Vaswani et al., 2017)
- The Illustrated Transformer (Jay Alammar)
- HuggingFace Transformers course
- Fast.ai course on NLP

---

*Remember: Attention is all you need! This simple mechanism powers every AI breakthrough from 2017 to 2025.* üåü

**üéØ You now understand how ChatGPT "pays attention" to your prompts!**