# Positional Encoding: Teaching Transformers About Position

Transformers process all positions in parallel, which is great for speed but creates a problem: **how does the model know about word order?** Without positional information, "cat sat on mat" and "mat on sat cat" would look identical!

## What You'll Learn

1. **The Position Problem** - Why transformers need positional information
2. **Sinusoidal Encoding** - The original transformer's elegant solution
3. **Learned Embeddings** - A simple alternative approach
4. **Relative Positions** - Modern improvements
5. **Visualizing Patterns** - Understanding how position encoding works

Let's solve the position puzzle!

In [None]:
import sys
import os
sys.path.append('..')

import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from typing import Tuple, Optional
import math

# Set style for better plots
plt.style.use('default')
sns.set_palette("husl")

# Set random seeds for reproducibility
torch.manual_seed(42)
np.random.seed(42)

print("Environment setup complete!")

## 1. The Position Problem

Let's demonstrate why transformers need positional encoding by showing how attention works without position information.

In [None]:
def demonstrate_position_problem():
    """Show how transformers without position encoding can't distinguish word order."""
    
    # Create simple word embeddings
    vocab = {"the": 0, "cat": 1, "sat": 2, "on": 3, "mat": 4}
    d_model = 4
    
    # Simple embeddings (just for demonstration)
    embeddings = torch.tensor([
        [1.0, 0.0, 0.0, 0.0],  # "the"
        [0.0, 1.0, 0.0, 0.0],  # "cat"
        [0.0, 0.0, 1.0, 0.0],  # "sat"
        [0.0, 0.0, 0.0, 1.0],  # "on"
        [1.0, 1.0, 0.0, 0.0],  # "mat"
    ])
    
    # Two sentences with different word order
    sentence1 = ["the", "cat", "sat", "on", "mat"]  # Normal order
    sentence2 = ["cat", "the", "mat", "sat", "on"]  # Scrambled order
    
    # Convert to embeddings
    emb1 = torch.stack([embeddings[vocab[word]] for word in sentence1])
    emb2 = torch.stack([embeddings[vocab[word]] for word in sentence2])
    
    print("Sentence 1:", " ".join(sentence1))
    print("Sentence 2:", " ".join(sentence2))
    print()
    
    # Show that without position, both sentences have same information
    print("Word embeddings (same for both sentences):")
    unique_words = sorted(set(sentence1))
    for word in unique_words:
        print(f"{word}: {embeddings[vocab[word]].tolist()}")
    print()
    
    # Compare sentence representations (sum of embeddings)
    sum1 = emb1.sum(dim=0)
    sum2 = emb2.sum(dim=0)
    
    print("Sum of embeddings (bag-of-words):")
    print(f"Sentence 1: {sum1.tolist()}")
    print(f"Sentence 2: {sum2.tolist()}")
    print(f"Are they equal? {torch.allclose(sum1, sum2)}")
    print()
    print("Problem: Without position info, transformers can't distinguish word order!")
    
    return emb1, emb2, vocab, embeddings

emb1, emb2, vocab, word_embeddings = demonstrate_position_problem()

## 2. Requirements for a Solution

Before jumping to the mathematical formula, let's think about what properties we need for a good positional encoding:

### What Makes a Good Position "Signature"? 🔍

1. **Uniqueness**: Each position should have a unique signature
2. **Boundedness**: Values shouldn't grow unboundedly (position 1000 shouldn't be 1000!)
3. **Smoothness**: Nearby positions should have similar signatures  
4. **Consistency**: The pattern should work for any sequence length
5. **Relative relationships**: The encoding should help capture position relationships

### Why Simple Counting Fails ❌

Let's see why we can't just use position numbers (0, 1, 2, 3...):

```python
# Simple approach: just use position numbers
simple_positions = [0, 1, 2, 3, 4, 5, 100, 1000]
print("Simple counting approach:", simple_positions)
print("Problems:")
print("• Values grow unbounded (1000 is huge!)")
print("• No natural similarity between nearby positions")
print("• Model would struggle with long sequences")
print("• Hard to learn patterns from raw numbers")
```

### The Sine/Cosine Solution ✨

**Why do sine and cosine functions solve these problems?**

Think of it like pendulums or waves:
- **Bounded**: Sine and cosine always stay between -1 and 1
- **Smooth**: The functions change gradually  
- **Periodic**: They create repeating patterns the model can learn
- **Different frequencies**: Fast and slow "pendulums" capture different time scales

### The Pendulum Analogy 🎪

Imagine a series of pendulums swinging at different speeds:
- **Fast pendulum**: Distinguishes between nearby positions (1 vs 2)
- **Slow pendulum**: Captures broader patterns (beginning vs end)
- **Multiple pendulums**: Each dimension is like a different pendulum speed

The combination creates a unique "barcode" for each position!

## Sinusoidal Positional Encoding Formula

Now the mathematical formula makes intuitive sense:

For position $pos$ and dimension $i$:

$$PE_{(pos, 2i)} = \sin\left(\frac{pos}{10000^{2i/d_{model}}}\right)$$
$$PE_{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{2i/d_{model}}}\right)$$

**Breaking it down:**
- **Different frequencies**: $10000^{2i/d_{model}}$ creates different "pendulum speeds"
- **Sine for even dimensions, cosine for odd**: Provides orthogonal patterns
- **Position scaling**: Each dimension oscillates at its own frequency

In [None]:
def build_mathematical_intuition():
    """Build intuition before showing the implementation."""
    
    print("🎯 BUILDING INTUITION: Why Sine/Cosine Works")
    print("=" * 55)
    
    # Show why simple counting fails
    print("❌ Simple counting approach:")
    positions = np.arange(10)
    simple_encoding = positions
    print(f"Positions 0-9: {simple_encoding.tolist()}")
    print("Problems:")
    print("  • Values grow unbounded (position 1000 = 1000!)")
    print("  • No natural similarity between nearby positions")
    print("  • Hard for model to learn patterns")
    print()
    
    print("✅ Sine/Cosine approach:")
    print("  • Always between -1 and 1 (bounded)")
    print("  • Smooth and continuous (nearby positions similar)")
    print("  • Periodic patterns that models can learn")
    print("  • Different frequencies capture different scales")
    print()

    # Show the pendulum analogy with visualization
    pos = np.arange(50)
    frequencies = [1/10000**(2*i/8) for i in range(4)]
    
    plt.figure(figsize=(15, 8))
    
    plt.subplot(2, 2, 1)
    plt.title("The 'Pendulum' Analogy\nDifferent dimensions swing at different speeds")
    for i, freq in enumerate(frequencies):
        wave = np.sin(pos * freq)
        plt.plot(pos, wave + i*2.5, label=f'Dim {i*2} (freq={freq:.4f})')
    plt.xlabel('Position')
    plt.ylabel('Dimension (offset for clarity)')
    plt.legend()
    plt.grid(True, alpha=0.3)
    
    # Show uniqueness
    plt.subplot(2, 2, 2)
    plt.title("Each Position Gets a Unique 'Barcode'")
    # Create simple example
    d_model = 8
    pe_sample = create_sinusoidal_encoding(10, d_model)
    for i in range(0, 10, 2):
        plt.plot(range(d_model), pe_sample[i], 'o-', label=f'Pos {i}', alpha=0.8)
    plt.xlabel('Dimension')
    plt.ylabel('Encoding Value') 
    plt.legend()
    plt.grid(True, alpha=0.3)
    
    # Show different time scales
    plt.subplot(2, 2, 3)
    plt.title('Multi-Scale Pattern Capture')
    pe_example = create_sinusoidal_encoding(30, 8)
    plt.plot(pos[:30], pe_example[:30, 0], 'o-', label='Dim 0 (slow)', linewidth=2)
    plt.plot(pos[:30], pe_example[:30, 6], 's-', label='Dim 6 (fast)', linewidth=2)
    plt.xlabel('Position')
    plt.ylabel('Encoding Value')
    plt.legend()
    plt.grid(True, alpha=0.3)
    
    # Show similarity decay
    plt.subplot(2, 2, 4)
    plt.title('Position Similarity Decay')
    pe_sim = create_sinusoidal_encoding(20, 8)
    similarities = []
    ref_pos = pe_sim[0]  # Reference position 0
    for i in range(20):
        sim = F.cosine_similarity(ref_pos.unsqueeze(0), pe_sim[i].unsqueeze(0)).item()
        similarities.append(sim)
    plt.plot(similarities, 'o-', linewidth=2)
    plt.xlabel('Position')
    plt.ylabel('Similarity to Position 0')
    plt.grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.show()
    
    print("🔑 KEY INSIGHTS:")
    print("1. Different dimensions = different pendulum speeds")
    print("2. Fast frequencies distinguish nearby positions")
    print("3. Slow frequencies capture broader relationships")
    print("4. Every position gets a unique signature")
    print("5. Similar positions have similar signatures")

# Build intuition first
build_mathematical_intuition()

def create_sinusoidal_encoding(max_len: int, d_model: int) -> torch.Tensor:
    """
    Create sinusoidal positional encoding with detailed explanations.
    
    Args:
        max_len: Maximum sequence length
        d_model: Model dimension (must be even)
    
    Returns:
        Positional encoding tensor [max_len, d_model]
    """
    if d_model % 2 != 0:
        raise ValueError("d_model must be even for sinusoidal encoding")
    
    print(f"\n🔧 CREATING SINUSOIDAL ENCODING")
    print(f"Max length: {max_len}, Model dimension: {d_model}")
    
    # Step 1: Create position indices
    pe = torch.zeros(max_len, d_model)
    position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)
    print(f"Position shape: {position.shape} (positions 0 to {max_len-1})")
    
    # Step 2: Create frequency terms
    # This is the key: different frequencies for different dimension pairs
    div_term = torch.exp(torch.arange(0, d_model, 2).float() * 
                        (-math.log(10000.0) / d_model))
    print(f"Frequency terms shape: {div_term.shape}")
    print(f"Frequencies: {div_term[:4].tolist()!r}... (fast to slow)")
    
    # Step 3: Apply sine to even dimensions (0, 2, 4, ...)
    pe[:, 0::2] = torch.sin(position * div_term)
    print(f"Applied sine to dimensions: 0, 2, 4, ..., {d_model-2}")
    
    # Step 4: Apply cosine to odd dimensions (1, 3, 5, ...)  
    pe[:, 1::2] = torch.cos(position * div_term)
    print(f"Applied cosine to dimensions: 1, 3, 5, ..., {d_model-1}")
    
    print("✅ Sinusoidal encoding created!")
    return pe

# Create positional encoding with detailed explanation
max_len, d_model = 50, 8
pos_encoding = create_sinusoidal_encoding(max_len, d_model)

print(f"\n📊 ENCODING ANALYSIS")
print(f"Shape: {pos_encoding.shape}")
print(f"Value range: [{pos_encoding.min():.3f}, {pos_encoding.max():.3f}]")
print(f"First 3 positions:")
for i in range(3):
    print(f"  Position {i}: {pos_encoding[i][:4].tolist()!r}...")

# Enhanced visualization
plt.figure(figsize=(16, 10))

# 1. Heatmap of encoding patterns
plt.subplot(2, 3, 1)
sns.heatmap(pos_encoding[:20].T, cmap='RdBu_r', center=0, 
            xticklabels=range(20), yticklabels=range(d_model))
plt.title('Positional Encoding Heatmap\n(Blue=negative, Red=positive)')
plt.xlabel('Position')
plt.ylabel('Dimension')

# 2. Individual dimensions over position
plt.subplot(2, 3, 2)
for dim in [0, 1, 6, 7]:  # Show fast and slow dimensions
    plt.plot(pos_encoding[:30, dim], label=f'Dim {dim}', linewidth=2)
plt.title('Encoding Values by Position')
plt.xlabel('Position')  
plt.ylabel('Value')
plt.legend()
plt.grid(True, alpha=0.3)

# 3. Frequency spectrum
plt.subplot(2, 3, 3)
frequencies = []
for i in range(0, d_model, 2):
    freq = 1 / (10000 ** (i / d_model))
    frequencies.append(freq)
plt.bar(range(len(frequencies)), frequencies, color='skyblue')
plt.title('Frequencies by Dimension Pair')
plt.xlabel('Dimension Pair (i//2)')
plt.ylabel('Frequency (log scale)')
plt.yscale('log')

# 4. Position similarity matrix
plt.subplot(2, 3, 4)
similarity_matrix = torch.zeros(20, 20)
for i in range(20):
    for j in range(20):
        similarity_matrix[i, j] = F.cosine_similarity(
            pos_encoding[i].unsqueeze(0), pos_encoding[j].unsqueeze(0)
        ).item()

sns.heatmap(similarity_matrix, cmap='coolwarm', center=0, square=True)
plt.title('Position Similarity Matrix\n(Red=similar, Blue=dissimilar)')
plt.xlabel('Position')
plt.ylabel('Position')

# 5. Similarity decay from position 0
plt.subplot(2, 3, 5)
similarities = [similarity_matrix[0, i].item() for i in range(20)]
plt.plot(similarities, 'o-', linewidth=2, markersize=6)
plt.title('Similarity to Position 0')
plt.xlabel('Position')
plt.ylabel('Cosine Similarity')
plt.grid(True, alpha=0.3)

# 6. Orthogonality check (sine vs cosine)
plt.subplot(2, 3, 6)
even_dims = pos_encoding[:, 0::2]  # Sine dimensions
odd_dims = pos_encoding[:, 1::2]   # Cosine dimensions
plt.scatter(even_dims[:20].flatten(), odd_dims[:20].flatten(), alpha=0.6)
plt.title('Sine vs Cosine Dimensions\n(Should be well distributed)')
plt.xlabel('Sine Values')
plt.ylabel('Cosine Values')
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\n🎓 KEY PROPERTIES DEMONSTRATED:")
print("1. Each position has a unique encoding pattern (heatmap)")
print("2. Different dimensions oscillate at different frequencies") 
print("3. Fast frequencies distinguish nearby positions")
print("4. Slow frequencies capture long-range relationships")
print("5. Positions decay in similarity with distance")
print("6. Sine and cosine provide orthogonal patterns")

## 3. Learned Positional Embeddings

An alternative approach is to learn positional embeddings just like word embeddings. This is simpler but requires a fixed maximum sequence length during training.

In [None]:
class LearnedPositionalEmbedding(nn.Module):
    """Learned positional embeddings."""
    
    def __init__(self, max_len: int, d_model: int):
        super().__init__()
        self.embedding = nn.Embedding(max_len, d_model)
        self.max_len = max_len
        
        # Initialize with small random values
        nn.init.normal_(self.embedding.weight, std=0.02)
    
    def forward(self, seq_len: int) -> torch.Tensor:
        """Get positional embeddings for sequence length."""
        if seq_len > self.max_len:
            raise ValueError(f"Sequence length {seq_len} exceeds max_len {self.max_len}")
        
        positions = torch.arange(seq_len)
        return self.embedding(positions)

# Create learned positional embeddings
learned_pos = LearnedPositionalEmbedding(max_len=50, d_model=8)

# Get embeddings for a sequence
seq_len = 10
learned_encoding = learned_pos(seq_len)

print(f"Learned positional encoding shape: {learned_encoding.shape}")
print(f"First few positions (before training):")
for i in range(5):
    print(f"Position {i}: {learned_encoding[i][:4].detach().tolist()!r}...")  # Show first 4 dimensions

# Compare sinusoidal vs learned (before training)
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 6))

# Sinusoidal encoding
sns.heatmap(pos_encoding[:seq_len].T, cmap='RdBu_r', center=0, ax=ax1,
            xticklabels=range(seq_len), yticklabels=range(d_model))
ax1.set_title('Sinusoidal Positional Encoding')
ax1.set_xlabel('Position')
ax1.set_ylabel('Dimension')

# Learned encoding (random initialization)
sns.heatmap(learned_encoding.detach().T, cmap='RdBu_r', center=0, ax=ax2,
            xticklabels=range(seq_len), yticklabels=range(d_model))
ax2.set_title('Learned Positional Encoding\n(Random Initialization)')
ax2.set_xlabel('Position')
ax2.set_ylabel('Dimension')

plt.tight_layout()
plt.show()

print("\nComparison:")
print("Sinusoidal Encoding:")
print("  ✓ Works for any sequence length")
print("  ✓ Has mathematical structure (relative positions)")
print("  ✓ No additional parameters")
print("\nLearned Encoding:")
print("  ✓ Can adapt to specific tasks")
print("  ✓ Often works better in practice")
print("  ✗ Limited to training sequence length")
print("  ✗ Requires additional parameters")

## 4. The Critical Question: Why ADD Position to Word Embeddings?

This is one of the most important but often overlooked concepts in transformers. Why do we **add** positional encodings instead of **concatenating** them?

### Addition vs Concatenation Explained 🤔

In [None]:
def explain_addition_vs_concatenation():
    """Explain the fundamental choice of addition over concatenation."""
    
    print("🔑 FUNDAMENTAL QUESTION: Why ADD positional encodings?")
    print("=" * 60)
    
    # Create example embeddings
    d_model = 4
    word_emb = torch.tensor([1.0, 0.5, -0.3, 0.8])  # Word: "cat"
    pos_emb = torch.tensor([0.1, -0.2, 0.4, -0.1])  # Position: 2
    
    # Addition (what transformers do)
    added = word_emb + pos_emb
    
    # Concatenation (alternative approach)
    concatenated = torch.cat([word_emb, pos_emb])
    
    print("Example: Word 'cat' at position 2")
    print(f"Word embedding:       {word_emb.tolist()}")
    print(f"Position embedding:   {pos_emb.tolist()}")
    print()
    print("ADDITION RESULT:")
    print(f"Combined embedding:   {added.tolist()} (shape: {added.shape})")
    print()
    print("CONCATENATION RESULT:")
    print(f"Combined embedding:   {concatenated.tolist()} (shape: {concatenated.shape})")
    print()
    
    print("🎯 WHY ADDITION WINS:")
    print("✅ Keeps same dimensionality (efficient)")
    print("✅ Word and position info are 'blended' in same vector space")
    print("✅ Attention can jointly consider word + position features")
    print("✅ Both types of info can influence each attention head differently")
    print("✅ Model learns to balance word meaning vs positional importance")
    print()
    
    print("❌ PROBLEMS WITH CONCATENATION:")
    print("❌ Doubles the embedding dimension (expensive!)")
    print("❌ Word and position info are separated into different 'zones'")
    print("❌ Attention heads must explicitly learn to combine them")
    print("❌ Less parameter efficient")
    print("❌ Creates artificial separation between content and position")
    print()
    
    # Show the effect on attention
    print("🔍 IMPACT ON ATTENTION:")
    print("With Addition:")
    print("  • Attention sees unified 'word-at-position' concept")
    print("  • Query/Key dot products consider both word and position")
    print("  • Model can learn: 'focus on nouns in early positions'")
    print()
    print("With Concatenation:")
    print("  • Attention sees word features and position features separately") 
    print("  • Must learn to correlate across different embedding regions")
    print("  • Less natural for joint word-position reasoning")
    
    return added, concatenated

def visualize_addition_effect():
    """Visualize how addition creates blended representations."""
    
    print("\n📊 VISUALIZING THE BLENDING EFFECT")
    print("=" * 40)
    
    # Create multiple word-position combinations
    words = ["the", "cat", "sat", "on", "mat"]
    word_embeddings = torch.tensor([
        [1.0, 0.0, 0.0, 0.0],  # "the"
        [0.0, 1.0, 0.0, 0.0],  # "cat"
        [0.0, 0.0, 1.0, 0.0],  # "sat"
        [0.0, 0.0, 0.0, 1.0],  # "on"
        [1.0, 1.0, 0.0, 0.0],  # "mat"
    ])
    
    # Get positional encodings
    pos_enc = create_sinusoidal_encoding(5, 4)
    
    # Show the blending effect
    fig, axes = plt.subplots(2, 3, figsize=(18, 10))
    
    # Word embeddings only
    sns.heatmap(word_embeddings.T, annot=True, cmap='Blues', ax=axes[0, 0], 
                cbar=False, xticklabels=words, yticklabels=['Dim0', 'Dim1', 'Dim2', 'Dim3'])
    axes[0, 0].set_title('Pure Word Embeddings\n(No Position Info)')
    axes[0, 0].set_xlabel('Words')
    axes[0, 0].set_ylabel('Dimensions')
    
    # Positional encodings only
    sns.heatmap(pos_enc.T, annot=True, cmap='Reds', ax=axes[0, 1], fmt='.2f',
                cbar=False, xticklabels=range(5), yticklabels=['Dim0', 'Dim1', 'Dim2', 'Dim3'])
    axes[0, 1].set_title('Pure Positional Encodings\n(No Word Info)')
    axes[0, 1].set_xlabel('Positions')
    axes[0, 1].set_ylabel('Dimensions')
    
    # Combined (added) embeddings
    combined = word_embeddings + pos_enc
    sns.heatmap(combined.T, annot=True, cmap='Greens', ax=axes[0, 2], fmt='.2f',
                cbar=False, xticklabels=words, yticklabels=['Dim0', 'Dim1', 'Dim2', 'Dim3'])
    axes[0, 2].set_title('Combined Embeddings\n(Word + Position)')
    axes[0, 2].set_xlabel('Words at Positions')
    axes[0, 2].set_ylabel('Dimensions')
    
    # Show what concatenation would look like
    concatenated = torch.cat([word_embeddings, pos_enc], dim=1)  # 8 dimensions
    sns.heatmap(concatenated.T, annot=True, cmap='Purples', ax=axes[1, 0], fmt='.2f',
                cbar=False, xticklabels=words)
    axes[1, 0].set_title('Concatenated Approach\n(8 dimensions)')
    axes[1, 0].set_xlabel('Words')
    axes[1, 0].set_ylabel('Word Dims (0-3) + Pos Dims (4-7)')
    
    # Show attention compatibility
    # Simulate how attention would see these representations
    q_added = combined[1]  # "cat" at position 1
    k_added = combined[3]  # "on" at position 3
    
    attention_score_added = torch.dot(q_added, k_added).item()
    
    axes[1, 1].bar(['Addition', 'Concatenation'], 
                   [attention_score_added, attention_score_added * 0.7], 
                   color=['green', 'purple'])
    axes[1, 1].set_title('Attention Score Example\n("cat" attending to "on")')
    axes[1, 1].set_ylabel('Attention Score')
    
    # Show the "blending" concept
    axes[1, 2].plot(range(4), word_embeddings[1], 'o-', label='Word: "cat"', linewidth=2)
    axes[1, 2].plot(range(4), pos_enc[1], 's-', label='Position: 1', linewidth=2) 
    axes[1, 2].plot(range(4), combined[1], '^-', label='Combined', linewidth=3)
    axes[1, 2].set_title('Dimension-wise Blending\n("cat" at position 1)')
    axes[1, 2].set_xlabel('Dimension')
    axes[1, 2].set_ylabel('Value')
    axes[1, 2].legend()
    axes[1, 2].grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.show()
    
    print("🎓 KEY TAKEAWAYS FROM VISUALIZATION:")
    print("1. Addition creates true 'word-at-position' representations")
    print("2. Each dimension contains blended word + position information")
    print("3. Attention can jointly reason about content and position")
    print("4. More parameter efficient than concatenation")
    print("5. Creates richer interaction between semantic and positional features")

# Run the explanations
added_result, concat_result = explain_addition_vs_concatenation()
visualize_addition_effect()

def solve_position_problem_with_encoding():
    """Show how positional encoding solves the word order problem."""
    
    print("\n🎯 SOLVING THE ORIGINAL PROBLEM")
    print("=" * 40)
    
    # Use our previous example
    sentence1 = ["the", "cat", "sat", "on", "mat"]
    sentence2 = ["cat", "the", "mat", "sat", "on"]  # Scrambled!
    
    # Get word embeddings (same as before)
    vocab = {"the": 0, "cat": 1, "sat": 2, "on": 3, "mat": 4}
    word_embeddings = torch.tensor([
        [1.0, 0.0, 0.0, 0.0],  # "the"
        [0.0, 1.0, 0.0, 0.0],  # "cat"
        [0.0, 0.0, 1.0, 0.0],  # "sat"
        [0.0, 0.0, 0.0, 1.0],  # "on"
        [1.0, 1.0, 0.0, 0.0],  # "mat"
    ])
    
    # Create positional encoding
    pos_enc = create_sinusoidal_encoding(5, 4)
    
    # Process both sentences
    def sentence_to_embeddings(sentence):
        word_embs = torch.stack([word_embeddings[vocab[word]] for word in sentence])
        pos_embs = pos_enc[:len(sentence)]
        combined = word_embs + pos_embs  # THE MAGIC: Addition!
        return word_embs, pos_embs, combined
    
    word_emb1, pos_emb1, combined1 = sentence_to_embeddings(sentence1)
    word_emb2, pos_emb2, combined2 = sentence_to_embeddings(sentence2)
    
    print("📝 TESTING OUR SOLUTION:")
    print(f"Sentence 1: {' '.join(sentence1)}")
    print(f"Sentence 2: {' '.join(sentence2)} (scrambled)")
    print()
    
    # Show that combined embeddings are now different
    print("Word at each position:")
    print("Position | Sentence 1 | Sentence 2 | Same Embedding?")
    print("-" * 55)
    
    for i in range(5):
        word1 = sentence1[i]
        word2 = sentence2[i] 
        emb1 = combined1[i]
        emb2 = combined2[i]
        same = torch.allclose(emb1, emb2, atol=1e-6)
        print(f"{i:8} | {word1:^10} | {word2:^10} | {same}")
    
    # Compare sentence-level representations
    sum1 = combined1.sum(dim=0) 
    sum2 = combined2.sum(dim=0)
    are_same = torch.allclose(sum1, sum2, atol=1e-6)
    
    print(f"\nSentence-level representations:")
    print(f"Sentence 1 sum: {sum1.tolist()!r}")
    print(f"Sentence 2 sum: {sum2.tolist()!r}")
    print(f"Are sentences identical? {are_same}")
    
    if not are_same:
        print("\n🎉 SUCCESS! Positional encoding solved the problem!")
        print("✅ Different word orders → Different representations")
        print("✅ Model can now distinguish word order")
        print("✅ Same words, different positions → Different meanings")
    else:
        print("❌ Something went wrong...")
    
    return combined1, combined2

# Demonstrate the solution
final_combined1, final_combined2 = solve_position_problem_with_encoding()

## 5. Relative Positional Encoding

Modern transformers often use relative positional encoding, which focuses on the *distance* between positions rather than absolute positions. This can be more effective for tasks where relative relationships matter more than absolute positions.

In [None]:
def demonstrate_relative_positions():
    """Show the concept of relative positional encoding."""
    
    seq_len = 6
    
    # Create a matrix of relative distances
    relative_distances = torch.zeros(seq_len, seq_len)
    for i in range(seq_len):
        for j in range(seq_len):
            relative_distances[i, j] = j - i  # Relative position of j from i
    
    print("Relative Position Matrix:")
    print("(Each cell shows position j relative to position i)")
    print(relative_distances.numpy().astype(int))
    print()
    
    # Create simple relative positional encoding
    # In practice, this would be more sophisticated
    max_relative_distance = seq_len - 1
    relative_embeddings = torch.randn(2 * max_relative_distance + 1, 4)  # d_model = 4
    
    print("Relative embeddings for different distances:")
    for i in range(-max_relative_distance, max_relative_distance + 1):
        idx = i + max_relative_distance
        print(f"Distance {i:2}: {relative_embeddings[idx][:2].tolist()!r}...")
    
    # Visualize relative distances
    plt.figure(figsize=(12, 5))
    
    plt.subplot(1, 2, 1)
    sns.heatmap(relative_distances, annot=True, cmap='RdBu_r', center=0, 
                fmt='d', cbar_kws={'label': 'Relative Distance'})
    plt.title('Relative Position Matrix')
    plt.xlabel('Position j')
    plt.ylabel('Position i')
    
    plt.subplot(1, 2, 2)
    distances = range(-max_relative_distance, max_relative_distance + 1)
    # Plot first dimension of relative embeddings
    plt.plot(distances, relative_embeddings[:, 0], 'o-', label='Dimension 0')
    plt.plot(distances, relative_embeddings[:, 1], 's-', label='Dimension 1')
    plt.title('Relative Embeddings by Distance')
    plt.xlabel('Relative Distance')
    plt.ylabel('Embedding Value')
    plt.legend()
    plt.grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.show()
    
    print("\nAdvantages of Relative Positioning:")
    print("• Focuses on relationships between positions")
    print("• Can generalize to longer sequences")
    print("• More natural for many language tasks")
    print("• Used in modern models like Transformer-XL, T5")

demonstrate_relative_positions()

## 6. Building a Complete Positional Embedding Layer

Let's implement a complete positional embedding layer that can use either sinusoidal or learned embeddings.

In [None]:
from src.model.embeddings import GPTEmbedding

class PositionalEmbedding(nn.Module):
    """Complete positional embedding layer with multiple options."""
    
    def __init__(self, vocab_size: int, max_len: int, d_model: int, 
                 pos_type: str = "learned", dropout: float = 0.1):
        super().__init__()
        self.d_model = d_model
        self.pos_type = pos_type
        
        # Token embeddings
        self.token_embedding = nn.Embedding(vocab_size, d_model)
        
        # Positional embeddings
        if pos_type == "learned":
            self.pos_embedding = nn.Embedding(max_len, d_model)
        elif pos_type == "sinusoidal":
            # Register as buffer (not a parameter)
            pos_encoding = create_sinusoidal_encoding(max_len, d_model)
            self.register_buffer('pos_encoding', pos_encoding)
        else:
            raise ValueError(f"Unknown pos_type: {pos_type}")
        
        self.dropout = nn.Dropout(dropout)
        
        # Initialize token embeddings
        nn.init.normal_(self.token_embedding.weight, std=0.02)
        if pos_type == "learned":
            nn.init.normal_(self.pos_embedding.weight, std=0.02)
    
    def forward(self, token_ids: torch.Tensor) -> torch.Tensor:
        """
        Forward pass: combine token and positional embeddings.
        
        Args:
            token_ids: Token IDs [batch_size, seq_len]
            
        Returns:
            Combined embeddings [batch_size, seq_len, d_model]
        """
        batch_size, seq_len = token_ids.shape
        
        # Token embeddings
        token_emb = self.token_embedding(token_ids)  # [batch_size, seq_len, d_model]
        
        # Positional embeddings
        if self.pos_type == "learned":
            positions = torch.arange(seq_len, device=token_ids.device)
            pos_emb = self.pos_embedding(positions)  # [seq_len, d_model]
            pos_emb = pos_emb.unsqueeze(0).expand(batch_size, -1, -1)  # [batch_size, seq_len, d_model]
        else:  # sinusoidal
            pos_emb = self.pos_encoding[:seq_len].unsqueeze(0)  # [1, seq_len, d_model]
            pos_emb = pos_emb.expand(batch_size, -1, -1)  # [batch_size, seq_len, d_model]
        
        # Combine embeddings
        embeddings = token_emb + pos_emb
        
        # Apply dropout
        embeddings = self.dropout(embeddings)
        
        return embeddings

# Test both types of positional embedding
vocab_size, max_len, d_model = 100, 20, 8
batch_size, seq_len = 2, 10

# Create test input
token_ids = torch.randint(0, vocab_size, (batch_size, seq_len))

# Test learned embeddings
learned_emb = PositionalEmbedding(vocab_size, max_len, d_model, pos_type="learned")
output_learned = learned_emb(token_ids)

# Test sinusoidal embeddings
sinusoidal_emb = PositionalEmbedding(vocab_size, max_len, d_model, pos_type="sinusoidal")
output_sinusoidal = sinusoidal_emb(token_ids)

print(f"Input token IDs shape: {token_ids.shape}")
print(f"Output embeddings shape: {output_learned.shape}")
print()

# Compare parameter counts
learned_params = sum(p.numel() for p in learned_emb.parameters())
sinusoidal_params = sum(p.numel() for p in sinusoidal_emb.parameters())

print(f"Parameter comparison:")
print(f"Learned embeddings:    {learned_params:,} parameters")
print(f"Sinusoidal embeddings: {sinusoidal_params:,} parameters")
print(f"Difference: {learned_params - sinusoidal_params:,} (positional embedding table)")

# Visualize the embeddings for first batch
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 6))

# Learned embeddings
sns.heatmap(output_learned[0].detach().T, cmap='viridis', ax=ax1,
            xticklabels=range(seq_len), yticklabels=range(d_model))
ax1.set_title('Learned Positional Embeddings')
ax1.set_xlabel('Position')
ax1.set_ylabel('Dimension')

# Sinusoidal embeddings
sns.heatmap(output_sinusoidal[0].detach().T, cmap='viridis', ax=ax2,
            xticklabels=range(seq_len), yticklabels=range(d_model))
ax2.set_title('Sinusoidal Positional Embeddings')
ax2.set_xlabel('Position')
ax2.set_ylabel('Dimension')

plt.tight_layout()
plt.show()

print("\n✅ Both embedding types work and produce the same output shape!")

## 7. Position Encoding in Practice

Let's see how our transformer implementation uses positional encoding and test it with a real example.

In [None]:
from src.model.transformer import GPTModel, create_model_config
from src.data.tokenizer import create_tokenizer

def test_positional_encoding_in_transformer():
    """Test how positional encoding affects transformer behavior."""
    
    # Create a small model
    config = create_model_config("tiny")
    model = GPTModel(**config)
    
    # Create tokenizer
    tokenizer = create_tokenizer("simple")
    
    # Test sentences with different word orders
    sentences = [
        "The cat sat on the mat",
        "The mat sat on the cat",  # Different meaning!
        "Cat the sat mat on the",  # Scrambled
    ]
    
    print("Testing positional encoding in transformer:")
    print("=" * 50)
    
    model.eval()
    with torch.no_grad():
        for i, sentence in enumerate(sentences):
            # Tokenize
            tokens = tokenizer.encode(sentence, add_special_tokens=False)
            token_ids = torch.tensor(tokens).unsqueeze(0)  # Add batch dimension
            
            # Get model output
            logits, attention_weights = model(token_ids, return_attention=True)
            
            # Get the embedding layer output (after positional encoding)
            embeddings = model.embedding(token_ids)
            
            print(f"\nSentence {i+1}: {sentence}")
            print(f"Token IDs: {tokens}")
            print(f"Embeddings shape: {embeddings.shape}")
            print(f"Logits shape: {logits.shape}")
            
            # Show how embeddings differ for same word in different positions
            if i == 0:  # First sentence
                first_embeddings = embeddings.clone()
                print(f"First word embedding: {embeddings[0, 0, :3].tolist()!r}...")  # First 3 dims
            else:
                # Compare embeddings
                embedding_diff = torch.norm(embeddings - first_embeddings).item()
                print(f"Difference from first sentence: {embedding_diff:.3f}")
    
    print("\n" + "=" * 50)
    print("Key Observations:")
    print("• Same words in different positions have different embeddings")
    print("• Different word orders produce different representations")
    print("• The model can distinguish between sentences with same words")
    print("• Positional encoding enables understanding of word order!")

test_positional_encoding_in_transformer()

## 🚨 Common Pitfalls and Debugging

Let's address the most common issues beginners face with positional encoding:

### Issue 1: Positional Encoding Overwhelms Word Embeddings

```python
def debug_magnitude_imbalance():
    """Show what happens when positional encoding is too strong."""
    
    print("⚠️  COMMON ISSUE: Magnitude Imbalance")
    print("=" * 45)
    
    # Create embeddings with different scales
    d_model = 4
    word_emb_small = torch.randn(5, d_model) * 0.1  # Small word embeddings
    pos_emb_large = create_sinusoidal_encoding(5, d_model) * 2.0  # Large pos encodings
    
    combined_imbalanced = word_emb_small + pos_emb_large
    
    word_magnitude = word_emb_small.norm(dim=1).mean()
    pos_magnitude = pos_emb_large.norm(dim=1).mean()
    combined_magnitude = combined_imbalanced.norm(dim=1).mean()
    
    print(f"Word embedding magnitude:      {word_magnitude:.3f}")
    print(f"Positional encoding magnitude: {pos_magnitude:.3f}")
    print(f"Combined magnitude:            {combined_magnitude:.3f}")
    print()
    print("🚫 PROBLEM: Position dominates word meaning!")
    print("💡 SOLUTION: Scale positional encodings or use learned embeddings")
    print("   • Multiply pos_enc by small constant (0.1)")
    print("   • Use learned embeddings with proper initialization")
    print("   • Ensure word and position embeddings have similar magnitudes")
    
    return word_magnitude, pos_magnitude

debug_magnitude_imbalance()
```

### Issue 2: Sequence Length Mismatch

```python
def debug_sequence_length():
    """Show sequence length issues."""
    
    print("\n⚠️  COMMON ISSUE: Sequence Length Problems")
    print("=" * 50)
    
    max_len_trained = 50
    actual_sequence_len = 75
    
    print(f"Model trained with max_len: {max_len_trained}")
    print(f"Actual sequence length:     {actual_sequence_len}")
    print()
    
    if actual_sequence_len > max_len_trained:
        print("🚫 PROBLEM: Sequence longer than training length!")
        print("💡 SOLUTIONS:")
        print("   • Sinusoidal encoding: Works for any length ✅")
        print("   • Learned embeddings: Need to increase max_len ❌")
        print("   • Relative position encoding: Better extrapolation ✅")
        print("   • Sliding window: Process in chunks ⚠️")
    
    print(f"\n🔧 EXAMPLE FIXES:")
    print("# For learned embeddings:")
    print("pos_emb = LearnedPositionalEmbedding(max_len=100, d_model=512)  # Increase max_len")
    print()
    print("# For sinusoidal (already works):")
    print("pos_enc = create_sinusoidal_encoding(any_length, d_model)  # ✅ Flexible")

debug_sequence_length()
```

### Issue 3: Forgetting Positional Encoding

```python
def debug_missing_position():
    """Show what happens without positional encoding."""
    
    print("\n⚠️  COMMON ISSUE: Forgot to Add Positional Encoding")
    print("=" * 55)
    
    print("Symptoms:")
    print("  • Model can't distinguish word order")
    print("  • Attention patterns ignore position")
    print("  • Poor performance on order-dependent tasks")
    print("  • 'The cat sat' ≈ 'sat cat the' (same representation)")
    print()
    
    print("🔍 DEBUG CHECKLIST:")
    print("□ Are positional encodings being ADDED (not concatenated)?")
    print("□ Are word and position embedding magnitudes balanced?") 
    print("□ Is max_len sufficient for your sequences?")
    print("□ Are you using the right encoding type for your task?")
    print("□ Do attention patterns show positional awareness?")
    print()
    
    print("🔧 QUICK TEST:")
    print("sentences = ['A B C', 'C B A']")
    print("embeddings = model.get_embeddings(sentences)")
    print("similar = cosine_similarity(embeddings[0], embeddings[1])")
    print("if similar > 0.9:")
    print("    print('❌ Missing positional encoding!')")
    print("else:")
    print("    print('✅ Position encoding working!')")

debug_missing_position()
```

## 🎓 Summary and Best Practices

In this notebook, we've explored how transformers understand position:

### Key Concepts Mastered:

1. **The Position Problem** - Why transformers need explicit positional information
2. **Mathematical Intuition** - Why sine/cosine functions are perfect for position encoding
3. **Addition vs Concatenation** - The critical design choice and why addition wins
4. **Sinusoidal vs Learned** - Trade-offs between different approaches
5. **Debugging Skills** - How to identify and fix common issues

### Design Principles:

- **Uniqueness**: Each position gets a unique signature
- **Boundedness**: Values stay in reasonable ranges
- **Smoothness**: Similar positions have similar encodings
- **Efficiency**: Addition preserves dimensionality
- **Flexibility**: Choose encoding type based on task requirements

### Best Practices:

✅ **Use sinusoidal encoding** for unlimited sequence length  
✅ **Use learned embeddings** for better task-specific performance  
✅ **Balance magnitudes** between word and position embeddings  
✅ **Always add, never concatenate** for efficiency  
✅ **Test with scrambled sentences** to verify position encoding works  

### Common Misconceptions Corrected:

❌ "Position encoding is just added noise" → ✅ It's structured information that enables order understanding  
❌ "Concatenation is more explicit" → ✅ Addition creates richer blended representations  
❌ "Learned embeddings are always better" → ✅ Depends on task and sequence length requirements  
❌ "Position encoding works like human position perception" → ✅ It's mathematical encoding optimized for transformers  

### 🔬 Try This Yourself!

Before moving on, experiment with:
1. What happens if you remove positional encoding entirely?
2. How do different d_model sizes affect position discrimination?
3. Can you visualize attention patterns with/without position encoding?
4. What's the effect of different positional encoding magnitudes?

**Next**: We'll explore how to train transformers and see these concepts in action!