# Positional Encoding: Teaching Transformers About Position

Transformers process all positions in parallel, which creates a problem: **how does the model know about word order?** 

Without positional information, "cat sat on mat" and "mat on sat cat" would look identical!

## What You'll Learn

1. **The Position Problem** - Why transformers need positional information
2. **Sinusoidal Solution** - The elegant mathematical approach
3. **Addition vs Concatenation** - Why we add instead of concatenate
4. **Implementation** - Building positional encoding from scratch

Let's solve the position puzzle!

In [None]:
import sys
import os
sys.path.append('..')

import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from typing import Tuple, Optional
import math

# Set style for better plots
plt.style.use('default')
sns.set_palette("husl")

# Set random seeds for reproducibility
torch.manual_seed(42)
np.random.seed(42)

print("Environment setup complete!")

## 1. The Position Problem

Let's see why transformers need positional encoding:

In [None]:
def demonstrate_position_problem():
    """Show how transformers without position encoding can't distinguish word order."""
    
    print("🚨 THE POSITION PROBLEM")
    
    # Two sentences with different word order
    sentence1 = ["cat", "sat", "on", "mat"]
    sentence2 = ["mat", "on", "sat", "cat"]  # Scrambled!
    
    print(f"Sentence 1: {' '.join(sentence1)}")
    print(f"Sentence 2: {' '.join(sentence2)}")
    print()
    
    # Simple word embeddings (just for demonstration)
    word_embeddings = {
        "cat": [1, 0, 0],
        "sat": [0, 1, 0], 
        "on": [0, 0, 1],
        "mat": [1, 1, 0]
    }
    
    # Without position info, both sentences have same words
    words1 = [word_embeddings[word] for word in sentence1]
    words2 = [word_embeddings[word] for word in sentence2]
    
    # Sum of embeddings (bag-of-words)
    sum1 = [sum(x) for x in zip(*words1)]
    sum2 = [sum(x) for x in zip(*words2)]
    
    print("Without position encoding:")
    print(f"Sentence 1 representation: {sum1}")
    print(f"Sentence 2 representation: {sum2}")
    print(f"Are they identical? {sum1 == sum2}")
    print()
    print("❌ Problem: Can't tell word order apart!")
    print("✅ Solution: Add position information to each word")

demonstrate_position_problem()

## 2. Sinusoidal Positional Encoding: The Solution

The original transformer paper used **sinusoidal positional encoding**. Here's why it's brilliant:

### Why Sine and Cosine? 🌊

**Requirements for position encoding:**
- ✅ **Unique**: Each position needs a different "signature"
- ✅ **Bounded**: Values shouldn't explode (position 1000 ≠ 1000!)
- ✅ **Smooth**: Nearby positions should be similar
- ✅ **Consistent**: Work for any sequence length

**Sine and cosine meet all requirements:**
- Always between -1 and 1 (bounded)
- Smooth, continuous functions
- Create unique patterns when combined at different frequencies

### The Formula

For position `pos` and dimension `i`:

$$PE_{(pos, 2i)} = \sin\left(\frac{pos}{10000^{2i/d_{model}}}\right)$$
$$PE_{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{2i/d_{model}}}\right)$$

**Translation:**
- Even dimensions (0, 2, 4...) get sine waves
- Odd dimensions (1, 3, 5...) get cosine waves  
- Different frequencies for different dimensions

In [None]:
import math

def create_sinusoidal_encoding(max_len: int, d_model: int) -> torch.Tensor:
    """Create sinusoidal positional encoding."""
    
    print(f"Creating positional encoding for max_len={max_len}, d_model={d_model}")
    
    # Initialize encoding matrix
    pe = torch.zeros(max_len, d_model)
    
    # Create position indices [0, 1, 2, ..., max_len-1]
    position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)
    
    # Create frequency terms for different dimensions
    # Each dimension pair gets a different frequency
    div_term = torch.exp(torch.arange(0, d_model, 2).float() * 
                        (-math.log(10000.0) / d_model))
    
    # Apply sine to even dimensions (0, 2, 4, ...)
    pe[:, 0::2] = torch.sin(position * div_term)
    
    # Apply cosine to odd dimensions (1, 3, 5, ...)
    pe[:, 1::2] = torch.cos(position * div_term)
    
    print(f"✅ Created encoding with shape: {pe.shape}")
    print(f"Value range: [{pe.min():.3f}, {pe.max():.3f}]")
    
    return pe

# Create positional encoding
max_len, d_model = 10, 8
pos_encoding = create_sinusoidal_encoding(max_len, d_model)

# Show the first few positions
print(f"\nFirst 5 position encodings:")
for i in range(5):
    print(f"Position {i}: {pos_encoding[i].tolist()!r}")

# Visualize the encoding pattern
plt.figure(figsize=(12, 6))

plt.subplot(1, 2, 1)
# Heatmap showing the pattern
sns.heatmap(pos_encoding.T, cmap='RdBu_r', center=0, 
            xticklabels=range(max_len), yticklabels=range(d_model))
plt.title('Positional Encoding Heatmap')
plt.xlabel('Position')
plt.ylabel('Dimension')

plt.subplot(1, 2, 2) 
# Show how different dimensions change over position
for dim in [0, 1, 6, 7]:  # Show a few dimensions
    plt.plot(pos_encoding[:, dim], label=f'Dim {dim}', linewidth=2)
plt.title('Encoding Values by Position')
plt.xlabel('Position')
plt.ylabel('Value')
plt.legend()
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\n🔑 Key observations:")
print("• Each position gets a unique pattern")
print("• Values stay bounded between -1 and 1") 
print("• Different dimensions oscillate at different speeds")
print("• Pattern creates unique 'fingerprint' for each position")

## 3. Why ADD Instead of Concatenate?

This is crucial! Why do we **add** positional encodings to word embeddings instead of **concatenating** them?

In [None]:
def explain_addition_vs_concatenation():
    """Show why we add instead of concatenate."""
    
    print("🤔 ADDITION VS CONCATENATION")
    print("=" * 40)
    
    # Example embeddings
    d_model = 4
    word_emb = torch.tensor([1.0, 0.5, -0.3, 0.8])  # Word: "cat"
    pos_emb = pos_encoding[1]  # Position 1
    
    print(f"Word embedding:     {word_emb.tolist()}")
    print(f"Position embedding: {pos_emb.tolist()!r}")
    print()
    
    # Addition (what transformers do)
    added = word_emb + pos_emb
    print("ADDITION APPROACH:")
    print(f"Combined: {added.tolist()!r} (shape: {added.shape})")
    print("✅ Same dimensionality as original")
    print("✅ Word and position info blended together")
    print("✅ Attention sees unified 'word-at-position' representation")
    print()
    
    # Concatenation (alternative approach)  
    concatenated = torch.cat([word_emb, pos_emb])
    print("CONCATENATION APPROACH:")
    print(f"Combined: {concatenated.tolist()!r} (shape: {concatenated.shape})")
    print("❌ Doubles the dimensionality (expensive!)")
    print("❌ Word and position info separated")
    print("❌ Attention must learn to combine them")
    print()
    
    print("🎯 WHY ADDITION WINS:")
    print("• More efficient (same dimensions)")
    print("• Creates richer word-position interactions")  
    print("• Attention naturally considers both word + position")
    print("• Model learns to balance word meaning vs position")

explain_addition_vs_concatenation()

# Show the solution in action
def solve_position_problem():
    """Show how position encoding solves the original problem."""
    
    print("\n🎉 SOLVING THE ORIGINAL PROBLEM")
    print("=" * 35)
    
    # Our problem sentences
    sentence1 = ["cat", "sat", "on", "mat"] 
    sentence2 = ["mat", "on", "sat", "cat"]
    
    # Simple word embeddings
    word_embeddings = torch.tensor([
        [1, 0, 0, 0],  # cat
        [0, 1, 0, 0],  # sat  
        [0, 0, 1, 0],  # on
        [1, 1, 0, 0],  # mat
    ])
    
    # Map words to embeddings
    word_to_idx = {"cat": 0, "sat": 1, "on": 2, "mat": 3}
    
    # Get word embeddings for both sentences
    emb1 = torch.stack([word_embeddings[word_to_idx[word]] for word in sentence1])
    emb2 = torch.stack([word_embeddings[word_to_idx[word]] for word in sentence2])
    
    # Add positional encoding
    pos_enc_small = create_sinusoidal_encoding(4, 4)
    combined1 = emb1.float() + pos_enc_small
    combined2 = emb2.float() + pos_enc_small
    
    print(f"Sentence 1: {' '.join(sentence1)}")
    print(f"Sentence 2: {' '.join(sentence2)}")
    print()
    
    # Check if they're different now
    sum1 = combined1.sum(dim=0)
    sum2 = combined2.sum(dim=0) 
    are_different = not torch.allclose(sum1, sum2, atol=1e-6)
    
    print("After adding positional encoding:")
    print(f"Sentence 1 representation: {sum1.tolist()!r}")
    print(f"Sentence 2 representation: {sum2.tolist()!r}")
    print(f"Are they different? {are_different}")
    
    if are_different:
        print("\n✅ SUCCESS! Position encoding solved the problem!")
        print("✅ Different word orders → Different representations")
        print("✅ Model can now understand word order")
    else:
        print("❌ Something went wrong...")

solve_position_problem()

## 4. Complete Implementation

Let's create a simple positional embedding layer:

In [None]:
class PositionalEmbedding(nn.Module):
    """Combines word embeddings with sinusoidal positional encoding."""
    
    def __init__(self, vocab_size: int, max_len: int, d_model: int):
        super().__init__()
        
        # Word embeddings
        self.token_embedding = nn.Embedding(vocab_size, d_model)
        
        # Sinusoidal positional encoding (not learned)
        pos_encoding = create_sinusoidal_encoding(max_len, d_model)
        self.register_buffer('pos_encoding', pos_encoding)
    
    def forward(self, token_ids: torch.Tensor) -> torch.Tensor:
        """
        Args:
            token_ids: [batch_size, seq_len]
        Returns:
            embeddings: [batch_size, seq_len, d_model]
        """
        batch_size, seq_len = token_ids.shape
        
        # Get word embeddings
        word_emb = self.token_embedding(token_ids)  # [batch_size, seq_len, d_model]
        
        # Get positional encoding for this sequence length
        pos_emb = self.pos_encoding[:seq_len].unsqueeze(0)  # [1, seq_len, d_model]
        pos_emb = pos_emb.expand(batch_size, -1, -1)  # [batch_size, seq_len, d_model]
        
        # Add them together (the key insight!)
        embeddings = word_emb + pos_emb
        
        return embeddings

# Test the complete implementation
vocab_size, max_len, d_model = 100, 20, 8
pos_emb_layer = PositionalEmbedding(vocab_size, max_len, d_model)

# Create test input
batch_size, seq_len = 2, 5
token_ids = torch.randint(0, vocab_size, (batch_size, seq_len))

# Forward pass
embeddings = pos_emb_layer(token_ids)

print(f"Input shape: {token_ids.shape}")
print(f"Output shape: {embeddings.shape}")
print("✅ Complete positional embedding layer working!")

# Show what happens to the same token at different positions
token_id = 42  # Arbitrary token
positions = [0, 1, 2, 3]

print(f"\nSame token (ID={token_id}) at different positions:")
for pos in positions:
    # Create input with our token at this position
    test_input = torch.full((1, pos+1), token_id)
    test_output = pos_emb_layer(test_input)
    final_embedding = test_output[0, pos]  # Get embedding at the position
    
    print(f"Position {pos}: {final_embedding[:3].tolist()!r}...") # Show first 3 dims

print("\n🔑 Notice: Same token gets different embeddings at different positions!")

## Summary

You've mastered positional encoding - the key to teaching transformers about word order!

### Key Concepts:
1. **The Problem**: Transformers process all positions in parallel and need explicit position information
2. **Sinusoidal Solution**: Sine and cosine functions create unique, bounded position signatures  
3. **Addition > Concatenation**: Adding creates richer word-position interactions efficiently
4. **Implementation**: Simple but powerful - transforms how models understand sequences

### What's Next?
Now you understand all the core transformer components:
- **Tokenization** (notebook 0) - Text → numbers
- **Attention** (notebook 1) - How to focus on relevant information
- **Transformer blocks** (notebook 2) - Complete processing units
- **Position encoding** (notebook 3) - Understanding word order

Ready to see it all working together in a complete transformer! 🚀

In [None]:
from src.model.embeddings import GPTEmbedding

class PositionalEmbedding(nn.Module):
    """Complete positional embedding layer with multiple options."""
    
    def __init__(self, vocab_size: int, max_len: int, d_model: int, 
                 pos_type: str = "learned", dropout: float = 0.1):
        super().__init__()
        self.d_model = d_model
        self.pos_type = pos_type
        
        # Token embeddings
        self.token_embedding = nn.Embedding(vocab_size, d_model)
        
        # Positional embeddings
        if pos_type == "learned":
            self.pos_embedding = nn.Embedding(max_len, d_model)
        elif pos_type == "sinusoidal":
            # Register as buffer (not a parameter)
            pos_encoding = create_sinusoidal_encoding(max_len, d_model)
            self.register_buffer('pos_encoding', pos_encoding)
        else:
            raise ValueError(f"Unknown pos_type: {pos_type}")
        
        self.dropout = nn.Dropout(dropout)
        
        # Initialize token embeddings
        nn.init.normal_(self.token_embedding.weight, std=0.02)
        if pos_type == "learned":
            nn.init.normal_(self.pos_embedding.weight, std=0.02)
    
    def forward(self, token_ids: torch.Tensor) -> torch.Tensor:
        """
        Forward pass: combine token and positional embeddings.
        
        Args:
            token_ids: Token IDs [batch_size, seq_len]
            
        Returns:
            Combined embeddings [batch_size, seq_len, d_model]
        """
        batch_size, seq_len = token_ids.shape
        
        # Token embeddings
        token_emb = self.token_embedding(token_ids)  # [batch_size, seq_len, d_model]
        
        # Positional embeddings
        if self.pos_type == "learned":
            positions = torch.arange(seq_len, device=token_ids.device)
            pos_emb = self.pos_embedding(positions)  # [seq_len, d_model]
            pos_emb = pos_emb.unsqueeze(0).expand(batch_size, -1, -1)  # [batch_size, seq_len, d_model]
        else:  # sinusoidal
            pos_emb = self.pos_encoding[:seq_len].unsqueeze(0)  # [1, seq_len, d_model]
            pos_emb = pos_emb.expand(batch_size, -1, -1)  # [batch_size, seq_len, d_model]
        
        # Combine embeddings
        embeddings = token_emb + pos_emb
        
        # Apply dropout
        embeddings = self.dropout(embeddings)
        
        return embeddings

# Test both types of positional embedding
vocab_size, max_len, d_model = 100, 20, 8
batch_size, seq_len = 2, 10

# Create test input
token_ids = torch.randint(0, vocab_size, (batch_size, seq_len))

# Test learned embeddings
learned_emb = PositionalEmbedding(vocab_size, max_len, d_model, pos_type="learned")
output_learned = learned_emb(token_ids)

# Test sinusoidal embeddings
sinusoidal_emb = PositionalEmbedding(vocab_size, max_len, d_model, pos_type="sinusoidal")
output_sinusoidal = sinusoidal_emb(token_ids)

print(f"Input token IDs shape: {token_ids.shape}")
print(f"Output embeddings shape: {output_learned.shape}")
print()

# Compare parameter counts
learned_params = sum(p.numel() for p in learned_emb.parameters())
sinusoidal_params = sum(p.numel() for p in sinusoidal_emb.parameters())

print(f"Parameter comparison:")
print(f"Learned embeddings:    {learned_params:,} parameters")
print(f"Sinusoidal embeddings: {sinusoidal_params:,} parameters")
print(f"Difference: {learned_params - sinusoidal_params:,} (positional embedding table)")

# Visualize the embeddings for first batch
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 6))

# Learned embeddings
sns.heatmap(output_learned[0].detach().T, cmap='viridis', ax=ax1,
            xticklabels=range(seq_len), yticklabels=range(d_model))
ax1.set_title('Learned Positional Embeddings')
ax1.set_xlabel('Position')
ax1.set_ylabel('Dimension')

# Sinusoidal embeddings
sns.heatmap(output_sinusoidal[0].detach().T, cmap='viridis', ax=ax2,
            xticklabels=range(seq_len), yticklabels=range(d_model))
ax2.set_title('Sinusoidal Positional Embeddings')
ax2.set_xlabel('Position')
ax2.set_ylabel('Dimension')

plt.tight_layout()
plt.show()

print("\n✅ Both embedding types work and produce the same output shape!")