# Positional Encoding: Teaching Transformers About Position

Transformers process all positions in parallel, which is great for speed but creates a problem: **how does the model know about word order?** Without positional information, "cat sat on mat" and "mat on sat cat" would look identical!

## What You'll Learn

1. **The Position Problem** - Why transformers need positional information
2. **Sinusoidal Encoding** - The original transformer's elegant solution
3. **Learned Embeddings** - A simple alternative approach
4. **Relative Positions** - Modern improvements
5. **Visualizing Patterns** - Understanding how position encoding works

Let's solve the position puzzle!

In [None]:
import sys
import os
sys.path.append('..')

import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from typing import Tuple, Optional
import math

# Set style for better plots
plt.style.use('default')
sns.set_palette("husl")

# Set random seeds for reproducibility
torch.manual_seed(42)
np.random.seed(42)

print("Environment setup complete!")

## 1. The Position Problem

Let's demonstrate why transformers need positional encoding by showing how attention works without position information.

In [None]:
def demonstrate_position_problem():
    """Show how transformers without position encoding can't distinguish word order."""
    
    # Create simple word embeddings
    vocab = {"the": 0, "cat": 1, "sat": 2, "on": 3, "mat": 4}
    d_model = 4
    
    # Simple embeddings (just for demonstration)
    embeddings = torch.tensor([
        [1.0, 0.0, 0.0, 0.0],  # "the"
        [0.0, 1.0, 0.0, 0.0],  # "cat"
        [0.0, 0.0, 1.0, 0.0],  # "sat"
        [0.0, 0.0, 0.0, 1.0],  # "on"
        [1.0, 1.0, 0.0, 0.0],  # "mat"
    ])
    
    # Two sentences with different word order
    sentence1 = ["the", "cat", "sat", "on", "mat"]  # Normal order
    sentence2 = ["cat", "the", "mat", "sat", "on"]  # Scrambled order
    
    # Convert to embeddings
    emb1 = torch.stack([embeddings[vocab[word]] for word in sentence1])
    emb2 = torch.stack([embeddings[vocab[word]] for word in sentence2])
    
    print("Sentence 1:", " ".join(sentence1))
    print("Sentence 2:", " ".join(sentence2))
    print()
    
    # Show that without position, both sentences have same information
    print("Word embeddings (same for both sentences):")
    unique_words = sorted(set(sentence1))
    for word in unique_words:
        print(f"{word}: {embeddings[vocab[word]].tolist()}")
    print()
    
    # Compare sentence representations (sum of embeddings)
    sum1 = emb1.sum(dim=0)
    sum2 = emb2.sum(dim=0)
    
    print("Sum of embeddings (bag-of-words):")
    print(f"Sentence 1: {sum1.tolist()}")
    print(f"Sentence 2: {sum2.tolist()}")
    print(f"Are they equal? {torch.allclose(sum1, sum2)}")
    print()
    print("Problem: Without position info, transformers can't distinguish word order!")
    
    return emb1, emb2, vocab, embeddings

emb1, emb2, vocab, word_embeddings = demonstrate_position_problem()

## 2. Sinusoidal Positional Encoding

The original transformer paper introduced sinusoidal positional encoding. The idea is elegant: use sine and cosine functions of different frequencies to create unique position signatures.

For position $pos$ and dimension $i$:

$$PE_{(pos, 2i)} = \sin\left(\frac{pos}{10000^{2i/d_{model}}}\right)$$
$$PE_{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{2i/d_{model}}}\right)$$

In [None]:
def create_sinusoidal_encoding(max_len: int, d_model: int) -> torch.Tensor:
    """
    Create sinusoidal positional encoding.
    
    Args:
        max_len: Maximum sequence length
        d_model: Model dimension
    
    Returns:
        Positional encoding tensor [max_len, d_model]
    """
    pe = torch.zeros(max_len, d_model)
    position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)
    
    # Create the div_term for the sinusoidal pattern
    div_term = torch.exp(torch.arange(0, d_model, 2).float() * 
                        (-math.log(10000.0) / d_model))
    
    # Apply sine to even dimensions
    pe[:, 0::2] = torch.sin(position * div_term)
    
    # Apply cosine to odd dimensions
    pe[:, 1::2] = torch.cos(position * div_term)
    
    return pe

# Create positional encoding
max_len, d_model = 50, 8
pos_encoding = create_sinusoidal_encoding(max_len, d_model)

print(f"Positional encoding shape: {pos_encoding.shape}")
print(f"First few positions:")
for i in range(5):
    print(f"Position {i}: {pos_encoding[i][:4].tolist()!r}...")  # Show first 4 dimensions

# Visualize the positional encoding
plt.figure(figsize=(12, 8))

# Plot the full encoding as a heatmap
plt.subplot(2, 2, 1)
sns.heatmap(pos_encoding[:20].T, cmap='RdBu_r', center=0, 
            xticklabels=range(20), yticklabels=range(d_model))
plt.title('Positional Encoding Heatmap\n(First 20 positions)')
plt.xlabel('Position')
plt.ylabel('Dimension')

# Plot individual dimensions
plt.subplot(2, 2, 2)
for dim in [0, 1, 2, 3]:
    plt.plot(pos_encoding[:30, dim], label=f'Dim {dim}')
plt.title('Individual Dimensions Over Position')
plt.xlabel('Position')
plt.ylabel('Value')
plt.legend()
plt.grid(True, alpha=0.3)

# Show the frequency pattern
plt.subplot(2, 2, 3)
frequencies = []
for i in range(0, d_model, 2):
    freq = 1 / (10000 ** (i / d_model))
    frequencies.append(freq)

plt.bar(range(len(frequencies)), frequencies)
plt.title('Frequencies by Dimension Pair')
plt.xlabel('Dimension Pair (i//2)')
plt.ylabel('Frequency')
plt.yscale('log')

# Show similarity between positions
plt.subplot(2, 2, 4)
# Compute cosine similarity between positions
similarities = []
pos_0 = pos_encoding[0]
for i in range(20):
    pos_i = pos_encoding[i]
    similarity = F.cosine_similarity(pos_0.unsqueeze(0), pos_i.unsqueeze(0))
    similarities.append(similarity.item())

plt.plot(similarities, 'o-')
plt.title('Similarity to Position 0')
plt.xlabel('Position')
plt.ylabel('Cosine Similarity')
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\nKey Properties of Sinusoidal Encoding:")
print("• Each position has a unique signature")
print("• Different frequencies capture different granularities")
print("• Low frequencies change slowly (long-range patterns)")
print("• High frequencies change quickly (fine-grained positions)")
print("• The model can learn to attend to relative positions")

## 3. Learned Positional Embeddings

An alternative approach is to learn positional embeddings just like word embeddings. This is simpler but requires a fixed maximum sequence length during training.

In [None]:
class LearnedPositionalEmbedding(nn.Module):
    """Learned positional embeddings."""
    
    def __init__(self, max_len: int, d_model: int):
        super().__init__()
        self.embedding = nn.Embedding(max_len, d_model)
        self.max_len = max_len
        
        # Initialize with small random values
        nn.init.normal_(self.embedding.weight, std=0.02)
    
    def forward(self, seq_len: int) -> torch.Tensor:
        """Get positional embeddings for sequence length."""
        if seq_len > self.max_len:
            raise ValueError(f"Sequence length {seq_len} exceeds max_len {self.max_len}")
        
        positions = torch.arange(seq_len)
        return self.embedding(positions)

# Create learned positional embeddings
learned_pos = LearnedPositionalEmbedding(max_len=50, d_model=8)

# Get embeddings for a sequence
seq_len = 10
learned_encoding = learned_pos(seq_len)

print(f"Learned positional encoding shape: {learned_encoding.shape}")
print(f"First few positions (before training):")
for i in range(5):
    print(f"Position {i}: {learned_encoding[i][:4].detach().tolist()!r}...")  # Show first 4 dimensions

# Compare sinusoidal vs learned (before training)
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 6))

# Sinusoidal encoding
sns.heatmap(pos_encoding[:seq_len].T, cmap='RdBu_r', center=0, ax=ax1,
            xticklabels=range(seq_len), yticklabels=range(d_model))
ax1.set_title('Sinusoidal Positional Encoding')
ax1.set_xlabel('Position')
ax1.set_ylabel('Dimension')

# Learned encoding (random initialization)
sns.heatmap(learned_encoding.detach().T, cmap='RdBu_r', center=0, ax=ax2,
            xticklabels=range(seq_len), yticklabels=range(d_model))
ax2.set_title('Learned Positional Encoding\n(Random Initialization)')
ax2.set_xlabel('Position')
ax2.set_ylabel('Dimension')

plt.tight_layout()
plt.show()

print("\nComparison:")
print("Sinusoidal Encoding:")
print("  ✓ Works for any sequence length")
print("  ✓ Has mathematical structure (relative positions)")
print("  ✓ No additional parameters")
print("\nLearned Encoding:")
print("  ✓ Can adapt to specific tasks")
print("  ✓ Often works better in practice")
print("  ✗ Limited to training sequence length")
print("  ✗ Requires additional parameters")

## 4. Combining Word and Position Embeddings

Now let's see how positional encoding solves the word order problem by adding it to word embeddings.

In [None]:
def solve_position_problem_with_encoding():
    """Show how positional encoding solves the word order problem."""
    
    # Use our previous example
    sentence1 = ["the", "cat", "sat", "on", "mat"]
    sentence2 = ["cat", "the", "mat", "sat", "on"]
    
    # Get word embeddings
    vocab = {"the": 0, "cat": 1, "sat": 2, "on": 3, "mat": 4}
    d_model = 4
    
    word_embeddings = torch.tensor([
        [1.0, 0.0, 0.0, 0.0],  # "the"
        [0.0, 1.0, 0.0, 0.0],  # "cat"
        [0.0, 0.0, 1.0, 0.0],  # "sat"
        [0.0, 0.0, 0.0, 1.0],  # "on"
        [1.0, 1.0, 0.0, 0.0],  # "mat"
    ])
    
    # Create positional encoding for sequence length
    pos_enc = create_sinusoidal_encoding(5, d_model)
    
    # Convert sentences to embeddings
    def sentence_to_embeddings(sentence):
        word_embs = torch.stack([word_embeddings[vocab[word]] for word in sentence])
        pos_embs = pos_enc[:len(sentence)]
        combined = word_embs + pos_embs  # Add positional encoding
        return word_embs, pos_embs, combined
    
    # Process both sentences
    word_emb1, pos_emb1, combined1 = sentence_to_embeddings(sentence1)
    word_emb2, pos_emb2, combined2 = sentence_to_embeddings(sentence2)
    
    print("Sentence 1:", " ".join(sentence1))
    print("Sentence 2:", " ".join(sentence2))
    print()
    
    # Show embeddings for first word in each position
    print("Word embeddings + Positional encoding:")
    print("Position | Word1 | Word2 | Same?")
    print("-" * 35)
    
    for i in range(5):
        word1 = sentence1[i]
        word2 = sentence2[i]
        emb1 = combined1[i]
        emb2 = combined2[i]
        same = torch.allclose(emb1, emb2)
        print(f"{i:8} | {word1:5} | {word2:5} | {same}")
    
    # Compare overall sentence representations
    sum1 = combined1.sum(dim=0)
    sum2 = combined2.sum(dim=0)
    
    print(f"\nSentence representations (sum):")
    print(f"Sentence 1: {sum1.tolist()!r}")
    print(f"Sentence 2: {sum2.tolist()!r}")
    print(f"Are they equal? {torch.allclose(sum1, sum2)}")
    
    print("\n✅ Success! Positional encoding makes word order matter!")
    
    return combined1, combined2, pos_emb1

combined1, combined2, pos_embeddings = solve_position_problem_with_encoding()

# Visualize the effect
fig, axes = plt.subplots(2, 3, figsize=(18, 10))

# Word embeddings
word_emb1 = combined1 - pos_embeddings  # Extract word embeddings
sns.heatmap(word_emb1.T, annot=True, cmap='Blues', ax=axes[0, 0], cbar=False)
axes[0, 0].set_title('Word Embeddings\n(Sentence 1)')
axes[0, 0].set_xlabel('Position')
axes[0, 0].set_ylabel('Dimension')

# Positional embeddings
sns.heatmap(pos_embeddings.T, annot=True, cmap='Reds', ax=axes[0, 1], cbar=False)
axes[0, 1].set_title('Positional Embeddings')
axes[0, 1].set_xlabel('Position')
axes[0, 1].set_ylabel('Dimension')

# Combined embeddings
sns.heatmap(combined1.T, annot=True, cmap='Greens', ax=axes[0, 2], cbar=False)
axes[0, 2].set_title('Combined Embeddings\n(Word + Position)')
axes[0, 2].set_xlabel('Position')
axes[0, 2].set_ylabel('Dimension')

# Sentence 2
word_emb2 = combined2 - pos_embeddings
sns.heatmap(word_emb2.T, annot=True, cmap='Blues', ax=axes[1, 0], cbar=False)
axes[1, 0].set_title('Word Embeddings\n(Sentence 2 - Scrambled)')
axes[1, 0].set_xlabel('Position')
axes[1, 0].set_ylabel('Dimension')

sns.heatmap(pos_embeddings.T, annot=True, cmap='Reds', ax=axes[1, 1], cbar=False)
axes[1, 1].set_title('Positional Embeddings\n(Same for both)')
axes[1, 1].set_xlabel('Position')
axes[1, 1].set_ylabel('Dimension')

sns.heatmap(combined2.T, annot=True, cmap='Greens', ax=axes[1, 2], cbar=False)
axes[1, 2].set_title('Combined Embeddings\n(Different from Sentence 1!)')
axes[1, 2].set_xlabel('Position')
axes[1, 2].set_ylabel('Dimension')

plt.tight_layout()
plt.show()

## 5. Relative Positional Encoding

Modern transformers often use relative positional encoding, which focuses on the *distance* between positions rather than absolute positions. This can be more effective for tasks where relative relationships matter more than absolute positions.

In [None]:
def demonstrate_relative_positions():
    """Show the concept of relative positional encoding."""
    
    seq_len = 6
    
    # Create a matrix of relative distances
    relative_distances = torch.zeros(seq_len, seq_len)
    for i in range(seq_len):
        for j in range(seq_len):
            relative_distances[i, j] = j - i  # Relative position of j from i
    
    print("Relative Position Matrix:")
    print("(Each cell shows position j relative to position i)")
    print(relative_distances.numpy().astype(int))
    print()
    
    # Create simple relative positional encoding
    # In practice, this would be more sophisticated
    max_relative_distance = seq_len - 1
    relative_embeddings = torch.randn(2 * max_relative_distance + 1, 4)  # d_model = 4
    
    print("Relative embeddings for different distances:")
    for i in range(-max_relative_distance, max_relative_distance + 1):
        idx = i + max_relative_distance
        print(f"Distance {i:2}: {relative_embeddings[idx][:2].tolist()!r}...")
    
    # Visualize relative distances
    plt.figure(figsize=(12, 5))
    
    plt.subplot(1, 2, 1)
    sns.heatmap(relative_distances, annot=True, cmap='RdBu_r', center=0, 
                fmt='d', cbar_kws={'label': 'Relative Distance'})
    plt.title('Relative Position Matrix')
    plt.xlabel('Position j')
    plt.ylabel('Position i')
    
    plt.subplot(1, 2, 2)
    distances = range(-max_relative_distance, max_relative_distance + 1)
    # Plot first dimension of relative embeddings
    plt.plot(distances, relative_embeddings[:, 0], 'o-', label='Dimension 0')
    plt.plot(distances, relative_embeddings[:, 1], 's-', label='Dimension 1')
    plt.title('Relative Embeddings by Distance')
    plt.xlabel('Relative Distance')
    plt.ylabel('Embedding Value')
    plt.legend()
    plt.grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.show()
    
    print("\nAdvantages of Relative Positioning:")
    print("• Focuses on relationships between positions")
    print("• Can generalize to longer sequences")
    print("• More natural for many language tasks")
    print("• Used in modern models like Transformer-XL, T5")

demonstrate_relative_positions()

## 6. Building a Complete Positional Embedding Layer

Let's implement a complete positional embedding layer that can use either sinusoidal or learned embeddings.

In [None]:
from src.model.embeddings import GPTEmbedding

class PositionalEmbedding(nn.Module):
    """Complete positional embedding layer with multiple options."""
    
    def __init__(self, vocab_size: int, max_len: int, d_model: int, 
                 pos_type: str = "learned", dropout: float = 0.1):
        super().__init__()
        self.d_model = d_model
        self.pos_type = pos_type
        
        # Token embeddings
        self.token_embedding = nn.Embedding(vocab_size, d_model)
        
        # Positional embeddings
        if pos_type == "learned":
            self.pos_embedding = nn.Embedding(max_len, d_model)
        elif pos_type == "sinusoidal":
            # Register as buffer (not a parameter)
            pos_encoding = create_sinusoidal_encoding(max_len, d_model)
            self.register_buffer('pos_encoding', pos_encoding)
        else:
            raise ValueError(f"Unknown pos_type: {pos_type}")
        
        self.dropout = nn.Dropout(dropout)
        
        # Initialize token embeddings
        nn.init.normal_(self.token_embedding.weight, std=0.02)
        if pos_type == "learned":
            nn.init.normal_(self.pos_embedding.weight, std=0.02)
    
    def forward(self, token_ids: torch.Tensor) -> torch.Tensor:
        """
        Forward pass: combine token and positional embeddings.
        
        Args:
            token_ids: Token IDs [batch_size, seq_len]
            
        Returns:
            Combined embeddings [batch_size, seq_len, d_model]
        """
        batch_size, seq_len = token_ids.shape
        
        # Token embeddings
        token_emb = self.token_embedding(token_ids)  # [batch_size, seq_len, d_model]
        
        # Positional embeddings
        if self.pos_type == "learned":
            positions = torch.arange(seq_len, device=token_ids.device)
            pos_emb = self.pos_embedding(positions)  # [seq_len, d_model]
            pos_emb = pos_emb.unsqueeze(0).expand(batch_size, -1, -1)  # [batch_size, seq_len, d_model]
        else:  # sinusoidal
            pos_emb = self.pos_encoding[:seq_len].unsqueeze(0)  # [1, seq_len, d_model]
            pos_emb = pos_emb.expand(batch_size, -1, -1)  # [batch_size, seq_len, d_model]
        
        # Combine embeddings
        embeddings = token_emb + pos_emb
        
        # Apply dropout
        embeddings = self.dropout(embeddings)
        
        return embeddings

# Test both types of positional embedding
vocab_size, max_len, d_model = 100, 20, 8
batch_size, seq_len = 2, 10

# Create test input
token_ids = torch.randint(0, vocab_size, (batch_size, seq_len))

# Test learned embeddings
learned_emb = PositionalEmbedding(vocab_size, max_len, d_model, pos_type="learned")
output_learned = learned_emb(token_ids)

# Test sinusoidal embeddings
sinusoidal_emb = PositionalEmbedding(vocab_size, max_len, d_model, pos_type="sinusoidal")
output_sinusoidal = sinusoidal_emb(token_ids)

print(f"Input token IDs shape: {token_ids.shape}")
print(f"Output embeddings shape: {output_learned.shape}")
print()

# Compare parameter counts
learned_params = sum(p.numel() for p in learned_emb.parameters())
sinusoidal_params = sum(p.numel() for p in sinusoidal_emb.parameters())

print(f"Parameter comparison:")
print(f"Learned embeddings:    {learned_params:,} parameters")
print(f"Sinusoidal embeddings: {sinusoidal_params:,} parameters")
print(f"Difference: {learned_params - sinusoidal_params:,} (positional embedding table)")

# Visualize the embeddings for first batch
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 6))

# Learned embeddings
sns.heatmap(output_learned[0].detach().T, cmap='viridis', ax=ax1,
            xticklabels=range(seq_len), yticklabels=range(d_model))
ax1.set_title('Learned Positional Embeddings')
ax1.set_xlabel('Position')
ax1.set_ylabel('Dimension')

# Sinusoidal embeddings
sns.heatmap(output_sinusoidal[0].detach().T, cmap='viridis', ax=ax2,
            xticklabels=range(seq_len), yticklabels=range(d_model))
ax2.set_title('Sinusoidal Positional Embeddings')
ax2.set_xlabel('Position')
ax2.set_ylabel('Dimension')

plt.tight_layout()
plt.show()

print("\n✅ Both embedding types work and produce the same output shape!")

## 7. Position Encoding in Practice

Let's see how our transformer implementation uses positional encoding and test it with a real example.

In [None]:
from src.model.transformer import GPTModel, create_model_config
from src.data.tokenizer import create_tokenizer

def test_positional_encoding_in_transformer():
    """Test how positional encoding affects transformer behavior."""
    
    # Create a small model
    config = create_model_config("tiny")
    model = GPTModel(**config)
    
    # Create tokenizer
    tokenizer = create_tokenizer("simple")
    
    # Test sentences with different word orders
    sentences = [
        "The cat sat on the mat",
        "The mat sat on the cat",  # Different meaning!
        "Cat the sat mat on the",  # Scrambled
    ]
    
    print("Testing positional encoding in transformer:")
    print("=" * 50)
    
    model.eval()
    with torch.no_grad():
        for i, sentence in enumerate(sentences):
            # Tokenize
            tokens = tokenizer.encode(sentence, add_special_tokens=False)
            token_ids = torch.tensor(tokens).unsqueeze(0)  # Add batch dimension
            
            # Get model output
            logits, attention_weights = model(token_ids, return_attention=True)
            
            # Get the embedding layer output (after positional encoding)
            embeddings = model.embedding(token_ids)
            
            print(f"\nSentence {i+1}: {sentence}")
            print(f"Token IDs: {tokens}")
            print(f"Embeddings shape: {embeddings.shape}")
            print(f"Logits shape: {logits.shape}")
            
            # Show how embeddings differ for same word in different positions
            if i == 0:  # First sentence
                first_embeddings = embeddings.clone()
                print(f"First word embedding: {embeddings[0, 0, :3].tolist()!r}...")  # First 3 dims
            else:
                # Compare embeddings
                embedding_diff = torch.norm(embeddings - first_embeddings).item()
                print(f"Difference from first sentence: {embedding_diff:.3f}")
    
    print("\n" + "=" * 50)
    print("Key Observations:")
    print("• Same words in different positions have different embeddings")
    print("• Different word orders produce different representations")
    print("• The model can distinguish between sentences with same words")
    print("• Positional encoding enables understanding of word order!")

test_positional_encoding_in_transformer()

## Summary

In this notebook, we've explored how transformers understand position:

1. **The Position Problem** - Transformers need explicit positional information
2. **Sinusoidal Encoding** - Elegant mathematical solution using sine/cosine waves
3. **Learned Embeddings** - Simple alternative that learns position representations
4. **Relative Positions** - Modern approaches focusing on relationships
5. **Complete Implementation** - Building a full positional embedding layer

### Key Insights:

- **Position is crucial**: Without it, "cat sat on mat" = "mat on sat cat"
- **Sinusoidal encoding**: Each position gets a unique frequency signature
- **Learned embeddings**: Often work better but limited to training length
- **Relative positions**: Focus on relationships rather than absolute positions
- **Addition works**: Simply adding positional to word embeddings is effective

### Design Choices:

- **Sinusoidal**: Use for unlimited sequence length, mathematical elegance
- **Learned**: Use for better task-specific performance, fixed length
- **Relative**: Use for tasks where relationships matter more than absolute position

### Modern Trends:

- Most large language models use learned positional embeddings
- Relative positioning is gaining popularity (T5, Transformer-XL)
- Some models use rotary positional encoding (RoPE) for better extrapolation

Next, we'll explore the training process and see how transformers learn from data!