# Positional Embeddings: ROPE, Sinusoidal, and Learned

This notebook explores different positional embedding techniques used in transformer models.

We'll cover:
1. **Sinusoidal Positional Encodings** (original Transformer)
2. **Learned Positional Embeddings** (BERT, GPT)
3. **Rotary Position Embeddings (ROPE)** (modern LLMs like LLaMA, GPT-NeoX)
4. **Comparison and visualizations**

## Why Positional Embeddings?

Transformers process all tokens in parallel, losing sequential information. Positional embeddings inject position information into the model.

## 1. Import Libraries

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import math

# Set random seed for reproducibility
torch.manual_seed(42)
np.random.seed(42)

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")

## 2. Sinusoidal Positional Encodings

The original Transformer paper used fixed sinusoidal functions:

$$PE_{(pos, 2i)} = \sin\left(\frac{pos}{10000^{2i/d_{model}}}\right)$$
$$PE_{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{2i/d_{model}}}\right)$$

**Benefits:**
- No learned parameters
- Can extrapolate to longer sequences
- Encodes relative positions through linear transformations

In [None]:
class SinusoidalPositionalEncoding(nn.Module):
    """
    Sinusoidal positional encoding as in 'Attention is All You Need'.
    """
    def __init__(self, d_model, max_len=5000):
        super().__init__()
        self.d_model = d_model
        
        # Create position encodings matrix
        pe = torch.zeros(max_len, d_model)
        position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)
        
        # Create division term for wavelengths
        div_term = torch.exp(torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model))
        
        # Apply sin to even indices, cos to odd indices
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        
        # Register as buffer (not a parameter)
        self.register_buffer('pe', pe)
    
    def forward(self, x):
        """
        Args:
            x: Tensor of shape (batch_size, seq_len, d_model)
        Returns:
            x with positional encoding added
        """
        seq_len = x.size(1)
        return x + self.pe[:seq_len, :].unsqueeze(0)

# Example usage
d_model = 128
seq_len = 50
batch_size = 4

sinusoidal_pe = SinusoidalPositionalEncoding(d_model)
x = torch.randn(batch_size, seq_len, d_model)
x_with_pe = sinusoidal_pe(x)

print(f"Input shape: {x.shape}")
print(f"Output shape: {x_with_pe.shape}")
print(f"Positional encoding shape: {sinusoidal_pe.pe.shape}")

### Visualize Sinusoidal Encodings

In [None]:
# Visualize the positional encoding matrix
plt.figure(figsize=(14, 6))

# Show first 100 positions and all dimensions
pe_matrix = sinusoidal_pe.pe[:100, :].numpy()

plt.subplot(1, 2, 1)
plt.imshow(pe_matrix.T, aspect='auto', cmap='RdBu')
plt.xlabel('Position')
plt.ylabel('Dimension')
plt.title('Sinusoidal Positional Encoding Heatmap')
plt.colorbar(label='Value')

# Show some specific dimensions over position
plt.subplot(1, 2, 2)
positions = np.arange(100)
for dim in [0, 1, 10, 20, 50]:
    plt.plot(positions, pe_matrix[:, dim], label=f'dim {dim}')
plt.xlabel('Position')
plt.ylabel('Encoding Value')
plt.title('Positional Encoding Curves')
plt.legend()
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("Notice: Different dimensions have different wavelengths.")
print("Lower dimensions change slowly, higher dimensions change quickly.")

## 3. Learned Positional Embeddings

Models like BERT and GPT-2 use learned positional embeddings:
- Simple lookup table of learned vectors
- Each position has its own learned embedding
- More flexible but limited to training sequence length

**Trade-offs:**
- ✅ Can learn task-specific position patterns
- ❌ Cannot extrapolate beyond max_len seen during training
- ❌ Requires storing parameters for each position

In [None]:
class LearnedPositionalEmbedding(nn.Module):
    """
    Learned positional embeddings as in BERT and GPT.
    """
    def __init__(self, d_model, max_len=512):
        super().__init__()
        self.positional_embeddings = nn.Embedding(max_len, d_model)
        self.max_len = max_len
    
    def forward(self, x):
        """
        Args:
            x: Tensor of shape (batch_size, seq_len, d_model)
        Returns:
            x with positional embedding added
        """
        batch_size, seq_len, d_model = x.shape
        
        if seq_len > self.max_len:
            raise ValueError(f"Sequence length {seq_len} exceeds max_len {self.max_len}")
        
        # Create position indices
        positions = torch.arange(seq_len, device=x.device).unsqueeze(0)
        pos_emb = self.positional_embeddings(positions)
        
        return x + pos_emb

# Example usage
learned_pe = LearnedPositionalEmbedding(d_model, max_len=512)
x = torch.randn(batch_size, seq_len, d_model)
x_with_learned_pe = learned_pe(x)

print(f"Input shape: {x.shape}")
print(f"Output shape: {x_with_learned_pe.shape}")
print(f"Number of parameters: {sum(p.numel() for p in learned_pe.parameters())}")

## 4. Rotary Position Embeddings (ROPE)

ROPE is used in modern LLMs (LLaMA, GPT-NeoX, PaLM). It rotates query and key vectors based on position.

**Key idea:** Apply position-dependent rotation to Q and K:
- Relative positions are preserved through dot products
- Works well with long sequences
- Better extrapolation than learned embeddings

**Mathematical formulation:**
$$f_q(x_m, m) = (W_q x_m) e^{im\theta}$$
$$f_k(x_n, n) = (W_k x_n) e^{in\theta}$$

The attention score becomes:
$$f_q(x_m, m)^T f_k(x_n, n) = (W_q x_m)^T (W_k x_n) e^{i(m-n)\theta}$$

**Benefits:**
- ✅ Encodes relative position in attention scores
- ✅ Good extrapolation to longer sequences
- ✅ No extra parameters
- ✅ Works in 2D complex space (pairs of dimensions)

In [None]:
class RotaryPositionalEmbedding(nn.Module):
    """
    Rotary Position Embedding (ROPE) as used in LLaMA and other modern LLMs.
    """
    def __init__(self, dim, max_seq_len=2048, base=10000):
        super().__init__()
        self.dim = dim
        self.max_seq_len = max_seq_len
        self.base = base
        
        # Compute frequencies for each dimension pair
        inv_freq = 1.0 / (base ** (torch.arange(0, dim, 2).float() / dim))
        self.register_buffer('inv_freq', inv_freq)
        
        # Cache cos and sin values
        self._set_cos_sin_cache(max_seq_len)
    
    def _set_cos_sin_cache(self, seq_len):
        """Precompute cos and sin values for efficiency."""
        self.max_seq_len_cached = seq_len
        t = torch.arange(seq_len, device=self.inv_freq.device).type_as(self.inv_freq)
        
        # Compute position * frequency for all positions and frequencies
        freqs = torch.einsum('i,j->ij', t, self.inv_freq)
        
        # Concatenate to match dimension (each freq used for 2 dimensions)
        emb = torch.cat((freqs, freqs), dim=-1)
        
        self.register_buffer('cos_cached', emb.cos())
        self.register_buffer('sin_cached', emb.sin())
    
    def forward(self, x, seq_dim=1):
        """
        Apply rotary embeddings to input tensor.
        
        Args:
            x: Input tensor of shape (..., seq_len, ..., dim)
            seq_dim: Dimension index for sequence length
        
        Returns:
            Tensor with rotary embeddings applied
        """
        seq_len = x.shape[seq_dim]
        
        if seq_len > self.max_seq_len_cached:
            self._set_cos_sin_cache(seq_len)
        
        return self.apply_rotary_emb(x, self.cos_cached[:seq_len], self.sin_cached[:seq_len])
    
    @staticmethod
    def apply_rotary_emb(x, cos, sin):
        """Apply rotation using cos and sin."""
        # Split x into pairs: [x0, x1, x2, x3, ...] -> [[x0, x1], [x2, x3], ...]
        x1 = x[..., ::2]
        x2 = x[..., 1::2]
        
        # Apply rotation
        # [cos * x1 - sin * x2, sin * x1 + cos * x2]
        cos = cos.unsqueeze(0)  # Add batch dimension
        sin = sin.unsqueeze(0)
        
        # Interleave the rotated pairs
        x_rotated = torch.stack([
            x1 * cos[..., ::2] - x2 * sin[..., ::2],
            x1 * sin[..., 1::2] + x2 * cos[..., 1::2]
        ], dim=-1)
        
        return x_rotated.flatten(-2)

# Example usage
head_dim = 64  # Typical attention head dimension
rope = RotaryPositionalEmbedding(head_dim)

# Apply to query and key
q = torch.randn(batch_size, seq_len, 8, head_dim)  # 8 heads
k = torch.randn(batch_size, seq_len, 8, head_dim)

print(f"Query shape: {q.shape}")
print(f"Key shape: {k.shape}")

# Apply ROPE to each head
q_rotated = rope(q, seq_dim=1)
k_rotated = rope(k, seq_dim=1)

print(f"Query with ROPE shape: {q_rotated.shape}")
print(f"Key with ROPE shape: {k_rotated.shape}")

### Visualize ROPE Frequencies

In [None]:
# Visualize ROPE cos and sin values
plt.figure(figsize=(14, 10))

# Show cos values
plt.subplot(3, 2, 1)
cos_matrix = rope.cos_cached[:100, :].cpu().numpy()
plt.imshow(cos_matrix.T, aspect='auto', cmap='RdBu')
plt.xlabel('Position')
plt.ylabel('Dimension')
plt.title('ROPE: Cosine Values')
plt.colorbar(label='Value')

# Show sin values
plt.subplot(3, 2, 2)
sin_matrix = rope.sin_cached[:100, :].cpu().numpy()
plt.imshow(sin_matrix.T, aspect='auto', cmap='RdBu')
plt.xlabel('Position')
plt.ylabel('Dimension')
plt.title('ROPE: Sine Values')
plt.colorbar(label='Value')

# Show frequency spectrum
plt.subplot(3, 2, 3)
inv_freq = rope.inv_freq.cpu().numpy()
plt.plot(inv_freq)
plt.xlabel('Dimension Index')
plt.ylabel('Inverse Frequency')
plt.title('ROPE: Frequency Spectrum')
plt.yscale('log')
plt.grid(True, alpha=0.3)

# Show cos for specific dimensions
plt.subplot(3, 2, 4)
positions = np.arange(100)
for dim in [0, 5, 15, 31]:
    plt.plot(positions, cos_matrix[:, dim], label=f'dim {dim}')
plt.xlabel('Position')
plt.ylabel('Cosine Value')
plt.title('ROPE: Cosine Curves by Dimension')
plt.legend()
plt.grid(True, alpha=0.3)

# Demonstrate position encoding effect
plt.subplot(3, 2, 5)
# Sample vector
sample_vec = torch.randn(1, 1, head_dim)
positions_to_show = [0, 10, 20, 30, 40]
rotated_vecs = []
for pos in positions_to_show:
    sample_at_pos = torch.zeros(1, 50, head_dim)
    sample_at_pos[:, pos, :] = sample_vec
    rotated = rope(sample_at_pos, seq_dim=1)
    rotated_vecs.append(rotated[0, pos, :].cpu().detach().numpy())

for i, pos in enumerate(positions_to_show):
    plt.plot(rotated_vecs[i], label=f'pos {pos}', alpha=0.7)
plt.xlabel('Dimension')
plt.ylabel('Value after ROPE')
plt.title('Same Vector at Different Positions')
plt.legend()
plt.grid(True, alpha=0.3)

# Relative position encoding visualization
plt.subplot(3, 2, 6)
# Compute dot product between vectors at different relative positions
base_pos = 10
relative_positions = list(range(-10, 11))
similarities = []

for rel_pos in relative_positions:
    target_pos = base_pos + rel_pos
    if 0 <= target_pos < 50:
        q_vec = torch.randn(1, 50, head_dim)
        k_vec = q_vec.clone()  # Same vector
        q_rot = rope(q_vec, seq_dim=1)
        k_rot = rope(k_vec, seq_dim=1)
        
        sim = (q_rot[0, base_pos] @ k_rot[0, target_pos]).item()
        similarities.append(sim)
    else:
        similarities.append(np.nan)

plt.plot(relative_positions, similarities, marker='o')
plt.xlabel('Relative Position')
plt.ylabel('Dot Product (Similarity)')
plt.title('ROPE Preserves Relative Position Information')
plt.axvline(x=0, color='red', linestyle='--', label='Same position')
plt.legend()
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\nKey observations:")
print("1. Different dimensions rotate at different frequencies")
print("2. Lower dimensions (low frequency) capture long-range position info")
print("3. Higher dimensions (high frequency) capture fine-grained position info")
print("4. Dot product between rotated vectors depends only on relative position")

## 5. ROPE in Attention Mechanism

Let's implement a complete attention layer with ROPE to see how it's used in practice.

In [None]:
class RoPEAttention(nn.Module):
    """
    Multi-head attention with Rotary Position Embeddings (ROPE).
    """
    def __init__(self, d_model, num_heads, dropout=0.1):
        super().__init__()
        assert d_model % num_heads == 0
        
        self.d_model = d_model
        self.num_heads = num_heads
        self.head_dim = d_model // num_heads
        
        # Linear projections
        self.q_proj = nn.Linear(d_model, d_model)
        self.k_proj = nn.Linear(d_model, d_model)
        self.v_proj = nn.Linear(d_model, d_model)
        self.out_proj = nn.Linear(d_model, d_model)
        
        # ROPE for Q and K
        self.rope = RotaryPositionalEmbedding(self.head_dim)
        
        self.dropout = nn.Dropout(dropout)
        self.scale = math.sqrt(self.head_dim)
    
    def forward(self, x, mask=None):
        """
        Args:
            x: Input tensor (batch_size, seq_len, d_model)
            mask: Optional attention mask
        Returns:
            Output tensor (batch_size, seq_len, d_model)
        """
        batch_size, seq_len, d_model = x.shape
        
        # Project to Q, K, V
        q = self.q_proj(x)  # (batch, seq_len, d_model)
        k = self.k_proj(x)
        v = self.v_proj(x)
        
        # Reshape for multi-head attention
        q = q.view(batch_size, seq_len, self.num_heads, self.head_dim)
        k = k.view(batch_size, seq_len, self.num_heads, self.head_dim)
        v = v.view(batch_size, seq_len, self.num_heads, self.head_dim)
        
        # Apply ROPE to Q and K (NOT to V!)
        q = self.rope(q, seq_dim=1)
        k = self.rope(k, seq_dim=1)
        
        # Transpose for attention computation: (batch, num_heads, seq_len, head_dim)
        q = q.transpose(1, 2)
        k = k.transpose(1, 2)
        v = v.transpose(1, 2)
        
        # Compute attention scores
        scores = torch.matmul(q, k.transpose(-2, -1)) / self.scale
        
        # Apply mask if provided
        if mask is not None:
            scores = scores.masked_fill(mask == 0, float('-inf'))
        
        # Apply softmax and dropout
        attn_weights = F.softmax(scores, dim=-1)
        attn_weights = self.dropout(attn_weights)
        
        # Apply attention to values
        out = torch.matmul(attn_weights, v)
        
        # Reshape and project output
        out = out.transpose(1, 2).contiguous().view(batch_size, seq_len, d_model)
        out = self.out_proj(out)
        
        return out, attn_weights

# Example usage
d_model = 256
num_heads = 8
seq_len = 32
batch_size = 4

rope_attn = RoPEAttention(d_model, num_heads)
x = torch.randn(batch_size, seq_len, d_model)

output, attn_weights = rope_attn(x)

print(f"Input shape: {x.shape}")
print(f"Output shape: {output.shape}")
print(f"Attention weights shape: {attn_weights.shape}")
print(f"\nNumber of parameters: {sum(p.numel() for p in rope_attn.parameters()):,}")

## 6. Comparison: Sinusoidal vs Learned vs ROPE

Let's compare the three approaches on key metrics.

In [None]:
# Comparison table
import pandas as pd

comparison_data = {
    'Property': [
        'Parameters',
        'Extrapolation',
        'Relative Position',
        'Memory Cost',
        'Computation Cost',
        'Length Generalization',
        'Used In'
    ],
    'Sinusoidal': [
        'None (fixed)',
        'Excellent',
        'Implicit',
        'Low (cache)',
        'Low',
        'Good',
        'Original Transformer'
    ],
    'Learned': [
        'max_len × d_model',
        'Poor',
        'Absolute only',
        'High',
        'Low',
        'Poor',
        'BERT, GPT-2'
    ],
    'ROPE': [
        'None (fixed)',
        'Excellent',
        'Explicit in QK^T',
        'Low (cache)',
        'Medium',
        'Excellent',
        'LLaMA, GPT-NeoX, PaLM'
    ]
}

df = pd.DataFrame(comparison_data)
print("\n=== Positional Embedding Comparison ===")
print(df.to_string(index=False))

# Visualize parameter counts
plt.figure(figsize=(10, 5))

max_len = 512
d_model_val = 768

param_counts = {
    'Sinusoidal': 0,
    'Learned': max_len * d_model_val,
    'ROPE': 0
}

plt.bar(param_counts.keys(), param_counts.values(), color=['blue', 'orange', 'green'])
plt.ylabel('Number of Parameters')
plt.title(f'Parameter Count Comparison\n(max_len={max_len}, d_model={d_model_val})')
plt.xticks(rotation=15)

# Add value labels on bars
for i, (name, count) in enumerate(param_counts.items()):
    plt.text(i, count, f'{count:,}', ha='center', va='bottom')

plt.tight_layout()
plt.show()

## 7. Length Extrapolation Test

Test how well each method handles sequences longer than seen during training.

In [None]:
def test_length_extrapolation():
    """
    Test how positional encodings handle sequences longer than training length.
    """
    d_model = 128
    train_len = 64
    test_len = 128  # Double the training length
    
    print("Testing length extrapolation...")
    print(f"Training length: {train_len}")
    print(f"Test length: {test_len}\n")
    
    # Sinusoidal: Works fine
    sin_pe = SinusoidalPositionalEncoding(d_model, max_len=train_len)
    try:
        # Extend the cache
        sin_pe_extended = SinusoidalPositionalEncoding(d_model, max_len=test_len)
        x_test = torch.randn(1, test_len, d_model)
        out = sin_pe_extended(x_test)
        print("✅ Sinusoidal: Successfully handled longer sequence")
        print(f"   Output shape: {out.shape}")
    except Exception as e:
        print(f"❌ Sinusoidal failed: {e}")
    
    # Learned: Fails without retraining
    learned_pe = LearnedPositionalEmbedding(d_model, max_len=train_len)
    try:
        x_test = torch.randn(1, test_len, d_model)
        out = learned_pe(x_test)
        print("✅ Learned: Successfully handled longer sequence")
    except Exception as e:
        print(f"❌ Learned: Failed with longer sequence")
        print(f"   Error: {type(e).__name__}: {e}")
    
    # ROPE: Works fine
    rope = RotaryPositionalEmbedding(d_model, max_seq_len=train_len)
    try:
        x_test = torch.randn(1, test_len, d_model)
        out = rope(x_test, seq_dim=1)
        print("✅ ROPE: Successfully handled longer sequence")
        print(f"   Output shape: {out.shape}")
        print(f"   Cache automatically extended to {rope.max_seq_len_cached}")
    except Exception as e:
        print(f"❌ ROPE failed: {e}")

test_length_extrapolation()

## 8. Key Takeaways

### Sinusoidal Positional Encoding
- ✅ No learnable parameters
- ✅ Good extrapolation to longer sequences
- ✅ Computationally efficient
- ❌ Less flexible than learned embeddings

### Learned Positional Embeddings
- ✅ Can learn task-specific patterns
- ✅ Simple to implement
- ❌ Poor extrapolation beyond training length
- ❌ Requires storage for each position

### Rotary Position Embeddings (ROPE)
- ✅ No learnable parameters
- ✅ Excellent extrapolation
- ✅ Explicitly encodes relative positions
- ✅ State-of-the-art for long context LLMs
- ⚠️ Slightly more complex to implement

### When to Use Each?

1. **Use Sinusoidal** for:
   - Research and experimentation
   - When you need simple, parameter-free positional encoding
   - Standard sequence lengths

2. **Use Learned** for:
   - Fixed-length sequences (e.g., BERT-style models)
   - When you want the model to learn position patterns
   - Classification tasks with fixed input size

3. **Use ROPE** for:
   - Modern LLMs and autoregressive models
   - Long context windows
   - When relative position is more important than absolute
   - Production LLM systems

### Modern Trends
- **ROPE is dominant** in state-of-the-art LLMs (LLaMA, GPT-NeoX, PaLM)
- **ALiBi** (Attention with Linear Biases) is another modern alternative
- Research continues on even better position encoding methods

## 9. Summary

In this notebook, we explored three major positional embedding techniques:

1. **Sinusoidal Encodings**: Fixed, parameter-free, good for standard transformers
2. **Learned Embeddings**: Flexible but limited to training sequence lengths
3. **ROPE**: Modern, efficient, excellent for long-context LLMs

ROPE has become the de facto standard for modern large language models due to its:
- Zero additional parameters
- Excellent length extrapolation
- Explicit relative position encoding
- Strong empirical performance

Understanding these positional embedding methods is crucial for working with modern transformer architectures and large language models.