# Positional Encoding: Teaching Transformers About Position

Transformers process all positions in parallel, creating a problem: **how does the model know about word order?**

Without positional information, "cat sat on mat" and "mat on sat cat" would look identical!

## The Solution
**Positional encoding** adds unique position signatures to word embeddings, giving transformers spatial awareness.

## What You'll Learn
1. **The Position Problem** - Why order matters
2. **Sinusoidal Encoding** - Mathematical solution using sine/cosine
3. **Addition vs Concatenation** - Why we add position info
4. **Complete Implementation** - Building the full embedding layer

import sys
import os
sys.path.append('..')

import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import math
from typing import Tuple, Optional

plt.style.use('default')
sns.set_palette("husl")
torch.manual_seed(42)
np.random.seed(42)
print("Environment setup complete!")

In [ ]:
# Demonstrate the position problem
sentence1 = ["cat", "sat", "on", "mat"]
sentence2 = ["mat", "on", "sat", "cat"]

word_embeddings = {
    "cat": [1, 0, 0], "sat": [0, 1, 0], 
    "on": [0, 0, 1], "mat": [1, 1, 0]
}

emb1 = [word_embeddings[word] for word in sentence1]
emb2 = [word_embeddings[word] for word in sentence2]

# Sum embeddings (what attention might see)
sum1 = [sum(x) for x in zip(*emb1)]
sum2 = [sum(x) for x in zip(*emb2)]

print(f"Sentence 1: {' '.join(sentence1)}")
print(f"Sentence 2: {' '.join(sentence2)}")
print(f"Aggregated representation 1: {sum1}")
print(f"Aggregated representation 2: {sum2}")
print(f"Identical? {sum1 == sum2}")
print("❌ Problem: Can't distinguish word order!")

In [None]:
## Sinusoidal Positional Encoding

Use sine and cosine functions to create unique, bounded position signatures.

In [ ]:
def create_sinusoidal_encoding(max_len, d_model):
    pe = torch.zeros(max_len, d_model)
    position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)
    
    div_term = torch.exp(torch.arange(0, d_model, 2).float() * 
                        (-math.log(10000.0) / d_model))
    
    pe[:, 0::2] = torch.sin(position * div_term)  # Even dimensions
    pe[:, 1::2] = torch.cos(position * div_term)  # Odd dimensions
    
    return pe

# Create and visualize positional encoding
max_len, d_model = 10, 8
pos_encoding = create_sinusoidal_encoding(max_len, d_model)

print(f"Positional encoding shape: {pos_encoding.shape}")
print(f"Value range: [{pos_encoding.min():.3f}, {pos_encoding.max():.3f}]")

# Show first few positions
for i in range(3):
    print(f"Position {i}: {[round(x, 3) for x in pos_encoding[i].tolist()]}")

# Visualize the encoding pattern
plt.figure(figsize=(12, 6))

plt.subplot(1, 2, 1)
sns.heatmap(pos_encoding.T, cmap='RdBu_r', center=0)
plt.title('Positional Encoding Pattern')
plt.xlabel('Position')
plt.ylabel('Dimension')

plt.subplot(1, 2, 2)
for dim in [0, 1, 6, 7]:
    plt.plot(pos_encoding[:, dim], label=f'Dim {dim}')
plt.title('Values by Position')
plt.xlabel('Position')
plt.ylabel('Value')
plt.legend()
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("✅ Each position gets a unique, bounded pattern!")

## Addition vs Concatenation

Compare adding positional encoding to word embeddings versus concatenating them.

In [ ]:
# Compare addition vs concatenation
word_emb = torch.tensor([1.0, 0.5, -0.3, 0.8])
pos_emb = pos_encoding[1][:4]  # Position 1, first 4 dimensions

print(f"Word embedding:     {word_emb.tolist()}")
print(f"Position embedding: {[round(x, 3) for x in pos_emb.tolist()]}")

# Addition (what transformers use)
added = word_emb + pos_emb
print(f"\nADDITION:")
print(f"Result: {[round(x, 3) for x in added.tolist()]} (shape: {added.shape})")
print("✅ Same dimensionality, blended word-position representation")

# Concatenation (alternative)
concatenated = torch.cat([word_emb, pos_emb])
print(f"\nCONCATENATION:")
print(f"Result: {[round(x, 3) for x in concatenated.tolist()]} (shape: {concatenated.shape})")
print("❌ Double dimensions, separated information")

# Show how addition solves the original problem
print(f"\n🎉 SOLVING THE POSITION PROBLEM:")
emb1 = torch.tensor([[1,0,0,0], [0,1,0,0], [0,0,1,0], [1,1,0,0]]).float()  # cat sat on mat
emb2 = torch.tensor([[1,1,0,0], [0,0,1,0], [0,1,0,0], [1,0,0,0]]).float()  # mat on sat cat

pos_enc_4 = create_sinusoidal_encoding(4, 4)
combined1 = emb1 + pos_enc_4
combined2 = emb2 + pos_enc_4

sum1, sum2 = combined1.sum(dim=0), combined2.sum(dim=0)
are_different = not torch.allclose(sum1, sum2, atol=1e-6)

print(f"Different representations after adding position? {are_different}")
print("✅ Position encoding solved the word order problem!")

In [None]:
## Complete Positional Embedding Layer

Build a complete neural network layer that combines token embeddings with positional encoding.

In [ ]:
class PositionalEmbedding(nn.Module):
    def __init__(self, vocab_size, max_len, d_model):
        super().__init__()
        self.token_embedding = nn.Embedding(vocab_size, d_model)
        
        # Create and register positional encoding as buffer (not trainable)
        pos_encoding = create_sinusoidal_encoding(max_len, d_model)
        self.register_buffer('pos_encoding', pos_encoding)
    
    def forward(self, token_ids):
        batch_size, seq_len = token_ids.shape
        
        # Get token embeddings
        word_emb = self.token_embedding(token_ids)
        
        # Get positional encoding for this sequence length
        pos_emb = self.pos_encoding[:seq_len].unsqueeze(0)
        pos_emb = pos_emb.expand(batch_size, -1, -1)
        
        # Add them together
        return word_emb + pos_emb

# Test the complete embedding layer
vocab_size, max_len, d_model = 100, 20, 8
pos_emb_layer = PositionalEmbedding(vocab_size, max_len, d_model)

batch_size, seq_len = 2, 5
token_ids = torch.randint(0, vocab_size, (batch_size, seq_len))
embeddings = pos_emb_layer(token_ids)

print(f"Input token IDs shape: {token_ids.shape}")
print(f"Output embeddings shape: {embeddings.shape}")

# Show same token at different positions gets different embeddings
token_id = 42
print(f"\nSame token (ID={token_id}) at different positions:")
for pos in range(4):
    test_input = torch.full((1, pos+1), token_id)
    test_output = pos_emb_layer(test_input)
    final_embedding = test_output[0, pos]
    print(f"Position {pos}: {[round(x, 3) for x in final_embedding[:3].tolist()]}...")

print("\n✅ Same token gets different embeddings at different positions!")

## Summary

You've mastered positional encoding - the key to teaching transformers about word order!

**Key Concepts**:
- **The Problem**: Transformers process positions in parallel and need explicit position information
- **Sinusoidal Solution**: Sine/cosine functions create unique, bounded position signatures  
- **Addition**: Adding position to word embeddings creates efficient blended representations
- **Implementation**: Simple but powerful transformation enabling spatial understanding

**What's Next**: Now you understand all core transformer components:
- Tokenization (notebook 0) - Text → numbers
- Attention (notebook 1) - Information routing  
- Transformer blocks (notebook 2) - Complete processing units
- Position encoding (notebook 3) - Spatial awareness

Ready to see it all working together in training! 🚀

In [None]:
## Complete Positional Embedding Layer

Combine word embeddings with positional encoding in a complete neural network layer.

## Summary

You've mastered positional encoding - the key to teaching transformers about word order!

### Key Concepts:
1. **The Problem**: Transformers process all positions in parallel and need explicit position information
2. **Sinusoidal Solution**: Sine and cosine functions create unique, bounded position signatures  
3. **Addition > Concatenation**: Adding creates richer word-position interactions efficiently
4. **Implementation**: Simple but powerful - transforms how models understand sequences

### What's Next?
Now you understand all the core transformer components:
- **Tokenization** (notebook 0) - Text → numbers
- **Attention** (notebook 1) - How to focus on relevant information
- **Transformer blocks** (notebook 2) - Complete processing units
- **Position encoding** (notebook 3) - Understanding word order

Ready to see it all working together in a complete transformer! 🚀