# üìö Tutorial 1: Embeddings & Positional Encoding

**Learning Objectives:**
- Understand why we need embeddings in deep learning
- Learn how token embeddings convert discrete symbols to continuous vectors
- Explore positional encodings and why they matter for Transformers
- Visualize embedding spaces and positional patterns
- Incorporate DeepSeek-R1 insights on representation learning

---

## Table of Contents
1. [Introduction: From Symbols to Vectors](#intro)
2. [Token Embeddings: The Foundation](#embeddings)
3. [Positional Encoding: Adding Order](#positional)
4. [DeepSeek Insights: Why This Matters](#deepseek)
5. [Hands-On Implementation](#implementation)
6. [Visualization & Analysis](#visualization)

---

In [None]:
# Setup: Import required libraries
import sys
import os
import torch
import torch.nn as nn
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path

# Add parent directory to path to import our modules
sys.path.insert(0, str(Path.cwd().parent))

from src.modules.embeddings import TokenEmbedding
from src.modules.positional_encoding import PositionalEncoding, LearnedPositionalEncoding

# Set style for better visualizations
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")

# Set random seed for reproducibility
torch.manual_seed(42)
np.random.seed(42)

print("‚úÖ Imports successful!")
print(f"PyTorch version: {torch.__version__}")
print(f"Device: {'cuda' if torch.cuda.is_available() else 'cpu'}")

---

## 1. Introduction: From Symbols to Vectors <a id="intro"></a>

### The Problem: Neural Networks Need Numbers

Neural networks can't directly process words, characters, or symbols. They work with **continuous numerical vectors** (tensors). This creates a fundamental challenge:

**How do we convert discrete symbols into meaningful continuous representations?**

Consider the sentence: `"The cat sat on the mat"`

- **Input:** `["The", "cat", "sat", "on", "the", "mat"]` (discrete tokens)
- **Need:** `[[0.2, -0.5, ...], [0.1, 0.8, ...], ...]` (continuous vectors)

### Why Not Just Use One-Hot Encoding?

Let's see why simple one-hot encoding fails for large vocabularies:

```python
# One-hot encoding example
vocab = ["cat", "dog", "bird", "fish"]
# "cat" = [1, 0, 0, 0]
# "dog" = [0, 1, 0, 0]
```

**Problems with One-Hot:**
1. ‚ùå **Sparse & Inefficient:** For vocab_size=50,000, each word is a 50,000-dimensional vector (99.998% zeros!)
2. ‚ùå **No Semantic Meaning:** "cat" and "dog" are equally different from each other as "cat" and "democracy"
3. ‚ùå **Cannot Capture Relations:** No way to represent that "king" - "man" + "woman" ‚âà "queen"

**Solution: Dense Embeddings** üéØ
- Map each token to a **learned dense vector** (typically 256-1024 dimensions)
- Similar words have similar vectors
- Captures semantic relationships

In [None]:
# Let's compare one-hot vs dense embeddings
vocab_size = 10000
d_model = 512  # embedding dimension

# One-hot: 10000 dimensions per token (sparse)
one_hot_size = vocab_size
print(f"One-hot encoding size: {one_hot_size:,} dimensions per token")
print(f"  ‚Üí For a 100-token sentence: {one_hot_size * 100:,} total values")
print(f"  ‚Üí Memory: ~{(one_hot_size * 100 * 4) / 1024:.1f} KB (float32)")

print("\n" + "="*50 + "\n")

# Dense embedding: 512 dimensions per token (dense)
dense_size = d_model
print(f"Dense embedding size: {dense_size} dimensions per token")
print(f"  ‚Üí For a 100-token sentence: {dense_size * 100:,} total values")
print(f"  ‚Üí Memory: ~{(dense_size * 100 * 4) / 1024:.1f} KB (float32)")

print("\n" + "="*50 + "\n")
print(f"‚úÖ Dense embeddings are {one_hot_size // dense_size:.1f}x more compact!")
print(f"‚úÖ Plus they capture semantic meaning!")

---

## 2. Token Embeddings: The Foundation <a id="embeddings"></a>

### Mathematical Foundation

An embedding layer is essentially a **learnable lookup table**:

$$\text{Embedding}: \mathbb{Z}_{V} \rightarrow \mathbb{R}^{d_{model}}$$

Where:
- $V$ = vocabulary size (e.g., 50,000 words)
- $d_{model}$ = embedding dimension (e.g., 512)

For a token with index $i$, the embedding is the $i$-th row of the embedding matrix $E \in \mathbb{R}^{V \times d_{model}}$:

$$\text{emb}(i) = E[i, :] \in \mathbb{R}^{d_{model}}$$

### The Attention Paper's Scaling Factor

In "Attention Is All You Need", embeddings are **scaled by** $\sqrt{d_{model}}$:

$$\text{scaled\_emb}(i) = \sqrt{d_{model}} \cdot E[i, :]$$

**Why?** This scaling ensures:
1. Embeddings and positional encodings have similar magnitudes
2. Better gradient flow during training
3. Prevents attention logits from growing too large

### Implementation Details

Let's see our `TokenEmbedding` class in action:

In [None]:
# Create a simple embedding layer
vocab_size = 1000
d_model = 128
padding_idx = 0  # Reserve index 0 for padding

embedding = TokenEmbedding(vocab_size=vocab_size, d_model=d_model, padding_idx=padding_idx)

print(f"üì¶ Created TokenEmbedding:")
print(f"  - Vocabulary size: {vocab_size:,}")
print(f"  - Embedding dimension: {d_model}")
print(f"  - Padding index: {padding_idx}")
print(f"  - Total parameters: {vocab_size * d_model:,}")
print(f"  - Scaling factor: ‚àö{d_model} = {np.sqrt(d_model):.2f}")

# Test with a sequence of tokens
token_ids = torch.tensor([[1, 42, 7, 99, 0],    # First sentence (0 is padding)
                          [15, 88, 0, 0, 0]])    # Second sentence (padded)

print(f"\nüìù Input token IDs shape: {token_ids.shape}")
print(f"Token IDs:\n{token_ids}")

# Get embeddings
embedded = embedding(token_ids)

print(f"\nüéØ Output embeddings shape: {embedded.shape}")
print(f"  - Batch size: {embedded.shape[0]}")
print(f"  - Sequence length: {embedded.shape[1]}")
print(f"  - Embedding dimension: {embedded.shape[2]}")

# Check padding is zeroed out
print(f"\nüîç Padding check (should be all zeros):")
print(f"  - Position [0, 4] (padding): {embedded[0, 4, :5].detach().numpy()}")
print(f"  - Position [1, 2] (padding): {embedded[1, 2, :5].detach().numpy()}")

### Visualizing the Embedding Space

Let's visualize how embeddings distribute in high-dimensional space. We'll use PCA to reduce dimensions for visualization:

In [None]:
from sklearn.decomposition import PCA

# Get embedding matrix (excluding padding token)
embedding_matrix = embedding.embedding.weight.detach().numpy()[1:500]  # Sample 500 tokens

# Reduce to 2D using PCA
pca = PCA(n_components=2)
embeddings_2d = pca.fit_transform(embedding_matrix)

# Plot
plt.figure(figsize=(12, 8))
plt.scatter(embeddings_2d[:, 0], embeddings_2d[:, 1], alpha=0.5, s=30)
plt.title(f'Token Embedding Space (PCA projection)\n{vocab_size} tokens ‚Üí {d_model}D ‚Üí 2D', fontsize=14)
plt.xlabel(f'First Principal Component ({pca.explained_variance_ratio_[0]*100:.1f}% variance)')
plt.ylabel(f'Second Principal Component ({pca.explained_variance_ratio_[1]*100:.1f}% variance)')
plt.grid(True, alpha=0.3)

# Highlight a few random tokens
highlight_indices = np.random.choice(len(embeddings_2d), 10, replace=False)
plt.scatter(embeddings_2d[highlight_indices, 0], 
           embeddings_2d[highlight_indices, 1], 
           c='red', s=100, marker='*', label='Sample tokens', zorder=5)

plt.legend()
plt.tight_layout()
plt.show()

print(f"üìä Total variance explained by 2 components: {sum(pca.explained_variance_ratio_)*100:.1f}%")
print(f"üí° The embeddings form a dense cloud in high-dimensional space")

---

## 3. Positional Encoding: Adding Order <a id="positional"></a>

### The Problem: Transformers Have No Inherent Sense of Order

Unlike RNNs which process sequences step-by-step, **Transformers process all tokens in parallel**. This creates a problem:

```
"The cat chased the dog" 
vs 
"The dog chased the cat"
```

Without positional information, these would look identical to the Transformer! ü§î

### Solution: Positional Encoding

We **add** positional information to the embeddings:

$$\text{input} = \text{token\_embedding} + \text{positional\_encoding}$$

### Two Approaches

#### 1. **Sinusoidal Positional Encoding** (Original Paper)

Uses fixed sine/cosine functions:

$$PE_{(pos, 2i)} = \sin\left(\frac{pos}{10000^{2i/d_{model}}}\right)$$

$$PE_{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{2i/d_{model}}}\right)$$

Where:
- $pos$ = position in sequence (0, 1, 2, ...)
- $i$ = dimension index (0 to $d_{model}/2$)
- Even dimensions use sine, odd dimensions use cosine

**Advantages:**
- ‚úÖ No learned parameters (model can generalize to longer sequences)
- ‚úÖ Allows model to learn relative positions (due to linear properties of sin/cos)
- ‚úÖ Each position has a unique encoding

#### 2. **Learned Positional Encoding**

Treats positional encodings as learnable parameters (like embeddings).

**Advantages:**
- ‚úÖ Can adapt to task-specific patterns
- ‚úÖ Sometimes performs better on fixed-length tasks

**Disadvantages:**
- ‚ùå Cannot extrapolate beyond max training length

Let's implement and visualize both!

In [None]:
# Create sinusoidal positional encoding
max_len = 100
pe_sinusoidal = PositionalEncoding(d_model=d_model, max_len=max_len, dropout=0.0)

# Create learned positional encoding
pe_learned = LearnedPositionalEncoding(d_model=d_model, max_len=max_len, dropout=0.0)

# Create dummy input
dummy_input = torch.randn(1, 50, d_model)  # (batch, seq_len, d_model)

# Apply both types
output_sin = pe_sinusoidal(dummy_input)
output_learned = pe_learned(dummy_input)

print("üìç Positional Encoding Comparison:")
print(f"\nSinusoidal PE:")
print(f"  - Parameters: 0 (fixed)")
print(f"  - Max sequence length: {max_len}")
print(f"  - Can extrapolate: Yes")

print(f"\nLearned PE:")
print(f"  - Parameters: {max_len * d_model:,}")
print(f"  - Max sequence length: {max_len}")
print(f"  - Can extrapolate: No")

print(f"\n‚úÖ Both add positional information to the {d_model}-dimensional embeddings")

### Visualizing Positional Encodings

Let's visualize the sinusoidal patterns:

In [None]:
# Get the positional encoding matrix
pe_matrix = pe_sinusoidal.pe[0].detach().numpy()  # Shape: (max_len, d_model)

# Create visualizations
fig, axes = plt.subplots(2, 2, figsize=(16, 10))

# 1. Heatmap of positional encodings
ax1 = axes[0, 0]
im = ax1.imshow(pe_matrix[:50].T, aspect='auto', cmap='RdBu_r', vmin=-1, vmax=1)
ax1.set_xlabel('Position in Sequence')
ax1.set_ylabel('Embedding Dimension')
ax1.set_title('Sinusoidal Positional Encoding Heatmap\n(First 50 positions)')
plt.colorbar(im, ax=ax1)

# 2. Wave patterns for different dimensions
ax2 = axes[0, 1]
positions = np.arange(max_len)
for dim in [0, 1, 8, 16, 32, 64]:
    if dim < d_model:
        ax2.plot(positions, pe_matrix[:, dim], label=f'Dim {dim}', alpha=0.7)
ax2.set_xlabel('Position')
ax2.set_ylabel('Encoding Value')
ax2.set_title('Positional Encoding Waves\n(Different dimensions have different frequencies)')
ax2.legend()
ax2.grid(True, alpha=0.3)

# 3. Encoding for specific positions
ax3 = axes[1, 0]
for pos in [0, 10, 25, 50]:
    ax3.plot(pe_matrix[pos], label=f'Position {pos}', alpha=0.7)
ax3.set_xlabel('Dimension')
ax3.set_ylabel('Encoding Value')
ax3.set_title('Positional Encoding Vectors\n(Each position has unique pattern)')
ax3.legend()
ax3.grid(True, alpha=0.3)

# 4. Distance between positions (cosine similarity)
ax4 = axes[1, 1]
from sklearn.metrics.pairwise import cosine_similarity
similarities = cosine_similarity(pe_matrix[:50])
im2 = ax4.imshow(similarities, cmap='viridis')
ax4.set_xlabel('Position')
ax4.set_ylabel('Position')
ax4.set_title('Cosine Similarity Between Positions\n(Darker = More similar)')
plt.colorbar(im2, ax=ax4)

plt.tight_layout()
plt.show()

print("üìä Key Observations:")
print("  1. Different dimensions oscillate at different frequencies")
print("  2. Nearby positions have similar encodings (smooth gradient)")
print("  3. Each position gets a unique pattern across all dimensions")
print("  4. The pattern allows the model to learn relative positions")

---

## 4. DeepSeek Insights: Why This Matters <a id="deepseek"></a>

### üî¨ DeepSeek-R1 Perspective on Embeddings

**DeepSeek-R1**, with its advanced reasoning capabilities, highlights several key insights about embeddings and positional encodings:

#### 1. **Representation Learning is Foundation**

> "High-quality representations are the bedrock of reasoning. Without rich, semantic embeddings, downstream attention mechanisms cannot discover meaningful patterns."

**What this means:**
- Embeddings create the "vocabulary" of features that attention operates on
- Poor embeddings ‚Üí Poor attention ‚Üí Poor reasoning
- The $\sqrt{d_{model}}$ scaling prevents gradient vanishing

#### 2. **Positional Encodings Enable Relational Reasoning**

> "Absolute position is less important than relative relationships. Sinusoidal encodings allow the model to learn: 'this word is 3 positions after that word' regardless of their absolute positions."

**Example:**
```
"The cat sat on the mat"
     ‚Üì   ‚Üì
"cat" and "sat" are always adjacent
"cat" and "mat" are always 4 positions apart
```

The model can learn these **relative patterns** because:

$$PE_{pos+k} = f(PE_{pos}, k)$$

Due to sine/cosine addition formulas.

#### 3. **Emergent Structure in Embedding Space**

> "Well-trained embeddings naturally cluster by semantic similarity, part-of-speech, and syntactic function‚Äîeven though we never explicitly trained for this!"

This is **representation learning** at work:
- Similar contexts ‚Üí Similar embeddings
- The model discovers linguistic structure automatically

In [None]:
# Demonstrate relative position learning
# The key property: PE(pos + k) can be represented as a linear function of PE(pos)

pos1 = 10
pos2 = 13  # 3 positions later

pe1 = pe_matrix[pos1]
pe2 = pe_matrix[pos2]

# Calculate the "difference" in positional encoding
diff = pe2 - pe1

print("üîç DeepSeek Insight: Relative Position Encoding")
print(f"\nPosition {pos1} encoding (first 10 dims): {pe1[:10]}")
print(f"Position {pos2} encoding (first 10 dims): {pe2[:10]}")
print(f"\nDifference (first 10 dims): {diff[:10]}")

# Check another pair with same relative distance
pos3 = 25
pos4 = 28  # Also 3 positions later

pe3 = pe_matrix[pos3]
pe4 = pe_matrix[pos4]
diff2 = pe4 - pe3

print(f"\n‚ú® Same relative distance (3 positions):")
print(f"Position {pos3} ‚Üí {pos4} difference (first 10 dims): {diff2[:10]}")

# The differences should have similar patterns (not identical due to non-linearity, but related)
correlation = np.corrcoef(diff, diff2)[0, 1]
print(f"\nüìä Correlation between the two 3-step differences: {correlation:.4f}")
print(f"üí° The model can learn: 'words that are 3 apart have this relationship'")

---

## 5. Hands-On Implementation <a id="implementation"></a>

### Building a Complete Input Pipeline

Let's put it all together: **Embeddings + Positional Encoding = Transformer Input**

In [None]:
class TransformerInputEmbedding(nn.Module):
    """
    Complete input embedding for Transformer:
    1. Token embedding (with scaling)
    2. Positional encoding
    3. Dropout
    """
    def __init__(self, vocab_size, d_model, max_len=5000, dropout=0.1, padding_idx=0):
        super().__init__()
        self.d_model = d_model
        self.token_embedding = TokenEmbedding(vocab_size, d_model, padding_idx)
        self.positional_encoding = PositionalEncoding(d_model, max_len, dropout)
        
    def forward(self, x):
        """
        Args:
            x: Token IDs [batch_size, seq_len]
        Returns:
            Embedded input [batch_size, seq_len, d_model]
        """
        # Step 1: Convert tokens to embeddings (includes ‚àöd_model scaling)
        token_emb = self.token_embedding(x)  # [batch, seq_len, d_model]
        
        # Step 2: Add positional encoding
        output = self.positional_encoding(token_emb)  # [batch, seq_len, d_model]
        
        return output


# Create the complete input embedding
input_embedding = TransformerInputEmbedding(
    vocab_size=10000,
    d_model=512,
    max_len=200,
    dropout=0.1,
    padding_idx=0
)

# Test with a sample sentence
# Let's simulate: "The cat sat on the mat <PAD> <PAD>"
sample_tokens = torch.tensor([[45, 123, 87, 56, 89, 234, 0, 0]])  # [1, 8]

print("üé¨ Complete Input Pipeline Demo")
print(f"\nInput tokens shape: {sample_tokens.shape}")
print(f"Input tokens: {sample_tokens[0].tolist()}")

# Pass through the pipeline
embedded_input = input_embedding(sample_tokens)

print(f"\n‚ú® Final embedded input shape: {embedded_input.shape}")
print(f"   - Ready for Transformer encoder/decoder!")

# Verify the components
print(f"\nüì¶ What happened:")
print(f"   1. Tokens ‚Üí Dense vectors ({input_embedding.d_model}D)")
print(f"   2. Scaled by ‚àö{input_embedding.d_model} = {np.sqrt(input_embedding.d_model):.2f}")
print(f"   3. Added position information")
print(f"   4. Applied dropout (p=0.1)")

# Show first few dimensions of first two tokens
print(f"\nüîç First token embedding (first 8 dims):")
print(f"   {embedded_input[0, 0, :8].detach().numpy()}")
print(f"\nüîç Second token embedding (first 8 dims):")
print(f"   {embedded_input[0, 1, :8].detach().numpy()}")

---

## 6. Visualization & Analysis <a id="visualization"></a>

### Comparing Token Embeddings vs. Token + Position

In [None]:
# Create a sequence and compare before/after positional encoding
test_tokens = torch.tensor([[10, 20, 30, 40, 50]])  # 5 different tokens

# Get token embeddings only (no positional encoding)
token_emb_only = input_embedding.token_embedding(test_tokens)

# Get complete embeddings (token + position)
input_embedding.eval()  # Turn off dropout for visualization
with torch.no_grad():
    complete_emb = input_embedding(test_tokens)

# Visualize
fig, axes = plt.subplots(1, 2, figsize=(16, 5))

# Before: Token embeddings only
ax1 = axes[0]
im1 = ax1.imshow(token_emb_only[0].T.numpy(), aspect='auto', cmap='coolwarm')
ax1.set_xlabel('Position in Sequence')
ax1.set_ylabel('Embedding Dimension')
ax1.set_title('Token Embeddings Only\n(No position information)')
ax1.set_xticks(range(5))
ax1.set_xticklabels(['Pos 0', 'Pos 1', 'Pos 2', 'Pos 3', 'Pos 4'])
plt.colorbar(im1, ax=ax1)

# After: Token + Positional embeddings
ax2 = axes[1]
im2 = ax2.imshow(complete_emb[0].T.numpy(), aspect='auto', cmap='coolwarm')
ax2.set_xlabel('Position in Sequence')
ax2.set_ylabel('Embedding Dimension')
ax2.set_title('Token + Positional Embeddings\n(Position information encoded)')
ax2.set_xticks(range(5))
ax2.set_xticklabels(['Pos 0', 'Pos 1', 'Pos 2', 'Pos 3', 'Pos 4'])
plt.colorbar(im2, ax=ax2)

plt.tight_layout()
plt.show()

print("üìä Key Differences:")
print("  Left: Only semantic (token) information")
print("  Right: Semantic + positional information")
print("  ‚Üí Now each position has a unique signature!")

---

## üéØ Summary & Key Takeaways

### What We Learned

1. **Token Embeddings**
   - Convert discrete tokens to continuous vectors
   - Much more efficient and expressive than one-hot encoding
   - Scaled by ‚àöd_model to maintain gradient flow
   - Learn semantic relationships automatically

2. **Positional Encoding**
   - Essential for Transformers (which process in parallel)
   - Sinusoidal: Fixed, generalizes to any length, enables relative position learning
   - Learned: Adaptive but limited to training length
   - Added to token embeddings, not concatenated

3. **DeepSeek Insights**
   - High-quality representations enable reasoning
   - Relative positions more important than absolute
   - Emergent structure from optimization

### Mathematical Formulation

$$\text{TransformerInput}(x) = \text{TokenEmb}(x) \cdot \sqrt{d_{model}} + \text{PosEnc}(x)$$

Where:
- $x \in \mathbb{Z}^{B \times L}$ (batch of token sequences)
- Output $\in \mathbb{R}^{B \times L \times d_{model}}$

### Next Steps

In **Tutorial 2: Attention Mechanisms**, we'll see how these rich embeddings are used by:
- **Scaled Dot-Product Attention**: Computing relevance between positions
- **Multi-Head Attention**: Parallel attention with different perspectives
- **Self-Attention vs Cross-Attention**: Different attention patterns

The embeddings we created here are the **input** to those attention mechanisms!

---

## üß™ Exercises

Try these experiments to deepen your understanding:

1. **Change d_model**: Try 256, 512, 1024. How does it affect the representation capacity?

2. **Visualize Learned PE**: Train the `LearnedPositionalEncoding` on a dummy task and visualize the learned patterns

3. **Embedding Similarity**: Use `TokenEmbedding.compute_embedding_similarity()` to find which tokens are most similar

4. **Longer Sequences**: Test sinusoidal PE with sequences longer than `max_len` during training

5. **2D Positional Encoding**: Extend to 2D (for images) by encoding both row and column positions