# Lab 4: Building a Small Language Model (SLM-1)
## Next-Token Prediction from Scratch

**Duration**: ~3 hours

### Learning Objectives
By the end of this lab, you will be able to:
1. Understand character-level language modeling
2. Build a bigram model (simplest language model)
3. Create embeddings from scratch
4. Train a neural language model
5. Generate new text!

### Prerequisites
- Completed Lab 3 (PyTorch basics)

### The Big Idea

**Language models predict the next token given previous tokens.**

```
Input: "The cat sat on the"
Model predicts: "mat" (or "floor", "chair", etc.)
```

This is **exactly** how ChatGPT, Claude, and other LLMs work - just at massive scale!

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import matplotlib.pyplot as plt
import numpy as np

# Set random seed
torch.manual_seed(42)
np.random.seed(42)

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")

---

# Part 1: The Autocomplete Game

## Your Phone Already Does This!

Every time you type on your phone, it suggests the next word:

```
You type: "I'll be there in"
Suggestions: [5 minutes] [a bit] [an hour]
```

**This is next-token prediction!**

Let's start even simpler: predict the next **character** instead of word.

In [None]:
# SOLVED EXAMPLE: Interactive Prediction Game

# What comes after these?
test_cases = [
    "Harr",      # Harry? Harrison?
    "Michae",    # Michael!
    "qu",        # Probably 'e' or 'i'
    "th",        # 'e' is very common
]

print("Guess the next character!")
print("=" * 40)

for text in test_cases:
    print(f"  '{text}' + ? = ???")

print("\n(We'll train a model to do this automatically!)")

## Our Dataset: Names

We'll use a dataset of ~32,000 names (inspired by Andrej Karpathy's makemore).

Why names?
- Short and simple
- Clear patterns (names follow conventions)
- Fun to generate new names!

In [None]:
# Download the names dataset
import urllib.request
import os

url = "https://raw.githubusercontent.com/karpathy/makemore/master/names.txt"
filename = "names.txt"

if not os.path.exists(filename):
    urllib.request.urlretrieve(url, filename)
    print(f"Downloaded {filename}")
else:
    print(f"{filename} already exists")

# Load names
with open(filename, 'r') as f:
    names = f.read().splitlines()

print(f"\nLoaded {len(names)} names")
print(f"\nFirst 10 names: {names[:10]}")
print(f"Random names: {np.random.choice(names, 5).tolist()}")

In [None]:
# SOLVED EXAMPLE: Explore the Dataset

# Length distribution
lengths = [len(name) for name in names]

print(f"Shortest name: {min(lengths)} characters")
print(f"Longest name: {max(lengths)} characters")
print(f"Average length: {np.mean(lengths):.1f} characters")

# Character frequency
all_chars = ''.join(names).lower()
char_counts = {}
for c in all_chars:
    char_counts[c] = char_counts.get(c, 0) + 1

# Sort by frequency
sorted_chars = sorted(char_counts.items(), key=lambda x: -x[1])

print(f"\nMost common characters:")
for char, count in sorted_chars[:10]:
    print(f"  '{char}': {count} ({count/len(all_chars)*100:.1f}%)")

In [None]:
# SOLVED EXAMPLE: Build Vocabulary

# Get all unique characters
chars = sorted(list(set(''.join(names).lower())))

# Add special tokens
# '.' represents start/end of name
chars = ['.'] + chars

vocab_size = len(chars)
print(f"Vocabulary: {chars}")
print(f"Vocabulary size: {vocab_size}")

# Create character <-> index mappings
char_to_idx = {c: i for i, c in enumerate(chars)}
idx_to_char = {i: c for i, c in enumerate(chars)}

print(f"\nExamples:")
print(f"  'a' -> {char_to_idx['a']}")
print(f"  'z' -> {char_to_idx['z']}")
print(f"  '.' -> {char_to_idx['.']}")

---

# Part 2: The Bigram Model

## Simplest Language Model Ever

**Bigram**: Look at ONE character, predict the NEXT character.

```
BIGRAM MODEL:

    Previous      →    Next
    Character          Character
    
       'a'        →    'n' (30%)
                  →    'l' (15%)
                  →    'r' (12%)
                  →    ...
```

We just count how often each pair of characters appears!

In [None]:
# SOLVED EXAMPLE: Build Bigram Counts

# Count all character pairs
bigram_counts = torch.zeros((vocab_size, vocab_size), dtype=torch.int32)

for name in names:
    # Add start/end tokens
    name = '.' + name.lower() + '.'
    
    for ch1, ch2 in zip(name, name[1:]):
        idx1 = char_to_idx[ch1]
        idx2 = char_to_idx[ch2]
        bigram_counts[idx1, idx2] += 1

print(f"Bigram counts shape: {bigram_counts.shape}")
print(f"Total bigrams counted: {bigram_counts.sum().item()}")

In [None]:
# SOLVED EXAMPLE: Visualize Bigram Counts

plt.figure(figsize=(16, 16))
plt.imshow(bigram_counts, cmap='Blues')

# Add labels
for i in range(vocab_size):
    for j in range(vocab_size):
        count = bigram_counts[i, j].item()
        if count > 0:
            plt.text(j, i, chars[i] + chars[j], ha='center', va='bottom', fontsize=6)
            plt.text(j, i, str(count), ha='center', va='top', fontsize=6, color='gray')

plt.xlabel('Second Character')
plt.ylabel('First Character')
plt.xticks(range(vocab_size), chars)
plt.yticks(range(vocab_size), chars)
plt.title('Bigram Counts (First → Second Character)')
plt.tight_layout()
plt.show()

In [None]:
# SOLVED EXAMPLE: Convert Counts to Probabilities

# Normalize rows to get probabilities
# P(next | current) = count(current, next) / count(current, *)

# Add smoothing to avoid division by zero
bigram_probs = (bigram_counts + 1).float()  # Add 1 (Laplace smoothing)
bigram_probs = bigram_probs / bigram_probs.sum(dim=1, keepdim=True)

print("Probability of next character after 'a':")
probs_after_a = bigram_probs[char_to_idx['a']]
top_k = probs_after_a.topk(5)

for prob, idx in zip(top_k.values, top_k.indices):
    print(f"  'a' → '{chars[idx]}': {prob.item():.2%}")

In [None]:
# SOLVED EXAMPLE: Generate Names with Bigram Model

def generate_bigram(max_len=20):
    """Generate a name using the bigram model."""
    name = []
    current_char = '.'  # Start token
    
    for _ in range(max_len):
        # Get probability distribution
        probs = bigram_probs[char_to_idx[current_char]]
        
        # Sample next character
        next_idx = torch.multinomial(probs, 1).item()
        next_char = chars[next_idx]
        
        # Check for end token
        if next_char == '.':
            break
            
        name.append(next_char)
        current_char = next_char
    
    return ''.join(name).capitalize()

# Generate some names
print("Names generated by bigram model:")
print("=" * 40)

for i in range(20):
    name = generate_bigram()
    print(f"  {name}")

### Bigram Model: Observations

**Pros:**
- Simple and fast
- Captures basic character patterns
- Some names look reasonable!

**Cons:**
- No memory beyond 1 character
- Can't learn longer patterns
- Many names are gibberish

**Let's use a neural network to do better!**

## Question 1.1

What character is most likely to:
1. Start a name (come after '.')?
2. End a name (come before '.')?

Find these using the bigram probability matrix.

In [None]:
# YOUR CODE HERE

# Most likely first character
# Hint: Look at row for '.' (start token)

# Most likely last character
# Hint: Look at column for '.' (end token)


---

# Part 3: Embeddings

## Why Embeddings?

**Problem:** How do we represent characters as numbers for a neural network?

**Bad idea: One-hot encoding**
```
'a' → [1, 0, 0, 0, ... 0]  (27 dimensions!)
'b' → [0, 1, 0, 0, ... 0]
'c' → [0, 0, 1, 0, ... 0]
```

Problems:
- Very sparse (mostly zeros)
- All characters equidistant (no similarity)

**Better idea: Learned embeddings**
```
'a' → [0.2, -0.5, 0.1, ...]  (dense, learned)
'b' → [0.3, -0.4, 0.2, ...]
'c' → [0.25, -0.45, 0.15, ...]  (similar to 'a' and 'b'!)
```

In [None]:
# SOLVED EXAMPLE: Creating Embeddings

# Embedding layer: maps indices to vectors
embed_dim = 2  # Start with 2D for visualization
embedding = nn.Embedding(vocab_size, embed_dim)

print(f"Embedding table shape: {embedding.weight.shape}")
print(f"  {vocab_size} characters × {embed_dim} dimensions")

# Look up some characters
idx = torch.tensor([char_to_idx['a'], char_to_idx['b'], char_to_idx['c']])
vectors = embedding(idx)

print(f"\nEmbedding for 'a', 'b', 'c':")
print(vectors)

In [None]:
# SOLVED EXAMPLE: Visualize 2D Embeddings

# Get all embeddings
all_idx = torch.arange(vocab_size)
all_embeddings = embedding(all_idx).detach().numpy()

plt.figure(figsize=(10, 10))
plt.scatter(all_embeddings[:, 0], all_embeddings[:, 1], s=100)

# Add labels
for i, char in enumerate(chars):
    plt.annotate(char, (all_embeddings[i, 0], all_embeddings[i, 1]), 
                 fontsize=12, ha='center', va='center')

plt.xlabel('Dimension 1')
plt.ylabel('Dimension 2')
plt.title('Character Embeddings (Random Initialization)\n(Will learn meaningful positions during training!)')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

---

# Part 4: Neural Language Model

## The Architecture

```
NEURAL LANGUAGE MODEL:

Input: "emm" (predict 'a')

  'e'  'm'  'm'
   │    │    │
   ▼    ▼    ▼
┌─────────────────┐
│   Embeddings    │  (lookup vectors)
└─────────────────┘
         │
         ▼
┌─────────────────┐
│    Concat       │  (combine vectors)
└─────────────────┘
         │
         ▼
┌─────────────────┐
│   Hidden MLP    │  (learn patterns)
└─────────────────┘
         │
         ▼
┌─────────────────┐
│    Output       │  (probability for each char)
└─────────────────┘
         │
         ▼
    P(a) = 0.85
    P(b) = 0.02
    ...
```

In [None]:
# SOLVED EXAMPLE: Prepare Training Data

# Context length: how many previous characters to look at
context_length = 3

def build_dataset(names, context_length):
    """Build training data: (context) -> (next_char)"""
    X, Y = [], []
    
    for name in names:
        # Add padding and end token
        name = '.' * context_length + name.lower() + '.'
        
        for i in range(len(name) - context_length):
            context = name[i:i+context_length]
            target = name[i+context_length]
            
            # Convert to indices
            context_idx = [char_to_idx[c] for c in context]
            target_idx = char_to_idx[target]
            
            X.append(context_idx)
            Y.append(target_idx)
    
    return torch.tensor(X), torch.tensor(Y)

# Build dataset
X, Y = build_dataset(names, context_length)

print(f"Dataset size: {len(X)} examples")
print(f"X shape: {X.shape}")  # (N, context_length)
print(f"Y shape: {Y.shape}")  # (N,)

# Show some examples
print(f"\nExamples:")
for i in range(5):
    context = ''.join([chars[idx] for idx in X[i].tolist()])
    target = chars[Y[i].item()]
    print(f"  '{context}' → '{target}'")

In [None]:
# SOLVED EXAMPLE: Train/Val/Test Split

# Shuffle
torch.manual_seed(42)
perm = torch.randperm(len(X))
X, Y = X[perm], Y[perm]

# Split: 80% train, 10% val, 10% test
n1 = int(0.8 * len(X))
n2 = int(0.9 * len(X))

X_train, Y_train = X[:n1], Y[:n1]
X_val, Y_val = X[n1:n2], Y[n1:n2]
X_test, Y_test = X[n2:], Y[n2:]

print(f"Training: {len(X_train)} examples")
print(f"Validation: {len(X_val)} examples")
print(f"Test: {len(X_test)} examples")

In [None]:
# SOLVED EXAMPLE: Define the Model

class CharLM(nn.Module):
    def __init__(self, vocab_size, embed_dim, hidden_dim, context_length):
        super().__init__()
        
        self.context_length = context_length
        
        # Embedding layer
        self.embedding = nn.Embedding(vocab_size, embed_dim)
        
        # MLP layers
        self.fc1 = nn.Linear(context_length * embed_dim, hidden_dim)
        self.fc2 = nn.Linear(hidden_dim, vocab_size)
        
    def forward(self, x):
        # x shape: (batch, context_length)
        
        # Get embeddings
        emb = self.embedding(x)  # (batch, context_length, embed_dim)
        
        # Flatten embeddings
        emb = emb.view(emb.size(0), -1)  # (batch, context_length * embed_dim)
        
        # MLP
        h = torch.tanh(self.fc1(emb))
        logits = self.fc2(h)
        
        return logits

# Create model
embed_dim = 10
hidden_dim = 100

model = CharLM(vocab_size, embed_dim, hidden_dim, context_length).to(device)
print(model)

n_params = sum(p.numel() for p in model.parameters())
print(f"\nTotal parameters: {n_params:,}")

In [None]:
# SOLVED EXAMPLE: Training Loop

# Move data to device
X_train = X_train.to(device)
Y_train = Y_train.to(device)
X_val = X_val.to(device)
Y_val = Y_val.to(device)

# Training settings
batch_size = 32
learning_rate = 0.01
num_epochs = 20

optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)
criterion = nn.CrossEntropyLoss()

# Training history
train_losses = []
val_losses = []

print("Training...")
print("=" * 50)

for epoch in range(num_epochs):
    model.train()
    
    # Mini-batch training
    total_loss = 0
    num_batches = 0
    
    for i in range(0, len(X_train), batch_size):
        # Get batch
        X_batch = X_train[i:i+batch_size]
        Y_batch = Y_train[i:i+batch_size]
        
        # Forward pass
        logits = model(X_batch)
        loss = criterion(logits, Y_batch)
        
        # Backward pass
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        
        total_loss += loss.item()
        num_batches += 1
    
    avg_train_loss = total_loss / num_batches
    train_losses.append(avg_train_loss)
    
    # Validation loss
    model.eval()
    with torch.no_grad():
        val_logits = model(X_val)
        val_loss = criterion(val_logits, Y_val).item()
    val_losses.append(val_loss)
    
    if (epoch + 1) % 5 == 0:
        print(f"Epoch {epoch+1:3d}: Train Loss = {avg_train_loss:.4f}, Val Loss = {val_loss:.4f}")

print("\nTraining complete!")

In [None]:
# SOLVED EXAMPLE: Plot Training Curves

plt.figure(figsize=(10, 5))
plt.plot(train_losses, 'o-', label='Train')
plt.plot(val_losses, 's-', label='Validation')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.title('Training Progress')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

In [None]:
# SOLVED EXAMPLE: Generate Names with Neural Model

def generate_neural(model, max_len=20, temperature=1.0):
    """Generate a name using the neural language model."""
    model.eval()
    
    # Start with padding
    context = [char_to_idx['.']] * context_length
    name = []
    
    with torch.no_grad():
        for _ in range(max_len):
            # Prepare input
            x = torch.tensor([context]).to(device)
            
            # Get prediction
            logits = model(x)
            
            # Apply temperature
            probs = F.softmax(logits / temperature, dim=-1)
            
            # Sample
            next_idx = torch.multinomial(probs, 1).item()
            next_char = chars[next_idx]
            
            # Check for end
            if next_char == '.':
                break
            
            name.append(next_char)
            
            # Update context (sliding window)
            context = context[1:] + [next_idx]
    
    return ''.join(name).capitalize()

# Generate names
print("Names generated by neural model:")
print("=" * 40)

for i in range(20):
    name = generate_neural(model, temperature=0.8)
    print(f"  {name}")

In [None]:
# SOLVED EXAMPLE: Temperature Comparison

print("Effect of Temperature on Generation:")
print("=" * 50)

for temp in [0.5, 0.8, 1.0, 1.5]:
    print(f"\nTemperature = {temp}:")
    names = [generate_neural(model, temperature=temp) for _ in range(5)]
    for name in names:
        print(f"  {name}")

print("\n" + "=" * 50)
print("Low temp (0.5): More conservative, common names")
print("High temp (1.5): More creative/random")

## Question 2.1

Compare the bigram model vs neural model:
1. Generate 50 names from each
2. Count how many "look like real names" (subjective, just guess)
3. Which model produces better results?

In [None]:
# YOUR CODE HERE

# Generate 50 names from bigram model
bigram_names = [generate_bigram() for _ in range(50)]

# Generate 50 names from neural model
neural_names = [generate_neural(model, temperature=0.8) for _ in range(50)]

# Print and compare
print("Bigram names:")
print(bigram_names[:20])

print("\nNeural names:")
print(neural_names[:20])

---

# Part 5: Visualizing Learned Embeddings

In [None]:
# SOLVED EXAMPLE: Visualize Learned Embeddings

# Get trained embeddings
embeddings = model.embedding.weight.detach().cpu().numpy()

# Use PCA to reduce to 2D (if embed_dim > 2)
from sklearn.decomposition import PCA

if embeddings.shape[1] > 2:
    pca = PCA(n_components=2)
    embeddings_2d = pca.fit_transform(embeddings)
else:
    embeddings_2d = embeddings

# Plot
plt.figure(figsize=(12, 10))

# Color vowels differently
vowels = set('aeiou')
colors = ['red' if c in vowels else 'blue' for c in chars]

for i, char in enumerate(chars):
    plt.scatter(embeddings_2d[i, 0], embeddings_2d[i, 1], 
                c=colors[i], s=200, alpha=0.7)
    plt.annotate(char, (embeddings_2d[i, 0], embeddings_2d[i, 1]),
                 fontsize=14, ha='center', va='center', fontweight='bold')

plt.xlabel('Dimension 1')
plt.ylabel('Dimension 2')
plt.title('Learned Character Embeddings\n(Red = vowels, Blue = consonants)')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print("\nNotice: Similar characters should be closer together!")
print("(Vowels may cluster, common consonants may cluster, etc.)")

---

# Challenge Problems

## Challenge 1: Longer Context

Increase the context length from 3 to 5 or 8 characters.
Does the model generate better names?

In [None]:
# YOUR CODE HERE


## Challenge 2: Deeper Network

Add more hidden layers to the network.
Does it improve the loss?

In [None]:
# YOUR CODE HERE


## Challenge 3: Train on Different Text

Try training on a different dataset, like:
- Pokemon names
- City names
- Made-up words

Just create a text file with one item per line!

In [None]:
# YOUR CODE HERE


---

# Summary

## What We Learned

| Concept | Description |
|---------|-------------|
| **Language Model** | Predicts next token given previous tokens |
| **Bigram Model** | Simplest LM: only looks at 1 previous character |
| **Embeddings** | Learned vector representations for characters |
| **Neural LM** | Uses MLP to learn complex patterns |
| **Temperature** | Controls randomness in generation |

## Key Code Patterns

```python
# Embedding layer
embedding = nn.Embedding(vocab_size, embed_dim)

# Neural LM architecture
class CharLM(nn.Module):
    def __init__(self, vocab_size, embed_dim, hidden_dim, context_length):
        self.embedding = nn.Embedding(vocab_size, embed_dim)
        self.fc1 = nn.Linear(context_length * embed_dim, hidden_dim)
        self.fc2 = nn.Linear(hidden_dim, vocab_size)
        
    def forward(self, x):
        emb = self.embedding(x).flatten(1)
        h = torch.tanh(self.fc1(emb))
        return self.fc2(h)

# Sampling with temperature
probs = F.softmax(logits / temperature, dim=-1)
next_idx = torch.multinomial(probs, 1)
```

## What's Next?

In **Lab 5**, we'll:
- Save our trained model
- Build a simple web interface with Gradio
- Deploy our name generator!