# 05: Language Modeling Fundamentals

**Duration:** 2-3 hours | **Difficulty:** Intermediate

## Learning Objectives

By the end of this notebook, you will understand:
- Language modeling fundamentals and applications
- N-gram vs neural language models
- Character-level modeling with LSTM
- Perplexity evaluation and text generation

## Table of Contents
1. [Introduction to Language Modeling](#1-introduction)
2. [N-gram Language Models](#2-ngram-models)
3. [Neural Character-Level Model](#3-neural-model)
4. [Text Generation and Evaluation](#4-generation)
5. [Practical Exercise](#5-exercise)

In [None]:
# Import required libraries
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader

import numpy as np
import matplotlib.pyplot as plt
from collections import Counter, defaultdict
import math
import random
from typing import List, Dict, Tuple, Optional

# Import our custom utilities
import sys
sys.path.append('../')
from utils.model_helpers import get_device, count_parameters
from configs.training_configs import get_training_config

# Set device and random seeds
device = get_device("auto")
print(f"Using device: {device}")

torch.manual_seed(42)
np.random.seed(42)
random.seed(42)

## 1. Introduction to Language Modeling {#1-introduction}

**Language modeling** predicts the next word/character in a sequence. It's fundamental to:
- Text generation and chatbots
- Speech recognition and machine translation
- Auto-completion systems

### Types:
1. **N-gram Models**: Statistical models based on word frequency
2. **Neural Models**: Deep learning models capturing complex patterns

In [None]:
# Load and prepare text corpus
with open('../data/corpora/ml_text_corpus.txt', 'r', encoding='utf-8') as f:
    corpus_text = f.read()

print(f"Corpus length: {len(corpus_text)} characters")
print(f"Sample text: {corpus_text[:200]}...")

# Character statistics
chars = sorted(list(set(corpus_text)))
vocab_size = len(chars)
print(f"\nUnique characters: {vocab_size}")
print(f"Characters: {''.join(chars[:30])}...")

## 2. N-gram Language Models {#2-ngram-models}

N-gram models predict next words based on frequency of previous N-1 words.

In [None]:
class SimpleNGramModel:
    """Simple N-gram language model for demonstration."""
    
    def __init__(self, n: int = 3):
        self.n = n
        self.ngram_counts = defaultdict(int)
        self.context_counts = defaultdict(int)
        self.vocab = set()
    
    def train(self, text: str):
        """Train model on text."""
        words = text.lower().split()
        self.vocab.update(words)
        
        # Add sentence markers
        words = ['<START>'] * (self.n - 1) + words + ['<END>']
        
        # Count N-grams
        for i in range(len(words) - self.n + 1):
            ngram = tuple(words[i:i + self.n])
            context = ngram[:-1]
            
            self.ngram_counts[ngram] += 1
            self.context_counts[context] += 1
    
    def get_probability(self, word: str, context: Tuple) -> float:
        """Get probability of word given context."""
        ngram = context + (word,)
        
        # Add-one smoothing
        ngram_count = self.ngram_counts.get(ngram, 0) + 1
        context_count = self.context_counts.get(context, 0) + len(self.vocab)
        
        return ngram_count / context_count
    
    def generate(self, prompt: str = "", max_length: int = 20) -> str:
        """Generate text using the model."""
        if not prompt:
            context = ['<START>'] * (self.n - 1)
        else:
            words = prompt.lower().split()
            context = (['<START>'] * max(0, self.n - 1 - len(words)) + words)[-(self.n - 1):]
        
        generated = list(context) if prompt else []
        
        for _ in range(max_length):
            candidates = list(self.vocab)
            probs = [self.get_probability(word, tuple(context)) for word in candidates]
            
            if not probs:
                break
            
            # Sample next word
            probs = np.array(probs)
            probs = probs / probs.sum()
            next_word = np.random.choice(candidates, p=probs)
            
            if next_word == '<END>':
                break
            
            generated.append(next_word)
            context = context[1:] + [next_word]
        
        return ' '.join([w for w in generated if w != '<START>'])

# Train and test N-gram models
print("Training N-gram models...")
models = {}
for n in [2, 3]:
    model = SimpleNGramModel(n=n)
    model.train(corpus_text)
    models[n] = model
    print(f"{n}-gram: {len(model.vocab)} vocab, {len(model.ngram_counts)} n-grams")

# Generate examples
print("\nGenerated text examples:")
for n, model in models.items():
    text = model.generate("machine learning", 10)
    print(f"{n}-gram: {text}")

## 3. Neural Character-Level Model {#3-neural-model}

Neural models can capture longer dependencies and learn character representations.

In [None]:
class CharDataset(Dataset):
    """Character-level dataset for language modeling."""
    
    def __init__(self, text: str, seq_length: int = 50):
        self.seq_length = seq_length
        
        # Build vocabulary
        self.chars = sorted(list(set(text)))
        self.vocab_size = len(self.chars)
        self.char_to_idx = {ch: i for i, ch in enumerate(self.chars)}
        self.idx_to_char = {i: ch for i, ch in enumerate(self.chars)}
        
        # Convert to indices
        self.data = [self.char_to_idx[ch] for ch in text]
        print(f"Dataset: {len(self.data)} chars, vocab_size: {self.vocab_size}")
    
    def __len__(self):
        return len(self.data) - self.seq_length
    
    def __getitem__(self, idx):
        x = torch.tensor(self.data[idx:idx + self.seq_length], dtype=torch.long)
        y = torch.tensor(self.data[idx + 1:idx + self.seq_length + 1], dtype=torch.long)
        return x, y
    
    def decode(self, indices):
        return ''.join([self.idx_to_char[idx] for idx in indices])

class CharLSTM(nn.Module):
    """LSTM-based character language model."""
    
    def __init__(self, vocab_size: int, embed_dim: int = 64, 
                 hidden_dim: int = 128, num_layers: int = 2, dropout: float = 0.3):
        super().__init__()
        
        self.hidden_dim = hidden_dim
        self.num_layers = num_layers
        
        self.embedding = nn.Embedding(vocab_size, embed_dim)
        self.lstm = nn.LSTM(embed_dim, hidden_dim, num_layers,
                           dropout=dropout if num_layers > 1 else 0, batch_first=True)
        self.dropout = nn.Dropout(dropout)
        self.output = nn.Linear(hidden_dim, vocab_size)
    
    def forward(self, x, hidden=None):
        embedded = self.embedding(x)
        lstm_out, hidden = self.lstm(embedded, hidden)
        output = self.output(self.dropout(lstm_out))
        return output, hidden
    
    def init_hidden(self, batch_size):
        h0 = torch.zeros(self.num_layers, batch_size, self.hidden_dim).to(device)
        c0 = torch.zeros(self.num_layers, batch_size, self.hidden_dim).to(device)
        return (h0, c0)

# Create dataset and model
dataset = CharDataset(corpus_text, seq_length=50)
dataloader = DataLoader(dataset, batch_size=32, shuffle=True)

model = CharLSTM(
    vocab_size=dataset.vocab_size,
    embed_dim=64,
    hidden_dim=128,
    num_layers=2
).to(device)

print(f"Model parameters: {count_parameters(model)['total']:,}")

# Test forward pass
sample_x, sample_y = next(iter(dataloader))
with torch.no_grad():
    output, _ = model(sample_x.to(device))
    print(f"Input shape: {sample_x.shape}, Output shape: {output.shape}")

## 4. Text Generation and Evaluation {#4-generation}

Let's train the model and implement text generation with different strategies.

In [None]:
def train_char_model(model, dataloader, epochs=3, lr=0.002):
    """Train character-level language model."""
    optimizer = optim.Adam(model.parameters(), lr=lr)
    criterion = nn.CrossEntropyLoss()
    
    losses = []
    model.train()
    
    for epoch in range(epochs):
        epoch_loss = 0
        
        for batch_idx, (data, target) in enumerate(dataloader):
            data, target = data.to(device), target.to(device)
            
            optimizer.zero_grad()
            output, _ = model(data)
            
            # Reshape for loss calculation
            loss = criterion(output.reshape(-1, dataset.vocab_size), target.reshape(-1))
            loss.backward()
            
            # Gradient clipping
            torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
            optimizer.step()
            
            epoch_loss += loss.item()
            
            if batch_idx % 10 == 0:
                print(f'Epoch {epoch+1}/{epochs}, Batch {batch_idx}, Loss: {loss.item():.4f}')
        
        avg_loss = epoch_loss / len(dataloader)
        losses.append(avg_loss)
        print(f'Epoch {epoch+1} Average Loss: {avg_loss:.4f}')
    
    return losses

def generate_text(model, dataset, prompt="", length=100, temperature=1.0):
    """Generate text using trained model."""
    model.eval()
    
    if prompt:
        input_seq = [dataset.char_to_idx.get(ch, 0) for ch in prompt]
    else:
        input_seq = [random.randint(0, dataset.vocab_size - 1)]
    
    generated = input_seq.copy()
    hidden = None
    
    with torch.no_grad():
        for _ in range(length):
            x = torch.tensor([generated[-50:]], dtype=torch.long).to(device)
            output, hidden = model(x, hidden)
            
            # Apply temperature and sample
            logits = output[0, -1, :] / temperature
            probs = F.softmax(logits, dim=0)
            next_char = torch.multinomial(probs, 1).item()
            
            generated.append(next_char)
    
    return dataset.decode(generated)

def calculate_perplexity(model, dataloader):
    """Calculate perplexity on test data."""
    model.eval()
    criterion = nn.CrossEntropyLoss()
    total_loss = 0
    total_tokens = 0
    
    with torch.no_grad():
        for data, target in dataloader:
            data, target = data.to(device), target.to(device)
            output, _ = model(data)
            
            loss = criterion(output.reshape(-1, dataset.vocab_size), target.reshape(-1))
            total_loss += loss.item() * target.numel()
            total_tokens += target.numel()
    
    avg_loss = total_loss / total_tokens
    perplexity = math.exp(avg_loss)
    return perplexity

# Train the model
print("Training character-level language model...")
training_losses = train_char_model(model, dataloader, epochs=3)

# Plot training loss
plt.figure(figsize=(8, 4))
plt.plot(training_losses, 'b-', linewidth=2)
plt.title('Training Loss')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.grid(True, alpha=0.3)
plt.show()

# Calculate perplexity
perplexity = calculate_perplexity(model, dataloader)
print(f"\nModel Perplexity: {perplexity:.2f}")

# Generate text samples
print("\nGenerated text samples:")
prompts = ["machine learning", "neural network", "deep learning"]
temperatures = [0.5, 1.0, 1.5]

for prompt in prompts:
    for temp in temperatures:
        text = generate_text(model, dataset, prompt, 80, temp)
        print(f"Prompt: '{prompt}', Temp: {temp}")
        print(f"Generated: {text[:100]}...\n")

## 5. Practical Exercise {#5-exercise}

**Exercise**: Experiment with different model configurations and generation strategies.

### Tasks:
1. Try different LSTM hidden sizes (64, 256)
2. Experiment with generation temperatures (0.1, 2.0)
3. Compare character vs word-level modeling
4. Implement beam search for generation

### Questions:
1. How does temperature affect generation quality?
2. What are trade-offs between character and word-level models?
3. How does model size impact perplexity and generation?

### Extension Ideas:
- Add attention mechanism to the LSTM
- Try GRU instead of LSTM
- Implement top-k and nucleus sampling
- Fine-tune on domain-specific text

In [None]:
# Exercise: Try different configurations
print("Exercise: Experimenting with model configurations")

# 1. Different hidden sizes
hidden_sizes = [64, 256]
for hidden_size in hidden_sizes:
    print(f"\nTesting hidden_size = {hidden_size}")
    
    test_model = CharLSTM(
        vocab_size=dataset.vocab_size,
        hidden_dim=hidden_size,
        num_layers=1
    ).to(device)
    
    params = count_parameters(test_model)
    print(f"Parameters: {params['total']:,}")
    
    # Quick training
    train_char_model(test_model, dataloader, epochs=1)
    
    # Generate sample
    sample = generate_text(test_model, dataset, "neural", 50, 1.0)
    print(f"Sample: {sample[:80]}...")

# 2. Temperature experiments
print("\n\nTemperature effects on generation:")
temperatures = [0.1, 0.5, 1.0, 1.5, 2.0]

for temp in temperatures:
    text = generate_text(model, dataset, "machine", 60, temp)
    print(f"Temp {temp}: {text[:70]}...")

print("\n=== Language Modeling Complete ===")
print("Key Concepts Learned:")
print("• N-gram vs neural language models")
print("• Character-level LSTM implementation")
print("• Text generation strategies")
print("• Perplexity evaluation")
print("• Temperature effects on sampling")