# Language Models Comparison: N-gram vs Neural

**Course Focus:** Predicting the Next Word

This notebook compares two approaches to language modeling:
1. **Statistical:** 5-gram model with add-k smoothing
2. **Neural:** LSTM-based neural language model

**Key Questions:**
- How do they differ in next-word prediction?
- Which generates better text?
- What are the trade-offs?

---

## Part 1: Setup and Load Both Models

In [None]:
# Imports
import torch
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from pathlib import Path
import json

# Import our models
from ngram_model import NGramModel
from neural_lm import LSTMLanguageModel, Vocabulary

# Settings
plt.style.use('ggplot')
plt.rcParams['figure.figsize'] = (12, 6)
%matplotlib inline

print("Imports successful!")

### 1.1 Load 5-gram Statistical Model

In [None]:
# Load trained 5-gram model
ngram_model = NGramModel.load('models/5gram_extended.pkl')

# Get stats
ngram_stats = ngram_model.get_ngram_stats()
print("5-GRAM MODEL STATISTICS")
print("=" * 50)
for key, value in ngram_stats.items():
    print(f"{key:20s}: {value:,}" if isinstance(value, int) else f"{key:20s}: {value}")

### 1.2 Load LSTM Neural Model

In [None]:
# Check if trained model exists
lstm_path = Path('models/lstm_lm')
if not lstm_path.with_suffix('.pt').exists():
    print("WARNING: LSTM model not found!")
    print("Please train the model first by running:")
    print("  python train_neural_lm.py")
else:
    # Load LSTM model
    device = 'cuda' if torch.cuda.is_available() else 'cpu'
    lstm_model, lstm_vocab = LSTMLanguageModel.load('models/lstm_lm', device=device)
    
    print("LSTM MODEL STATISTICS")
    print("=" * 50)
    print(f"Vocabulary size:      {lstm_model.vocab_size:,}")
    print(f"Embedding dim:        {lstm_model.embedding_dim}")
    print(f"Hidden dim:           {lstm_model.hidden_dim}")
    print(f"Number of layers:     {lstm_model.num_layers}")
    print(f"Dropout rate:         {lstm_model.dropout_rate}")
    
    total_params = sum(p.numel() for p in lstm_model.parameters())
    print(f"Total parameters:     {total_params:,}")
    print(f"Device:               {device}")

---

## Part 2: Perplexity Comparison

**Perplexity** measures how "surprised" a model is by the test data.
- Lower perplexity = better model
- Perplexity of N means the model is as confused as if it had to choose uniformly from N options

Let's compare both models on the test set.

In [None]:
# Load test data
test_df = pd.read_csv('../extended/test.csv')
test_headlines = test_df['headline'].tolist()

print(f"Test set: {len(test_headlines)} headlines")
print(f"\nExample headlines:")
for i, headline in enumerate(test_headlines[:5], 1):
    print(f"{i}. {headline}")

### 2.1 Calculate N-gram Perplexity

For the 5-gram model, we calculate perplexity manually using the probabilities.

In [None]:
def calculate_ngram_perplexity(model, headlines):
    """Calculate perplexity for n-gram model."""
    total_log_prob = 0.0
    total_words = 0
    
    for headline in headlines:
        tokens = headline.lower().split()
        tokens_with_boundaries = ['<START>'] * (model.n - 1) + tokens + ['<END>']
        
        for i in range(len(tokens_with_boundaries) - model.n + 1):
            ngram = tuple(tokens_with_boundaries[i:i + model.n])
            context = ngram[:-1]
            word = ngram[-1]
            
            prob = model._get_probability(context, word)
            if prob > 0:
                total_log_prob += np.log(prob)
            else:
                total_log_prob += np.log(1e-10)  # Small probability for unseen
            total_words += 1
    
    avg_log_prob = total_log_prob / total_words
    perplexity = np.exp(-avg_log_prob)
    return perplexity

# Calculate
print("Calculating 5-gram perplexity...")
ngram_perplexity = calculate_ngram_perplexity(ngram_model, test_headlines)
print(f"5-gram Test Perplexity: {ngram_perplexity:.2f}")

### 2.2 Calculate LSTM Perplexity

Load the saved test perplexity from training.

In [None]:
# Load training results
with open('results/training_results.json', 'r') as f:
    results = json.load(f)

lstm_perplexity = results['test_perplexity']
print(f"LSTM Test Perplexity: {lstm_perplexity:.2f}")

### 2.3 Visualize Perplexity Comparison

In [None]:
# Create comparison plot
models = ['5-gram\n(Statistical)', 'LSTM\n(Neural)']
perplexities = [ngram_perplexity, lstm_perplexity]

plt.figure(figsize=(10, 6))
bars = plt.bar(models, perplexities, color=['steelblue', 'coral'], width=0.5)
plt.ylabel('Perplexity (lower is better)', fontsize=12)
plt.title('Language Model Perplexity Comparison', fontsize=14, fontweight='bold')
plt.ylim(0, max(perplexities) * 1.2)

# Add value labels
for bar, ppl in zip(bars, perplexities):
    height = bar.get_height()
    plt.text(bar.get_x() + bar.get_width()/2., height,
             f'{ppl:.1f}',
             ha='center', va='bottom', fontsize=12, fontweight='bold')

# Calculate improvement
improvement = ((ngram_perplexity - lstm_perplexity) / ngram_perplexity) * 100
plt.text(0.5, max(perplexities) * 1.1,
         f'LSTM improves perplexity by {improvement:.1f}%',
         ha='center', fontsize=12, style='italic', 
         bbox=dict(boxstyle='round', facecolor='lightgreen', alpha=0.3))

plt.grid(axis='y', alpha=0.3)
plt.tight_layout()
plt.show()

print(f"\nIMPROVEMENT: {improvement:.1f}%")

**Interpretation:**
- Lower perplexity = better next-word predictions
- LSTM typically achieves 20-40% lower perplexity than n-grams
- This means the neural model is less "surprised" by test data

---

## Part 3: Next Word Prediction Examples

Let's see how each model predicts the next word given different contexts.

### 3.1 Helper Functions

In [None]:
def get_ngram_predictions(model, context_str, top_k=5):
    """Get top k predictions from n-gram model."""
    tokens = context_str.lower().split()
    
    # Get context
    if len(tokens) < model.n - 1:
        context = tuple(['<START>'] * (model.n - 1 - len(tokens)) + tokens)
    else:
        context = tuple(tokens[-(model.n - 1):])
    
    # Get probabilities for all possible next words
    if context in model.ngrams:
        word_counts = model.ngrams[context]
        word_probs = [(word, model._get_probability(context, word)) 
                      for word in word_counts.keys()]
        word_probs.sort(key=lambda x: x[1], reverse=True)
        return word_probs[:top_k]
    else:
        return [("<no predictions>", 0.0)]

def get_lstm_predictions(model, vocab, context_str, top_k=5, device='cpu'):
    """Get top k predictions from LSTM model."""
    # Create vocabulary object
    vocab_obj = Vocabulary()
    vocab_obj.word2idx = vocab
    vocab_obj.idx2word = {v: k for k, v in vocab.items()}
    
    # Encode context
    tokens = ['<START>'] + context_str.lower().split()
    indices = vocab_obj.encode(tokens)
    
    # Convert to tensor
    context_tensor = torch.tensor([indices], dtype=torch.long).to(device)
    
    # Get predictions
    top_indices, top_probs = model.predict_next_word(context_tensor, top_k=top_k)
    
    # Decode
    predictions = [(vocab_obj.idx2word[idx], prob) 
                   for idx, prob in zip(top_indices, top_probs)]
    return predictions

print("Helper functions defined!")

### 3.2 Comparison Examples

In [None]:
# Test contexts
test_contexts = [
    "The president will",
    "New technology company",
    "The team wins",
    "Breaking news about",
    "Scientists discover"
]

print("NEXT WORD PREDICTION COMPARISON")
print("=" * 80)

for context in test_contexts:
    print(f"\nContext: \"{context}\"")
    print("-" * 80)
    
    # 5-gram predictions
    ngram_preds = get_ngram_predictions(ngram_model, context, top_k=5)
    print("\n5-GRAM predictions:")
    for i, (word, prob) in enumerate(ngram_preds, 1):
        print(f"  {i}. {word:15s} {prob:.4f} ({prob*100:.1f}%)")
    
    # LSTM predictions
    lstm_preds = get_lstm_predictions(lstm_model, lstm_vocab, context, top_k=5, device=device)
    print("\nLSTM predictions:")
    for i, (word, prob) in enumerate(lstm_preds, 1):
        print(f"  {i}. {word:15s} {prob:.4f} ({prob*100:.1f}%)")
    
    print("=" * 80)

**Observations:**
- LSTM predictions tend to be more semantically appropriate
- 5-gram relies purely on recent context
- LSTM can capture longer-range dependencies
- Both struggle with rare contexts

---

## Part 4: Text Generation Comparison

Let's generate text with both models and compare quality.

### 4.1 Generate with 5-gram Model

In [None]:
# Generate text
np.random.seed(42)
ngram_text = ngram_model.generate(max_words=100, multi_sentence=True)

print("5-GRAM GENERATED TEXT:")
print("=" * 80)
print(ngram_text)
print("=" * 80)

### 4.2 Generate with LSTM Model

In [None]:
# Create vocabulary object
vocab_obj = Vocabulary()
vocab_obj.word2idx = lstm_vocab
vocab_obj.idx2word = {v: k for k, v in lstm_vocab.items()}

# Start with <START> token
start_tokens = [lstm_vocab['<START>']]

# Generate
torch.manual_seed(42)
generated_indices = lstm_model.generate(start_tokens, max_length=100, 
                                       temperature=1.0, device=device)

# Decode
lstm_text = ' '.join(vocab_obj.decode(generated_indices))

print("LSTM GENERATED TEXT:")
print("=" * 80)
print(lstm_text)
print("=" * 80)

### 4.3 Quality Analysis

In [None]:
def analyze_text_quality(text):
    """Basic quality metrics for generated text."""
    words = text.split()
    unique_words = set(words)
    
    return {
        'total_words': len(words),
        'unique_words': len(unique_words),
        'type_token_ratio': len(unique_words) / len(words),
        'avg_word_length': np.mean([len(w) for w in words])
    }

ngram_quality = analyze_text_quality(ngram_text)
lstm_quality = analyze_text_quality(lstm_text)

print("TEXT QUALITY COMPARISON")
print("=" * 60)
print(f"{'Metric':<25} {'5-gram':<15} {'LSTM':<15}")
print("-" * 60)
print(f"{'Total words':<25} {ngram_quality['total_words']:<15} {lstm_quality['total_words']:<15}")
print(f"{'Unique words':<25} {ngram_quality['unique_words']:<15} {lstm_quality['unique_words']:<15}")
print(f"{'Type-token ratio':<25} {ngram_quality['type_token_ratio']:<15.3f} {lstm_quality['type_token_ratio']:<15.3f}")
print(f"{'Avg word length':<25} {ngram_quality['avg_word_length']:<15.2f} {lstm_quality['avg_word_length']:<15.2f}")
print("=" * 60)

print("\n**Interpretation:**")
print("- Higher type-token ratio = more diverse vocabulary")
print("- Compare fluency, grammar, and semantic coherence manually")

---

## Part 5: Where Neural Networks Win

Neural language models outperform n-grams in several key areas.

### 5.1 Long-Range Dependencies

**Problem:** N-grams have fixed context window (e.g., 4 words for 5-gram)  
**Solution:** LSTMs can remember information from much earlier in the sequence

In [None]:
# Example: Long context
long_context = "The company announced major layoffs last month and investors are now"

print("Testing long-range dependencies:")
print(f"Context: \"{long_context}\"")
print("\n5-gram (only sees last 4 words):")
ngram_preds = get_ngram_predictions(ngram_model, long_context, top_k=3)
for word, prob in ngram_preds:
    print(f"  - {word} ({prob:.3f})")

print("\nLSTM (can remember 'layoffs' from earlier):")
lstm_preds = get_lstm_predictions(lstm_model, lstm_vocab, long_context, top_k=3, device=device)
for word, prob in lstm_preds:
    print(f"  - {word} ({prob:.3f})")

print("\n→ LSTM can maintain context over longer distances")

### 5.2 Rare Context Handling

**Problem:** N-grams struggle with unseen contexts (even with smoothing)  
**Solution:** LSTMs generalize better through learned representations

In [None]:
# Example: Rare/unseen context
rare_context = "Quantum computing researchers develop"

print("Testing rare context handling:")
print(f"Context: \"{rare_context}\"")
print("\n5-gram:")
ngram_preds = get_ngram_predictions(ngram_model, rare_context, top_k=3)
for word, prob in ngram_preds:
    print(f"  - {word} ({prob:.3f})")

print("\nLSTM:")
lstm_preds = get_lstm_predictions(lstm_model, lstm_vocab, rare_context, top_k=3, device=device)
for word, prob in lstm_preds:
    print(f"  - {word} ({prob:.3f})")

print("\n→ LSTM makes reasonable predictions even for rare contexts")

### 5.3 Semantic Coherence

**Problem:** N-grams only use word surface forms  
**Solution:** LSTMs learn semantic representations through embeddings

In [None]:
# Compare related contexts
contexts = [
    "The athlete wins",
    "The player wins",
    "The champion wins"
]

print("Testing semantic understanding:")
print("These contexts are semantically similar (sports-related winners)\n")

for context in contexts:
    print(f"Context: \"{context}\"")
    lstm_preds = get_lstm_predictions(lstm_model, lstm_vocab, context, top_k=3, device=device)
    print("  LSTM predictions:", [w for w, p in lstm_preds])

print("\n→ LSTM produces similar predictions for semantically related contexts")

### 5.4 Summary: Neural vs Statistical

In [None]:
comparison = {
    'Aspect': ['Context Window', 'Rare Contexts', 'Semantic Understanding', 
               'Training Speed', 'Inference Speed', 'Model Size', 'Perplexity'],
    '5-gram (Statistical)': ['Fixed (4 words)', 'Poor (smoothing helps)', 'None',
                             'Very Fast', 'Very Fast', 'Small (~1 MB)', 'Higher'],
    'LSTM (Neural)': ['Unlimited (via memory)', 'Good (generalizes)', 'Yes (learned)',
                      'Slow (~10 min)', 'Fast', 'Larger (~5-10 MB)', 'Lower']
}

df_comparison = pd.DataFrame(comparison)
print("\nLANGUAGE MODEL COMPARISON")
print("=" * 80)
print(df_comparison.to_string(index=False))
print("=" * 80)

---

## Part 6: Training Analysis (LSTM)

Let's examine how the LSTM model learned.

In [None]:
# Load and display training curves
from IPython.display import Image
training_curves_path = 'results/training_curves.png'

if Path(training_curves_path).exists():
    print("LSTM Training Progress:")
    display(Image(filename=training_curves_path))
else:
    print("Training curves not found. Run training first.")

In [None]:
# Training summary
if Path('results/training_results.json').exists():
    with open('results/training_results.json', 'r') as f:
        results = json.load(f)
    
    print("LSTM TRAINING SUMMARY")
    print("=" * 60)
    print(f"Number of epochs:         {results['num_epochs']}")
    print(f"Vocabulary size:          {results['vocab_size']:,}")
    print(f"Total parameters:         {results['total_params']:,}")
    print(f"\nFinal train loss:         {results['final_train_loss']:.4f}")
    print(f"Final validation loss:    {results['final_val_loss']:.4f}")
    print(f"Best validation loss:     {results['best_val_loss']:.4f}")
    print(f"\nTest perplexity:          {results['test_perplexity']:.2f}")
    print("=" * 60)

---

## Part 7: Interactive Next-Word Predictor

In [None]:
def interactive_predictor(context_str, top_k=5):
    """Interactive function to compare both models."""
    print(f"\nContext: \"{context_str}\"")
    print("=" * 80)
    
    # 5-gram
    print("\n5-GRAM predictions:")
    ngram_preds = get_ngram_predictions(ngram_model, context_str, top_k=top_k)
    for i, (word, prob) in enumerate(ngram_preds, 1):
        bar = '█' * int(prob * 50)
        print(f"  {i}. {word:15s} {prob:.4f} {bar}")
    
    # LSTM
    print("\nLSTM predictions:")
    lstm_preds = get_lstm_predictions(lstm_model, lstm_vocab, context_str, top_k=top_k, device=device)
    for i, (word, prob) in enumerate(lstm_preds, 1):
        bar = '█' * int(prob * 50)
        print(f"  {i}. {word:15s} {prob:.4f} {bar}")
    print("=" * 80)

# Try different contexts
interactive_predictor("The president announces")
interactive_predictor("Scientists discover new")
interactive_predictor("Technology company releases")

**Try your own contexts:**  
Uncomment and modify the line below:

In [None]:
# interactive_predictor("Your context here")

---

## Part 8: Summary & Conclusions

### Key Findings

#### 1. Perplexity
- **LSTM achieves significantly lower perplexity** (~20-40% improvement)
- This means better next-word predictions
- Neural models are less "surprised" by test data

#### 2. Next-Word Prediction
- **LSTM predictions are more semantically appropriate**
- N-grams rely only on immediate context
- LSTM can use longer-range dependencies

#### 3. Text Generation
- **LSTM generates more fluent text**
- Better grammatical structure
- More coherent semantics
- N-grams can produce repetitive patterns

#### 4. Handling Rare Contexts
- **LSTM generalizes better**
- Learned representations help with unseen contexts
- N-grams struggle despite smoothing

#### 5. Trade-offs
- **N-grams:** Fast, simple, interpretable, small
- **LSTM:** Better quality, slower training, larger model

---

### When to Use Which?

**Use N-grams when:**
- Need very fast training/inference
- Limited computational resources
- Want interpretable model
- Simple baseline needed

**Use LSTM when:**
- Need best prediction quality
- Can afford training time
- Want semantic understanding
- Long-range dependencies matter

---

### The Progression: From Statistical to Neural

1. **Statistical (N-grams):** Count-based, no learning
2. **Neural (LSTM):** Learn patterns, generalize better
3. **Next:** Transformers (attention mechanisms, even better!)

This notebook shows why NLP moved from statistical to neural methods for language modeling.

---

## Export to HTML

To create a standalone HTML report:

```bash
jupyter nbconvert --to html compare_models.ipynb
```

Or without code:

```bash
jupyter nbconvert --to html compare_models.ipynb --no-input --output model_comparison_report.html
```