# 5-Gram Language Model: Training and Text Generation

This notebook demonstrates n-gram language modeling on the extended headlines dataset (10,000 headlines).

## What is an N-Gram Language Model?

An n-gram model predicts the next word based on the previous (n-1) words:

**P(word | context) = Count(context, word) / Count(context)**

For a 5-gram model:
- Context = 4 previous words
- Model predicts the 5th word
- Example: Given `president announces new reform`, predict the next word

## Key Concepts:
1. **Smoothing**: Add-k smoothing handles unseen n-grams
2. **Backoff**: If 5-gram not found, use 4-gram, then 3-gram, etc.
3. **Generation**: Sample words based on conditional probabilities

## 1. Setup and Imports

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from collections import Counter
import sys
sys.path.append('.')

from ngram_model import NGramModel

plt.style.use('ggplot')
%matplotlib inline
plt.rcParams['figure.figsize'] = (12, 6)

## 2. Load Dataset

In [None]:
df = pd.read_csv('../extended/news_headlines_extended.csv')

print(f"Dataset loaded successfully!")
print(f"Total headlines: {len(df):,}")
print(f"\nCategory distribution:")
print(df['category'].value_counts())
print(f"\nFirst 5 headlines:")
print(df[['headline', 'category']].head())

In [None]:
headlines = df['headline'].tolist()

print(f"Sample headlines:")
for i, headline in enumerate(headlines[:10], 1):
    print(f"{i:2d}. {headline}")

## 3. Train 5-Gram Model

In [None]:
print("Training 5-gram model...\n")

model = NGramModel(n=5, smoothing_k=0.01)
model.train(headlines, verbose=True)

## 4. Model Statistics

In [None]:
stats = model.get_ngram_stats()

print("Model Statistics:")
print("=" * 60)
for key, value in stats.items():
    print(f"{key:20s}: {value:,}" if isinstance(value, int) else f"{key:20s}: {value}")

## 5. Top N-Grams Analysis

In [None]:
top_ngrams = model.get_top_ngrams(k=30)

print("Top 30 Most Frequent 5-Grams:")
print("=" * 80)
print(f"{'Rank':<6} {'Context (4 words)':<50} {'Next Word':<15} {'Count':<6}")
print("=" * 80)

for i, (context, word, count) in enumerate(top_ngrams, 1):
    context_str = ' '.join(context)
    print(f"{i:<6} {context_str:<50} {word:<15} {count:<6}")

## 6. Visualization: N-Gram Frequency Distribution

In [None]:
ngram_counts = [count for _, _, count in top_ngrams]
ranks = list(range(1, len(ngram_counts) + 1))

fig, axes = plt.subplots(1, 2, figsize=(15, 5))

axes[0].bar(ranks, ngram_counts, color='steelblue', alpha=0.7)
axes[0].set_xlabel('Rank', fontsize=12)
axes[0].set_ylabel('Frequency', fontsize=12)
axes[0].set_title('Top 30 5-Gram Frequencies', fontsize=14, fontweight='bold')
axes[0].grid(axis='y', alpha=0.3)

axes[1].loglog(ranks, ngram_counts, 'o-', markersize=6, color='steelblue', alpha=0.7)
axes[1].set_xlabel('Rank (log scale)', fontsize=12)
axes[1].set_ylabel('Frequency (log scale)', fontsize=12)
axes[1].set_title('5-Gram Frequency (Log-Log)', fontsize=14, fontweight='bold')
axes[1].grid(True, alpha=0.3, which='both')

plt.tight_layout()
plt.show()

## 7. Conditional Probability Examples

In [None]:
sample_contexts = [
    ('president', 'announces', 'new', 'reform'),
    ('team', 'wins', 'championship', 'after'),
    ('company', 'launches', 'new', 'device'),
    ('million', 'for', 'new', 'technology'),
    ('<start>', '<start>', '<start>', '<start>'),
]

print("Conditional Probability Examples:")
print("=" * 70)

for context in sample_contexts:
    if context in model.ngrams:
        print(f"\nContext: {' '.join(context)}")
        word_probs = [(word, model._get_probability(context, word))
                     for word in model.ngrams[context].keys()]
        word_probs.sort(key=lambda x: x[1], reverse=True)
        
        print(f"  Top 5 most likely next words:")
        for word, prob in word_probs[:5]:
            print(f"    P({word:15s} | context) = {prob:.4f}")
    else:
        print(f"\nContext: {' '.join(context)}")
        print("  Context not found in model (will use backoff)")

## 8. Text Generation: Multiple Samples

In [None]:
print("Generating text samples...\n")
print("=" * 70)

for i in range(3):
    print(f"\n--- Sample {i+1} (50 words) ---\n")
    generated = model.generate(max_words=50)
    print(generated)
    print()

## 9. Generate Half a Page (~200 words)

In [None]:
print("Generating longer text (half a page, ~200 words)...\n")
print("=" * 70)

long_text = model.generate(max_words=200)
word_count = len(long_text.split())

print(f"[Generated {word_count} words]\n")
print(long_text)

## 10. Generate with Seed Text

In [None]:
seeds = [
    "president announces new reform",
    "team wins championship after",
    "company launches new device"
]

print("Text Generation with Seed Phrases:\n")
print("=" * 70)

for seed in seeds:
    print(f"\nSeed: '{seed}'\n")
    generated = model.generate(max_words=50, seed=seed)
    print(generated)
    print()

## 11. Comparison: Different N-Gram Sizes

In [None]:
print("Training models with different n-gram sizes...\n")

models = {}
for n in [3, 4, 5]:
    print(f"Training {n}-gram model...")
    m = NGramModel(n=n, smoothing_k=0.01)
    m.train(headlines, verbose=False)
    models[n] = m
    print(f"  Vocabulary: {m.vocab_size}, Contexts: {len(m.ngrams)}")

print("\n" + "=" * 70)
print("Comparison of Generated Text (50 words each)")
print("=" * 70)

for n in [3, 4, 5]:
    print(f"\n--- {n}-gram model ---\n")
    generated = models[n].generate(max_words=50)
    print(generated)
    print()

## 12. Save the Model

In [None]:
import os

model_path = 'models/5gram_extended.pkl'
os.makedirs(os.path.dirname(model_path), exist_ok=True)

model.save(model_path)
print(f"\nModel successfully saved!")
print(f"You can load it later using: model = NGramModel.load('{model_path}')")

## 13. Summary

### Key Findings:

1. **Model Characteristics**:
   - 5-gram model captures local context (4 previous words)
   - Vocabulary size reflects the extended dataset
   - Add-k smoothing handles unseen n-grams

2. **Text Generation Quality**:
   - Higher n (more context) produces more coherent text
   - 5-gram model maintains better local structure than 3-gram
   - Generated text follows the pattern of news headlines

3. **Limitations**:
   - Template-based training data creates repetitive patterns
   - No long-range dependencies (only local context)
   - May generate grammatically incorrect or semantically odd sequences

4. **Applications**:
   - Text completion and suggestion
   - Spelling correction
   - Baseline for more complex language models
   - Educational tool for understanding language modeling