# Statistical Language Models 

This notebook covers the fundamental concepts of **Statistical Language Models**, which form the foundation of modern NLP and LLMs.

Statistical Language Models assign probabilities to sequences of words, helping us understand:
- How likely a sentence is in a given language
- Which word should come next in a sequence
- How to evaluate model performance

### Key Concepts Covered

1. **Unigram Models** - Single word probabilities
2. **Bigram Models** - Word pair probabilities  
3. **Sentence Probability** - Computing likelihood of sequences
4. **Smoothing Techniques** - Handling unseen word combinations
5. **Perplexity** - Evaluating model quality

---

ðŸ“Œ **Sample Dataset Resources**
- Project Gutenberg texts: https://www.gutenberg.org/
- Example file: https://www.gutenberg.org/files/1342/1342-0.txt (Pride and Prejudice)

You can download a text file and place it in the same directory as this notebook.

## 1. Load and Preprocess Text

Text preprocessing is the first critical step in building a language model. We need to:
- **Normalize** the text (convert to lowercase)
- **Remove** special characters and punctuation
- **Tokenize** the text into individual words

This creates a clean, standardized corpus for analysis.

In [1]:
import re
from collections import Counter, defaultdict

def preprocess(text):
    """
    Preprocess text by:
    1. Converting to lowercase
    2. Removing non-alphabetic characters
    3. Splitting into tokens (words)
    """
    text = text.lower()
    text = re.sub(r'[^a-z\s]', '', text)
    return text.split()

# Sample text for demonstration
sample_text = "I love learning NLP and I love coding"
tokens = preprocess(sample_text)
print(f"Original text: {sample_text}")
print(f"Tokens: {tokens}")
print(f"Number of tokens: {len(tokens)}")

Original text: I love learning NLP and I love coding
Tokens: ['i', 'love', 'learning', 'nlp', 'and', 'i', 'love', 'coding']
Number of tokens: 8


## 2. Unigram Language Model

A **unigram model** treats each word independently and calculates the probability of a word based on its frequency in the corpus.

**Formula:**  
$$P(w) = \frac{\text{count}(w)}{\text{total words}}$$

This is the simplest language model where:
- Each word's probability is independent of context
- Probabilities sum to 1 across the vocabulary
- More frequent words have higher probabilities

In [2]:
# Count each word's occurrences
unigram_counts = Counter(tokens)
total_tokens = sum(unigram_counts.values())

# Calculate probability for each word
unigram_probs = {w: c / total_tokens for w, c in unigram_counts.items()}

print("Unigram Counts:")
for word, count in unigram_counts.items():
    print(f"  {word}: {count}")

print(f"\nTotal tokens: {total_tokens}")
print("\nUnigram Probabilities:")
for word, prob in unigram_probs.items():
    print(f"  P({word}) = {prob:.4f}")

# Verify probabilities sum to 1
print(f"\nSum of probabilities: {sum(unigram_probs.values()):.4f}")

Unigram Counts:
  i: 2
  love: 2
  learning: 1
  nlp: 1
  and: 1
  coding: 1

Total tokens: 8

Unigram Probabilities:
  P(i) = 0.2500
  P(love) = 0.2500
  P(learning) = 0.1250
  P(nlp) = 0.1250
  P(and) = 0.1250
  P(coding) = 0.1250

Sum of probabilities: 1.0000


## 3. Bigram Language Model

A **bigram model** considers word pairs (consecutive words) and calculates the probability of a word given the previous word.

**Formula:**  
$$P(w_2 | w_1) = \frac{\text{count}(w_1, w_2)}{\text{count}(w_1)}$$

This captures:
- **Context dependency** - the probability of a word depends on the previous word
- **Local patterns** - common word sequences in the language
- **Better predictions** than unigram models for natural language

In [3]:
# Count bigram occurrences
bigrams = defaultdict(int)

for i in range(len(tokens) - 1):
    bigrams[(tokens[i], tokens[i+1])] += 1

print("Bigram Counts:")
for bigram, count in bigrams.items():
    print(f"  {bigram}: {count}")

# Calculate conditional probabilities P(w2|w1)
bigram_probs = {}
for (w1, w2), count in bigrams.items():
    bigram_probs[(w1, w2)] = count / unigram_counts[w1]

print("\nBigram Probabilities:")
for bigram, prob in bigram_probs.items():
    print(f"  P({bigram[1]} | {bigram[0]}) = {prob:.4f}")

Bigram Counts:
  ('i', 'love'): 2
  ('love', 'learning'): 1
  ('learning', 'nlp'): 1
  ('nlp', 'and'): 1
  ('and', 'i'): 1
  ('love', 'coding'): 1

Bigram Probabilities:
  P(love | i) = 1.0000
  P(learning | love) = 0.5000
  P(nlp | learning) = 1.0000
  P(and | nlp) = 1.0000
  P(i | and) = 1.0000
  P(coding | love) = 0.5000


## 4. Sentence Probability (Bigram Model)

To calculate the probability of an entire sentence, we use the **chain rule** with bigram probabilities:

**Formula:**  
$$P(\text{sentence}) = P(w_1) \times P(w_2|w_1) \times P(w_3|w_2) \times ... \times P(w_n|w_{n-1})$$

This tells us:
- How likely a sentence is according to our language model
- Which sentences are more "natural" in the language
- Can be used for tasks like sentence ranking or text generation

In [4]:
def sentence_probability(sentence, unigram_probs, bigram_probs):
    """
    Calculate the probability of a sentence using bigram model.
    P(sentence) = P(w1) * P(w2|w1) * P(w3|w2) * ...
    """
    words = preprocess(sentence)
    
    # Start with first word probability
    prob = unigram_probs.get(words[0], 0)
    
    # Multiply by conditional probabilities
    for i in range(len(words) - 1):
        prob *= bigram_probs.get((words[i], words[i+1]), 0)
    
    return prob

# Test with different sentences
test_sentences = [
    "I love coding",
    "I love learning",
    "coding love I"  # Unnatural word order
]

print("Sentence Probabilities:")
for sent in test_sentences:
    prob = sentence_probability(sent, unigram_probs, bigram_probs)
    print(f"  '{sent}': {prob:.6f}")

Sentence Probabilities:
  'I love coding': 0.125000
  'I love learning': 0.125000
  'coding love I': 0.000000


## 5. Laplace (Add-One) Smoothing

**The Problem:** What happens when we encounter a bigram we've never seen before? The probability is 0, which makes the entire sentence probability 0!

**The Solution:** **Smoothing** techniques add a small probability to unseen events.

### Laplace (Add-One) Smoothing

**Formula:**  
$$P_{\text{smoothed}}(w_2|w_1) = \frac{\text{count}(w_1, w_2) + 1}{\text{count}(w_1) + V}$$

Where $V$ is the vocabulary size.

This ensures:
- No zero probabilities for unseen bigrams
- All bigrams have at least a small probability
- Better generalization to new text

In [5]:
# Get vocabulary
vocab = set(tokens)
V = len(vocab)

print(f"Vocabulary size: {V}")
print(f"Vocabulary: {vocab}")

# Apply Laplace smoothing to all possible bigrams
smoothed_bigram_probs = {}
for w1 in vocab:
    for w2 in vocab:
        count = bigrams.get((w1, w2), 0)
        smoothed_bigram_probs[(w1, w2)] = (count + 1) / (unigram_counts[w1] + V)

# Compare original vs smoothed probabilities
print("\nComparison (Original vs Smoothed):")
sample_bigrams = [('i', 'love'), ('love', 'learning'), ('coding', 'nlp')]  # Last one is unseen

for bg in sample_bigrams:
    orig_prob = bigram_probs.get(bg, 0.0)
    smooth_prob = smoothed_bigram_probs.get(bg, 0.0)
    print(f"  {bg}:")
    print(f"    Original: {orig_prob:.6f}")
    print(f"    Smoothed: {smooth_prob:.6f}")

Vocabulary size: 6
Vocabulary: {'nlp', 'i', 'and', 'learning', 'coding', 'love'}

Comparison (Original vs Smoothed):
  ('i', 'love'):
    Original: 1.000000
    Smoothed: 0.375000
  ('love', 'learning'):
    Original: 0.500000
    Smoothed: 0.250000
  ('coding', 'nlp'):
    Original: 0.000000
    Smoothed: 0.142857


## 6. Perplexity Calculation

**Perplexity** is a standard metric for evaluating language models. It measures how "surprised" the model is by the test data.

**Formula:**  
$$\text{Perplexity} = \exp\left(-\frac{1}{N} \sum_{i=1}^{N} \log P(w_i|w_{i-1})\right)$$

**Interpretation:**
- **Lower perplexity** = better model (less surprised by the data)
- **Higher perplexity** = worse model (more uncertain)
- Perplexity of $k$ means the model is as uncertain as if it had to choose uniformly from $k$ possibilities

Example: A perplexity of 50 means the model is as uncertain as if it had to randomly choose from 50 words at each position.

In [6]:
import math

def perplexity(sentence, bigram_probs):
    """
    Calculate perplexity of a sentence using the bigram model.
    Lower perplexity = better model fit
    """
    words = preprocess(sentence)
    N = len(words) - 1  # Number of bigrams
    log_prob = 0

    # Sum log probabilities to avoid numerical underflow
    for i in range(N):
        prob = bigram_probs.get((words[i], words[i+1]), 1e-10)  # Small value for unseen bigrams
        log_prob += math.log(prob)

    # Calculate perplexity
    return math.exp(-log_prob / N)

# Test on various sentences
test_sentences = [
    "I love learning",
    "I love coding",
    "learning and coding"
]

print("Perplexity Scores (using smoothed model):")
for sent in test_sentences:
    pp = perplexity(sent, smoothed_bigram_probs)
    print(f"  '{sent}': {pp:.2f}")

Perplexity Scores (using smoothed model):
  'I love learning': 3.27
  'I love coding': 3.27
  'learning and coding': 7.00


## Key Takeaways

### What We Learned

1. **Unigram Models** 
   - Simplest model: treats words independently
   - Good baseline but ignores context

2. **Bigram Models**
   - Captures local context (previous word)
   - Much better for natural language
   - Can be extended to trigrams, n-grams

3. **Smoothing**
   - Essential for handling unseen word combinations
   - Laplace smoothing is simple but effective
   - Other methods: Good-Turing, Kneser-Ney

4. **Perplexity**
   - Standard evaluation metric
   - Lower is better
   - Allows comparison between models

### Limitations of Statistical LMs

- **Data sparsity**: Need huge corpora for good coverage
- **Limited context**: Bigrams only see 1 word back
- **No semantic understanding**: Purely based on co-occurrence
- **Storage**: Need to store counts for all n-grams

### Modern Evolution

Today's LLMs (GPT, BERT, etc.) overcome these limitations through:
- **Neural networks**: Learn distributed representations
- **Long-range context**: Attention mechanisms see entire sequences
- **Semantic understanding**: Learn meaning, not just co-occurrence
- **Efficient storage**: Parameters instead of explicit counts

But statistical LMs remain important for understanding the foundations!