# Smoothing in n-gram LMs

- 📺 **Video:** [https://youtu.be/Yfug5eIQh5w](https://youtu.be/Yfug5eIQh5w)

## Overview
- Tackle zero probabilities in n-gram models using add-k and interpolated smoothing.
- Interpret how smoothing redistributes mass to unseen events.

## Key ideas
- **Additive smoothing:** add pseudo-counts to avoid zeros.
- **Interpolation:** blend higher- and lower-order models for robustness.
- **Discounting:** subtract mass from seen events to allocate to unseen ones.
- **Perplexity impact:** smoothing often improves evaluation on held-out data.

## Demo
Compare maximum-likelihood and add-one smoothed probabilities on sparse trigrams to mirror the lecture (https://youtu.be/rXySwxZMaf0).

In [1]:
from collections import defaultdict

corpus = [
    'the cat sat on the mat',
    'the dog chased the cat',
    'the cat chased the mouse',
    'the mouse ate the cheese',
    'the dog sat on the rug'
]

vocab = sorted(set(' '.join(corpus).split())) + ['</s>']
trigram_counts = defaultdict(int)
bigram_counts = defaultdict(int)
for sentence in corpus:
    tokens = ['<s>', '<s>'] + sentence.split() + ['</s>']
    for i in range(2, len(tokens)):
        trigram = (tokens[i-2], tokens[i-1], tokens[i])
        bigram = (tokens[i-2], tokens[i-1])
        trigram_counts[trigram] += 1
        bigram_counts[bigram] += 1

def mle_prob(w1, w2, w3):
    count = trigram_counts[(w1, w2, w3)]
    base = bigram_counts[(w1, w2)]
    return count / base if base else 0.0

def add_one_prob(w1, w2, w3):
    count = trigram_counts[(w1, w2, w3)] + 1
    base = bigram_counts[(w1, w2)] + len(vocab)
    return count / base

examples = [
    ('the', 'cat', 'chased'),
    ('the', 'cat', 'ate'),
    ('dog', 'sat', 'on'),
    ('sat', 'on', 'the'),
    ('mouse', 'ate', 'the')
]

for w1, w2, w3 in examples:
    print(f"P({w3}|{w1} {w2}) MLE={mle_prob(w1, w2, w3):.4f} | add-one={add_one_prob(w1, w2, w3):.4f}")


P(chased|the cat) MLE=0.3333 | add-one=0.1333
P(ate|the cat) MLE=0.0000 | add-one=0.0667
P(on|dog sat) MLE=1.0000 | add-one=0.1538
P(the|sat on) MLE=1.0000 | add-one=0.2143
P(the|mouse ate) MLE=1.0000 | add-one=0.1538


## Try it
- Modify the demo
- Add a tiny dataset or counter-example


## References
- [Eisenstein 6.1](https://github.com/jacobeisenstein/gt-nlp-class/blob/master/notes/eisenstein-nlp-notes.pdf)
- [Eisenstein 6.2](https://github.com/jacobeisenstein/gt-nlp-class/blob/master/notes/eisenstein-nlp-notes.pdf)
- [Eisenstein 6.4](https://github.com/jacobeisenstein/gt-nlp-class/blob/master/notes/eisenstein-nlp-notes.pdf)
- [Eisenstein 6.3](https://github.com/jacobeisenstein/gt-nlp-class/blob/master/notes/eisenstein-nlp-notes.pdf)
- [[Blog] Understanding LSTMs](http://colah.github.io/posts/2015-08-Understanding-LSTMs/)
- [Neural Machine Translation by Jointly Learning to Align and Translate](https://arxiv.org/abs/1409.0473)
- [Attention Is All You Need](https://arxiv.org/pdf/1706.03762.pdf)
- [Attention Is All You Need](https://arxiv.org/pdf/1706.03762.pdf)
- [[Blog] The Illustrated Transformer](http://jalammar.github.io/illustrated-transformer/)
- [Attention Is All You Need](https://arxiv.org/pdf/1706.03762.pdf)
- [Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation](https://arxiv.org/abs/2108.12409)
- [The Impact of Positional Encoding on Length Generalization in Transformers](https://arxiv.org/abs/2305.19466)


*Links only; we do not redistribute slides or papers.*