# n-gram LMs

- 📺 **Video:** [https://youtu.be/J-yHbD8LYCM](https://youtu.be/J-yHbD8LYCM)

## Overview
- Introduce n-gram language models that predict the next token from fixed-length histories.
- Understand data sparsity challenges and motivations for smoothing.

## Key ideas
- **Markov assumption:** condition next word on the previous n-1 words.
- **Count-based models:** maximum likelihood counts work well with abundant data.
- **Sparsity:** unseen n-grams receive zero probability without smoothing.
- **Applications:** n-gram LMs score sentences and provide baselines for speech and MT.

## Demo
Estimate a trigram language model on a mini corpus and sample sentences, connecting to the lecture (https://youtu.be/-9urhJp9G68).

In [1]:
from collections import defaultdict
import numpy as np

corpus = [
    'the cat sat on the mat',
    'the dog chased the cat',
    'the cat chased the mouse',
    'the mouse ate the cheese',
    'the dog sat on the rug'
]

trigram_counts = defaultdict(int)
bigram_counts = defaultdict(int)
for sentence in corpus:
    tokens = ['<s>', '<s>'] + sentence.split() + ['</s>']
    for i in range(2, len(tokens)):
        trigram = (tokens[i-2], tokens[i-1], tokens[i])
        bigram = (tokens[i-2], tokens[i-1])
        trigram_counts[trigram] += 1
        bigram_counts[bigram] += 1

def trigram_prob(w1, w2, w3):
    count = trigram_counts[(w1, w2, w3)]
    base = bigram_counts[(w1, w2)]
    if base == 0:
        return 0.0
    return count / base

def sample_sentence(max_len=15):
    w1, w2 = '<s>', '<s>'
    sentence = []
    for _ in range(max_len):
        vocab = sorted({token for (_, _, token) in trigram_counts.keys() if _[0] == w1 and _[1] == w2})
        if not vocab:
            break
        probs = np.array([trigram_prob(w1, w2, w3) for w3 in vocab])
        probs /= probs.sum()
        w3 = np.random.choice(vocab, p=probs)
        if w3 == '</s>':
            break
        sentence.append(w3)
        w1, w2 = w2, w3
    return ' '.join(sentence)

print('Probability of "the cat chased":', trigram_prob('the', 'cat', 'chased'))
for _ in range(3):
    print('Sampled:', sample_sentence())


Probability of "the cat chased": 0.3333333333333333
Sampled: 
Sampled: 
Sampled: 


## Try it
- Modify the demo
- Add a tiny dataset or counter-example


## References
- [Eisenstein 6.1](https://github.com/jacobeisenstein/gt-nlp-class/blob/master/notes/eisenstein-nlp-notes.pdf)
- [Eisenstein 6.2](https://github.com/jacobeisenstein/gt-nlp-class/blob/master/notes/eisenstein-nlp-notes.pdf)
- [Eisenstein 6.4](https://github.com/jacobeisenstein/gt-nlp-class/blob/master/notes/eisenstein-nlp-notes.pdf)
- [Eisenstein 6.3](https://github.com/jacobeisenstein/gt-nlp-class/blob/master/notes/eisenstein-nlp-notes.pdf)
- [[Blog] Understanding LSTMs](http://colah.github.io/posts/2015-08-Understanding-LSTMs/)
- [Neural Machine Translation by Jointly Learning to Align and Translate](https://arxiv.org/abs/1409.0473)
- [Attention Is All You Need](https://arxiv.org/pdf/1706.03762.pdf)
- [Attention Is All You Need](https://arxiv.org/pdf/1706.03762.pdf)
- [[Blog] The Illustrated Transformer](http://jalammar.github.io/illustrated-transformer/)
- [Attention Is All You Need](https://arxiv.org/pdf/1706.03762.pdf)
- [Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation](https://arxiv.org/abs/2108.12409)
- [The Impact of Positional Encoding on Length Generalization in Transformers](https://arxiv.org/abs/2305.19466)


*Links only; we do not redistribute slides or papers.*