# Smoothing in n-gram LMs

- 📺 **Video:** [https://youtu.be/Yfug5eIQh5w](https://youtu.be/Yfug5eIQh5w)

## Overview
Addresses how to han zero probabilities and improve estimates in n-gram models via smoothing techniques. The video likely starts with the simplest method, Laplace (add-1) smoothing, explaining that we pretend we saw each possible n-gram at least once to avoid zeros (so add 1 to all counts and adjust probabilities accordingly).

In [None]:
import os, random
random.seed(0)
CI = os.environ.get('CI') == 'true'

## Key ideas
- It then moves to more sophisticated methods like Good-Turing discounting (reallocating some probability mass from seen n-grams to unseen ones) and Interpolation/backoff strategies.
- For instance, backoff: if an n-gram has zero count, back off to using an (n-1)-gram probability.
- Or interpolation: always combine trigram, bigram, and unigram probabilities with certain weights The video probably emphasizes Kneser-Ney smoothing, one of the best methods: explaining its intuition that not all contexts are equally novel - it redistributes probabilities in a clever way by considering how likely a word is to appear as a novel continuation.
- This could be high-level since full Kneser-Ney details are complex, but the key idea is conveyed that it adjusts for how likely context is to produce new words.

## Demo

In [None]:
print('Try the exercises below and follow the linked materials.')

## Try it
- Modify the demo
- Add a tiny dataset or counter-example


## References
- [Eisenstein 6.1](https://github.com/jacobeisenstein/gt-nlp-class/blob/master/notes/eisenstein-nlp-notes.pdf)
- [Eisenstein 6.2](https://github.com/jacobeisenstein/gt-nlp-class/blob/master/notes/eisenstein-nlp-notes.pdf)
- [Eisenstein 6.4](https://github.com/jacobeisenstein/gt-nlp-class/blob/master/notes/eisenstein-nlp-notes.pdf)
- [Eisenstein 6.3](https://github.com/jacobeisenstein/gt-nlp-class/blob/master/notes/eisenstein-nlp-notes.pdf)
- [[Blog] Understanding LSTMs](http://colah.github.io/posts/2015-08-Understanding-LSTMs/)
- [Neural Machine Translation by Jointly Learning to Align and Translate](https://arxiv.org/abs/1409.0473)
- [Attention Is All You Need](https://arxiv.org/pdf/1706.03762.pdf)
- [Attention Is All You Need](https://arxiv.org/pdf/1706.03762.pdf)
- [[Blog] The Illustrated Transformer](http://jalammar.github.io/illustrated-transformer/)
- [Attention Is All You Need](https://arxiv.org/pdf/1706.03762.pdf)
- [Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation](https://arxiv.org/abs/2108.12409)
- [The Impact of Positional Encoding on Length Generalization in Transformers](https://arxiv.org/abs/2305.19466)


*Links only; we do not redistribute slides or papers.*