This project implements a statistical language model based on N-grams, trained on text from The Lord of the Rings (LOTR) corpus. The approach relies on estimating word sequence probabilities from frequency counts and applying backoff and pruning strategies to improve robustness.
An N-gram model estimates the probability of a word given a fixed-length context of the previous
where
By varying
- Bigram model:
$n = 2$ - Trigram model:
$n = 3$ - General N-gram model:
$n \geq 2$
Higher-order models capture more context but suffer more from data sparsity.
To address sparsity, a backoff mechanism is used. When a higher-order N-gram is not observed in the training data, the model falls back to a lower-order model.
For example, in a trigram model:
This process can continue recursively down to unigrams if necessary. Backoff ensures that the model can always produce a probability estimate, even for unseen sequences.
To reduce noise and improve efficiency, low-frequency N-grams are removed from the model. This is done by applying a threshold:
where
Pruning reduces memory usage and removes unreliable transitions, but increases reliance on backoff.
The model is trained on text from The Lord of the Rings. This corpus provides a consistent narrative style and vocabulary, allowing the model to learn domain-specific word patterns and structures.
The model combines:
- Maximum likelihood estimation for N-gram probabilities
- Backoff to handle unseen contexts
- Pruning to reduce sparsity and noise
This results in a probabilistic language model that captures local word dependencies within the LOTR corpus.