In this project, I have implemented a trigram language model in Spanish, trained on texts from Cervantes.
pip install git+https://github.com/leyresv/Ngram_Language_Model.git
pip install -r requirements.txt
To use the Language Model on your own data, open a terminal prompt on the root directory and introduce the following command:
python src/main/main.py
We can compute probabilities of entire sequences using the chain rule of probability:
A language model computes the probabilities of sequences of words happening in a certain language. However, applying the chain rule to compute those probabilities would lead to all the sentences non present in the train set to having a 0 probability. To overcome this problem, we can use an n-gram model with the Markov assumption (the probability of a word depends only on the previous word): instead of computing the probaiblity of a word given its entire history (all the previous words in the sentence), we will approximate its conditional probability of the preceding N words:
We can then compute the probability of a full sequence:
To avoid computational issues, we compute instead the log probability:
We can estimate the N-gram probabilities by getting counts of all the words appearing in a corpus and normalizing them:
There are different options to keep a language model from assigning zero probabilities to unseen contexts:
We assign a small probability mass to the unseen n-grams:
-
$k$ : smoothing factor -
$V$ : vocabulary size
We use the n-gram if the evidence is sufficient, otherwise the n-1-gram, otherwise the n-2-gram,..., otherwise the unigram.
We mix the probability estimates from all the n-gram estimators:
With:
We use a probability-based metric to evaluate our language model: the perplexity, computed as the inverse probability of the test set, normalized by the number of words: