This project contains Python classes for making word and chracter level predictions based on an N-gram language model. The word prediction class predicts words based the current prefix of a word and optional left context. The character prediction class predicts the most probable next characters based on optional left context.
- pip
- Python 2.7
- KenLM
Language model queries are performed using the KenLM library. Use the package manager pip to install KenLM. We have made a branch of the original KenLM repo. The only change is to change several scripts to compile KenLM with support for up to 12-gram language models. This is required by the example character language model provided here.
pip install https://github.com/kdv123/kenlm/archive/master.zip
The examples directory under the root repository has the following example scripts:
- Find most probable word for a given context
- Find a list of probable words for a given prefix and context
- Add a new vocabulary
There are three python scripts which represent three class for the predictor. The predictor.py script contains the WordPredictor class and the chracter_predictor.py script contains the CharacterPredictor class. The vocabtrie.py contains a VocabTrie class which is used by the WordPredictor class to create a trie data structure.
To use the WordPredictor class you need to do the following:
from predictor import WordPredictor
Then you need to specify the path to a language model filename and a vocabulary filename. There are some example language models and vocabulary filename in the resources sub-directory.
lm_filename = 'resources/lm_word_medium.kenlm'
vocab_filename = 'resources/vocab_100k'
word_predictor = WordPredictor(lm_filename, vocab_filename)
There are three methods to predict the most probable word or a list of probable words:
- The first method takes a prefix, a vocab_id and a minimum log probabilty as argument and returns a list of probable words without considering any context:
def get_words(prefix, vocab_id, num_predictions, min_log_prob)
When an object of the WordPredictor is instantiated it creates a trie data structure with the default vocab_id = ''. A list of characters from the vocabulary is also created on instantiation and the method returns a list of probable words starting with the prefix and each character of the character list. The default value for the parameter num_predictions is 0 and the method returns all the predictions ordering from the most probable to the least. The default value for the parameter min_log_prob is -float('inf').
- The second method is similar to the previous one by it also takes into account a context to predict the list of probable words:
def get_words(prefix, context, vocab_id, num_predictions, min_log_prob)
- The third method can take the similar arguments to the first and second method but in this case it only returns the most probable word for a given context and a prefix:
def get_most_probable_word(prefix, context, vocab_id, num_predictions, min_log_prob)
This material is based upon work supported by the National Science Foundation under Grant No. 1750193. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.