This Python program leverages tri-grams to complete a given cloze. The user is required to provide a cloze, candidates list, corpus, and a lexicon. The program processes the provided data, calculates probabilities into a matrix, and selects the highest probability at each step to complete the cloze.
- Utilizes tri-grams and dictionaries of dictionaries for efficient processing.
- Requires user input for cloze, candidates list, corpus, and lexicon.
- Processes data to create a matrix of probabilities.
- Selects the highest probability at each step to complete the cloze.
- Clone the repository to your local machine:
- Create a configuration file named config.json with the following structure:
{
"input_filename": "data/document.cloze.txt",
"candidates_filename": "data/candidate.words.txt",
"lexicon_filename": "data/lexicon.txt",
"corpus": "data/en.wikipedia2018.10M.txt"
}
- Update the file paths in the configuration file according to your setup.
- Run the program
- Review the completed cloze list generated by the program.
Note: Ensure that the specified files in the configuration file exist and have the required data.
For testing purposes, the program was evaluated using:
- Corpus: 10 million lines from Wikipedia.
- Lexicon: The 50,000 most frequent English words.
- Success Rate: 83.33%
This project is licensed under the MIT License - see the LICENSE file for details.