Skip to content

nirpr/cloze_completion

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

25 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

N-Gram Cloze Completion

Overview

This Python program leverages tri-grams to complete a given cloze. The user is required to provide a cloze, candidates list, corpus, and a lexicon. The program processes the provided data, calculates probabilities into a matrix, and selects the highest probability at each step to complete the cloze.

Features

  • Utilizes tri-grams and dictionaries of dictionaries for efficient processing.
  • Requires user input for cloze, candidates list, corpus, and lexicon.
  • Processes data to create a matrix of probabilities.
  • Selects the highest probability at each step to complete the cloze.

How to Use with Configuration File

  1. Clone the repository to your local machine:
  2. Create a configuration file named config.json with the following structure:
{
  "input_filename":  "data/document.cloze.txt",
  "candidates_filename": "data/candidate.words.txt",
  "lexicon_filename": "data/lexicon.txt",
  "corpus": "data/en.wikipedia2018.10M.txt"
}
  1. Update the file paths in the configuration file according to your setup.
  2. Run the program
  3. Review the completed cloze list generated by the program.

Note: Ensure that the specified files in the configuration file exist and have the required data.

Testing

For testing purposes, the program was evaluated using:

  • Corpus: 10 million lines from Wikipedia.
  • Lexicon: The 50,000 most frequent English words.
  • Success Rate: 83.33%

License

This project is licensed under the MIT License - see the LICENSE file for details.

Releases

No releases published

Packages

No packages published

Languages