A simple next-word prediction engine
Switch branches/tags
Clone or download
Latest commit 64427b7 Oct 8, 2014
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
data
scripts
src
test Update README Jul 22, 2013
.gitignore
LICENSE.md
README.md

README.md

Mastodon

A simple next-word prediction engine

Quick start

# Fetch a sample corpus
$ mkdir data
$ mkdir data/samples
$ curl http://norvig.com/big.txt -o data/samples/big.txt

# Generate stats using NSP
$ mkdir data/output
$ cd scripts
$ ./generate_stats.sh ../data/samples/big.txt ../data/output/

# Create binary dictionaries
$ cd ..
$ mkdir dictionaries
$ mkdir dictionaries/test
$ cd scripts
$ python makedict.py -u ../data/output/unigrams.txt -n ../data/output/ngrams2.ll,../data/output/ngrams3.ll,../data/output/ngrams4.ll -o ../dictionaries/test/big.dict

# Create binary dictionaries for unit tests
$ python makedict.py -t
$ python unittests.py
$ cd ../cpp
$ make test

Generating statistics

To create a binary dictionary, we need data created from the N-Gram Statistics Package (NSP), available at http://www.d.umn.edu/~tpederse/nsp.html. The script generate_stats.sh in the scripts/ folder serves this purpose.

A sample corpus can be found at http://norvig.com/big.txt.

$ curl https://dl.dropbox.com/u/228601/8pen/big.txt -o data/samples/big.txt

We can generate the desired statistics in the following way:

$ cd scripts
$ ./generate_stats.sh INPUT_FILE OUTPUT_DIR

Unigrams

The script generates a simple word frequency list unigram.txt in OUTPUT_DIR, in which each line is of the form weight unigram. Example output:

79377 the
39997 of
38076 and
28604 to
21780 in
20910 a
...

The weight is simply the number of occurences of the corresponding word in the corpus.

N-grams

The script then generates a lists of bi-, tri-, and four-grams (ngrams2.ll, ngrams3.ll, ngrams4.ll, also locaed in OUTPUT_DIR) of the form unigram<>unigram<>...<>rank weight (we ignore rank for now). Example output:

of<>the<>2 25053.6988
in<>the<>6 10335.9606
did<>not<>8 9798.6723

Generating dictionaries

To generate a binary dictionary using output of the NSP, a script makedict.py in the python/ folder is available. Example usage:

$ python makedict.py -u UNIGRAM_FILE -n BIGRAM_FILE,TRIGRAM_FILE,FOURGRAM_FILE -o OUTPUT_FILE

Using dictionaries

Implementations in Python and C++ are currently available for loading a binary dictionary and querying it for:

  • Corrections
  • Completions (Python only)
  • Next-word predictions

Python

Here is a simple usage in Python:

bindict = BinaryDictionary.from_file('../dictionaries/test/test.dict')
bindict.get_predictions(['hello']) # => [('there',10),('sir',3)]
bindict.get_corrections('yuur')    # => ['your','you','year']
bindict.get_completions('yo', 2)   # => ['you','your']

C++

Here is a simple usage in C++:

BinaryDictionary bindict;
bindict.fromFile("../dictionaries/test/test.dict");

string phrase[] = {"how", "are"};
vector<weighted_string> holder;
vector<weighted_string> predictions = bindict.getPredictions(phrase, 2, holder, 4);

vector<weighted_string> holder;
vector<weighted_string> corrections = bindict.getCorrections("you", holder, 100);

Note that querying for word completions is not yet implemented in C++.

Unit tests

The unit tests are designed to be used with a simple dictionary, located at dictionaries/test/test.dict, and generated using the -t option:

$ python makedict.py -t

Python

The Python unit tests use the unittest module, and are available in python/unittests.py:

$ python unittests.py

C++

The C++ unit tests, located at cpp/tests/unit/test.cpp, are based on the UnitTest++ framework (included). Simply use the provided Makefile in the cpp folder to run the tests:

$ make test

Generating statistics

License

Mastodon is released under the MIT license. See LICENSE.md.