# Language Modeling using KenLM

First off, you need to install KenLM if you haven't already. 
- Download stable release and unzip: http://kheafield.com/code/kenlm.tar.gz
- Need Boost >= 1.42.0 and bjam
    - Ubuntu: `sudo apt-get install libboost-all-dev`
    - Mac: `brew install boost; brew install bjam`
- Run within kenlm directory:
    ```bash
    mkdir -p build
    cd build
    cmake ..
    make -j 4
    ```
- `pip install https://github.com/kpu/kenlm/archive/master.zip`
- For more information on KenLM see: https://github.com/kpu/kenlm and http://kheafield.com/code/kenlm/

You've already been provided with the binaries for 2 trained language models, `nli_5gram.binary` and `sentiment.binary`.

Now, load the models!

In [1]:
import kenlm
import os

In [2]:
nli_model = kenlm.Model("nli_5gram.binary") # Or correct path to binary
sent_model =  kenlm.Model("sentiment.binary")

Funciton to calculate perplexity and find OOV words,

In [3]:
def get_ppl(lm, sentences):
    """
    Assume sentences is a list of strings (space delimited sentences)
    """
    total_nll = 0
    total_wc = 0
    for sent in sentences:
        sent = re.sub(r"([\w/'+$\s-]+|[^\w/'+$\s-]+)\s*", r"\1 ", sent)
        words = sent.strip().split()
        score = lm.score(sent, bos=False, eos=False)
        word_count = len(words)
        total_wc += word_count
        total_nll += score
    ppl = 10**-(total_nll/total_wc)
    return ppl

def get_oov(model, data):
    oov = []
    vocab = []
    for sent in data:
        sentence = sent
        words =  sentence.split()
        vocab += words
        # Find out-of-vocabulary words
        for w in words:
            if w not in model:
                    oov.append(w)
    return set(oov), set(vocab)

Function to load data,

In [4]:
def load_data(path):
    data = []
    with open(path) as f:
        for i, line in enumerate(f): 
            data.append(line)
    return data

Come up with whatever sentences you like and see how your model performs on them. This score isn't the perplexity but the log probability,

In [8]:
sentence = "I am a chipmanzee ."
print(nli_model.score(sentence))
print(sent_model.score(sentence))

-12.8444375992
-20.9499568939


Some datasets you could try out:-

Billion word dataset: http://www.statmt.org/lm-benchmark/

Quaker historical corpus: https://www.woodbrooke.org.uk/resource-library/quaker-historical-corpus/

All of Shakespeare: http://norvig.com/ngrams/

IMDB: http://ai.stanford.edu/~amaas/data/sentiment/

SNLI test set and MultiNLI dev-set (only hypothesis sentences) in data folder.