<a href="https://colab.research.google.com/github/kyunghyuncho/ammi-2019-nlp/blob/master/01-day-LM/ken_lm.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# KenLM Framework for Language Modeling


**Install KenLM**

Download stable release and unzip: http://kheafield.com/code/kenlm.tar.gz

Need Boost >= 1.42.0 and bjam
*   Ubuntu: sudo apt-get install libboost-all-dev
*   Mac: brew install boost; brew install bjam

Run within kenlm directory:
    
*  mkdir -p build
  *  cd build
  *  cmake ..
  *  make -j 4
 
pip install https://github.com/kpu/kenlm/archive/master.zip

For more information on KenLM see: https://github.com/kpu/kenlm and http://kheafield.com/code/kenlm/


In [1]:
import sys
sys.path.append('utils/')

In [2]:
import kenlm
import os
import re
import utils.ngram_utils as ngram_utils


In [3]:
path = '/home/roberta/ammi-2019-nlp/data/'
os.chdir(path)


## 3-gram model with KenLM

In [4]:
cat train.txt | /home/roberta/kenlm/bin/lmplz -o 3 > amazonLM3.arpa

=== 1/5 Counting and sorting n-grams ===
File stdin isn't normal.  Using slower read() instead of mmap().  No progress bar.
=== 2/5 Calculating and sorting adjusted counts ===
Chain sizes: 1:279660 2:75231117312 3:141058342912
Statistics:
1 23305 D1=0.646943 D2=0.974739 D3+=1.27322
2 289522 D1=0.731498 D2=1.05898 D3+=1.40125
3 817369 D1=0.806321 D2=1.13568 D3+=1.32916
Memory estimate for binary LM:
type       kB
probing 21745 assuming -p 1.5
probing 23532 assuming -r models -p 1.5
trie     8599 without quantization
trie     4646 assuming -q 8 -b 8 quantization 
trie     8189 assuming -a 22 array pointer compression
trie     4237 assuming -a 22 -q 8 -b 8 array pointer compression and quantization
=== 3/5 Calculating and sorting initial probabilities ===
Chain sizes: 1:279660 2:4632352 3:16347380
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
#############################################################################################

In [5]:
# path = '/home/roberta'
# os.chdir(path)
# !kenlm/bin/lmplz amazonLM.arpa amazonLM.klm

In [6]:
import kenlm
model_3n = kenlm.LanguageModel('amazonLM3.arpa')


In [7]:
# Read data from .txt files and create lists of reviews

train_data = []
# create a list of all the reviews 
with open('../data/train.txt', 'r') as f:
    train_data = [review for review in f.read().split('\n') if review]
    
valid_data = []
# create a list of all the reviews 
with open('../data/valid.txt', 'r') as f:
    valid_data = [review for review in f.read().split('\n') if review]
    

In [8]:
# Tokenize the Datasets
# TODO: this takes a really long time !! why?
train_data_tokenized, all_tokens_train = ngram_utils.tokenize_dataset(train_data)
valid_data_tokenized, all_tokens_valid = ngram_utils.tokenize_dataset(valid_data)


HBox(children=(IntProgress(value=1, bar_style='info', max=1), HTML(value='')))




HBox(children=(IntProgress(value=1, bar_style='info', max=1), HTML(value='')))




In [19]:
train_data = []
for t in train_data_tokenized:
    train_data.append(' '.join(t))
train_data[:3]

['this is a great tutu and at a really great price .',
 "it doesn ' t look cheap at all .",
 "i ' m so glad i looked on amazon and found such an affordable tutu that isn ' t made poorly ."]

In [20]:
valid_data = []
for t in valid_data_tokenized:
    valid_data.append(' '.join(t))
valid_data[:3]

['these are not sized right .',
 'a 3x is always big on me and these r cut wrong !',
 "i ' m returning them ."]

#### The KenLM model reports negative log likelihood, not perplexity. So we'll be converting the score and report net perplexity. The following function calculate the perpelxity, get_ppl, and find all OOV words, get_oov.

#### Pereplexity is defined as follows, $$ PPL = b^{- \frac{1}{N} \sum_{i=1}^N \log_b q(x_i)} $$ All probabilities here are in log base 10 so to convert to perplexity, we do the following $$PPL = 10^{-\log(P) / N} $$ where $P$ is the total NLL, and $N$ is the word count.

In [25]:
def get_ppl(lm, sentences):
    """
    Assume sentences is a list of strings (space delimited sentences)
    """
    total_nll = 0
    total_wc = 0
    for sent in sentences:
        sent = re.sub(r"([\w/'+$\s-]+|[^\w/'+$\s-]+)\s*", r"\1 ", sent)
        words = sent.strip().split()
        score = lm.score(sent, bos=False, eos=False)
        word_count = len(words)
        total_wc += word_count
        total_nll += score
    ppl = 10**-(total_nll/total_wc)
    return ppl


In [27]:
train_ppl = get_ppl(model_3n, train_data)
train_ppl

29.274380639797194

In [29]:
valid_ppl = get_ppl(model_3n, valid_data)
valid_ppl

104.2331865732742

### Score Sentences

In [43]:
sentences = ['i like pandas']
ppl = get_ppl(model_3n, sentences)
ppl

208.9574544349147

Function for loading the data

In [44]:
sentences = ['i like this tutu']
ppl = get_ppl(model_3n, sentences)
ppl

198.57890045778882

In [46]:
sentences = ['this', 'is', 'a', 'great', 'tutu', 'and', 'at', 'a', 'really', 'great', 'price', '.']
ppl = get_ppl(model_3n, sentences)
ppl

446.33827823247873

In [47]:
sentences = ['.']
ppl = get_ppl(model_3n, sentences)
ppl

37.44112415772661

In [48]:
sentences = ['who wants dinner?']
ppl = get_ppl(model_3n, sentences)
ppl

2118.7494704899073

In [50]:
sentences = ['i want to get a refund']
ppl = get_ppl(model_3n, sentences)
ppl

44.450639257705284

In [51]:
sentences = ['this watch is not what i expected']
ppl = get_ppl(model_3n, sentences)
ppl

20.246407945986885

In [57]:
sentences = ['this fits me perfectly .']
ppl = get_ppl(model_3n, sentences)
ppl

25.158339446498

In [58]:
sentences = ['this coat fits me perfectly ?']
ppl = get_ppl(model_3n, sentences)
ppl

289.3983397318885

## 5-gram model with KenLM

In [55]:
cat train.txt | /home/roberta/kenlm/bin/lmplz -o 5 > amazonLM5.arpa

=== 1/5 Counting and sorting n-grams ===
File stdin isn't normal.  Using slower read() instead of mmap().  No progress bar.
=== 2/5 Calculating and sorting adjusted counts ===
Chain sizes: 1:279660 2:21101410304 3:39565144064 4:63304228864 5:92318670848
Statistics:
1 23305 D1=0.646943 D2=0.974739 D3+=1.27322
2 289522 D1=0.731498 D2=1.05898 D3+=1.40125
3 817369 D1=0.829874 D2=1.1484 D3+=1.34578
4 1241735 D1=0.904845 D2=1.216 D3+=1.45795
5 1455284 D1=0.933032 D2=1.34075 D3+=1.46117
Memory estimate for binary LM:
type    MB
probing 79 assuming -p 1.5
probing 92 assuming -r models -p 1.5
trie    36 without quantization
trie    19 assuming -q 8 -b 8 quantization 
trie    32 assuming -a 22 array pointer compression
trie    15 assuming -a 22 -q 8 -b 8 array pointer compression and quantization
=== 3/5 Calculating and sorting initial probabilities ===
Chain sizes: 1:279660 2:4632352 3:16347380 4:29801640 5:40747952
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---8

In [59]:
model_5n = kenlm.LanguageModel('amazonLM5.arpa')


In [60]:
train_ppl = get_ppl(model_5n, train_data)
train_ppl

6.429997779629135

In [61]:
valid_ppl = get_ppl(model_5n, valid_data)
valid_ppl

101.14058521657815

In [62]:
sentences = ['i like pandas']
ppl = get_ppl(model_5n, sentences)
ppl

134.44366930774507

In [63]:
sentences = ['i like this tutu']
ppl = get_ppl(model_5n, sentences)
ppl

220.46198893739796

In [64]:
sentences = ['this', 'is', 'a', 'great', 'tutu', 'and', 'at', 'a', 'really', 'great', 'price', '.']
ppl = get_ppl(model_5n, sentences)
ppl

446.33827823247873

In [65]:
sentences = ['who wants dinner?']
ppl = get_ppl(model_5n, sentences)
ppl

1897.4791957979626

In [66]:
sentences = ['i want to get a refund']
ppl = get_ppl(model_5n, sentences)
ppl

46.47910811364677

In [67]:
sentences = ['this watch is not what i expected']
ppl = get_ppl(model_5n, sentences)
ppl

23.111172724610682

In [68]:
sentences = ['this fits me perfectly .']
ppl = get_ppl(model_5n, sentences)
ppl

28.47332659107708

In [69]:
sentences = ['this coat fits me perfectly ?']
ppl = get_ppl(model_5n, sentences)
ppl

284.9399861757416

In [30]:
def load_data(path):
    data = []
    with open(path) as f:
        for i, line in enumerate(f): 
            data.append(line)
    return data

In [31]:
def get_oov(model, data):
    oov = []
    vocab = []
    for sent in data:
        sentence = sent
        words =  sentence.split()
        vocab += words
        # Find out-of-vocabulary words
        for w in words:
            if w not in model:
                    oov.append(w)
    return set(oov), set(vocab)

In [32]:
path_to_train = '/home/roberta/ammi-2019-nlp/data/train.txt'
train_data = load_data(path_to_train)
train_data[:3]

["this is a great tutu and at a really great price . it doesn ' t look cheap at all . i ' m so glad i looked on amazon and found such an affordable tutu that isn ' t made poorly . a + + \n",
 'i bought this for my 4 yr old daughter for dance class , she wore it today for the first time and the teacher thought it was adorable . i bought this to go with a light blue long sleeve leotard and was happy the colors matched up great . price was very good too since some of these go for over $ 15 . 00 dollars . \n',
 'what can i say . . . my daughters have it in orange , black , white and pink and i am thinking to buy for they the fuccia one . it is a very good way for exalt a dancer outfit : great colors , comfortable , looks great , easy to wear , durables and little girls love it . i think it is a great buy for costumer and play too . \n']

In [33]:
oov = get_oov(model, data)
# oov[0]

NameError: name 'model' is not defined