<a href="https://colab.research.google.com/github/kyunghyuncho/ammi-2019-nlp/blob/master/01-day-LM/ken_lm.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# KenLM Framework for Language Modeling

## Install KenLM

### Reference: https://github.com/kpu/kenlm

In [1]:
# !pip install https://github.com/kpu/kenlm/archive/master.zip

In [2]:
import sys
sys.path.append('utils/')

In [3]:
import kenlm
import os
import re
import utils.ngram_utils as ngram_utils
import numpy as np

In [4]:
# Read data from .txt files and create lists of reviews

train_data = []
# create a list of all the reviews 
with open('../data/train.txt', 'r') as f:
    train_data = [review for review in f.read().split('\n') if review]
    
valid_data = []
# create a list of all the reviews 
with open('../data/valid.txt', 'r') as f:
    valid_data = [review for review in f.read().split('\n') if review]
    

In [5]:
# Tokenize the Datasets
# TODO: this takes a really long time !! why?
train_data_tokenized, all_tokens_train = ngram_utils.tokenize_dataset(train_data)
valid_data_tokenized, all_tokens_valid = ngram_utils.tokenize_dataset(valid_data)


HBox(children=(IntProgress(value=1, bar_style='info', max=1), HTML(value='')))




HBox(children=(IntProgress(value=1, bar_style='info', max=1), HTML(value='')))




In [6]:
vocab = list(set(all_tokens_train))
len(vocab)

71102

In [7]:
train_data = []
for t in train_data_tokenized:
    train_data.append(' '.join(t))
train_data[:3]

['this is a great tutu and at a really great price .',
 "it doesn ' t look cheap at all .",
 "i ' m so glad i looked on amazon and found such an affordable tutu that isn ' t made poorly ."]

In [8]:
valid_data = []
for t in valid_data_tokenized:
    valid_data.append(' '.join(t))
valid_data[:3]

['good value .',
 'not super cheap material .',
 'at first , i was absolutely delighted with these peds . . .']

In [9]:
len(train_data), len(valid_data)

(1069724, 124185)

In [10]:
# # Change directory where you have the data
# path_to_data = '../data/'
# os.chdir(path_to_data)
# path_to_data

## 3-gram model with KenLM

In [14]:
cat ../data/train.txt | ../../kenlm/build/bin/lmplz -o 3 > amazonLM3.arpa

=== 1/5 Counting and sorting n-grams ===
File stdin isn't normal.  Using slower read() instead of mmap().  No progress bar.
Unigram tokens 16393736 types 71696
=== 2/5 Calculating and sorting adjusted counts ===
Chain sizes: 1:860352 2:75230912512 3:141057966080
Statistics:
1 71696 D1=0.690098 D2=0.962667 D3+=1.22676
2 1239185 D1=0.712943 D2=1.05296 D3+=1.36242
3 4834597 D1=0.772513 D2=1.0869 D3+=1.33918
Memory estimate for binary LM:
type     MB
probing 113 assuming -p 1.5
probing 120 assuming -r models -p 1.5
trie     44 without quantization
trie     24 assuming -q 8 -b 8 quantization 
trie     42 assuming -a 22 array pointer compression
trie     22 assuming -a 22 -q 8 -b 8 array pointer compression and quantization
=== 3/5 Calculating and sorting initial probabilities ===
Chain sizes: 1:860352 2:19826960 3:96691940
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
#####################################################################

In [15]:
!../../kenlm/build/bin/build_binary amazonLM3.arpa amazonLM3.klm

Reading amazonLM3.arpa
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
****************************************************************************************************
SUCCESS


In [17]:
model_3n = kenlm.LanguageModel('amazonLM3.klm')
model_3n

<Model from b'amazonLM3.klm'>

## 5-gram KenLM

In [18]:
cat ../data/train.txt | ../../kenlm/build/bin/lmplz -o 5 > amazonLM5.arpa

=== 1/5 Counting and sorting n-grams ===
File stdin isn't normal.  Using slower read() instead of mmap().  No progress bar.
Unigram tokens 16393736 types 71696
=== 2/5 Calculating and sorting adjusted counts ===
Chain sizes: 1:860352 2:21101352960 3:39565037568 4:63304056832 5:92318425088
Statistics:
1 71696 D1=0.690098 D2=0.962667 D3+=1.22676
2 1239185 D1=0.712943 D2=1.05296 D3+=1.36242
3 4834597 D1=0.796199 D2=1.09701 D3+=1.35908
4 9215190 D1=0.868874 D2=1.16401 D3+=1.3733
5 12376562 D1=0.898907 D2=1.2197 D3+=1.36975
Memory estimate for binary LM:
type     MB
probing 564 assuming -p 1.5
probing 651 assuming -r models -p 1.5
trie    261 without quantization
trie    142 assuming -q 8 -b 8 quantization 
trie    232 assuming -a 22 array pointer compression
trie    112 assuming -a 22 -q 8 -b 8 array pointer compression and quantization
=== 3/5 Calculating and sorting initial probabilities ===
Chain sizes: 1:860352 2:19826960 3:96691940 4:221164560 5:346543736
----5---10---15---20---25---3

In [19]:
!../../kenlm/build/bin/build_binary amazonLM5.arpa amazonLM5.klm

Reading amazonLM5.arpa
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
****************************************************************************************************
SUCCESS


In [20]:
model_5n = kenlm.LanguageModel('amazonLM5.klm')
model_5n

<Model from b'amazonLM5.klm'>

## Perplexity (Train + Valid Data)

### The KenLM model reports negative log likelihood, not perplexity. So we'll be converting the score and report net perplexity. The following function calculate the perpelxity.

### Pereplexity is defined as follows, $$ PPL = b^{- \frac{1}{N} \sum_{i=1}^N \log_b q(x_i)} $$ 

### All probabilities here are in log base 10 so to convert to perplexity, we do the following 

### $$PPL = 10^{-\log(P) / N} $$ 

### where $P$ is the total NLL, and $N$ is the word count.

In [21]:
def get_ppl(lm, sentences):
    """
    Assume sentences is a list of strings (space delimited sentences)
    """
    total_nll = 0
    total_wc = 0
    for sent in sentences:
        sent = re.sub(r"([\w/'+$\s-]+|[^\w/'+$\s-]+)\s*", r"\1 ", sent)
        words = sent.strip().split()
        score = lm.score(sent, bos=False, eos=False)
        word_count = len(words)
        total_wc += word_count
        total_nll += score
    ppl = 10**-(total_nll/total_wc)
    return ppl


In [22]:
# 3-gram
train_ppl = get_ppl(model_3n, train_data)
valid_ppl = get_ppl(model_3n, valid_data)
train_ppl, valid_ppl

(39.03832072741626, 73.11449196507486)

In [23]:
# 5-gram
train_ppl = get_ppl(model_5n, train_data)
valid_ppl = get_ppl(model_5n, valid_data)
train_ppl, valid_ppl

(14.550236138321127, 68.47091425220557)

## Score Sentences

In [24]:
sentences = ['i like this product very much .']
ppl3 = get_ppl(model_3n, sentences)
ppl5 = get_ppl(model_5n, sentences)
ppl3, ppl5

(37.1090241223962, 23.193163391871256)

In [25]:
sentences = ['i like pandas']
ppl3 = get_ppl(model_3n, sentences)
ppl5 = get_ppl(model_5n, sentences)
ppl3, ppl5

(11495.161728574109, 5921.243703148126)

Function for loading the data

In [26]:
sentences = ['this color is very ugly']
ppl3 = get_ppl(model_3n, sentences)
ppl5 = get_ppl(model_5n, sentences)
ppl3, ppl5

(176.24740921086848, 201.38290355857487)

In [27]:
sentences = ['kigali is an awesome city !']
ppl3 = get_ppl(model_3n, sentences)
ppl5 = get_ppl(model_5n, sentences)
ppl3, ppl5

(1315.5265486330818, 1509.4421649339931)

In [28]:
sentences = ['i want to get a refund']
ppl3 = get_ppl(model_3n, sentences)
ppl5 = get_ppl(model_5n, sentences)
ppl3, ppl5

(38.63318152095029, 40.45785877491473)

In [29]:
sentences = ['this watch is not what i expected']
ppl3 = get_ppl(model_3n, sentences)
ppl5 = get_ppl(model_5n, sentences)
ppl3, ppl5

(22.704919858886626, 28.244409713562888)

In [30]:
sentences = ['this dress fits me perfectly !']
ppl3 = get_ppl(model_3n, sentences)
ppl5 = get_ppl(model_5n, sentences)
ppl3, ppl5

(30.16364100643393, 28.5398056711585)

In [31]:
sentences = ['my wife loves this ring']
ppl3 = get_ppl(model_3n, sentences)
ppl5 = get_ppl(model_5n, sentences)
ppl3, ppl5

(53.479550924271464, 79.24041877941873)

## Generate Sentences

In [67]:
def generate(lm, context='<s>', max_num_tokens=20):
    generated_tokens = []
    cur_sent = context
    for j in range(max_num_tokens):
        scores = []
        for i, token in enumerate(vocab):
            sent = cur_sent + ' ' + token
            if token == '</s>':
                eos = True
            else:
                eos = False
            token_score = lm.score(sent, bos=True, eos=eos)
            scores.append(token_score)
        best_token = vocab[np.argmax(scores)]
        generated_tokens.append(best_token)
        cur_sent = cur_sent + ' ' + best_token
        if best_token == '</s>':
            break
    return generated_tokens

In [68]:
s3 = generate(model_3n)
s5 = generate(model_5n)
print(' '.join(word for word in s3))
print(' '.join(word for word in s5))

i bought this for my husband and he loves them . i ' m not sure if i had to
i bought this for my husband and he loves them . they are very comfortable and i love the color


In [63]:
context = '<s> i will'
s3 = generate(model_3n, context=context)
s5 = generate(model_5n, context=context)
print(' '.join(word for word in s3))
print(' '.join(word for word in s5))

be buying more . i ' m not sure if i had to return them . i ' m not
be returning them . i will be buying more of these . i ' m a size 8 and they


In [58]:
context = '<s> i like'
s3 = generate(model_3n, context=context)
s5 = generate(model_5n, context=context)
print(' '.join(word for word in s3))
print(' '.join(word for word in s5))

the picture . i ' m not sure if i had to return them . i ' m not sure
the look of the shoe . i have a pair of these for my husband and he loves them .


In [59]:
context = '<s> i am'
s3 = generate(model_3n, context=context)
s5 = generate(model_5n, context=context)
print(' '.join(word for word in s3))
print(' '.join(word for word in s5))

a size 8 . 5 and they are very comfortable . i ' m not sure if i had to
a size 8 and they fit perfectly . i have a pair of these for my husband and he loves


In [60]:
context = '<s> this'
s3 = generate(model_3n, context=context)
s5 = generate(model_5n, context=context)
print(' '.join(word for word in s3))
print(' '.join(word for word in s5))

is a little bit of a good price . i ' m not sure if i had to return them
is a great product . i have a pair of these for my husband and he loves them . they


In [61]:
context = '<s> this dress'
s3 = generate(model_3n, context=context)
s5 = generate(model_5n, context=context)
print(' '.join(word for word in s3))
print(' '.join(word for word in s5))

is very comfortable . i ' m not sure if i had to return them . i ' m not
is very cute . i love the color and the fit is perfect . i have a pair of these


In [62]:
context = '<s> this animal'
s3 = generate(model_3n, context=context)
s5 = generate(model_5n, context=context)
print(' '.join(word for word in s3))
print(' '.join(word for word in s5))

print . i ' m not sure if i had to return them . i ' m not sure if
print ! really cute and the color is a little darker than the picture . i ' m a size


In [64]:
context = '<s> what'
s3 = generate(model_3n, context=context)
s5 = generate(model_5n, context=context)
print(' '.join(word for word in s3))
print(' '.join(word for word in s5))

can i say ? i ' m not sure if i had to return them . i ' m not
can i say ? i love this watch . it is a little big , but i ' m not
