<a href="https://colab.research.google.com/github/kyunghyuncho/ammi-2019-nlp/blob/master/01-day-LM/ngram_lm.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Language Modeling

### Goal: compute a probabilty distribution over all possible sentences:


### $$p(W) = p(w_1, w_2, ..., w_T)$$

### This unsupervised learning problem can be framed as a sequence of supervised learning problems:

### $$p(W) = p(w_1) * p(w_2|w_1) * ... * p(w_T|w_1, ..., w_{T-1})$$

### If we have K sentences, where the j-th sentence has T_j words for all j frmo 1 to K, then we want to max:

### $$log p(W) = \sum_{j = 1}^K \sum_{i=1}^{T_j} log p(w_i | w_{<i})$$




# N-gram language model

### Goal: estimate the n-gram probabilities using counts of sequences of n consecutive words

### Given a sequence of words $w$, we want to compute

###  $$P(w_i|w_{i−1}, w_{i−2}, …, w_{i−n+1})$$

### Where $w_i$ is the i-th word of the sequence.

### $$P(w_i|w_{i−n+1}, ..., w_{i−2}, w_{i−1}) = \frac{p(w_{i−n+1}, ..., w_{i−2}, w_{i−1}, w_i)}{\sum_{w \in V} p(w_{i−n+1}, ..., w_{i−2}, w_{i−1}, w)}$$

### Key Idea: We can estimate the probabilities using counts of n-grams in our dataset 


## N-gram Probabilities

## $$P(w_i|w_{i−n+1}, ..., w_{i−2}, w_{i−1}) \approx \frac{c(w_{i−n+1}, ..., w_{i−2}, w_{i−1}, w_i)}{\sum_{w \in V} c(w_{i−n+1}, ..., w_{i−2}, w_{i−1}, w)}$$


## Bigram Probabilities

## $$p(w_i | w_{i-1}) = \frac{c(w_{i-1}, w_i)}{\sum_{w_i} c(w_{i-1}, w_i)} $$


In [76]:
import os
import sys
sys.path.append('utils/')
from utils import ngram_utils as ngram_utils
import utils.global_variables as gl
import torch
import random
from utils.ngram_utils import NgramLM

In [77]:
torch.manual_seed(1)


<torch._C.Generator at 0x7fd7d0b902b0>

### Load Data from .txt Files

In [80]:
# Read data from .txt files and create lists of reviews

train_data = []
# create a list of all the reviews 
with open('../data/small_train.txt', 'r') as f:
    train_data = [review for review in f.read().split('\n') if review]
    
valid_data = []
# create a list of all the reviews 
with open('../data/small_valid.txt', 'r') as f:
    valid_data = [review for review in f.read().split('\n') if review]
    

In [81]:
# type(train_data), len(train_data), \
# type(train_data[0]), len(train_data[0]), \
# type(train_data[0][0]), len(train_data[0][0])

In [82]:
train_data[0], train_data[0][0]


("this is a great tutu and at a really great price . it doesn ' t look cheap at all . i ' m so glad i looked on amazon and found such an affordable tutu that isn ' t made poorly . a + + ",
 't')

### Process the Data

In [83]:
# Tokenize the Datasets
# TODO: this takes a really long time !! why?
train_data_tokenized, all_tokens_train = ngram_utils.tokenize_dataset(train_data)
valid_data_tokenized, all_tokens_valid = ngram_utils.tokenize_dataset(valid_data)


HBox(children=(IntProgress(value=1, bar_style='info', max=1), HTML(value='')))




HBox(children=(IntProgress(value=1, bar_style='info', max=1), HTML(value='')))




Let's look at the tokenized data!

In [89]:
# # Number of All Tokens
# len(all_tokens_train), all_tokens_train[0], \
len(train_data_tokenized), train_data_tokenized[0]

(10832,
 ['this',
  'is',
  'a',
  'great',
  'tutu',
  'and',
  'at',
  'a',
  'really',
  'great',
  'price',
  '.'])

In [90]:
train_ngram_lm = NgramLM(train_data_tokenized, all_tokens_train, n=3, smoothing=None)
valid_ngram_lm = NgramLM(valid_data_tokenized, all_tokens_valid, n=3, smoothing=None)

In [91]:
train_ngram_lm.n, train_ngram_lm.frac_vocab, train_ngram_lm.num_all_tokens

(3, 0.9, 165996)

In [94]:
valid_ngram_lm.vocabulary[:3], valid_ngram_lm.raw_data[:3]

(['thbis', 'mistake', 'ipods'],
 ['got',
  'this',
  'as',
  'a',
  'gift',
  'for',
  'my',
  'husband',
  'and',
  'we',
  'both',
  'love',
  'them',
  '.'])

In [95]:
valid_ngram_lm.vocab_ngram[:3], valid_ngram_lm.count_ngram[:3]

((('.', '<eos>', '<eos>'), ('<sos>', '<sos>', 'i'), ('<sos>', '<sos>', 'the')),
 (1370, 357, 148))

In [96]:
valid_ngram_lm.vocab_unigram[:3], valid_ngram_lm.count_unigram[:3]

((('.',), ('the',), ('i',)), (1470, 1026, 815))

In [97]:
valid_ngram_lm.vocab_bigram[:3], valid_ngram_lm.count_bigram[:3]

((('.', '<eos>'), ('<sos>', 'i'), ('<sos>', 'the')), (1370, 357, 148))

In [98]:
valid_ngram_lm.vocab_prev_ngram[:3], valid_ngram_lm.count_prev_ngram[:3]

((('.', '<eos>'), ('<sos>', 'i'), ('<sos>', 'the')), (1370, 357, 148))

In [106]:
valid_ngram_lm.id2token[:10]

['<pad>',
 '<unk>',
 '<sos>',
 '<eos>',
 ('.', '<eos>', '<eos>'),
 ('<sos>', '<sos>', 'i'),
 ('<sos>', '<sos>', 'the'),
 ('!', '<eos>', '<eos>'),
 ('<sos>', '<sos>', 'it'),
 ('<sos>', '<sos>', 'this')]

In [107]:
valid_ngram_lm.token2id['<pad>']

0

In [108]:
valid_ngram_lm.token2id[('.', '<eos>', '<eos>')]

4

#### Build the Vocabulary 


In [109]:
# Build a vocabulary using all the tokens found in train data (90% of most common ones)
vocabulary = train_ngram_lm.vocabulary
print('Word vocabulary size: {} words'.format(len(vocabulary)))        

Word vocabulary size: 7710 words


### CORPUS ANALYSIS (Train + Valid Data)

#### Number of Tokens in the Corpus Data


In [110]:
print("Number of All Tokens ", train_ngram_lm.num_all_tokens)

Number of All Tokens  165996


In [111]:
print("Number of All UNIQUE Tokens ", len(vocabulary))

Number of All UNIQUE Tokens  7710


#### Number of Sentences in the Train Data


In [112]:
print("Number of Sentences ", len(train_ngram_lm.raw_data))

Number of Sentences  10832


## N-grams

In [113]:
n = 3 # trigrams

### Function for padding the sentences with special markers sentence beginning and end, i.e. $<bos>$ and $<eos>$

In [114]:
train_padded = train_ngram_lm.padded_data
train_ngram = train_ngram_lm.ngram_data
vocab_ngram = train_ngram_lm.vocab_ngram
count_ngram = train_ngram_lm.count_ngram 

In [115]:
train_padded[0]

['<sos>',
 '<sos>',
 'this',
 'is',
 'a',
 'great',
 'tutu',
 'and',
 'at',
 'a',
 'really',
 'great',
 'price',
 '.',
 '<eos>',
 '<eos>']

### Function for finding all N-grams

In [116]:
train_ngram[0]

[('<sos>', '<sos>', 'this'),
 ('<sos>', 'this', 'is'),
 ('this', 'is', 'a'),
 ('is', 'a', 'great'),
 ('a', 'great', 'tutu'),
 ('great', 'tutu', 'and'),
 ('tutu', 'and', 'at'),
 ('and', 'at', 'a'),
 ('at', 'a', 'really'),
 ('a', 'really', 'great'),
 ('really', 'great', 'price'),
 ('great', 'price', '.'),
 ('price', '.', '<eos>'),
 ('.', '<eos>', '<eos>')]

In [117]:
vocab_ngram[0]

('.', '<eos>', '<eos>')

In [118]:
count_ngram[0]

9630

In [125]:
trie_ngram = train_ngram_lm.trie_ngram
# trie_ngram
# trie_prev_ngram = train_ngram_lm.trie_prev_ngram

In [131]:
trie_ngram['./<eos>/<eos>']

9630

In [175]:
# trie_ngram  # (ngram, number_of_times_ngram_appears_in_data)

In [133]:
id2token_ngram = train_ngram_lm.id2token
token2id_ngram = train_ngram_lm.token2id

In [134]:
random_token_id = random.randint(0, len(id2token_ngram) - 1)
random_token = id2token_ngram[random_token_id]

print ("Token id {} ; token {}".format(random_token_id, id2token_ngram[random_token_id]))
print ("Token {}; token id {}".format(random_token, token2id_ngram[random_token]))

Token id 91753 ; token ('an', 'older', 'coat')
Token ('an', 'older', 'coat'); token id 91753


### Ngram Count & Probability

In [None]:
# TODO: print the words for which the pd is nonzero !!! -- more intuitive than a list of numbers

In [136]:
vocab_ngram[:10], count_ngram[:10]

((('.', '<eos>', '<eos>'),
  ('<sos>', '<sos>', 'i'),
  ('<sos>', '<sos>', 'the'),
  ('!', '<eos>', '<eos>'),
  ('<sos>', '<sos>', 'they'),
  ('<sos>', '<sos>', 'it'),
  ('.', '.', '.'),
  ('<sos>', '<sos>', 'this'),
  ('.', '.', '<eos>'),
  ('<sos>', '<sos>', 'these')),
 (9630, 2692, 893, 840, 635, 543, 501, 411, 385, 374))

In [196]:
c = train_ngram_lm.get_ngram_count(('an', 'older', 'coat'))
p = train_ngram_lm.get_ngram_prob(('an', 'older', 'coat'))

p1 = train_ngram_lm.get_ngram_prob(('an', 'older', 'pc'))
p2 = train_ngram_lm.get_ngram_prob(('an', 'older', 'lady'))
p3 = train_ngram_lm.get_ngram_prob(('an', 'older', 'watch'))

pd = train_ngram_lm.get_prob_distr_ngram(('an', 'older'))

c, p, p1, p2, p3, sum(pd)#, pd

(1, 0.25, 0.25, 0.25, 0.0, 1.0)

In [197]:
c = train_ngram_lm.get_ngram_count(('really', 'great', 'price'))
p = train_ngram_lm.get_ngram_prob(('really', 'great', 'price'))
pd = train_ngram_lm.get_prob_distr_ngram(('really', 'great'))

c, p, sum(pd)#, pd 

(1, 0.14285714285714285, 1.0000000000000002)

In [198]:
c = train_ngram_lm.get_ngram_count(('really', 'great'))

c

0

In [199]:
c = train_ngram_lm.get_ngram_count(('.', '<eos>', '<eos>'))
p = train_ngram_lm.get_ngram_prob(('.', '<eos>', '<eos>'))
pd = train_ngram_lm.get_prob_distr_ngram(('.', '<eos>'))

c, p, sum(pd)#, pd

(9630, 1.0, 1.0)

In [200]:
c = train_ngram_lm.get_ngram_count(('.', '<sos>', '<sos>'))

c

0

In [204]:
c = train_ngram_lm.get_ngram_count(('i', 'like', 'pandas'))
p = train_ngram_lm.get_ngram_count(('i', 'like', 'pandas'))
pd = train_ngram_lm.get_prob_distr_ngram(('i', 'like'))

c, p, sum(pd)#, pd

(0, 0, 1.0)

In [218]:
c = train_ngram_lm.get_ngram_count(('i', 'like', 'this'))
p = train_ngram_lm.get_ngram_prob(('i', 'like', 'this'))
pd = train_ngram_lm.get_prob_distr_ngram(('i', 'like'))

c, p, sum(pd)#, pd

(9, 0.0703125, 1.0)

In [220]:
c = train_ngram_lm.get_ngram_count(('is', 'a', 'great'))
p = train_ngram_lm.get_ngram_prob(('is', 'a', 'great'))
pd = train_ngram_lm.get_prob_distr_ngram(('is', 'a'))

c, p, sum(pd)#, pd

(26, 0.09885931558935361, 1.0000000000000007)

In [230]:
c = train_ngram_lm.get_ngram_count(('send', 'it', 'back'))
p = train_ngram_lm.get_ngram_prob(('send', 'it', 'back'))
pd = train_ngram_lm.get_prob_distr_ngram(('send', 'it', 'back'))

c, p, sum(pd)#, pd

(2, 1.0, 0)

In [214]:
c = train_ngram_lm.get_ngram_count(('i', 'like', 'these', 'pictures'))
p = train_ngram_lm.get_ngram_prob(('i', 'like', 'these', 'pictures'))
pd = train_ngram_lm.get_prob_distr_ngram(('i', 'like', 'these'))

c, p, sum(pd)#, pd

(0, 0, 0)

## Add-One Smoothing

## $$P(w_i|w_{i−n+1}, ..., w_{i−2}, w_{i−1}) \approx \frac{1 + c(w_{i−n+1}, ..., w_{i−2}, w_{i−1}, w_i)}{\mid V\mid + \sum_{w \in V} c(w_{i−n+1}, ..., w_{i−2}, w_{i−1}, w)}$$


In [225]:
p = train_ngram_lm.get_ngram_prob_add_one_smoothing(('.', '<sos>', '<sos>'))
p

0.00014402995823131211

In [226]:
p = train_ngram_lm.get_ngram_prob_add_one_smoothing(('i', 'like', 'pandas'))
p

0.0001414227124876255

In [227]:
p = train_ngram_lm.get_ngram_prob_add_one_smoothing(('i', 'like', 'this'))
p

0.001414227124876255

In [228]:
p = train_ngram_lm.get_ngram_prob_add_one_smoothing(('really', 'great', 'price'))
p

0.00028776978417266187

In [231]:
p = train_ngram_lm.get_ngram_prob_add_one_smoothing(('send', 'it', 'back'))
p

0.00043196544276457883

In [237]:
p = train_ngram_lm.get_ngram_prob_add_one_smoothing(('.', '<eos>', '<eos>'))
p

0.5811259277137513

## Additive Smoothing

## $$P(w_i|w_{i−n+1}, ..., w_{i−2}, w_{i−1}) \approx \frac{\delta + c(w_{i−n+1}, ..., w_{i−2}, w_{i−1}, w_i)}{\delta\mid V\mid + \sum_{w \in V} c(w_{i−n+1}, ..., w_{i−2}, w_{i−1}, w)}$$


In [243]:
p = train_ngram_lm.get_ngram_prob_additive_smoothing(('.', '<sos>', '<sos>'), delta = 0.5)
p

0.00014402995823131211

In [244]:
p = train_ngram_lm.get_ngram_prob_additive_smoothing(('i', 'like', 'pandas'), delta = 0.5)
p

0.00013890818169190166

In [245]:
p = train_ngram_lm.get_ngram_prob_additive_smoothing(('i', 'like', 'this'), delta = 0.5)
p

0.0026392554521461314

In [246]:
p = train_ngram_lm.get_ngram_prob_additive_smoothing(('really', 'great', 'price'), delta = 0.5)
p

0.00043122035360068997

In [247]:
p = train_ngram_lm.get_ngram_prob_additive_smoothing(('send', 'it', 'back'), delta = 0.5)
p

0.0007197351374694113

In [248]:
p = train_ngram_lm.get_ngram_prob_additive_smoothing(('.', '<eos>', '<eos>'), delta = 0.5)
p

0.7350685036064573

### Changing the Parameter $\delta$

In [251]:
# small delta --> closer to no smoothing  (1.0)
p = train_ngram_lm.get_ngram_prob_additive_smoothing(('.', '<eos>', '<eos>'), delta = 0.1)
p

0.9327605745667988

In [252]:
# arge delta --> closer to add-one smoothing (0.58)
p = train_ngram_lm.get_ngram_prob_additive_smoothing(('.', '<eos>', '<eos>'), delta = 0.9)
p

0.6065295017854105

## Linear Interpolation Smoothing (Jelinek-Mercer)

### $$P(w_i|w_{i−n+1}, ..., w_{i−2}, w_{i−1}) \approx \alpha_n P(w_i|w_{i−n+1}, ..., w_{i−2}, w_{i−1}) + (1 - \alpha_n) P(w|w_{i−n+2}, ..., w_{i−2}, w_{i−1})$$


In [263]:
p = train_ngram_lm.get_ngram_prob_interpolation_smoothing(('.', '<sos>', '<sos>'), alpha = 0.8)
p

0.0

In [264]:
p = train_ngram_lm.get_ngram_prob_interpolation_smoothing(('i', 'like', 'pandas'), alpha = 0.8)
p

0.0

In [265]:
p = train_ngram_lm.get_ngram_prob_interpolation_smoothing(('i', 'like', 'this'), alpha = 0.8)
p

0.05625

In [266]:
p = train_ngram_lm.get_ngram_prob_interpolation_smoothing(('really', 'great', 'price'), alpha = 0.8)
p

0.11428571428571428

In [267]:
p = train_ngram_lm.get_ngram_prob_interpolation_smoothing(('send', 'it', 'back'), alpha = 0.8)
p

0.8

### Changing the Parameter $\alpha$

In [268]:
# small delta --> closer to no smoothing  (1.0)
p = train_ngram_lm.get_ngram_prob_interpolation_smoothing(('.', '<eos>', '<eos>'), alpha = 0.8)
p

0.8

In [271]:
# small delta --> closer to no smoothing  (1.0)
p = train_ngram_lm.get_ngram_prob_interpolation_smoothing(('.', '<eos>', '<eos>'), alpha = 0.5)
p

0.5

In [272]:
# small delta --> closer to no smoothing  (1.0)
p = train_ngram_lm.get_ngram_prob_interpolation_smoothing(('.', '<eos>', '<eos>'), alpha = 0.2)
p

0.2

## Linear Interpolation with Absolute Discounting

### $$p_{bi}(w|v) = max ({ \frac{N(v, w) - b_{bi}}{N(v)}, 0)  + b_{bi} \frac{V - N_0(v, \cdot)}{N(v)} p_{uni}(w) \large}$$

### $$p_{uni}(w) = max ({ \frac{N(w) - b_{uni}}{N}, 0)  + b_{uni} \frac{V - N_0(\cdot)}{N} \frac{1}{V}}$$

### $$b_{bi} = \frac{N_1(\cdot, \cdot)}{N_1(\cdot, \cdot) + 2*N_2(\cdot, \cdot)}$$

### $$b_{uni} = \frac{N_1(\cdot)}{N_1(\cdot) + 2*N_2(\cdot)}$$


### $$N_r(\cdot) = \sum_{w: N(w) = r} 1$$

### $$N_r(\cdot, \cdot) = \sum_{v, w: N(v, w) = r} 1$$

### $$N_r(v, \cdot) = \sum_{w: N(v, w) = r} 1$$

### V is the number of words in the vocabulary

### $N_r(\cdot, \cdot)$ and $N_r(\cdot)$  are the count-counts for bigrams and unigrams respectively $


### Remember to check that probabilities sum up to one:
### $$\sum_w p_{bi}(w|v) = \sum_w p_{uni}(w) = 1$$



In [273]:
y = "m"
x = "'"

z = train_ngram_lm.get_p_bi(y, x)
z

567.685815486679

In [276]:
train_ngram[:3]

[[('<sos>', '<sos>', 'this'),
  ('<sos>', 'this', 'is'),
  ('this', 'is', 'a'),
  ('is', 'a', 'great'),
  ('a', 'great', 'tutu'),
  ('great', 'tutu', 'and'),
  ('tutu', 'and', 'at'),
  ('and', 'at', 'a'),
  ('at', 'a', 'really'),
  ('a', 'really', 'great'),
  ('really', 'great', 'price'),
  ('great', 'price', '.'),
  ('price', '.', '<eos>'),
  ('.', '<eos>', '<eos>')],
 [('<sos>', '<sos>', 'it'),
  ('<sos>', 'it', 'doesn'),
  ('it', 'doesn', "'"),
  ('doesn', "'", 't'),
  ("'", 't', 'look'),
  ('t', 'look', 'cheap'),
  ('look', 'cheap', 'at'),
  ('cheap', 'at', 'all'),
  ('at', 'all', '.'),
  ('all', '.', '<eos>'),
  ('.', '<eos>', '<eos>')],
 [('<sos>', '<sos>', 'i'),
  ('<sos>', 'i', "'"),
  ('i', "'", 'm'),
  ("'", 'm', 'so'),
  ('m', 'so', 'glad'),
  ('so', 'glad', 'i'),
  ('glad', 'i', 'looked'),
  ('i', 'looked', 'on'),
  ('looked', 'on', 'amazon'),
  ('on', 'amazon', 'and'),
  ('amazon', 'and', 'found'),
  ('and', 'found', 'such'),
  ('found', 'such', 'an'),
  ('such', 'an', 'af

## Kneser-Ney Smoothing (best to use in practice!) http://smithamilli.com/blog/kneser-ney/

### Bigram LM
###  $$p(s) = \prod_{i = 1} ^ {N + 1} p(w_i | w_{i-1})$$

## Likelihood of a Sentence

### Bigram LM: $$ p(i \; love \; this \; light) = p(i|\cdot) \; p(love|i)\;  p(this|love)\;  p(light|this) \\
\approx \frac{c(i, \cdot)}{\sum_w c(\cdot, \; w)} \; \frac{c(love, i)}{\sum_wc(i, \; w)}\;  \frac{c(this, love)}{\sum_wc(love, \;w)}\;  \frac{c(light, this)}{\sum_wc(this, \;w)}$$ 

### Trigram LM: $$ p(i \; love \; this  \;light) = p(i|\cdot, \cdot) \; p(love|\cdot, i) \; p(this|i, love)\;  p(light|love, this)$$ 



### Score Sentences

In [281]:
n = 3
sentence = [['this', 'is', 'a', 'great', 'tutu']]
print(sentence)
ps = train_ngram_lm.get_prob_sentence(sentence)
ps

[['this', 'is', 'a', 'great', 'tutu']]


0.0

## Sentence Generation

#### No Context

In [283]:
num_tokens = 5
generated_sentence = train_ngram_lm.generate_sentence(num_tokens)
generated_sentence


i
i can
i can see
i can see better
i can see better ,


'i can see better ,'

In [284]:
num_tokens = 10
generated_sentence = train_ngram_lm.generate_sentence(num_tokens, context=('i', 'like', 'the'))
generated_sentence


covered
covered elastic
covered elastic legs
covered elastic legs and
covered elastic legs and they
covered elastic legs and they '
covered elastic legs and they ' re
covered elastic legs and they ' re wearing
covered elastic legs and they ' re wearing body
covered elastic legs and they ' re wearing body armor


"covered elastic legs and they ' re wearing body armor"

In [282]:
num_tokens = 20
generated_sentence = train_ngram_lm.generate_sentence(num_tokens)
generated_sentence


i
i love
i love the
i love the two
i love the two outer
i love the two outer pockets
i love the two outer pockets .
i love the two outer pockets . <
i love the two outer pockets . < 65
i love the two outer pockets . < 65 word
i love the two outer pockets . < 65 word onlinel
i love the two outer pockets . < 65 word onlinel camo
i love the two outer pockets . < 65 word onlinel camo hightop
i love the two outer pockets . < 65 word onlinel camo hightop wears
i love the two outer pockets . < 65 word onlinel camo hightop wears mymerrell
i love the two outer pockets . < 65 word onlinel camo hightop wears mymerrell tissues
i love the two outer pockets . < 65 word onlinel camo hightop wears mymerrell tissues with
i love the two outer pockets . < 65 word onlinel camo hightop wears mymerrell tissues with lbs
i love the two outer pockets . < 65 word onlinel camo hightop wears mymerrell tissues with lbs hands
i love the two outer pockets . < 65 word onlinel camo hightop wears mymerrell tissues with

'i love the two outer pockets . < 65 word onlinel camo hightop wears mymerrell tissues with lbs hands snuggly'

#### With Context

In [286]:
num_tokens = 5
generated_sentence = train_ngram_lm.generate_sentence(num_tokens, context=('i', 'like', 'the'))
generated_sentence


shoes
shoes .
shoes . <
shoes . < antenna
shoes . < antenna teva


'shoes . < antenna teva'

In [285]:
num_tokens = 10
generated_sentence = train_ngram_lm.generate_sentence(num_tokens, context=('i', 'like', 'the'))
generated_sentence


currency
currency compartment
currency compartment there
currency compartment there are
currency compartment there are only
currency compartment there are only two
currency compartment there are only two factors
currency compartment there are only two factors which
currency compartment there are only two factors which keep
currency compartment there are only two factors which keep the


'currency compartment there are only two factors which keep the'

In [288]:
num_tokens = 5
generated_sentence = train_ngram_lm.generate_sentence(num_tokens, context=('i', 'like', 'the'))
generated_sentence

picture
picture and
picture and very
picture and very handy
picture and very handy for


'picture and very handy for'

In [290]:
num_tokens = 5
generated_sentence = train_ngram_lm.generate_sentence(num_tokens, context=('i', 'will', 'buy'))
generated_sentence

more
more modules
more modules is
more modules is really
more modules is really too


'more modules is really too'

In [291]:
num_tokens = 5
generated_sentence = train_ngram_lm.generate_sentence(num_tokens, context=('i'))
generated_sentence


like
like the
like the way
like the way i
like the way i could


'like the way i could'

In [294]:
num_tokens = 10
generated_sentence = train_ngram_lm.generate_sentence(num_tokens, context=('i'))
generated_sentence


have
have short
have short hair
have short hair and
have short hair and i
have short hair and i have
have short hair and i have to
have short hair and i have to start
have short hair and i have to start my
have short hair and i have to start my shopping


'have short hair and i have to start my shopping'

In [296]:
num_tokens = 10
generated_sentence = train_ngram_lm.generate_sentence(num_tokens, context=('this', 'is', 'the', 'best', 'i'))
generated_sentence


'
' ve
' ve ever
' ve ever owned
' ve ever owned .
' ve ever owned . <
' ve ever owned . < ecco
' ve ever owned . < ecco sunglasses
' ve ever owned . < ecco sunglasses spec
' ve ever owned . < ecco sunglasses spec hiking


"' ve ever owned . < ecco sunglasses spec hiking"

In [297]:
num_tokens = 5
generated_sentence = train_ngram_lm.generate_sentence(num_tokens, context=('this', 'is', 'the', 'best', 'i'))
generated_sentence


'
' ve
' ve found
' ve found the
' ve found the band


"' ve found the band"

In [298]:
num_tokens = 5
generated_sentence = train_ngram_lm.generate_sentence(num_tokens, context=('this', 'is', 'not', 'what', 'i'))
generated_sentence


don
don '
don ' t
don ' t just
don ' t just purchase


"don ' t just purchase"

## Log-Likelihood (n-gram)
## $$LL = \sum_{j=1}^{K} \sum_{i=1}^{T_j + 1} log p_{bi}(w_{j, i} | w_{j, n - i + 1}, \cdot, w_{j, i - 2}, w_{j, i - 1})$$

## Perplexity
## $$PP = exp(-\frac{LL}{\sum_j(T_j + 1)})$$

In [300]:
ppl_train = train_ngram_lm.get_perplexity(train_data_tokenized)
ppl_valid = train_ngram_lm.get_perplexity(valid_data_tokenized)


In [301]:
ppl_valid, ppl_train

(4.003496405825398e+17, 725.1099435816803)

### Let's Compare Different Smoothing Techniques

In [302]:
# No Smoothing
train_ngram_lm_no_smoothing = NgramLM(train_data_tokenized, all_tokens_train, n=3)
valid_ngram_lm_no_smoothing = NgramLM(valid_data_tokenized, all_tokens_valid, n=3)

ppl_train_no_smoothing = train_ngram_lm_no_smoothing.get_perplexity(train_data_tokenized)
ppl_valid_no_smoothing = train_ngram_lm_no_smoothing.get_perplexity(valid_data_tokenized)

ppl_valid_no_smoothing, ppl_train_no_smoothing


(4.003496405825398e+17, 725.1099435816803)

In [303]:
# Additive Smoothing
train_ngram_lm_additive = NgramLM(train_data_tokenized, all_tokens_train, n=3, smoothing='additive', delta=0.5)
valid_ngram_lm_additive = NgramLM(valid_data_tokenized, all_tokens_valid, n=3, smoothing='additive', delta=0.5)

ppl_train_no_additive = train_ngram_lm_additive.get_perplexity(train_data_tokenized)
ppl_valid_no_additive = train_ngram_lm_additive.get_perplexity(valid_data_tokenized)

ppl_valid_no_additive, ppl_train_no_additive


(2245.2347120600725, 1078.232279802221)

In [317]:
# Additive Smoothing
train_ngram_lm_additive_d2 = NgramLM(train_data_tokenized, all_tokens_train, n=3, smoothing='additive', delta=0.2)
valid_ngram_lm_additive_d2 = NgramLM(valid_data_tokenized, all_tokens_valid, n=3, smoothing='additive', delta=0.2)

ppl_train_no_additive_d2 = train_ngram_lm_additive_d2.get_perplexity(train_data_tokenized)
ppl_valid_no_additive_d2 = train_ngram_lm_additive_d2.get_perplexity(valid_data_tokenized)

ppl_valid_no_additive_d2, ppl_train_no_additive_d2


(1708.6807985656146, 556.8908309552986)

In [311]:
# Additive Smoothing
train_ngram_lm_additive_d8 = NgramLM(train_data_tokenized, all_tokens_train, n=3, smoothing='additive', delta=0.8)
valid_ngram_lm_additive_d8 = NgramLM(valid_data_tokenized, all_tokens_valid, n=3, smoothing='additive', delta=0.8)

ppl_train_no_additive_d8 = train_ngram_lm_additive_d8.get_perplexity(train_data_tokenized)
ppl_valid_no_additive_d8 = train_ngram_lm_additive_d8.get_perplexity(valid_data_tokenized)

ppl_valid_no_additive_d8, ppl_train_no_additive_d8


(2587.7558131859264, 1468.8550211850015)

In [314]:
# Additive Smoothing
train_ngram_lm_add1 = NgramLM(train_data_tokenized, all_tokens_train, n=3, smoothing='add-one')
valid_ngram_lm_add1 = NgramLM(valid_data_tokenized, all_tokens_valid, n=3, smoothing='add-one')

ppl_train_no_add1 = train_ngram_lm_add1.get_perplexity(train_data_tokenized)
ppl_valid_no_add1 = train_ngram_lm_add1.get_perplexity(valid_data_tokenized)

ppl_valid_no_add1, ppl_train_no_add1


(2767.1331172664704, 1685.1666310399155)

In [315]:
# Interpolation Smoothing
train_ngram_lm_interp_a2 = NgramLM(train_data_tokenized, all_tokens_train, n=3, smoothing='interpolation', alpha=0.2)
valid_ngram_lm_interp_a2 = NgramLM(valid_data_tokenized, all_tokens_valid, n=3, smoothing='interpolation', alpha=0.2)

ppl_train_no_interp_a2 = train_ngram_lm_interp_a2.get_perplexity(train_data_tokenized)
ppl_valid_no_interp_a2 = train_ngram_lm_interp_a2.get_perplexity(valid_data_tokenized)

ppl_valid_no_interp_a2, ppl_train_no_interp_a2


(4.120249201591078e+17, 3260.7947470396057)

In [319]:
# Interpolation Smoothing
train_ngram_lm_interp_a8 = NgramLM(train_data_tokenized, all_tokens_train, n=3, smoothing='interpolation', alpha=0.8)
valid_ngram_lm_interp_a8 = NgramLM(valid_data_tokenized, all_tokens_valid, n=3, smoothing='interpolation', alpha=0.8)

ppl_train_no_interp_a8 = train_ngram_lm_interp_a8.get_perplexity(train_data_tokenized)
ppl_valid_no_interp_a8 = train_ngram_lm_interp_a8.get_perplexity(valid_data_tokenized)

ppl_valid_no_interp_a8, ppl_train_no_interp_a8


(4.0194841118222176e+17, 893.1597201353827)

In [320]:
# Interpolation Smoothing
train_ngram_lm_interp_a5 = NgramLM(train_data_tokenized, all_tokens_train, n=3, smoothing='interpolation', alpha=0.5)
valid_ngram_lm_interp_a5 = NgramLM(valid_data_tokenized, all_tokens_valid, n=3, smoothing='interpolation', alpha=0.5)

ppl_train_no_interp_a5 = train_ngram_lm_interp_a5.get_perplexity(train_data_tokenized)
ppl_valid_no_interp_a5 = train_ngram_lm_interp_a5.get_perplexity(valid_data_tokenized)

ppl_valid_no_interp_a5, ppl_train_no_interp_a5


(4.053367926970881e+17, 1385.4824458368155)

In [305]:
# # Discounted Interpolation Smoothing
# train_ngram_lm_discount = NgramLM(train_data_tokenized, all_tokens_train, n=3, smoothing='discounting')
# valid_ngram_lm_discount = NgramLM(valid_data_tokenized, all_tokens_valid, n=3, smoothing='discounting')

# ppl_train_no_discount = train_ngram_lm_discount.get_perplexity(train_data_tokenized)
# ppl_valid_no_discount = train_ngram_lm_discount.get_perplexity(valid_data_tokenized)

# ppl_valid_no_discount, ppl_train_no_discount


### Vary n from ngrams

In [306]:
# Interpolation Smoothing, N = 2
train_ngram_lm_interp2 = NgramLM(train_data_tokenized, all_tokens_train, n=2, smoothing='interpolation')
valid_ngram_lm_interp2 = NgramLM(valid_data_tokenized, all_tokens_valid, n=2, smoothing='interpolation')

ppl_train_no_interp2 = train_ngram_lm_interp2.get_perplexity(train_data_tokenized)
ppl_valid_no_interp2 = train_ngram_lm_interp2.get_perplexity(valid_data_tokenized)

ppl_valid_no_interp2, ppl_train_no_interp2


(1.5296236594505036e+16, 4396.493002816503)

In [307]:
# Interpolation Smoothing, N = 3
train_ngram_lm_interp3 = NgramLM(train_data_tokenized, all_tokens_train, n=3, smoothing='interpolation')
valid_ngram_lm_interp3 = NgramLM(valid_data_tokenized, all_tokens_valid, n=3, smoothing='interpolation')

ppl_train_no_interp3 = train_ngram_lm_interp3.get_perplexity(train_data_tokenized)
ppl_valid_no_interp3 = train_ngram_lm_interp3.get_perplexity(valid_data_tokenized)

ppl_valid_no_interp3, ppl_train_no_interp3


(4.0194841118222176e+17, 893.1597201353827)

In [308]:
# Interpolation Smoothing, N = 5
train_ngram_lm_interp5 = NgramLM(train_data_tokenized, all_tokens_train, n=5, smoothing='interpolation')
valid_ngram_lm_interp5 = NgramLM(valid_data_tokenized, all_tokens_valid, n=5, smoothing='interpolation')

ppl_train_no_interp5 = train_ngram_lm_interp5.get_perplexity(train_data_tokenized)
ppl_valid_no_interp5 = train_ngram_lm_interp5.get_perplexity(valid_data_tokenized)

ppl_valid_no_interp5, ppl_train_no_interp5


(1.1201894610153348e+18, 227.42393815798513)

In [309]:
# Interpolation Smoothing, N = 10
train_ngram_lm_interp10 = NgramLM(train_data_tokenized, all_tokens_train, n=10, smoothing='interpolation')
valid_ngram_lm_interp10 = NgramLM(valid_data_tokenized, all_tokens_valid, n=10, smoothing='interpolation')

ppl_train_no_interp10 = train_ngram_lm_interp10.get_perplexity(train_data_tokenized)
ppl_valid_no_interp10 = train_ngram_lm_interp10.get_perplexity(valid_data_tokenized)

ppl_valid_no_interp10, ppl_train_no_interp10


(1.1524271783534917e+18, 196.61911426440474)

In [318]:
# Interpolation Smoothing, N = 20
train_ngram_lm_interp20 = NgramLM(train_data_tokenized, all_tokens_train, n=20, smoothing='interpolation')
valid_ngram_lm_interp20 = NgramLM(valid_data_tokenized, all_tokens_valid, n=20, smoothing='interpolation')

ppl_train_no_interp20 = train_ngram_lm_interp20.get_perplexity(train_data_tokenized)
ppl_valid_no_interp20 = train_ngram_lm_interp20.get_perplexity(valid_data_tokenized)

ppl_valid_no_interp20, ppl_train_no_interp20


(1.1552387706052442e+18, 208.770758577245)

### Exercise: do the above dor additive with delta 0.1 or smaller

In [321]:
# Interpolation Smoothing, N = 5
train_ngram_lm_add5 = NgramLM(train_data_tokenized, all_tokens_train, n=5, smoothing='additive', delta=0.1)
valid_ngram_lm_add5 = NgramLM(valid_data_tokenized, all_tokens_valid, n=5, smoothing='additive', delta=0.1)

ppl_train_no_add5 = train_ngram_lm_add5.get_perplexity(train_data_tokenized)
ppl_valid_no_add5 = train_ngram_lm_add5.get_perplexity(valid_data_tokenized)

ppl_valid_no_add5, ppl_train_no_add5


(7537.898310997279, 1041.5790008573922)

### Sentence Probabilities

In [322]:
sentence = [['this', 'is', 'a', 'great', 'tutu']]
print(sentence)
ps = train_ngram_lm.get_prob_sentence(sentence)
ps

[['this', 'is', 'a', 'great', 'tutu']]


0.0

In [323]:
sentence = [['this', 'is', 'a', 'great', 'tutu']]
print(sentence)
ps = train_ngram_lm_interp3.get_prob_sentence(sentence)
ps

[['this', 'is', 'a', 'great', 'tutu']]


0.0

In [324]:
sentence = [['this', 'is', 'a', 'great', 'tutu']]
print(sentence)
ps = train_ngram_lm_interp5.get_prob_sentence(sentence)
ps

[['this', 'is', 'a', 'great', 'tutu']]


0.0

In [325]:
sentence = [['this', 'is', 'a', 'great', 'tutu']]
print(sentence)
ps = train_ngram_lm_interp10.get_prob_sentence(sentence)
ps

[['this', 'is', 'a', 'great', 'tutu']]


0.0

In [326]:
sentence = [['this', 'is', 'a', 'great', 'tutu']]
print(sentence)
ps = train_ngram_lm_additive.get_prob_sentence(sentence)
ps

[['this', 'is', 'a', 'great', 'tutu']]


1.4486052229717558e-18

In [327]:
sentence = [['this', 'is', 'a', 'great', 'tutu']]
print(sentence)
ps = train_ngram_lm_add5.get_prob_sentence(sentence)
ps

[['this', 'is', 'a', 'great', 'tutu']]


3.735148080988107e-24

### Sentence Generation

In [336]:
num_tokens = 10
generated_sentence = train_ngram_lm_interp5.generate_sentence(num_tokens)
generated_sentence


definitely
definitely not
definitely not for
definitely not for the
definitely not for the airports
definitely not for the airports .
definitely not for the airports . <
definitely not for the airports . < minutes
definitely not for the airports . < minutes vintage
definitely not for the airports . < minutes vintage polka


'definitely not for the airports . < minutes vintage polka'

In [337]:
num_tokens = 10
generated_sentence = train_ngram_lm_interp3.generate_sentence(num_tokens)
generated_sentence


but
but for
but for the
but for the rib
but for the rib cage
but for the rib cage ,
but for the rib cage , but
but for the rib cage , but the
but for the rib cage , but the price
but for the rib cage , but the price .


'but for the rib cage , but the price .'

In [338]:
num_tokens = 10
generated_sentence = train_ngram_lm_additive.generate_sentence(num_tokens)
generated_sentence


i
i have
i have wanted
i have wanted to
i have wanted to own
i have wanted to own one
i have wanted to own one so
i have wanted to own one so i
i have wanted to own one so i bought
i have wanted to own one so i bought this


'i have wanted to own one so i bought this'

In [339]:
num_tokens = 10
generated_sentence = train_ngram_lm_interp10.generate_sentence(num_tokens)
generated_sentence


out
out of
out of the
out of the box
out of the box ,
out of the box , there
out of the box , there '
out of the box , there ' s
out of the box , there ' s no
out of the box , there ' s no break


"out of the box , there ' s no break"

In [340]:
num_tokens = 10
generated_sentence = train_ngram_lm_interp2.generate_sentence(num_tokens)
generated_sentence


i
i have
i have wide
i have wide ,
i have wide , please
i have wide , please even
i have wide , please even the
i have wide , please even the places
i have wide , please even the places to
i have wide , please even the places to have


'i have wide , please even the places to have'

In [341]:
num_tokens = 5
generated_sentence = train_ngram_lm_interp10.generate_sentence(num_tokens)
generated_sentence


fit
fit well
fit well .
fit well . <
fit well . < 00


'fit well . < 00'

In [342]:
num_tokens = 20
generated_sentence = train_ngram_lm_interp10.generate_sentence(num_tokens)
generated_sentence


the
the band
the band fits
the band fits perfect
the band fits perfect but
the band fits perfect but the
the band fits perfect but the cup
the band fits perfect but the cup size
the band fits perfect but the cup size seems
the band fits perfect but the cup size seems a
the band fits perfect but the cup size seems a tad
the band fits perfect but the cup size seems a tad small
the band fits perfect but the cup size seems a tad small ,
the band fits perfect but the cup size seems a tad small , but
the band fits perfect but the cup size seems a tad small , but i
the band fits perfect but the cup size seems a tad small , but i don
the band fits perfect but the cup size seems a tad small , but i don '
the band fits perfect but the cup size seems a tad small , but i don ' t
the band fits perfect but the cup size seems a tad small , but i don ' t want
the band fits perfect but the cup size seems a tad small , but i don ' t want to


"the band fits perfect but the cup size seems a tad small , but i don ' t want to"