<a href="https://colab.research.google.com/github/kyunghyuncho/ammi-2019-nlp/blob/master/01-day-LM/ngram_lm.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Language Modeling

### Goal: compute a probabilty distribution over all possible sentences:


### $$p(W) = p(w_1, w_2, ..., w_T)$$

### This unsupervised learning problem can be framed as a sequence of supervised learning problems:

### $$p(W) = p(w_1) * p(w_2|w_1) * ... * p(w_T|w_1, ..., w_{T-1})$$

### If we have K sentences, where the j-th sentence has T_j words for all j frmo 1 to K, then we want to max:

### $$log p(W) = \sum_{j = 1}^K \sum_{i=1}^{T_j} log p(w_i | w_{<i})$$




# N-gram language model

### Goal: estimate the n-gram probabilities using counts of sequences of n consecutive words

### Given a sequence of words $w$, we want to compute

###  $$P(w_i|w_{i−1}, w_{i−2}, …, w_{i−n+1})$$

### Where $w_i$ is the i-th word of the sequence.

### $$P(w_i|w_{i−n+1}, ..., w_{i−2}, w_{i−1}) = \frac{p(w_{i−n+1}, ..., w_{i−2}, w_{i−1}, w_i)}{\sum_{w \in V} p(w_{i−n+1}, ..., w_{i−2}, w_{i−1}, w)}$$

### Key Idea: We can estimate the probabilities using counts of n-grams in our dataset 


## N-gram Probabilities

## $$P(w_i|w_{i−n+1}, ..., w_{i−2}, w_{i−1}) \approx \frac{c(w_{i−n+1}, ..., w_{i−2}, w_{i−1}, w_i)}{\sum_{w \in V} c(w_{i−n+1}, ..., w_{i−2}, w_{i−1}, w)}$$


## Bigram Probabilities

## $$p(w_i | w_{i-1}) = \frac{c(w_{i-1}, w_i)}{\sum_{w_i} c(w_{i-1}, w_i)} $$


In [1]:
import os
import sys
sys.path.append('utils/')
from utils import ngram_utils as ngram_utils
import utils.global_variables as gl
import torch
import random
from utils.ngram_utils import NgramLM

In [2]:
torch.manual_seed(1)


<torch._C.Generator at 0x7ff937baf2b0>

### Load Data from .txt Files

In [3]:
# Read data from .txt files and create lists of reviews

train_data = []
# create a list of all the reviews 
with open('../data/small_train.txt', 'r') as f:
    train_data = [review for review in f.read().split('\n') if review]
    
valid_data = []
# create a list of all the reviews 
with open('../data/small_valid.txt', 'r') as f:
    valid_data = [review for review in f.read().split('\n') if review]
    

In [4]:
# type(train_data), len(train_data), \
# type(train_data[0]), len(train_data[0]), \
# type(train_data[0][0]), len(train_data[0][0])

In [5]:
train_data[0], train_data[0][0], len(train_data)


("this is a great tutu and at a really great price . it doesn ' t look cheap at all . i ' m so glad i looked on amazon and found such an affordable tutu that isn ' t made poorly . a + + ",
 't',
 2227)

### Process the Data

In [6]:
# Tokenize the Datasets
# TODO: this takes a really long time !! why?
train_data_tokenized, all_tokens_train = ngram_utils.tokenize_dataset(train_data)
valid_data_tokenized, all_tokens_valid = ngram_utils.tokenize_dataset(valid_data)


HBox(children=(IntProgress(value=1, bar_style='info', max=1), HTML(value='')))




HBox(children=(IntProgress(value=1, bar_style='info', max=1), HTML(value='')))




Let's look at the tokenized data!

In [7]:
# # Number of All Tokens
# len(all_tokens_train), all_tokens_train[0], \
len(train_data_tokenized), train_data_tokenized[0]

(10832,
 ['this',
  'is',
  'a',
  'great',
  'tutu',
  'and',
  'at',
  'a',
  'really',
  'great',
  'price',
  '.'])

In [8]:
train_ngram_lm = NgramLM(train_data_tokenized, all_tokens_train, n=3, smoothing=None)
valid_ngram_lm = NgramLM(valid_data_tokenized, all_tokens_valid, n=3, smoothing=None)

In [9]:
train_ngram_lm.trie_ngram['./<eos>/<eos>']

9630

In [10]:
train_ngram_lm.n, train_ngram_lm.frac_vocab

(3, 0.9)

In [11]:
valid_ngram_lm.id2token[0:10]

['<unk>', '<sos>', '<eos>', '.', 'the', 'i', ',', 'and', 'it', 'a']

In [12]:
valid_ngram_lm.token2id['<unk>'], valid_ngram_lm.token2id['<sos>'], valid_ngram_lm.token2id['the']

(0, 1, 4)

In [13]:
valid_ngram_lm.vocab_ngram[:10], valid_ngram_lm.count_ngram[:10]

((('.', '<eos>', '<eos>'),
  ('<sos>', '<sos>', 'i'),
  ('<sos>', '<sos>', 'the'),
  ('!', '<eos>', '<eos>'),
  ('<sos>', '<sos>', 'it'),
  ('<sos>', '<sos>', 'this'),
  ('it', "'", 's'),
  ('.', '.', '.'),
  ('<sos>', '<sos>', 'they'),
  ('.', '.', '<eos>')),
 (1370, 357, 148, 138, 122, 83, 57, 53, 52, 46))

In [14]:
valid_ngram_lm.vocab_bigram[:10], valid_ngram_lm.count_bigram[:10]

((('.', '<eos>'),
  ('<sos>', 'i'),
  ('<sos>', 'the'),
  ('!', '<eos>'),
  ("'", 's'),
  ('<sos>', 'it'),
  ("'", 't'),
  (',', 'and'),
  ('.', '.'),
  ('i', "'")),
 (1370, 357, 148, 138, 124, 122, 121, 115, 99, 90))

In [15]:
valid_ngram_lm.vocab_unigram[:10], valid_ngram_lm.count_unigram[:10]

((('.',),
  ('the',),
  ('i',),
  (',',),
  ('and',),
  ('it',),
  ('a',),
  ('to',),
  ('is',),
  ("'",)),
 (1470, 1026, 815, 780, 690, 586, 549, 486, 380, 376))

In [16]:
valid_ngram_lm.vocab_prev_ngram[:10], valid_ngram_lm.count_prev_ngram[:10]

((('.', '<eos>'),
  ('<sos>', 'i'),
  ('<sos>', 'the'),
  ('!', '<eos>'),
  ("'", 's'),
  ('<sos>', 'it'),
  ("'", 't'),
  (',', 'and'),
  ('.', '.'),
  ('i', "'")),
 (1370, 357, 148, 138, 124, 122, 121, 115, 99, 90))

In [17]:
valid_ngram_lm.id2token_ngram[:10]

[('.', '<eos>', '<eos>'),
 ('<sos>', '<sos>', 'i'),
 ('<sos>', '<sos>', 'the'),
 ('!', '<eos>', '<eos>'),
 ('<sos>', '<sos>', 'it'),
 ('<sos>', '<sos>', 'this'),
 ('it', "'", 's'),
 ('.', '.', '.'),
 ('<sos>', '<sos>', 'they'),
 ('.', '.', '<eos>')]

In [18]:
valid_ngram_lm.token2id_ngram[('.', '<eos>', '<eos>')], valid_ngram_lm.token2id_ngram[('.', '.', '<eos>')]

(0, 9)

#### Build the Vocabulary 


In [19]:
# Build a vocabulary using all the tokens found in train data (90% of most common ones)
print('Word vocabulary size: {} words'.format(len(train_ngram_lm.token2id)))        

Word vocabulary size: 6942 words


### CORPUS ANALYSIS (Train + Valid Data)

#### Number of Tokens in the Corpus Data


In [20]:
print("Number of All Tokens ", len(all_tokens_train))

Number of All Tokens  165996


#### Number of Sentences in the Train Data


In [21]:
print("Number of Sentences ", len(train_ngram_lm.raw_data))

Number of Sentences  10832


## N-grams

In [22]:
n = 3 # trigrams

### Function for padding the sentences with special markers sentence beginning and end, i.e. $<bos>$ and $<eos>$

In [23]:
train_padded = train_ngram_lm.padded_data
train_ngram = train_ngram_lm.ngram_data
vocab_ngram = train_ngram_lm.vocab_ngram
count_ngram = train_ngram_lm.count_ngram 

In [24]:
train_padded[0]

['<sos>',
 '<sos>',
 'this',
 'is',
 'a',
 'great',
 'tutu',
 'and',
 'at',
 'a',
 'really',
 'great',
 'price',
 '.',
 '<eos>',
 '<eos>']

### Function for finding all N-grams

In [25]:
train_ngram[0]

[('<sos>', '<sos>', 'this'),
 ('<sos>', 'this', 'is'),
 ('this', 'is', 'a'),
 ('is', 'a', 'great'),
 ('a', 'great', 'tutu'),
 ('great', 'tutu', 'and'),
 ('tutu', 'and', 'at'),
 ('and', 'at', 'a'),
 ('at', 'a', 'really'),
 ('a', 'really', 'great'),
 ('really', 'great', 'price'),
 ('great', 'price', '.'),
 ('price', '.', '<eos>'),
 ('.', '<eos>', '<eos>')]

In [26]:
vocab_ngram[0]

('.', '<eos>', '<eos>')

In [27]:
count_ngram[0]

9630

In [28]:
trie_ngram = train_ngram_lm.trie_ngram
# trie_ngram
# trie_prev_ngram = train_ngram_lm.trie_prev_ngram

In [29]:
trie_ngram['./<eos>/<eos>']

9630

In [30]:
id2token = train_ngram_lm.id2token
token2id = train_ngram_lm.token2id

In [31]:
id2token_ngram = train_ngram_lm.id2token_ngram
token2id_ngram = train_ngram_lm.token2id_ngram

In [32]:
random_token_id = random.randint(0, len(id2token_ngram) - 1)
random_token = id2token_ngram[random_token_id]

print ("Token id {} ; token {}".format(random_token_id, id2token_ngram[random_token_id]))
print ("Token {}; token id {}".format(random_token, token2id_ngram[random_token]))

Token id 15988 ; token ('wear', 'these', 'in')
Token ('wear', 'these', 'in'); token id 15988


### Ngram Count & Probability

In [33]:
# TODO: print the words for which the pd is nonzero !!! -- more intuitive than a list of numbers

In [34]:
vocab_ngram[:10], count_ngram[:10]

((('.', '<eos>', '<eos>'),
  ('<sos>', '<sos>', 'i'),
  ('<sos>', '<sos>', 'the'),
  ('!', '<eos>', '<eos>'),
  ('<sos>', '<sos>', 'they'),
  ('<sos>', '<sos>', 'it'),
  ('.', '.', '.'),
  ('<sos>', '<sos>', 'this'),
  ('.', '.', '<eos>'),
  ('<sos>', '<sos>', 'these')),
 (9630, 2692, 893, 840, 635, 543, 501, 411, 385, 374))

In [35]:
c = train_ngram_lm.get_ngram_count(('an', 'older', 'coat'))
p = train_ngram_lm.get_ngram_prob(('an', 'older', 'coat'))

p1 = train_ngram_lm.get_ngram_prob(('an', 'older', 'pc'))
p2 = train_ngram_lm.get_ngram_prob(('an', 'older', 'lady'))
p3 = train_ngram_lm.get_ngram_prob(('an', 'older', 'watch'))

pd = train_ngram_lm.get_prob_distr_ngram(('an', 'older'))

c, p, p1, p2, p3, sum(pd)#, pd

(1, 0.25, 0.25, 0.25, 0.0, 1.0)

In [36]:
c = train_ngram_lm.get_ngram_count(('really', 'great', 'price'))
p = train_ngram_lm.get_ngram_prob(('really', 'great', 'price'))
pd = train_ngram_lm.get_prob_distr_ngram(('really', 'great'))

c, p, sum(pd)#, pd 

(1, 0.14285714285714285, 1.0000000000000002)

In [37]:
c = train_ngram_lm.get_ngram_count(('really', 'great'))

c

0

In [38]:
c = train_ngram_lm.get_ngram_count(('.', '<eos>', '<eos>'))
p = train_ngram_lm.get_ngram_prob(('.', '<eos>', '<eos>'))
pd = train_ngram_lm.get_prob_distr_ngram(('.', '<eos>'))

c, p, sum(pd)#, pd

(9630, 1.0, 1.0)

In [39]:
c = train_ngram_lm.get_ngram_count(('.', '<sos>', '<sos>'))

c

0

In [40]:
c = train_ngram_lm.get_ngram_count(('i', 'like', 'pandas'))
p = train_ngram_lm.get_ngram_count(('i', 'like', 'pandas'))
pd = train_ngram_lm.get_prob_distr_ngram(('i', 'like'))

c, p, sum(pd)#, pd

(0, 0, 1.0)

In [41]:
c = train_ngram_lm.get_ngram_count(('is', 'a', 'great'))
p = train_ngram_lm.get_ngram_prob(('is', 'a', 'great'))
pd = train_ngram_lm.get_prob_distr_ngram(('is', 'a'))

c, p, sum(pd)#, pd

(26, 0.09885931558935361, 1.0000000000000007)

In [42]:
c = train_ngram_lm.get_ngram_count(('send', 'it', 'back'))
p = train_ngram_lm.get_ngram_prob(('send', 'it', 'back'))
pd = train_ngram_lm.get_prob_distr_ngram(('send', 'it', 'back'))

c, p, sum(pd)#, pd

(2, 1.0, 0.9999999999999207)

In [43]:
c = train_ngram_lm.get_ngram_count(('i', 'like', 'these', 'pictures'))
p = train_ngram_lm.get_ngram_prob(('i', 'like', 'these', 'pictures'))
pd = train_ngram_lm.get_prob_distr_ngram(('i', 'like', 'these'))

c, p, sum(pd)#, pd

(0, 0, 0.9999999999999207)

## Add-One Smoothing

## $$P(w_i|w_{i−n+1}, ..., w_{i−2}, w_{i−1}) \approx \frac{1 + c(w_{i−n+1}, ..., w_{i−2}, w_{i−1}, w_i)}{\mid V\mid + \sum_{w \in V} c(w_{i−n+1}, ..., w_{i−2}, w_{i−1}, w)}$$


In [44]:
p = train_ngram_lm.get_ngram_prob_add_one_smoothing(('.', '<sos>', '<sos>'))
p

0.00014405070584845865

In [45]:
p = train_ngram_lm.get_ngram_prob_add_one_smoothing(('i', 'like', 'pandas'))
p

0.00014144271570014144

In [46]:
p = train_ngram_lm.get_ngram_prob_add_one_smoothing(('i', 'like', 'this'))
p

0.0014144271570014145

In [47]:
p = train_ngram_lm.get_ngram_prob_add_one_smoothing(('really', 'great', 'price'))
p

0.0002878111958555188

In [48]:
p = train_ngram_lm.get_ngram_prob_add_one_smoothing(('send', 'it', 'back'))
p

0.0004320276497695853

In [49]:
p = train_ngram_lm.get_ngram_prob_add_one_smoothing(('.', '<eos>', '<eos>'))
p

0.5811609944484672

## Additive Smoothing

## $$P(w_i|w_{i−n+1}, ..., w_{i−2}, w_{i−1}) \approx \frac{\delta + c(w_{i−n+1}, ..., w_{i−2}, w_{i−1}, w_i)}{\delta\mid V\mid + \sum_{w \in V} c(w_{i−n+1}, ..., w_{i−2}, w_{i−1}, w)}$$


In [50]:
p = train_ngram_lm.get_ngram_prob_additive_smoothing(('.', '<sos>', '<sos>'), delta = 0.5)
p

0.00014405070584845865

In [51]:
p = train_ngram_lm.get_ngram_prob_additive_smoothing(('i', 'like', 'pandas'), delta = 0.5)
p

0.0001389274798555154

In [52]:
p = train_ngram_lm.get_ngram_prob_additive_smoothing(('i', 'like', 'this'), delta = 0.5)
p

0.002639622117254793

In [53]:
p = train_ngram_lm.get_ngram_prob_additive_smoothing(('really', 'great', 'price'), delta = 0.5)
p

0.0004312823461759632

In [54]:
p = train_ngram_lm.get_ngram_prob_additive_smoothing(('send', 'it', 'back'), delta = 0.5)
p

0.0007198387561186294

In [55]:
p = train_ngram_lm.get_ngram_prob_additive_smoothing(('.', '<eos>', '<eos>'), delta = 0.5)
p

0.7350965575146935

### Changing the Parameter $\delta$

In [56]:
# small delta --> closer to no smoothing  (1.0)
p = train_ngram_lm.get_ngram_prob_additive_smoothing(('.', '<eos>', '<eos>'), delta = 0.1)
p

0.9327696092675462

In [57]:
# arge delta --> closer to add-one smoothing (0.58)
p = train_ngram_lm.get_ngram_prob_additive_smoothing(('.', '<eos>', '<eos>'), delta = 0.9)
p

0.6065638816460719

## Linear Interpolation Smoothing (Jelinek-Mercer)

### $$P(w_i|w_{i−n+1}, ..., w_{i−2}, w_{i−1}) \approx \alpha_n P(w_i|w_{i−n+1}, ..., w_{i−2}, w_{i−1}) + (1 - \alpha_n) P(w|w_{i−n+2}, ..., w_{i−2}, w_{i−1})$$


In [58]:
p = train_ngram_lm.get_ngram_prob_interpolation_smoothing(('.', '<sos>', '<sos>'), alpha = 0.8)
p

0.0

In [59]:
p = train_ngram_lm.get_ngram_prob_interpolation_smoothing(('i', 'like', 'pandas'), alpha = 0.8)
p

0.0

In [60]:
p = train_ngram_lm.get_ngram_prob_interpolation_smoothing(('i', 'like', 'this'), alpha = 0.8)
p

0.05625

In [61]:
p = train_ngram_lm.get_ngram_prob_interpolation_smoothing(('really', 'great', 'price'), alpha = 0.8)
p

0.11428571428571428

In [62]:
p = train_ngram_lm.get_ngram_prob_interpolation_smoothing(('send', 'it', 'back'), alpha = 0.8)
p

0.8

### Changing the Parameter $\alpha$

In [63]:
# small delta --> closer to no smoothing  (1.0)
p = train_ngram_lm.get_ngram_prob_interpolation_smoothing(('.', '<eos>', '<eos>'), alpha = 0.8)
p

0.8

In [64]:
# small delta --> closer to no smoothing  (1.0)
p = train_ngram_lm.get_ngram_prob_interpolation_smoothing(('.', '<eos>', '<eos>'), alpha = 0.5)
p

0.5

In [65]:
# small delta --> closer to no smoothing  (1.0)
p = train_ngram_lm.get_ngram_prob_interpolation_smoothing(('.', '<eos>', '<eos>'), alpha = 0.2)
p

0.2

## Linear Interpolation with Absolute Discounting

### $$p_{bi}(w|v) = max ({ \frac{N(v, w) - b_{bi}}{N(v)}, 0)  + b_{bi} \frac{V - N_0(v, \cdot)}{N(v)} p_{uni}(w) \large}$$

### $$p_{uni}(w) = max ({ \frac{N(w) - b_{uni}}{N}, 0)  + b_{uni} \frac{V - N_0(\cdot)}{N} \frac{1}{V}}$$

### $$b_{bi} = \frac{N_1(\cdot, \cdot)}{N_1(\cdot, \cdot) + 2*N_2(\cdot, \cdot)}$$

### $$b_{uni} = \frac{N_1(\cdot)}{N_1(\cdot) + 2*N_2(\cdot)}$$


### $$N_r(\cdot) = \sum_{w: N(w) = r} 1$$

### $$N_r(\cdot, \cdot) = \sum_{v, w: N(v, w) = r} 1$$

### $$N_r(v, \cdot) = \sum_{w: N(v, w) = r} 1$$

### V is the number of words in the vocabulary

### $N_r(\cdot, \cdot)$ and $N_r(\cdot)$  are the count-counts for bigrams and unigrams respectively $


### Remember to check that probabilities sum up to one:
### $$\sum_w p_{bi}(w|v) = \sum_w p_{uni}(w) = 1$$



In [66]:
y = "m"
x = "'"

z = train_ngram_lm.get_p_bi(y, x)
z

  b_bi * (W - N_0) / N_v * p_uni


inf

In [67]:
train_ngram[:3]

[[('<sos>', '<sos>', 'this'),
  ('<sos>', 'this', 'is'),
  ('this', 'is', 'a'),
  ('is', 'a', 'great'),
  ('a', 'great', 'tutu'),
  ('great', 'tutu', 'and'),
  ('tutu', 'and', 'at'),
  ('and', 'at', 'a'),
  ('at', 'a', 'really'),
  ('a', 'really', 'great'),
  ('really', 'great', 'price'),
  ('great', 'price', '.'),
  ('price', '.', '<eos>'),
  ('.', '<eos>', '<eos>')],
 [('<sos>', '<sos>', 'it'),
  ('<sos>', 'it', 'doesn'),
  ('it', 'doesn', "'"),
  ('doesn', "'", 't'),
  ("'", 't', 'look'),
  ('t', 'look', 'cheap'),
  ('look', 'cheap', 'at'),
  ('cheap', 'at', 'all'),
  ('at', 'all', '.'),
  ('all', '.', '<eos>'),
  ('.', '<eos>', '<eos>')],
 [('<sos>', '<sos>', 'i'),
  ('<sos>', 'i', "'"),
  ('i', "'", 'm'),
  ("'", 'm', 'so'),
  ('m', 'so', 'glad'),
  ('so', 'glad', 'i'),
  ('glad', 'i', 'looked'),
  ('i', 'looked', 'on'),
  ('looked', 'on', 'amazon'),
  ('on', 'amazon', 'and'),
  ('amazon', 'and', 'found'),
  ('and', 'found', 'such'),
  ('found', 'such', 'an'),
  ('such', 'an', 'af

## Kneser-Ney Smoothing (best to use in practice!) http://smithamilli.com/blog/kneser-ney/

### Bigram LM
###  $$p(s) = \prod_{i = 1} ^ {N + 1} p(w_i | w_{i-1})$$

## Likelihood of a Sentence

### Bigram LM: $$ p(i \; love \; this \; light) = p(i|\cdot) \; p(love|i)\;  p(this|love)\;  p(light|this) \\
\approx \frac{c(i, \cdot)}{\sum_w c(\cdot, \; w)} \; \frac{c(love, i)}{\sum_wc(i, \; w)}\;  \frac{c(this, love)}{\sum_wc(love, \;w)}\;  \frac{c(light, this)}{\sum_wc(this, \;w)}$$ 

### Trigram LM: $$ p(i \; love \; this  \;light) = p(i|\cdot, \cdot) \; p(love|\cdot, i) \; p(this|i, love)\;  p(light|love, this)$$ 



### Score Sentences

In [68]:
n = 3
sentence = [['this', 'is', 'a', 'great', 'tutu']]
print(sentence)
ps = train_ngram_lm.get_prob_sentence(sentence)
ss =  train_ngram_lm.get_score_sentence(sentence)
ps, ss

[['this', 'is', 'a', 'great', 'tutu']]


(0.0, 3.870959633291363e+77)

In [69]:
n = 3
sentence = [['this', 'is', 'a', 'great', 'tutu', 'and', 'at', 'a', 'really', 'great', 'price', '.']]
print(sentence)
ps = train_ngram_lm.get_prob_sentence(sentence)
ss = train_ngram_lm.get_score_sentence(sentence)
ps, ss

[['this', 'is', 'a', 'great', 'tutu', 'and', 'at', 'a', 'really', 'great', 'price', '.']]


(5.573729307593221e-11, 4.82603286530514)

## Sentence Generation

#### No Context

In [70]:
num_tokens = 5
generated_sentence = train_ngram_lm.generate_sentence(num_tokens)
generated_sentence


the
the soles
the soles are
the soles are soft
the soles are soft and


'the soles are soft and'

In [71]:
num_tokens = 10
generated_sentence = train_ngram_lm.generate_sentence(num_tokens, context=('i', 'like', 'the'))
generated_sentence


adjustment
adjustment on
adjustment on the
adjustment on the underwire
adjustment on the underwire and
adjustment on the underwire and a
adjustment on the underwire and a sundry
adjustment on the underwire and a sundry thing
adjustment on the underwire and a sundry thing or
adjustment on the underwire and a sundry thing or 2


'adjustment on the underwire and a sundry thing or 2'

In [72]:
num_tokens = 20
generated_sentence = train_ngram_lm.generate_sentence(num_tokens)
generated_sentence


they
they make
they make my
they make my legs
they make my legs you
they make my legs you '
they make my legs you ' ll
they make my legs you ' ll have
they make my legs you ' ll have to
they make my legs you ' ll have to sew
they make my legs you ' ll have to sew all
they make my legs you ' ll have to sew all of
they make my legs you ' ll have to sew all of my
they make my legs you ' ll have to sew all of my wardrobe
they make my legs you ' ll have to sew all of my wardrobe ,
they make my legs you ' ll have to sew all of my wardrobe , or
they make my legs you ' ll have to sew all of my wardrobe , or used
they make my legs you ' ll have to sew all of my wardrobe , or used as
they make my legs you ' ll have to sew all of my wardrobe , or used as a
they make my legs you ' ll have to sew all of my wardrobe , or used as a mistake


"they make my legs you ' ll have to sew all of my wardrobe , or used as a mistake"

#### With Context

In [73]:
num_tokens = 5
generated_sentence = train_ngram_lm.generate_sentence(num_tokens, context=('i', 'like', 'the'))
generated_sentence


picture
picture you
picture you think
picture you think that
picture you think that with


'picture you think that with'

In [74]:
num_tokens = 10
generated_sentence = train_ngram_lm.generate_sentence(num_tokens, context=('i', 'like', 'the'))
generated_sentence


material
material washes
material washes beautlfully
material washes beautlfully .
material washes beautlfully . <eos>
material washes beautlfully . <eos> <eos>
material washes beautlfully . <eos> <eos> inside
material washes beautlfully . <eos> <eos> inside or
material washes beautlfully . <eos> <eos> inside or counted
material washes beautlfully . <eos> <eos> inside or counted boobs


'material washes beautlfully . <eos> <eos> inside or counted boobs'

In [75]:
num_tokens = 5
generated_sentence = train_ngram_lm.generate_sentence(num_tokens, context=('i', 'like', 'the'))
generated_sentence

well
well made
well made as
well made as the
well made as the price


'well made as the price'

In [76]:
num_tokens = 5
generated_sentence = train_ngram_lm.generate_sentence(num_tokens, context=('i', 'will', 'buy'))
generated_sentence

more
more in
more in pink
more in pink ,
more in pink , no


'more in pink , no'

In [77]:
num_tokens = 5
generated_sentence = train_ngram_lm.generate_sentence(num_tokens, context=('i'))
generated_sentence


got
got this
got this when
got this when i
got this when i did


'got this when i did'

In [78]:
num_tokens = 10
generated_sentence = train_ngram_lm.generate_sentence(num_tokens, context=('i'))
generated_sentence


am
am so
am so impressed
am so impressed with
am so impressed with it
am so impressed with it she
am so impressed with it she loved
am so impressed with it she loved carrying
am so impressed with it she loved carrying it
am so impressed with it she loved carrying it around


'am so impressed with it she loved carrying it around'

In [79]:
num_tokens = 10
generated_sentence = train_ngram_lm.generate_sentence(num_tokens, context=('this', 'is', 'the', 'best', 'i'))
generated_sentence


'
' d
' d highly
' d highly recommend
' d highly recommend .
' d highly recommend . <eos>
' d highly recommend . <eos> <eos>
' d highly recommend . <eos> <eos> permapress
' d highly recommend . <eos> <eos> permapress studio
' d highly recommend . <eos> <eos> permapress studio absolutely


"' d highly recommend . <eos> <eos> permapress studio absolutely"

In [80]:
num_tokens = 5
generated_sentence = train_ngram_lm.generate_sentence(num_tokens, context=('this', 'is', 'the', 'best', 'i'))
generated_sentence


'
' d
' d ever
' d ever worn
' d ever worn .


"' d ever worn ."

In [81]:
num_tokens = 5
generated_sentence = train_ngram_lm.generate_sentence(num_tokens, context=('this', 'is', 'not', 'what', 'i'))
generated_sentence


really
really like
really like it
really like it .
really like it . <eos>


'really like it . <eos>'

## Log-Likelihood (n-gram)
## $$LL = \sum_{j=1}^{K} \sum_{i=1}^{T_j + 1} log p_{bi}(w_{j, i} | w_{j, n - i + 1}, \cdot, w_{j, i - 2}, w_{j, i - 1})$$

## Perplexity
## $$PP = exp(-\frac{LL}{\sum_j(T_j + 1)})$$

In [82]:
ppl_train = train_ngram_lm.get_perplexity(train_data_tokenized)
ppl_valid = train_ngram_lm.get_perplexity(valid_data_tokenized)


In [83]:
ppl_valid, ppl_train

(4.003496405825398e+17, 725.1099435816803)

### Let's Compare Different Smoothing Techniques

In [84]:
# No Smoothing
train_ngram_lm_no_smoothing = NgramLM(train_data_tokenized, all_tokens_train, n=3)
valid_ngram_lm_no_smoothing = NgramLM(valid_data_tokenized, all_tokens_valid, n=3)

ppl_train_no_smoothing = train_ngram_lm_no_smoothing.get_perplexity(train_data_tokenized)
ppl_valid_no_smoothing = train_ngram_lm_no_smoothing.get_perplexity(valid_data_tokenized)

ppl_valid_no_smoothing, ppl_train_no_smoothing


(4.003496405825398e+17, 725.1099435816803)

In [85]:
# Additive Smoothing
train_ngram_lm_additive = NgramLM(train_data_tokenized, all_tokens_train, n=3, smoothing='additive', delta=0.5)
valid_ngram_lm_additive = NgramLM(valid_data_tokenized, all_tokens_valid, n=3, smoothing='additive', delta=0.5)

ppl_train_no_additive = train_ngram_lm_additive.get_perplexity(train_data_tokenized)
ppl_valid_no_additive = train_ngram_lm_additive.get_perplexity(valid_data_tokenized)

ppl_valid_no_additive, ppl_train_no_additive


(2244.9288157583414, 1078.084826605658)

In [86]:
# Additive Smoothing
train_ngram_lm_additive_d2 = NgramLM(train_data_tokenized, all_tokens_train, n=3, smoothing='additive', delta=0.2)
valid_ngram_lm_additive_d2 = NgramLM(valid_data_tokenized, all_tokens_valid, n=3, smoothing='additive', delta=0.2)

ppl_train_no_additive_d2 = train_ngram_lm_additive_d2.get_perplexity(train_data_tokenized)
ppl_valid_no_additive_d2 = train_ngram_lm_additive_d2.get_perplexity(valid_data_tokenized)

ppl_valid_no_additive_d2, ppl_train_no_additive_d2


(1708.4563801973802, 556.8177307958886)

In [87]:
# Additive Smoothing
train_ngram_lm_additive_d8 = NgramLM(train_data_tokenized, all_tokens_train, n=3, smoothing='additive', delta=0.8)
valid_ngram_lm_additive_d8 = NgramLM(valid_data_tokenized, all_tokens_valid, n=3, smoothing='additive', delta=0.8)

ppl_train_no_additive_d8 = train_ngram_lm_additive_d8.get_perplexity(train_data_tokenized)
ppl_valid_no_additive_d8 = train_ngram_lm_additive_d8.get_perplexity(valid_data_tokenized)

ppl_valid_no_additive_d8, ppl_train_no_additive_d8


(2587.3964397085933, 1468.6501405429465)

In [88]:
# Additive Smoothing
train_ngram_lm_add1 = NgramLM(train_data_tokenized, all_tokens_train, n=3, smoothing='add-one')
valid_ngram_lm_add1 = NgramLM(valid_data_tokenized, all_tokens_valid, n=3, smoothing='add-one')

ppl_train_no_add1 = train_ngram_lm_add1.get_perplexity(train_data_tokenized)
ppl_valid_no_add1 = train_ngram_lm_add1.get_perplexity(valid_data_tokenized)

ppl_valid_no_add1, ppl_train_no_add1


(2766.7454147375506, 1684.9294317536671)

In [89]:
# Interpolation Smoothing
train_ngram_lm_interp_a2 = NgramLM(train_data_tokenized, all_tokens_train, n=3, smoothing='interpolation', alpha=0.2)
valid_ngram_lm_interp_a2 = NgramLM(valid_data_tokenized, all_tokens_valid, n=3, smoothing='interpolation', alpha=0.2)

ppl_train_no_interp_a2 = train_ngram_lm_interp_a2.get_perplexity(train_data_tokenized)
ppl_valid_no_interp_a2 = train_ngram_lm_interp_a2.get_perplexity(valid_data_tokenized)

ppl_valid_no_interp_a2, ppl_train_no_interp_a2


(4.120249201591078e+17, 3260.7947470396057)

In [90]:
# Interpolation Smoothing
train_ngram_lm_interp_a8 = NgramLM(train_data_tokenized, all_tokens_train, n=3, smoothing='interpolation', alpha=0.8)
valid_ngram_lm_interp_a8 = NgramLM(valid_data_tokenized, all_tokens_valid, n=3, smoothing='interpolation', alpha=0.8)

ppl_train_no_interp_a8 = train_ngram_lm_interp_a8.get_perplexity(train_data_tokenized)
ppl_valid_no_interp_a8 = train_ngram_lm_interp_a8.get_perplexity(valid_data_tokenized)

ppl_valid_no_interp_a8, ppl_train_no_interp_a8


(4.0194841118222176e+17, 893.1597201353827)

In [91]:
# Interpolation Smoothing
train_ngram_lm_interp_a5 = NgramLM(train_data_tokenized, all_tokens_train, n=3, smoothing='interpolation', alpha=0.5)
valid_ngram_lm_interp_a5 = NgramLM(valid_data_tokenized, all_tokens_valid, n=3, smoothing='interpolation', alpha=0.5)

ppl_train_no_interp_a5 = train_ngram_lm_interp_a5.get_perplexity(train_data_tokenized)
ppl_valid_no_interp_a5 = train_ngram_lm_interp_a5.get_perplexity(valid_data_tokenized)

ppl_valid_no_interp_a5, ppl_train_no_interp_a5


(4.053367926970881e+17, 1385.4824458368155)

In [92]:
# # Discounted Interpolation Smoothing
# train_ngram_lm_discount = NgramLM(train_data_tokenized, all_tokens_train, n=3, smoothing='discounting')
# valid_ngram_lm_discount = NgramLM(valid_data_tokenized, all_tokens_valid, n=3, smoothing='discounting')

# ppl_train_no_discount = train_ngram_lm_discount.get_perplexity(train_data_tokenized)
# ppl_valid_no_discount = train_ngram_lm_discount.get_perplexity(valid_data_tokenized)

# ppl_valid_no_discount, ppl_train_no_discount


### Vary n from ngrams

In [93]:
# Interpolation Smoothing, N = 2
train_ngram_lm_interp2 = NgramLM(train_data_tokenized, all_tokens_train, n=2, smoothing='interpolation')
valid_ngram_lm_interp2 = NgramLM(valid_data_tokenized, all_tokens_valid, n=2, smoothing='interpolation')

ppl_train_no_interp2 = train_ngram_lm_interp2.get_perplexity(train_data_tokenized)
ppl_valid_no_interp2 = train_ngram_lm_interp2.get_perplexity(valid_data_tokenized)

ppl_valid_no_interp2, ppl_train_no_interp2


(1.5296236594505036e+16, 4396.493002816503)

In [94]:
# Interpolation Smoothing, N = 3
train_ngram_lm_interp3 = NgramLM(train_data_tokenized, all_tokens_train, n=3, smoothing='interpolation')
valid_ngram_lm_interp3 = NgramLM(valid_data_tokenized, all_tokens_valid, n=3, smoothing='interpolation')

ppl_train_no_interp3 = train_ngram_lm_interp3.get_perplexity(train_data_tokenized)
ppl_valid_no_interp3 = train_ngram_lm_interp3.get_perplexity(valid_data_tokenized)

ppl_valid_no_interp3, ppl_train_no_interp3


(4.0194841118222176e+17, 893.1597201353827)

In [95]:
# Interpolation Smoothing, N = 5
train_ngram_lm_interp5 = NgramLM(train_data_tokenized, all_tokens_train, n=5, smoothing='interpolation')
valid_ngram_lm_interp5 = NgramLM(valid_data_tokenized, all_tokens_valid, n=5, smoothing='interpolation')

ppl_train_no_interp5 = train_ngram_lm_interp5.get_perplexity(train_data_tokenized)
ppl_valid_no_interp5 = train_ngram_lm_interp5.get_perplexity(valid_data_tokenized)

ppl_valid_no_interp5, ppl_train_no_interp5


(1.1201894610153348e+18, 227.42393815798513)

In [96]:
# Interpolation Smoothing, N = 10
train_ngram_lm_interp10 = NgramLM(train_data_tokenized, all_tokens_train, n=10, smoothing='interpolation')
valid_ngram_lm_interp10 = NgramLM(valid_data_tokenized, all_tokens_valid, n=10, smoothing='interpolation')

ppl_train_no_interp10 = train_ngram_lm_interp10.get_perplexity(train_data_tokenized)
ppl_valid_no_interp10 = train_ngram_lm_interp10.get_perplexity(valid_data_tokenized)

ppl_valid_no_interp10, ppl_train_no_interp10


(1.1524271783534917e+18, 196.61911426440474)

In [97]:
# Interpolation Smoothing, N = 20
train_ngram_lm_interp20 = NgramLM(train_data_tokenized, all_tokens_train, n=20, smoothing='interpolation')
valid_ngram_lm_interp20 = NgramLM(valid_data_tokenized, all_tokens_valid, n=20, smoothing='interpolation')

ppl_train_no_interp20 = train_ngram_lm_interp20.get_perplexity(train_data_tokenized)
ppl_valid_no_interp20 = train_ngram_lm_interp20.get_perplexity(valid_data_tokenized)

ppl_valid_no_interp20, ppl_train_no_interp20


(1.1552387706052442e+18, 208.770758577245)

### Exercise: do the above dor additive with delta 0.1 or smaller

In [98]:
# Interpolation Smoothing, N = 5
train_ngram_lm_add5 = NgramLM(train_data_tokenized, all_tokens_train, n=5, smoothing='additive', delta=0.1)
valid_ngram_lm_add5 = NgramLM(valid_data_tokenized, all_tokens_valid, n=5, smoothing='additive', delta=0.1)

ppl_train_no_add5 = train_ngram_lm_add5.get_perplexity(train_data_tokenized)
ppl_valid_no_add5 = train_ngram_lm_add5.get_perplexity(valid_data_tokenized)

ppl_valid_no_add5, ppl_train_no_add5


(7536.774576449134, 1041.422926258175)

### Sentence Probabilities

In [99]:
sentence = [['this', 'is', 'a', 'great', 'tutu']]
print(sentence)
ps = train_ngram_lm.get_prob_sentence(sentence)
ss = train_ngram_lm.get_score_sentence(sentence)
ps, ss

[['this', 'is', 'a', 'great', 'tutu']]


(0.0, 3.870959633291363e+77)

In [100]:
sentence = [['this', 'is', 'a', 'great', 'tutu', 'and', 'at', 'a', 'really', 'great', 'price', '.']]
print(sentence)
ps = train_ngram_lm_interp3.get_prob_sentence(sentence)
ss = train_ngram_lm_interp3.get_score_sentence(sentence)
ps, ss

[['this', 'is', 'a', 'great', 'tutu', 'and', 'at', 'a', 'really', 'great', 'price', '.']]


(2.45135207350985e-12, 5.943463782512477)

In [101]:
sentence = [['this', 'is', 'a', 'great', 'tutu', 'and', 'at', 'a', 'really', 'great', 'price', '.']]
print(sentence)
ps = train_ngram_lm_interp5.get_prob_sentence(sentence)
ss = train_ngram_lm_interp5.get_score_sentence(sentence)
ps, ss

[['this', 'is', 'a', 'great', 'tutu', 'and', 'at', 'a', 'really', 'great', 'price', '.']]


(2.3425643029349238e-06, 2.1438510902596253)

In [102]:
sentence = [['this', 'is', 'a', 'great', 'tutu', 'and', 'at', 'a', 'really', 'great', 'price', '.']]
print(sentence)
ps = train_ngram_lm_interp10.get_prob_sentence(sentence)
ss = train_ngram_lm_interp10.get_score_sentence(sentence)
ps, ss

[['this', 'is', 'a', 'great', 'tutu', 'and', 'at', 'a', 'really', 'great', 'price', '.']]


(9.423439850812819e-07, 1.8788822770702536)

In [103]:
sentence = [['this', 'is', 'a', 'great', 'tutu']]
print(sentence)
ps = train_ngram_lm_interp10.get_prob_sentence(sentence)
ss = train_ngram_lm_interp10.get_score_sentence(sentence)
ps, ss

[['this', 'is', 'a', 'great', 'tutu']]


(0.0, 7.761435089767163e+184)

In [104]:
sentence = [['this', 'is', 'a', 'great', 'tutu', 'and', 'at', 'a', 'really', 'great', 'price', '.']]
print(sentence)
ps = train_ngram_lm_additive.get_prob_sentence(sentence)
ss = train_ngram_lm_additive.get_score_sentence(sentence)
ps, ss

[['this', 'is', 'a', 'great', 'tutu', 'and', 'at', 'a', 'really', 'great', 'price', '.']]


(6.192884995700733e-35, 190.78334792964762)

In [105]:
sentence = [['this', 'is', 'a', 'great', 'tutu', 'and', 'at', 'a', 'really', 'great', 'price', '.']]
print(sentence)
ps = train_ngram_lm_add5.get_prob_sentence(sentence)
ss = train_ngram_lm_add5.get_score_sentence(sentence)
ps, ss

[['this', 'is', 'a', 'great', 'tutu', 'and', 'at', 'a', 'really', 'great', 'price', '.']]


(5.744821411697747e-34, 90.2270326215453)

In [106]:
sentence = [['this', 'is', 'a', 'great', 'tutu']]
print(sentence)
ps = train_ngram_lm_add5.get_prob_sentence(sentence)
ss = train_ngram_lm_add5.get_score_sentence(sentence)
ps, ss

[['this', 'is', 'a', 'great', 'tutu']]


(3.7391449700741474e-24, 220.15206814674661)

### Sentence Generation

In [107]:
num_tokens = 10
generated_sentence = train_ngram_lm_interp5.generate_sentence(num_tokens)
generated_sentence

i
i am
i am very
i am very happy
i am very happy !
i am very happy ! <eos>
i am very happy ! <eos> <eos>
i am very happy ! <eos> <eos> <eos>
i am very happy ! <eos> <eos> <eos> <eos>
i am very happy ! <eos> <eos> <eos> <eos> yaaay


'i am very happy ! <eos> <eos> <eos> <eos> yaaay'

In [108]:
num_tokens = 10
generated_sentence = train_ngram_lm_interp3.generate_sentence(num_tokens)
generated_sentence


i
i ordered
i ordered but
i ordered but other
i ordered but other materials
i ordered but other materials work
i ordered but other materials work great
i ordered but other materials work great once
i ordered but other materials work great once they
i ordered but other materials work great once they are


'i ordered but other materials work great once they are'

In [109]:
num_tokens = 10
generated_sentence = train_ngram_lm_additive.generate_sentence(num_tokens)
generated_sentence


i
i bought
i bought about
i bought about 10
i bought about 10 minutes
i bought about 10 minutes walking
i bought about 10 minutes walking .
i bought about 10 minutes walking . <eos>
i bought about 10 minutes walking . <eos> <eos>
i bought about 10 minutes walking . <eos> <eos> sensor


'i bought about 10 minutes walking . <eos> <eos> sensor'

In [110]:
num_tokens = 10
generated_sentence = train_ngram_lm_interp10.generate_sentence(num_tokens)
generated_sentence


they
they are
they are very
they are very attractive
they are very attractive and
they are very attractive and you
they are very attractive and you can
they are very attractive and you can '
they are very attractive and you can ' t
they are very attractive and you can ' t go


"they are very attractive and you can ' t go"

In [111]:
num_tokens = 10
generated_sentence = train_ngram_lm_interp2.generate_sentence(num_tokens)
generated_sentence


these
these right
these right out
these right out after
these right out after using
these right out after using this
these right out after using this for
these right out after using this for the
these right out after using this for the prices
these right out after using this for the prices are


'these right out after using this for the prices are'

In [112]:
num_tokens = 5
generated_sentence = train_ngram_lm_interp10.generate_sentence(num_tokens)
generated_sentence


bought
bought these
bought these to
bought these to hold
bought these to hold up


'bought these to hold up'

In [113]:
num_tokens = 20
generated_sentence = train_ngram_lm_interp10.generate_sentence(num_tokens)
generated_sentence


i
i like
i like these
i like these .
i like these . <eos>
i like these . <eos> <eos>
i like these . <eos> <eos> <eos>
i like these . <eos> <eos> <eos> <eos>
i like these . <eos> <eos> <eos> <eos> <eos>
i like these . <eos> <eos> <eos> <eos> <eos> <eos>
i like these . <eos> <eos> <eos> <eos> <eos> <eos> <eos>
i like these . <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos>
i like these . <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos>
i like these . <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> home
i like these . <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> home delighted
i like these . <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> home delighted prefer
i like these . <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> home delighted prefer done
i like these . <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> home delighted prefer done quilted
i like these . <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> home delighted prefer done quilted fyi
i

'i like these . <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> home delighted prefer done quilted fyi determination'

In [1]:
num_tokens = 20
generated_sentence = train_ngram_lm_interp10.generate_sentence(num_tokens, context=('i', 'hate'))
generated_sentence


NameError: name 'train_ngram_lm_interp10' is not defined