<a href="https://colab.research.google.com/github/kyunghyuncho/ammi-2019-nlp/blob/master/01-day-LM/ngram_lm.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Language Modeling

### Goal: compute a probabilty distribution over all possible sentences:


### $$p(W) = p(w_1, w_2, ..., w_T)$$

### This unsupervised learning problem can be framed as a sequence of supervised learning problems:

### $$p(W) = p(w_1) * p(w_2|w_1) * ... * p(w_T|w_1, ..., w_{T-1})$$

### If we have N sentences, each of them with T words / tokens, then we want to max:

### $$log p(W) = \sum_{n = 1}^N \sum_{i=1}^{T} log p(w_i | w_{<i})$$




# N-gram language model

### Goal: estimate the n-gram probabilities using counts of sequences of n consecutive words

### Given a sequence of words $w$, we want to compute

###  $$P(w_i|w_{i−1}, w_{i−2}, …, w_{i−n+1})$$

### Where $w_i$ is the i-th word of the sequence.

### $$P(w_i|w_{i−n+1}, ..., w_{i−2}, w_{i−1}) = \frac{p(w_{i−n+1}, ..., w_{i−2}, w_{i−1}, w_i)}{\sum_{w \in V} p(w_{i−n+1}, ..., w_{i−2}, w_{i−1}, w)}$$

### Key Idea: We can estimate the probabilities using counts of n-grams in our dataset 


In [27]:
# TODOs
#: implement the neural LM with concat instead of summation -- so that you have a fixed input etc.
# make a separate
# create some slides with pictures maybe explaining the model visualizations -- line by line
# get google cloud working
# make it work on gpu
# show them kenlm and how to use to do different stuff with it
# use the same sentences to generation and testing etc.
# explain perplexity
# ngram, ff, rnn, rnn+attention
# do sentence generation
# do long sentences
# compare different n-grams -- 2,3,more

In [1]:
import os
import sys
sys.path.append('utils/')

### Install if needed

TODO: should we install as needed and import as needed or all at once?

### Imports

In [46]:
from utils import ngram_utils as ngram_utils
import utils.global_variables as gl

In [21]:
torch.manual_seed(1)


<torch._C.Generator at 0x7f5b0c0bfb50>

### Load Data from .txt Files

In [22]:
# Read data from .txt files and create lists of reviews

train_data = []
# create a list of all the reviews 
with open('../data/amazon_reviews_clothing_train.txt', 'r') as f:
    train_data = [review for review in f.read().split('\n') if review]

test_data = []
# create a list of all the reviews 
with open('../data/amazon_reviews_clothing_test.txt', 'r') as f:
    test_data = [review for review in f.read().split('\n') if review]
    
valid_data = []
# create a list of all the reviews 
with open('../data/amazon_reviews_clothing_valid.txt', 'r') as f:
    valid_data = [review for review in f.read().split('\n') if review]
    

In [23]:
# type(train_data), len(train_data), \
# type(train_data[0]), len(train_data[0]), \
# type(train_data[0][0]), len(train_data[0][0])

In [24]:
train_data[0], train_data[0][0]


("this is a great tutu and at a really great price . it doesn ' t look cheap at all . i ' m so glad i looked on amazon and found such an affordable tutu that isn ' t made poorly . a + + ",
 't')

### Process the Data

In [26]:
# # TODO: for now only work with small subset of the data -- switch to all data later
# train_data = train_data[:800]
# test_data = test_data[:100]
# valid_data = valid_data[:100]

In [27]:
type(train_data), type(train_data[0]), type(train_data[0][0])

(list, str, str)

In [28]:
# Tokenize the Datasets
# TODO: this takes a really long time !! why?
train_data_tokenized, all_tokens_train = ngram_utils.tokenize_dataset(train_data)
test_data_tokenized, all_tokens_test = ngram_utils.tokenize_dataset(test_data)
valid_data_tokenized, all_tokens_valid = ngram_utils.tokenize_dataset(valid_data)


HBox(children=(IntProgress(value=1, bar_style='info', max=1), HTML(value='')))




HBox(children=(IntProgress(value=1, bar_style='info', max=1), HTML(value='')))




HBox(children=(IntProgress(value=1, bar_style='info', max=1), HTML(value='')))




Let's look at the tokenized data!

In [29]:
# # Number of All Tokens
# len(all_tokens_train), all_tokens_train[0], \
# len(train_data_tokenized), train_data_tokenized[0]

#### Build the Vocabulary 


In [30]:
# Build a vocabulary using all the tokens found in train data (90% of most common ones)
voc = list(set(all_tokens_train))
print('Word vocabulary size: {} words'.format(len(voc)))        

Word vocabulary size: 71128 words


### CORPUS ANALYSIS (Train + Valid Data)

#### Number of Tokens in the Corpus Data


In [31]:
print("Number of All Tokens ", len(all_tokens_train))

Number of All Tokens  16412250


In [32]:
print("Number of All UNIQUE Tokens ", len(voc))

Number of All UNIQUE Tokens  71128


#### Number of Sentences in the Train Data


In [33]:
print("Number of Sentences ", len(train_data_tokenized))

Number of Sentences  222919


## N-grams

In [34]:
n = 3 # trigrams

### Function for padding the sentences with special markers sentence beginning and end, i.e. $<bos>$ and $<eos>$

In [35]:
train_padded = ngram_utils.pad_sentences(train_data_tokenized, n)
valid_padded = ngram_utils.pad_sentences(valid_data_tokenized, n)
test_padded = ngram_utils.pad_sentences(test_data_tokenized, n)

In [36]:
# train_padded[:2]

### Function for finding all N-grams

### Function for Getting N-gram counts for already tokenized data

In [37]:
n = 3
train_padded = ngram_utils.pad_sentences(train_data_tokenized, n)
train_ngram = ngram_utils.find_ngrams(train_padded, n)
vocab_ngram, count_ngram = ngram_utils.ngram_counts(train_ngram)

In [38]:
# train_padded, train_ngram, vocab_ngram, count_ngram

#### Trigrams, Bigrams, Unigrams

In [39]:
train_padded_trigram =ngram_ utils.pad_sentences(train_data_tokenized, 3)
train_trigram = ngram_utils.find_ngrams(train_padded_trigram, 3)
vocab_trigram, count_trigram = ngram_utils.ngram_counts(train_trigram)

train_padded_bigram = ngram_utils.pad_sentences(train_data_tokenized, 2)
train_bigram = ngram_utils.find_ngrams(train_padded_bigram, 2)
vocab_bigram, count_bigram = ngram_utils.ngram_counts(train_bigram)

train_padded_unigram = ngram_utils.pad_sentences(train_data_tokenized, 1)
train_unigram = ngram_utils.find_ngrams(train_padded_unigram, 1)
vocab_unigram, count_unigram = ngram_utils.ngram_counts(train_unigram)


In [40]:
# vocab_bigram[:3], count_bigram[:3]

In [41]:
# vocab_unigram[:3], count_unigram[:3]

### Function for Getting N-gram Dict

In [43]:
id2token_ngram, token2id_ngram = utils.ngram_dict(vocab_ngram)

In [57]:
# id2token_ngram[:10], \
# token2id_ngram['<unk>'], token2id_ngram['<eos>'], token2id_ngram[('rosetta', 'stone')]

In [58]:
random_token_id = random.randint(0, len(id2token_ngram) - 1)
random_token = id2token_ngram[random_token_id]

print ("Token id {} ; token {}".format(random_token_id, id2token_ngram[random_token_id]))
print ("Token {}; token id {}".format(random_token, token2id_ngram[random_token]))

Token id 1910618 ; token ('protects', 'my', 'baby')
Token ('protects', 'my', 'baby'); token id 1910618


In [59]:
# # Function that combines all the above and goes from tokenized data to the ngram dataset
# def create_id_dataset(data, n):
#     padded_data = pad_sentences(data, n)
#     ngram_data = find_ngrams(padded_data, n)
    
#     vocab, count = ngram_counts(ngram_data, n)    
#     id2token, token2id = ngram_dict(vocab)
    
#     data_id = create_data_id(ngram_data, token2id)
#     data_id_merged = create_data_id_merged(data_id, token2id, n)
    
#     return data_id, data_id_merged

In [60]:
# all_data_id, all_data_id_merged = create_id_dataset(train_data_tokenized, n)

### Ngram Counts

In [61]:
# vocab_ngram[:10], count_ngram[:10]

In [62]:
c = ngram_utils.get_ngram_count(('i', 'like', 'this'), vocab_ngram, count_ngram)
c

1081

In [63]:
c = ngram_utils.get_ngram_count(('i', 'like', 'pandas'), vocab_ngram, count_ngram)
c

0

### Function for computing the probability of a sentence

## N-gram Probabilities

## $$P(w|w_{−n}, ..., w_{−2}, w_{−1}) \approx \frac{c(w_{−n}, ..., w_{−2}, w_{−1}, w)}{\sum_{w \in V} c(w_{−n}, ..., w_{−2}, w_{−1}, w)}$$


## Bigram Probabilities

## $$p(w_i | w_{i-1}) = \frac{c(w_{i-1}, w_i)}{\sum_{w_i} c(w_{i-1}, w_i)} $$


In [64]:
p = ngram_utils.get_ngram_prob(('rosetta', 'stone', 'is'), vocab_ngram, count_ngram)
p

# p = get_ngram_prob(('i', 'am', 'rosetta'), vocab_ngram, count_ngram)
# p

# p = get_ngram_prob(('it', "'", 's'), vocab_ngram, count_ngram)
# p

# p = get_ngram_prob(('i', "like", 'this'), vocab_ngram, count_ngram)
# p, 1/(2+1+1+1+1)

KeyboardInterrupt: 

In [None]:
p = ngram_utils.get_ngram_prob(('am', 'rosetta', 'stone'), vocab_ngram, count_ngram)
p

## Additive Smoothing

In [None]:
p = ngram_utils.get_ngram_prob_addditive_smoothing(('am', 'rosetta', 'stone'), vocab_ngram, count_ngram, delta=0.5)
p

## Add-One Smoothing

In [None]:
p = ngram_utils.get_ngram_prob_add_one_smoothing(('am', 'rosetta', 'stone'), vocab_ngram, count_ngram)
p

### Linear Interpolation Smoothing

#### TODO: add formula

In [None]:
# train_padded_trigram = ngram_utils.pad_sentences(train_data_tokenized, 3)
# train_trigram = ngram_utils.find_ngrams(train_padded_trigram, 3)
# vocab_trigram, count_trigram = ngram_utils.ngram_counts(train_trigram)

# train_padded_bigram = ngram_utils.pad_sentences(train_data_tokenized, 2)
# train_bigram = ngram_utils.find_ngrams(train_padded_bigram, 2)
# vocab_bigram, count_bigram = ngram_utils.ngram_counts(train_bigram)

# train_padded_unigram = ngram_utils.pad_sentences(train_data_tokenized, 1)
# train_unigram = ngram_utils.find_ngrams(train_padded_unigram, 1)
# vocab_unigram, count_unigram = ngram_utils.ngram_counts(train_unigram)

In [None]:
p = ngram_utils.get_ngram_prob_interpolation_smoothing(('am', 'rosetta', 'stone'), vocab_trigram, count_trigram, vocab_bigram, count_bigram, alpha=0.8)
p

### Smoothing: Linear Interpolation with Absolute Discounting

### $$p_{bi}(w|v) = max ({ \frac{N(v, w) - b_{bi}}{N(v)}, 0)  + b_{bi} \frac{V - N_0(v, \cdot)}{N(v)} p_{uni}(w) \large}$$

### $$p_{uni}(w) = max ({ \frac{N(w) - b_{uni}}{N}, 0)  + b_{uni} \frac{V - N_0(\cdot)}{N} \frac{1}{V}}$$

### $$b_{bi} = \frac{N_1(\cdot, \cdot)}{N_1(\cdot, \cdot) + 2*N_2(\cdot, \cdot)}$$

### $$b_{uni} = \frac{N_1(\cdot)}{N_1(\cdot) + 2*N_2(\cdot)}$$


### $$N_r(\cdot) = \sum_{w: N(w) = r} 1$$

### $$N_r(\cdot, \cdot) = \sum_{v, w: N(v, w) = r} 1$$

### $$N_r(v, \cdot) = \sum_{w: N(v, w) = r} 1$$

### V is the number of words in the vocabulary

### $N_r(\cdot, \cdot)$ and $N_r(\cdot)$  are the count-counts for bigrams and unigrams respectively $


In [None]:
x = 'stone'
y = 'rosetta'

z = get_p_bi(y, x)
z

### Let's check that the probabilities sum up to one
### $$\sum_w p_{bi}(w|v) = \sum_w p_{uni}(w) = 1$$



TODO: add this check or leave as homework

### Bigram LM
###  $$p(s) = \prod_{i = 1} ^ {N + 1} p(w_i | w_{i-1})$$

### Likelihood of a Sentence

In [None]:
n = 3
sentence = [['this', 'is', 'a', 'great', 'tutu']]
print(sentence)
ps = utils.get_prob_sentence(sentence, vocab_ngram, count_ngram, n)
ps

### Examples
### Bigram LM: $$ p(i \; love \; this \; light) = p(i|\cdot) \; p(love|i)\;  p(this|love)\;  p(light|this) \\
\approx \frac{c(i, \cdot)}{\sum_w c(\cdot, \; w)} \; \frac{c(love, i)}{\sum_wc(i, \; w)}\;  \frac{c(this, love)}{\sum_wc(love, \;w)}\;  \frac{c(light, this)}{\sum_wc(this, \;w)}$$ 

### Trigram LM: $$ p(i \; love \; this  \;light) = p(i|\cdot, \cdot) \; p(love|\cdot, i) \; p(this|i, love)\;  p(light|love, this)$$ 



In [44]:
# prob distr for the word following prev_tokens (i.e. tutu) 
# over all the words in the vocabulary 

# prev_tokens = train_data_tokenized[0][4] #[0]
prev_tokens = vocab_ngram[3][1:] #[0]   # need frmo 1 on so that this is a correct prev token
print(prev_tokens)
pd = utils.get_prob_distr_ngram(prev_tokens, vocab_ngram, count_ngram, voc, print_nonzero_probs=True)
sum(pd)#, pd

("'", 's')


ERROR:root:Internal Python error in the inspect module.
Below is the traceback from this internal error.



Traceback (most recent call last):
  File "/opt/anaconda3/lib/python3.7/site-packages/IPython/core/interactiveshell.py", line 3291, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-44-5f5789e3ef63>", line 7, in <module>
    pd = utils.get_prob_distr_ngram(prev_tokens, vocab_ngram, count_ngram, voc, print_nonzero_probs=True)
  File "/home/jupyter/AMMI/ammi-2019-nlp/01-day-LM/utils/ngram_utils.py", line 262, in get_prob_distr_ngram
    pd[idx] = get_ngram_prob(token_ngram, vocab_ngram, count_ngram)
  File "/home/jupyter/AMMI/ammi-2019-nlp/01-day-LM/utils/ngram_utils.py", line 106, in get_ngram_prob
    all_counts += get_ngram_count(t, vocab, count)
  File "/home/jupyter/AMMI/ammi-2019-nlp/01-day-LM/utils/ngram_utils.py", line 94, in get_ngram_count
    if ngram in vocab:
KeyboardInterrupt

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/anaconda3/lib/python3.7/site-packages/IPytho

KeyboardInterrupt: 

In [None]:
print(prev_tokens)
next_token = ngram_utils.sample_from_pd(prev_tokens, vocab_ngram, count_ngram, voc, print_nonzero_probs=True)
next_token

### Sentence Generation

In [None]:
num_tokens = 5
generated_sentence = ngram_utils.generate_sentence(num_tokens, vocab_ngram, count_ngram, voc, n)
generated_sentence

In [None]:
num_tokens = 5
generated_sentence = ngram_utils.generate_sentence(num_tokens, vocab_ngram, count_ngram, voc, n)
generated_sentence


In [None]:
# TODOs
# show rank for each word in a sentence
# explain perplexity 

### Log-Likelihood
### $LL = \sum_{k=1}^{K} \sum_{n=1}^{N_k + 1} log p_{bi}(w_{k,n} | w_{k,n-1})$

### Perplexity

### $PP = exp(-\frac{LL}{\sum_k(N_k + 1)})$

In [137]:
ppl_test = ngram_utils.get_perplexity(test_data_tokenized, vocab_ngram, count_ngram)
ppl_valid = ngram_utils.get_perplexity(valid_data_tokenized, vocab_ngram, count_ngram)
ppl_train = ngram_utils.get_perplexity(train_data_tokenized, vocab_ngram, count_ngram)


KeyboardInterrupt: 

In [65]:
ppl_test, ppl_valid, ppl_train
# TODO check whether this makes sense -- maybe it seems too good?

NameError: name 'ppl_test' is not defined

#### Let's look at some examples and see if they make sense