Before you turn this problem in, make sure everything runs as expected. First, **restart the kernel** (in the menubar, select Kernel$\rightarrow$Restart) and then **run all cells** (in the menubar, select Cell$\rightarrow$Run All).

Make sure you fill in any place that says `YOUR CODE HERE` or "YOUR ANSWER HERE", as well as your name and NetID below:

In [1]:
NAME = "Punit K. Jha"
NetID = "punit2"

---

Lab 3: Language Modeling
=============
In this problem set, your objective is to train a language model, evaluate it and explore how it can be used for language generation. Towards that end you will:

- Train an n-gram language model.
- Use that language model to generate representative sentences.
- Study the effect of training data size, and language model complexity (n-gram size), on the modeling capacity of a language model.

- **To submit this assignment, rename the whole directory as your NetID. Compress the whole directory using tar or zip, and submit ```Your_NetID.tgz``` or ```Your_NetID.zip``` on Compass.**

Total points: 100 points

# 0. Setup

In order to develop this assignment, you will need [python 3.6](https://www.python.org/downloads/) and the following libraries. Most if not all of these are part of [anaconda](https://www.continuum.io/downloads), so a good starting point would be to install that.

- [jupyter](http://jupyter.readthedocs.org/en/latest/install.html)
- [nosetests](https://nose.readthedocs.org/en/latest/)
- [nltk](https://www.nltk.org)

Here is some help on installing packages in python: https://packaging.python.org/installing/. You can use ```pip --user``` to install locally without sudo.

In [2]:
import sys
from importlib import reload
from collections import defaultdict

In [3]:
print('My Python version')

print('python: {}'.format(sys.version))

My Python version
python: 3.7.4 (default, Aug 13 2019, 20:35:49) 
[GCC 7.3.0]


In [4]:
import nose
import nltk

In [5]:
print('My library versions')

print('nose: {}'.format(nose.__version__))
print('nltk: {}'.format(nltk.__version__))

My library versions
nose: 1.3.7
nltk: 3.4.5


To test whether your libraries are the right version, run:

`nosetests tests/test_environment.py`

In [6]:
! nosetests tests/test_environment.py

.
----------------------------------------------------------------------
Ran 1 test in 0.001s

OK


# 1. Training a language model

Let us first train a 3-gram language model. We need a monolingual corpus, which we will get using nltk.

Total: 40 points

Let us first extract from nltk's reuters corpus, 2 corpora in 2 different domains (here, subject areas), the food industry and the natural resources industry.

In [7]:
import nltk

food = ['barley', 'castor-oil', 'cocoa', 'coconut', 'coconut-oil', 'coffee', 'copra-cake', 'grain', 'groundnut', 'groundnut-oil', 'potato', 'soy-meal', 'soy-oil', 'soybean', 'sugar', 'sun-meal', 'sun-oil', 'sunseed', 'tea', 'veg-oil', 'wheat']
natural_resources = ['alum', 'fuel', 'gas', 'gold', 'iron-steel', 'lead', 'nat-gas', 'palladium', 'propane', 'tin', 'zinc']
corpus = nltk.corpus.reuters
food_corpus = corpus.raw(categories=food)
natr_corpus = corpus.raw(categories=natural_resources)

## Tokenization

Your first task is to tokenize the raw text into a list of sentences, which are in turn a list of words. No need for any other kind of preprocessing such as lowercasing.

- **Deliverable 1.1**: Complete the function `ece365lib.train.tokenize`. (5 points)
- **Test**: `nose tests/test_train.py:test_d1_1_tk`

In [8]:
from ece365lib import train
# nltk.download('punkt')
reload(train);

In [9]:
food_corpus_tk = train.tokenize_corpus(food_corpus)
natr_corpus_tk = train.tokenize_corpus(natr_corpus)
# food_corpus

In [10]:
! nosetests tests/test_train.py:test_d1_1_tk

.
----------------------------------------------------------------------
Ran 1 test in 4.986s

OK


## Padding

Your second task is to pad your sentences with the start-of-sentence symbol `'<s>'` and end-of-sentence symbol `'</s>'`. These symbols are necessary to model the probability of words that usually start a sentence and those that usually end a sentence.

- **Deliverable 1.2**: Complete the function `ece365lib.train.pad`. (5 points)
- **Test**: `nosetests tests/test_train.py:test_d1_2_pad`

In [11]:
reload(train);
# for x in food_corpus_tk:
#     x.insert(0, '<s>')
#     x.append('</s>')
#     print(x)

In [12]:
food_corpus_tk_pd = train.pad_corpus(food_corpus_tk)
natr_corpus_tk_pd = train.pad_corpus(natr_corpus_tk)

In [13]:
print(len(food_corpus_tk_pd[45]))
print(len(food_corpus_tk[45]))
print(food_corpus_tk_pd[45])
print(food_corpus_tk[45])

23
21
['<s>', 'Australia', 'and', 'Canada', 'could', 'then', 'increase', 'their', 'wheat', 'exports', 'as', 'they', 'are', 'more', 'competitive', 'than', 'the', 'U.S.', ',', 'He', 'said', '.', '</s>']
['Australia', 'and', 'Canada', 'could', 'then', 'increase', 'their', 'wheat', 'exports', 'as', 'they', 'are', 'more', 'competitive', 'than', 'the', 'U.S.', ',', 'He', 'said', '.']


In [14]:
! nosetests tests/test_train.py:test_d1_2_pad

.
----------------------------------------------------------------------
Ran 1 test in 5.155s

OK


## Train-Test Split

Your third task is to split the corpora into train, for training the language model, and test, for testing the language model. We will go with the traditional 80% (train), 20% (test) split. The first `floor(0.8*num_of_tokens)` should constitute the training corpus, and the rest should constitute the test corpus.

- **Deliverable 1.3**: Complete the function `ece365lib.train.split_corpus`. (5 points)
- **Test**: `nosetests tests/test_train.py:test_d1_3_spc`

In [15]:
reload(train);
# print(len(food_corpus_tk_pd))
# total=0
# for x in food_corpus_tk_pd:
#     total+=len(x)
# print(total)
# print(food_corpus_tk_pd)

In [16]:
food_corpus_tr, food_corpus_te = train.split_corpus(food_corpus_tk_pd)
natr_corpus_tr, natr_corpus_te = train.split_corpus(natr_corpus_tk_pd)

In [17]:
! nosetests tests/test_train.py:test_d1_3_spc

.
----------------------------------------------------------------------
Ran 1 test in 5.192s

OK


## Splitting into n-grams

Your fourth task is to count n-grams in the text up to a specific order.

- **Deliverable 1.4**: Complete the function `ece365lib.train.count_ngrams`. (20 points)
- **Test**: `nosetests tests/test_train.py:test_d1_4_cn`

In [72]:
reload(train);
from nltk import ngrams

# n = 3
# sixgrams = ngrams(food_corpus_tk[1].split(), n)
# unique_data = [list(x) for x in set(tuple(x) for x in food_corpus_tr)]
# len(unique_data)
# print(unique_data)
# flat_list = [item for sublist in natr_corpus_tr for item in sublist]
# vocab=list(set(flat_list))
# print(len(vocab))
# print(vocab)
# print(food_corpus_tk[1])
# print(sixgrams)
allo=[]
n = 2
for x in food_corpus_tk:
    allo.append(list(ngrams(x, n)))
flat_list = [item for sublist in allo for item in sublist]
print(len(allo[1]))
print(allo[1])
# for x in allo:
#     print(x)


38
[('It', 'also'), ('also', 'said'), ('said', 'that'), ('that', 'each'), ('each', 'year'), ('year', '1.575'), ('1.575', 'mln'), ('mln', 'tonnes'), ('tonnes', ','), (',', 'or'), ('or', '25'), ('25', 'pct'), ('pct', ','), (',', 'of'), ('of', 'China'), ('China', "'s"), ("'s", 'fruit'), ('fruit', 'output'), ('output', 'are'), ('are', 'left'), ('left', 'to'), ('to', 'rot'), ('rot', ','), (',', 'and'), ('and', '2.1'), ('2.1', 'mln'), ('mln', 'tonnes'), ('tonnes', ','), (',', 'or'), ('or', 'up'), ('up', 'to'), ('to', '30'), ('30', 'pct'), ('pct', ','), (',', 'of'), ('of', 'its'), ('its', 'vegetables'), ('vegetables', '.')]


In [19]:
# import collections
# freqqq = collections.Counter(flat_list)
# z=0
# for x,y in dict(freqqq).items():
#     z+=1
#     print(x,y,z)

In [73]:
food_ngrams, food_vocab = train.count_ngrams(food_corpus_tr, 3)
natr_ngrams, natr_vocab = train.count_ngrams(natr_corpus_tr, 3)

In [64]:
! nosetests tests/test_train.py:test_d1_4_cn

.
----------------------------------------------------------------------
Ran 1 test in 6.366s

OK


## Estimating n-gram probability

Your last task in this part of the problem set is to estimate the n-gram probabilities p(w_i|w_{i-n+1}, w_{i-n+2}, .., w_{i-1}), with no smoothing. For the purposes of this exercise we will use the maximum likelihood estimate and perform no smoothing. 

- **Deliverable 1.5**: Complete the function `ece365lib.train.estimate`. (5 points)
- **Test**: `nosetests tests/test_train.py:test_d1_5_es`

In [65]:
reload(train);
from collections import defaultdict
# print(natr_ngrams)

N=0.0
one_g=1
two_g=1
N_two=0.0
three_g=1
N_three=0.0
for i,j in food_ngrams.items():
        if (len(i)==1):
            N+=j
            if (i[0] == 'palm'):
#                 print(i,j)
                one_g=j
        if (len(i)==2):
            if(i[0]=='of'):
#                 print(i,j)
                N_two+=j
                if(i[1]=='palm'):
                    two_g=j
        if(len(i)==3):
            if(i[0]=='producer' and i[1]=='of' ):
#                 print(i,j)
                N_three+=j
                if(i[2]=='palm'):
                    three_g=j
                
# print((one_g/N)*(two_g/N_two)*(three_g/N_three))                
                    
            
            
            
print(N)    
print(one_g)
print("one_g/N",one_g/N)
print("N_two",N_two)
print("two_g",two_g)
print("two_g/N_two",two_g/N_two)
print("N_three",N_three)
print("three_g",three_g)
print("three_g/N_three",three_g/N_three)
print(len(food_ngrams))
# print(natr_ngrams)

371778.0
194
one_g/N 0.0005218167831340209
N_two 3960.0
two_g 11
two_g/N_two 0.002777777777777778
N_three 8.0
three_g 2
three_g/N_three 0.25
219782


In [66]:
print(train.estimate(food_ngrams, ['palm'], ['producer', 'of']))
print(train.estimate(natr_ngrams, ['basis'], ['tested', 'the']))

8.0 2
0.25
2.0 1
0.5


In [67]:
! nosetests tests/test_train.py:test_d1_5_es

.
----------------------------------------------------------------------
Ran 1 test in 6.274s

OK


Application: the speech recognition task takes human voice as its input and outputs text. If the pronunciation of two words are similar, Language Model can help decide which word to choose! 

In [75]:
# print(type(food_ngrams))
print(food_ngrams[('there', 'is', 'no')])
print(food_ngrams[('their', 'is', 'no')])

11
0


Given the count of 'there is no' and 'their is no', which word ('there' or 'their') is more likely to be taken as the output? 

Language Model is not only helpful in speech recogition, but text generation (*e.g.*, machine translation, summarization, image captioning), spelling correction and so on. 

## Training a language model

Now we will combine everything together and train our language model! One way to see what the language model has learned is to see the sentences it can generate.

For the sake of simplicity, and for the purposes of later parts in this problem set, we use nltk's lm module to train a language model.

In [76]:
from nltk.lm import Laplace
from nltk.lm.preprocessing import padded_everygram_pipeline
size_ngram = 3

food_train, food_vocab = padded_everygram_pipeline(size_ngram, food_corpus_tk[:int(0.8*len(food_corpus_tk))])
natr_train, natr_vocab = padded_everygram_pipeline(size_ngram, natr_corpus_tk[:int(0.8*len(natr_corpus_tk))])

food_test = sum([['<s>'] + x + ['</s>'] for x in food_corpus_tk[int(0.8*len(food_corpus_tk)):]],[])
natr_test = sum([['<s>'] + x + ['</s>'] for x in natr_corpus_tk[int(0.8*len(natr_corpus_tk)):]],[])

food_lm = Laplace(size_ngram)
natr_lm = Laplace(size_ngram)

food_lm.fit(food_train, food_vocab)
natr_lm.fit(natr_train, natr_vocab)

Now let's ask our language model to generate a sentence. 

In [77]:
# This might take some time
n_words = 10
print(food_lm.generate(n_words, random_seed=3))  # random_seed makes the random sampling part of generation reproducible. 
print(natr_lm.generate(n_words, random_seed=3))

['<s>', 'Coffee', 'does', 'not', 'know', 'how', 'much', 'they', 'planted', '.']
['<s>', 'Currently', ',', 'there', 'was', 'a', '15.3', 'pct', 'increase', ',']


# 2. Evaluating a language model

Next, we evaluate our language models using the perplexity measure, and draw conclusions on how a change of domains (here, subject areas) can affect the performance of a language model. Perplexity measures the language model capacity at predicting sentences in a test corpus.

Total: 20 points

- **Deliverable 2.1**: Complete the function `ece365lib.evaluate.get_perplexity`. (10 points)
- **Test**: `nosetests tests/test_train.py:test_d2_1_gp`

In [80]:
from ece365lib import evaluate
reload(evaluate);

In [81]:
# This might take some time
print(evaluate.get_perplexity(food_lm, food_test[:5000]))
print(evaluate.get_perplexity(food_lm, natr_test[:5000]))
print(evaluate.get_perplexity(natr_lm, natr_test[:5000]))
print(evaluate.get_perplexity(natr_lm, food_test[:5000]))

8396.70521548381
8546.64191689775
5428.123517045601
5490.520376861629


In [82]:
! nosetests tests/test_train.py:test_d2_1_gp

.
----------------------------------------------------------------------
Ran 1 test in 106.045s

OK


- **Deliverable 2.2**: What observations can you make on the results? Is the domain shift affecting the performance of the language model? What are possible explanations? (10 points)


<b> We know that higher the conditional probability of the word sequence, the lower the perplexity. We see that as we the if we train on a corpus and use it to predict a entirely different test corpus then the perplexity increases i.e the test set probability decreases. So yes, the domain shift it affecting the perfomance of the language model. Since our model was trained on the "food" domain the model might not have even seen words in the "natr" domian so making predictions is difficult and has higher errors.


# 3. Data size and model complexity

Let us now see how the size of the training data and the complexity of the model we choose affects the quality of our language model.

Total: 40 points

For this part we'd like to see the difference between a 2-gram model and a 3-gram model. Typically, with a larger n, the n-gram model gives us more information about the word sequence and has lower perplexity. 

For testing, we'll only be considering 5% instead of 20% of the test data for running time purposes. 

- **Deliverable 3.1**: Complete the function `ece365lib.train.vary_ngram`. (40 points)
- **Test**: `nosetests tests/test_train.py:test_d3_1_vary`

In [84]:
from ece365lib import train
reload(train);

In [85]:
n_gram_orders = [2, 3]

train_corpus = natr_corpus_tk[:int(0.8*len(natr_corpus_tk))]
test_corpus = natr_corpus_tk[int(0.8*len(natr_corpus_tk)): int(0.85*len(natr_corpus_tk))]

results = train.vary_ngram(train_corpus, test_corpus, n_gram_orders)

print(results)

{2: 5393.67203181578, 3: 5424.76053819262}


In [86]:
! nosetests tests/test_train.py:test_d3_1_vary

.
----------------------------------------------------------------------
Ran 1 test in 241.152s

OK


However, we notice that the 3-gram language model actually performs worse than the 2-gram language model. This is due to the small size of the training corpus. A 3-gram language model is actually too complex of a model for a small training size. If our training data was larger, we would be seeing the opposite. If we trained 1-gram, 2-gram, and 3-gram models on 38 million words from the Wall Street Journal, we will get perplexity of 962, 170, 109 respectively on a test set of 1.5 million words. 

Now let's see a few examples of top frequent n-gram examples. Let's start with unigram. 

In [87]:
natr_ngrams, natr_vocab = train.count_ngrams(natr_corpus_tr, 3)

top_ngram = []
count = 0
for i in sorted(natr_ngrams.items(), key=lambda x: x[1], reverse=True):
    if len(i[0]) == 1:
        top_ngram.append(i[0])
        count += 1
    if count >=20:
        break
print(top_ngram)

[(',',), ('the',), ('<s>',), ('</s>',), ('.',), ('of',), ('to',), ('and',), ('said',), ('in',), ('a',), ('for',), ('The',), ('from',), ('pct',), ('mln',), ('at',), ('on',), ("'s",), ('is',)]


Do you think unigram captures any grammatical information? How well do you think unigram captures the language information? 

<b>The unigram does really capture grammatical information as we can see that the comma,full stop, sentence start and stop, prepositions, conjunctions etc are the most frequent occuring words in the unigram model. However, the language information and sentence construction is poor as reflected from higher perplexity scores as compared to bi,tri-gram models.<b>

Now let's see bigram and trigram. 

In [88]:
top_ngram = []
count = 0
for i in sorted(natr_ngrams.items(), key=lambda x: x[1], reverse=True):
    if len(i[0]) == 2:
        top_ngram.append(i[0])
        count += 1
    if count >=20:
        break
print(top_ngram)

top_ngram = []
count = 0
for i in sorted(natr_ngrams.items(), key=lambda x: x[1], reverse=True):
    if len(i[0]) == 3:
        top_ngram.append(i[0])
        count += 1
    if count >=20:
        break
print(top_ngram)

[('.', '</s>'), ('said', '.'), ('<s>', 'The'), ('in', 'the'), ('of', 'the'), ('&', 'lt'), ('lt', ';'), (',', 'the'), ('said', 'it'), ('said', 'the'), ('<s>', '``'), (',', "''"), (',', 'which'), ('to', 'the'), ('for', 'the'), (',', 'a'), ('on', 'the'), (',', 'and'), ('mln', 'dlrs'), ('<s>', 'It')]
[('said', '.', '</s>'), ('&', 'lt', ';'), ('.', "''", '</s>'), ('<s>', 'The', 'company'), ('<s>', 'It', 'said'), ('he', 'said', '.'), ('ounces', 'of', 'gold'), ('year', '.', '</s>'), ('The', 'company', 'said'), ('...', '...', '...'), ('added', '.', '</s>'), ('oil', 'and', 'gas'), (',', 'it', 'said'), ('pct', '.', '</s>'), (',', "''", 'he'), (',', 'he', 'said'), ('it', 'said', '.'), ('sources', 'said', '.'), ('is', 'expected', 'to'), ('<s>', 'He', 'said')]


Compared with unigram, bigram and trigram can capture more information. 
Bigram language model can already capture some of the grammatical information, such as 'in the', 'of the'. However, the power of bigram is still limited. 
The trigram can output more adequate short phrases such as 'ounces of gold', 'The company said', 'oil and gas'. 

Therefore, typically the n-gram model with a larger n contains more information about the word sequence and thus, has lower perplexity. However, the tradeoff is the computational efficiency and memory. 