# Language Modeling using Ngram

In this Exercise, we are going to create a bigram language model and its variation. We will build one model for each of the following type and calculate their perplexity:
- Unigram Model
- Bigram Model
- Bigram Model with Laplace smoothing
- Bigram Model with Interpolation
- Bigram Model with Kneser-ney Interpolation

We will also use NLTK which is a natural language processing library for python to make our lives easier.



In [49]:
# #download corpus
# !wget --no-check-certificate https://github.com/ekapolc/nlp_2019/raw/master/HW4/BEST2010.zip
# !unzip BEST2010.zip

In [50]:
# !wget https://www.dropbox.com/s/jajdlqnp5h0ywvo/tokenized_wiki_sample.csv

In [51]:
# First we import necessary library such as math, nltk, bigram, and collections.
%pip install -q nltk
import math
import nltk
import io
import random
from random import shuffle
from nltk import bigrams, trigrams
from collections import Counter, defaultdict

random.seed(999)

Note: you may need to restart the kernel to use updated packages.


BEST2010 is a free Thai NLP dataset by NECTEC usually used as a standard benchmark for various NLP tasks including language modeling. It is separated into 4 domains including article, encyclopedia, news, and novel. The data is already  tokenized using '|' as a separator.

For example,

ตาม|ที่|นางประนอม ทองจันทร์| |กับ| |ด.ช.กิตติพงษ์ แหลมผักแว่น| |และ| |ด.ญ.กาญจนา กรองแก้ว| |ป่วย|สงสัย|ติด|เชื้อ|ไข้|ขณะ|นี้|ยัง|ไม่|ดี|ขึ้น|

In [52]:
total_word_count = 0
best2010 = []
with open("BEST2010/news.txt", "r", encoding="utf-8") as f:
    for i, line in enumerate(f):
        line = line.strip()[:-1]  # remove the trailing |
        total_word_count += len(line.split("|"))
        best2010.append(line)

In [53]:
# For simplicity, we assumes that each line is a sentence.
print(f"Total sentences in BEST2010 news dataset :\t{len(best2010)}")
print(f"Total word counts in BEST2010 news dataset :\t{total_word_count}")

Total sentences in BEST2010 news dataset :	30969
Total word counts in BEST2010 news dataset :	1660190


We separate the input into 2 sets, train and test data with 70:30 ratio

In [54]:
sentences = best2010
# The data is separated to train and test set with 70:30 ratio.
train = sentences[: int(len(sentences) * 0.7)]
test = sentences[int(len(sentences) * 0.7) :]

# Training data
train_word_count = 0
for line in train:
    for word in line.split("|"):
        train_word_count += 1
print("Total sentences in BEST2010 news training dataset :\t" + str(len(train)))
print("Total word counts in BEST2010 news training dataset :\t" + str(train_word_count))

Total sentences in BEST2010 news training dataset :	21678
Total word counts in BEST2010 news training dataset :	1042797


Here we load the data from Wikipedia which is also already tokenized. It will be used for answering questions in MyCourseville.

In [55]:
import pandas as pd

wiki_data = pd.read_csv("tokenized_wiki_sample.csv")

## Data Preprocessing

Before training any language models, the first step we always do is process the data into the format suited for the LM.

For this exercise, we will use NLTK to help process our data.

In [56]:
from nltk.lm.preprocessing import pad_both_ends, flatten
from nltk.lm.vocabulary import Vocabulary
from nltk import ngrams

We begin by "tokenizing" our training set. Note that the data is already tokenized so we can just split it.

In [57]:
tokenized_train = [["<s>"] + t.split("|") + ["</s>"] for t in train]  # "tokenize" each sentence

Next we create a vocabulary with the ```Vocabulary``` class from NLTK. It accepts a list of tokens so we flatten our sentences into one long sentence first.







In [58]:
flat_tokens = list(flatten(tokenized_train))  # join all sentences into one long sentence
vocab = Vocabulary(
    flat_tokens, unk_cutoff=3
)  # Words with frequency **below** 3 (not exactly 3) will not be considered in our vocab and will be converted to <UNK>.

Then we replace low frequency words and pad each sentence with \<s\> in the front and \</s\> in the back of each sentence.

Now *each* sentence is going to look something like this:
\["\<s\>", "hello", "my", "name", "is", "\<UNK\>", "\</s\>" \]

In [59]:
tokenized_train = [[token if token in vocab else "<UNK>" for token in sentence] for sentence in tokenized_train]

Finally, we do the same for the test set and the wiki dataset.

In [60]:
tokenized_test = [t.split("|") for t in test]
tokenized_test = [[token if token in vocab else "<UNK>" for token in sentence] for sentence in tokenized_test]

tokenized_wiki_test = [t.split("|") for t in wiki_data["tokenized"].tolist()]
tokenized_wiki_test = [[token if token in vocab else "<UNK>" for token in sentence] for sentence in tokenized_wiki_test]

# Unigram

In this section, we will demonstrate how to build a unigram language model <br>
**Important note:** <br>
**\<s\>** = sentence start symbol <br>
**\</s\>** = sentence end symbol

# VERY IMPORTANT:
- In this notebook, we will *not* default the unknown token probability to ```1/len(vocab)``` but instead will treat it as a normal word and let the model learn its probability so that we can compare our results to NLTK.
- **Also make sure that the code in this notebook can be executed without any problem. If we find that you used NLTK to answer questions in MyCourseVille and did not finish the assignment, you will receive a grade of 0 for this assignment.**

In [61]:
class UnigramModel:
    def __init__(self, data, vocab):
        self.unigram_count = defaultdict(lambda: 0.0)
        self.word_count = 0
        self.vocab = vocab
        for sentence in data:
            for w in sentence:  # [(word1, ), (word2, ), (word3, )...]
                w = w[0]
                self.unigram_count[w] += 1.0
                self.word_count += 1

    def __getitem__(self, w):
        w = w[0]  # [(word1, ), (word2, ), (word3, )...]
        if w in self.vocab:
            return self.unigram_count[w] / (self.word_count)
        else:
            return self.unigram_count["<UNK>"] / (self.word_count)

In [62]:
train_unigrams = [list(ngrams(sent, n=1)) for sent in tokenized_train]  # creating the unigrams by setting n=1
model = UnigramModel(train_unigrams, vocab)

In [63]:
def getLnValue(x):
    if x == 0:
        return -math.inf
    return math.log(x)

In [64]:
# problability of 'นายก'
print(getLnValue(model[("นายก",)]))

# for example, problability of 'นายกรัฐมนตรี' which is an unknown word is equal to
print(getLnValue(model[("นายกรัฐมนตรี",)]))

# problability of 'นายก' 'ได้' 'ให้' 'สัมภาษณ์' 'กับ' 'สื่อ'
prob = (
    getLnValue(model[("นายก",)])
    + getLnValue(model[("ได้",)])
    + getLnValue(model[("ให้",)])
    + getLnValue(model[("สัมภาษณ์",)])
    + getLnValue(model[("กับ",)])
    + getLnValue(model[("สื่อ",)])
    + getLnValue(model[("</s>",)])
)
print("Problability of a sentence", math.exp(prob))

-6.571687039690381
-3.952132570275872
Problability of a sentence 4.877889285183675e-18


# Perplexity

In order to compare language model we need to calculate perplexity. In this task you should write a perplexity calculation code for the unigram model. The result perplexity should be around 420.67 and
345.12 on train and test data.

## TODO #1 Calculate perplexity

In [65]:
def getLnValue(x):
    if x == 0:
        return -math.inf
    return math.log(x)


def calculate_sentence_ln_prob(sentence, model):
    """Calculate the log probability of a sentence given a language model."""
    ln_prob = 0
    for i in range(len(sentence)):
        ln_prob += getLnValue(model[sentence[i]])
    return ln_prob


def perplexity(test, model):
    """Compute perplexity of the test set with a language model using sentence ln prob."""
    word_count = 0
    ln_total = 0
    for sentence in test:
        ln_total += calculate_sentence_ln_prob(sentence, model)
        word_count += len(sentence)
    return math.exp(-ln_total / word_count)

In [66]:
test_unigrams = [list(ngrams(sent, n=1)) for sent in tokenized_test]

In [67]:
print(perplexity(train_unigrams, model))
print(perplexity(test_unigrams, model))

448.89690751824827
392.74028966757214


## Q1 MCV
Calculate the perplexity of the model on the wiki test set and answer in MyCourseVille

In [68]:
wiki_test_unigrams = [list(ngrams(sent, n=1)) for sent in tokenized_wiki_test]

In [69]:
print(perplexity([list(flatten(wiki_test_unigrams))], model))

485.7336366066887


# Bigram

Next, you will create a better language model than a unigram (which is not much to compare with). But first, it is very tedious to count every pair of words that occur in our corpus by ourselves. Lucky for us, nltk provides us a simple library which will simplify the process.

In [70]:
# example of nltk usage for bigram
sentence = "I always search google for an answer ."
padded_sentence = list(pad_both_ends(sentence.split(), n=2))

print("This is how nltk generate bigram.")
for w1, w2 in bigrams(padded_sentence):
    print(w1, w2)
print("\n<s> and </s> are used as a start and end of sentence symbol. respectively.")

This is how nltk generate bigram.
<s> I
I always
always search
search google
google for
for an
an answer
answer .
. </s>

<s> and </s> are used as a start and end of sentence symbol. respectively.


Now, you should be able to implement a bigram model by yourself. Also, you must create a new perplexity calculation for bigram. The result perplexity should be around 56.46 and 85.38 on train and test data.

## TODO #3 Write Bigram Model

In [71]:
class BigramModel:
    def __init__(self, data, vocab):
        self.bigram_count = defaultdict(lambda: 0.0)
        self.unigram_count = defaultdict(lambda: 0.0)
        self.total_word_count = 0
        self.vocab = vocab
        for sentence in data:
            for w1, w2 in sentence:  # [(word1, word2), (word2, word3), (word3, word4)...]
                self.bigram_count[(w1, w2)] += 1.0
                self.unigram_count[w1] += 1.0
                self.total_word_count += 1

    def __getitem__(self, bigram):
        """
        Return the probability of a given bigram.
        Note: Return least prob value if the bigram is not in the model.
        """
        w1, w2 = bigram
        if bigram in self.bigram_count:
            return self.bigram_count[bigram] / self.unigram_count[w1]
        else:
            return 0

## TODO #4 Write Perplexity for Bigram Model

Sum perplexity score at a sentence level, instead of word level

In [72]:
from tqdm import tqdm


def perplexity(bigram_data, model):
    """Compute perplexity of the test set with a language model using bigram ln prob."""
    sum_ln_prob = 0
    looper = tqdm(bigram_data[0], position=0, leave=True, desc="Calculating perplexity")
    for index, (w1, w2) in enumerate(looper):
        sum_ln_prob += getLnValue(model[(w1, w2)])
        if index % 10000 == 0:
            looper.set_postfix({"current perplexity": math.exp(-sum_ln_prob / (index + 1))})

    return math.exp(-sum_ln_prob / len(bigram_data[0]))

In [73]:
train_bigrams = [list(ngrams(sent, n=2)) for sent in tokenized_train]
test_bigrams = [list(ngrams(sent, n=2)) for sent in tokenized_test]

In [74]:
bigram_model_scratch = BigramModel(train_bigrams, vocab)

In [75]:
print(perplexity([list(flatten(train_bigrams))], bigram_model_scratch))
print(perplexity([list(flatten(test_bigrams))[:17]], bigram_model_scratch))
print(perplexity([list(flatten(test_bigrams))], bigram_model_scratch))

Calculating perplexity: 100%|██████████| 1064475/1064475 [00:00<00:00, 2186708.50it/s, current perplexity=56.5]


56.45504870219316


Calculating perplexity: 100%|██████████| 17/17 [00:00<00:00, 52622.26it/s, current perplexity=1.94]


inf


Calculating perplexity: 100%|██████████| 608102/608102 [00:00<00:00, 2155564.01it/s, current perplexity=inf]

inf





## Q2 MCV

In [76]:
wiki_test_bigrams = [list(ngrams(sent, n=2)) for sent in tokenized_wiki_test]

In [77]:
print(perplexity([list(flatten(wiki_test_bigrams))], bigram_model_scratch))

Calculating perplexity: 100%|██████████| 214930/214930 [00:00<00:00, 2152941.36it/s, current perplexity=inf]

inf





# Smoothing

Usually any ngram models have a sparsity problem, which means it does not have every possible ngram of words in the dataset. Smoothing techniques can alleviate this problem. In this section, you will implement three basic smoothing methods laplace smoothing, interpolation for bigram, and Knesey-Ney smoothing.

## TODO #5 write Bigram with Laplace smoothing (Add-One Smoothing)

The result perplexity on training and testing should be:

    370.23, 361.25 for Laplace smoothing

In [78]:
class BigramWithLaplaceSmoothing(BigramModel):
    def __getitem__(self, bigram):
        w1, w2 = bigram
        return (self.bigram_count[(w1, w2)] + 1) / (self.unigram_count[w1] + len(self.vocab))


model = BigramWithLaplaceSmoothing(train_bigrams, vocab)
print(perplexity([list(flatten(train_bigrams))], model))
print(perplexity([list(flatten(test_bigrams))], model))

Calculating perplexity: 100%|██████████| 1064475/1064475 [00:00<00:00, 1977111.84it/s, current perplexity=371]


370.28232024056035


Calculating perplexity: 100%|██████████| 608102/608102 [00:00<00:00, 1625136.93it/s, current perplexity=369]

369.1605485652555





## Q3 MCV

In [79]:
print(perplexity([list(flatten(wiki_test_bigrams))], model))

Calculating perplexity: 100%|██████████| 214930/214930 [00:00<00:00, 1739308.31it/s, current perplexity=730]

735.0253999215859





## TODO #6 Write Bigram with Interpolation
Set the lambda value as 0.7 for bigram, 0.25 for unigram, and 0.05 for unknown word.

The result perplexity on training and testing should be:

    70.07, 102.67 for Interpolation

In [80]:
class BigramWithInterpolation:
    def __init__(self, data, vocab, l=0.7):
        self.unigram_count = defaultdict(lambda: 0.0)
        self.bigram_count = defaultdict(lambda: 0.0)
        self.total_word_count = 0
        self.vocab = vocab
        self.l = l  # l for lambda
        for sentence in data:
            for w1, w2 in sentence:
                self.bigram_count[(w1, w2)] += 1.0
                self.unigram_count[w1] += 1.0
                self.total_word_count += 1

            # account of the last word of each sentence
            self.unigram_count[w2] += 1.0
            self.total_word_count += 1

    def __getitem__(self, bigram):
        w1, w2 = bigram
        unigram_prob = self.unigram_count[w2] / self.total_word_count
        if w1 not in self.vocab:
            w1 = "<UNK>"
        if self.unigram_count[w1] == 0:
            bigram_prob = 0
        else:
            bigram_prob = self.bigram_count[(w1, w2)] / self.unigram_count[w1]

        return 0.7 * bigram_prob + 0.25 * unigram_prob + 0.05 * (1 / len(self.vocab))


model = BigramWithInterpolation(train_bigrams, vocab)
print(perplexity([list(flatten(train_bigrams))], model))
print(perplexity([list(flatten(test_bigrams))], model))

Calculating perplexity: 100%|██████████| 1064475/1064475 [00:00<00:00, 1364681.55it/s, current perplexity=70.1]


70.06754793707988


Calculating perplexity: 100%|██████████| 608102/608102 [00:00<00:00, 1257695.96it/s, current perplexity=103]

102.99295310997627





## Q4 MCV

In [81]:
print(perplexity([list(flatten(wiki_test_bigrams))], model))

Calculating perplexity: 100%|██████████| 214930/214930 [00:00<00:00, 1101110.98it/s, current perplexity=251]

251.60709095654997





## Language modeling on multiple domains

Sometimes, we do not have enough data to create a language model for a new domain. In that case, we can improvised by combining several models to improve result on the new domain.

In this exercise you will try to merge two language models from news and article domains to create a language model for the encyclopedia domain.

In [82]:
# create encyclopeida data (test data)
encyclo_data = []
with open("BEST2010/encyclopedia.txt", "r", encoding="utf-8") as f:
    for i, line in enumerate(f):
        encyclo_data.append(line.strip()[:-1])

(news) First, you should try to calculate perplexity of your bigram with interpolation on encyclopedia data. The  perplexity should be around 236.329

In [83]:
encyclopedia_bigrams = [list(ngrams(pad_both_ends(sent.split("|"), 2), n=2)) for sent in encyclo_data]

In [84]:
# 1) news only on "encyclopedia"
print(perplexity([list(flatten(encyclopedia_bigrams))], model))

Calculating perplexity: 100%|██████████| 1214496/1214496 [00:01<00:00, 1167541.89it/s, current perplexity=468]

467.7718689489432





## TODO #7 - Langauge Modelling on Multiple Domains
Combine news and article datasets to create another bigram model and evaluate it on the encyclopedia data.



(article) For your information, a bigram model with interpolation using article data to test on encyclopedia data has a perplexity of 218.55

In [85]:
# 2) article only on "encyclopedia"
best2010_article = []
with open("BEST2010/article.txt", "r", encoding="utf-8") as f:
    for i, line in enumerate(f):
        best2010_article.append(line.strip()[:-1])

combined_total_word_count = 0
for line in best2010_article:
    combined_total_word_count += len(line.split("|"))

article_bigrams = [list(ngrams(pad_both_ends(sent.split("|"), 2), n=2)) for sent in best2010_article]
article_vocab = Vocabulary(list(flatten([sent.split("|") for sent in best2010_article])), unk_cutoff=3)

model = BigramWithInterpolation(article_bigrams, article_vocab)

In [86]:
print(
    "Perplexity of the bigram model using article data with interpolation smoothing on encyclopedia test data",
    perplexity([list(flatten(encyclopedia_bigrams))], model),
)

Calculating perplexity: 100%|██████████| 1214496/1214496 [00:00<00:00, 1233036.06it/s, current perplexity=426]

Perplexity of the bigram model using article data with interpolation smoothing on encyclopedia test data 426.31585617805587





In [87]:
# 3) train on news + article, test on "encyclopedia"
best2010_article_and_news = best2010_article.copy()
with open("BEST2010/news.txt", "r", encoding="utf-8") as f:
    for i, line in enumerate(f):
        best2010_article_and_news.append(line.strip()[:-1])

combined_bigrams = [list(ngrams(pad_both_ends(sent.split("|"), 2), n=2)) for sent in best2010_article_and_news]
combined_vocab = Vocabulary(list(flatten([sent.split("|") for sent in best2010_article_and_news])), unk_cutoff=3)

combined_model = BigramWithInterpolation(combined_bigrams, combined_vocab)
print(
    "Perplexity of the combined Bigram model with interpolation smoothing on encyclopedia test data",
    perplexity([list(flatten(encyclopedia_bigrams))], combined_model),
)

Calculating perplexity: 100%|██████████| 1214496/1214496 [00:01<00:00, 1115484.63it/s, current perplexity=398]

Perplexity of the combined Bigram model with interpolation smoothing on encyclopedia test data 398.50315555053845





## Q5 MCV

Did you get a better or worse result when using combined data?

## TODO #8 - Kneser-ney on "News"

<!-- Reimplement equation 4.33 in SLP textbook (https://lagunita.stanford.edu/c4x/Engineering/CS-224N/asset/slp4.pdf) -->

Implement Bigram Knerser-ney LM. The result perplexity should be around 65.81, 92.88 on train and test data. Be careful not to mix up vocab from the above section!


In [88]:
from tqdm import tqdm


class BigramKneserNey:
    def __init__(self, data, vocab, discount=0.75):
        self.bigram_count = defaultdict(lambda: 0.0)
        self.unigram_count = defaultdict(lambda: 0.0)
        self.total_word_count = 0
        self.vocab = vocab
        self.discount = discount
        self.bigram_types = defaultdict(set)  # To store word types that follow a given word

        # Count unigrams and bigrams
        for sentence in tqdm(data, position=0, leave=True, desc="Counting bigrams"):
            for w1, w2 in sentence:  # sentence = [(word1, word2), (word2, word3), ...]
                self.bigram_count[(w1, w2)] += 1.0
                self.unigram_count[w1] += 1.0
                self.bigram_types[w2].add(w1)  # w2 follows w1
                self.total_word_count += 1

    def continuation_probability(self, word):
        """
        Compute continuation probability for a word based on how many unique bigrams end with it.
        """
        return len(self.bigram_types[word]) / len(self.bigram_count)

    def kneser_ney_probability(self, bigram):
        """
        Return the smoothed probability for a given bigram using Kneser-Ney smoothing.
        """
        w1, w2 = bigram
        bigram_count = self.bigram_count[bigram]
        unigram_count = self.unigram_count[w1]

        # Discounted bigram probability
        if unigram_count > 0:
            p_bigram = max(bigram_count - self.discount, 0) / unigram_count
        else:
            p_bigram = 0

        # Continuation probability
        p_continuation = self.continuation_probability(w2)

        # Weight of backoff
        lambda_w1 = (
            (self.discount / unigram_count) * len([w for w in self.bigram_count if w[0] == w1])
            if unigram_count > 0
            else 0
        )

        return p_bigram + lambda_w1 * p_continuation

    def __getitem__(self, bigram):
        """
        Return the Kneser-Ney smoothed probability of a given bigram.
        """
        return self.kneser_ney_probability(bigram)


model = BigramKneserNey(train_bigrams, vocab)
# print(perplexity([list(flatten(train_bigrams))],model))
print(perplexity([list(flatten(train_bigrams))[:1000]], model))
print(perplexity([list(flatten(test_bigrams))[:1000]], model))
# print(perplexity([list(flatten(test_bigrams))], model))

Counting bigrams: 100%|██████████| 21678/21678 [00:00<00:00, 56863.12it/s]
Calculating perplexity: 100%|██████████| 1000/1000 [00:05<00:00, 190.92it/s, current perplexity=5.76e+3]


50.40541747450013


Calculating perplexity: 100%|██████████| 1000/1000 [00:05<00:00, 192.74it/s, current perplexity=1.93]

89.82129031989595





## Q6 MCV

In [89]:
print(perplexity([list(flatten(wiki_test_bigrams))], model))

Calculating perplexity: 100%|██████████| 214930/214930 [19:55<00:00, 179.75it/s, current perplexity=246]

247.7732037817852



