### Natcha Jangphiphatnawakit 63340500031

# Language Modeling using Ngram

In this Exercise, you are going to use NLTK which is a natural language processing library for python to create a bigram language model and its variation. You will build one model for each of the following type and calculate their perplexity:
- Unigram Model
- Bigram Model
- Bigram Model with add one estimation
- Bigram Model with Interpolation
- Bigram Model with Kneser-ney Interpolation
- Neural LM



In [1]:
# from google.colab import drive
# drive.mount('/content/drive')

In [2]:
#download corpus
# import shutil
# shutil.copy("/content/drive/MyDrive/FRA 501 IntroNLP&DL/Dataset/BEST2010.zip", "/content/BEST2010.zip")
# !unzip BEST2010.zip

In [3]:
#First we import necessary library such as math, nltk, bigram, and collections.
import math
import nltk
import io
import random
from random import shuffle
from nltk import bigrams, trigrams
from collections import Counter, defaultdict
random.seed(999)

BEST2010 is a free Thai NLP dataset by NECTEC usually use as a standard benchmark for various NLP tasks includeing language modeling. BEST2010 is separated into 4 domain article, encyclopedia, news and novel. The data is already  tokenized using '|' as a separator.

For example,

ตาม|ที่|นางประนอม ทองจันทร์| |กับ| |ด.ช.กิตติพงษ์ แหลมผักแว่น| |และ| |ด.ญ.กาญจนา กรองแก้ว| |ป่วย|สงสัย|ติด|เชื้อ|ไข้|ขณะ|นี้|ยัง|ไม่|ดี|ขึ้น|

In [4]:
# We choose news domain as our dataset
best2010=[]
fp= io.open('BEST2010/news.txt','r',encoding='utf-8')
for i,line in enumerate(fp):
    best2010.append(line.strip()[:-1])
fp.close()
all_vocabulary =set()
total_word_count =0
for line in best2010:
    for word in line.split('|'):        
        all_vocabulary.add(word)
        total_word_count+=1

In [5]:
#For simplicity, we assumes that each line is a sentence.
print ('Total sentences in BEST2010 news dataset :\t'+ str(len(best2010)))
print ('Total word counts in BEST2010 news dataset :\t'+ str(total_word_count))
print ('Total vocabulary in BEST2010 news dataset :\t'+ str(len(all_vocabulary)))

Total sentences in BEST2010 news dataset :	30969
Total word counts in BEST2010 news dataset :	1660190
Total vocabulary in BEST2010 news dataset :	35488


We separate out input into 2 sets, train and test data with 70:30 ratio

In [6]:
sentences = best2010
# The data is separated to train and test set with 70:30 ratio.
train = sentences[:int(len(sentences)*0.7)]
test = sentences[int(len(sentences)*0.7):]

#Training data
train_vocabulary =set()
train_word_count =0
for line in train:
    for word in line.split('|'):        
        train_vocabulary.add(word)
        train_word_count+=1
print ('Total sentences in BEST2010 news training dataset :\t'+ str(len(train)))
print ('Total word counts in BEST2010 news training dataset :\t'+ str(train_word_count))
print ('Total vocabuary in BEST2010 news training dataset :\t'+ str(len(train_vocabulary)))
# We will use 1/vocab_size as a default value for unknown word
unk_value = math.pow(len(train_vocabulary),-1)

Total sentences in BEST2010 news training dataset :	21678
Total word counts in BEST2010 news training dataset :	1042797
Total vocabuary in BEST2010 news training dataset :	26240


# Unigram

In this section, we will demonstrate how to build a unigram language model <br>
**Important note:** <br>
**\<s\>** = sentence start symbol <br>
**\</s\>** = sentence end symbol 

In [7]:
def getUnigramModel(data):
    model = defaultdict(lambda: 0)
    word_count =0
    for sentence in data:
        sentence +=  u'|</s>' #for unigram model we can always ignore <s>, since p(w0=<s>)=1
        for w1 in sentence.split('|'):
            model[w1] +=1.0
            word_count+=1
    for w1 in model:
        model[w1] = model[w1]/(word_count)
    return model

In [8]:
uni_model = getUnigramModel(train)

In [9]:
def getLnValue(x):
    if x >0.0:
        return math.log(x)
    else:
        return math.log(unk_value)

In [10]:
#problability of 'นายก'
print(getLnValue(uni_model[u'นายก']))
#for example, problability of 'นายกรัฐมนตรี' which is an unknown word is equal to
print(getLnValue(uni_model[u'นายกรัฐมนตรี']))
#problability of 'นายก' 'ได้' 'ให้' 'สัมภาษณ์' 'กับ' 'สื่อ'
prob = getLnValue(uni_model[u'นายก'])+getLnValue(uni_model[u'ได้'])+ getLnValue(uni_model[u'ให้'])+getLnValue(uni_model[u'สัมภาษณ์'])+getLnValue(uni_model[u'กับ'])+getLnValue(uni_model[u'สื่อ'])+getLnValue(uni_model['</s>'])
print ('Problability of a sentence', math.exp(prob))

-6.551526663995246
-10.175040243058024
Problability of a sentence 5.617210748667918e-18


## TODO #1 **Calculate perplexity**

In order to compare language model we need to calculate perplexity. In this task you should write a perplexity calculation code for the unigram model. The result perplexity should be around 556.39 and
476.07 on train and test data.

In [11]:
import numpy as np

def calculate_sentence_ln_prob(sentence, model):
    word = sentence.split('|')
    ln_prob = 0
    count = 0

    # for ไล่เเต่ละคำใน sentence เพื่อคำนวณ LnValue --> sum ln_prob ทุกคำ
    for w in word:
        ln_prob += getLnValue(model[w])
        count += 1

    return ln_prob

def perplexity(test,model):
    ln_prob = 0
    word_count = 0

    # for ไล่เเต่ละ sentence --> คำนวณ calculate_sentence_ln_prob ของเเต่ละ sentence --> sum ln_prob ทุก sentence
    for s in test:
        ln_prob += calculate_sentence_ln_prob(s, model)
        word_count += len(s.split('|')) #count only number of words

    return np.exp(-ln_prob/word_count)

In [12]:
print(perplexity(train,uni_model))
print(perplexity(test,uni_model))

585.1911467875568
492.6246258090756


# Bigram

Next, you will create a better language model than a unigram (which is not much to compare with). But first, it is very tedious to count every pair of words that occur in our corpus by ourselves. In this case, nltk provide us a simple library which will do it for us.

In [13]:
#example of nltk usage for bigram
sentence = 'I always search google for an answer .'

print('This is how nltk generate bigram.')
for w1,w2 in bigrams(sentence.split(), pad_right=True, pad_left=True):
    print (w1,w2)
print('None is used as a start and end of sentence symbol.')

This is how nltk generate bigram.
None I
I always
always search
search google
google for
for an
an answer
answer .
. None
None is used as a start and end of sentence symbol.


Now, you should be able to implement a bigram model by yourself. Also, you must create a new perplexity calculation for bigram. The result perplexity should be around 58.78 and 146.26 on train and test data.

## TODO #2 **Create a Bigram Model**

In [14]:
def getBigramModel(data):
    ###FILL YOUR CODE HERE###
    unigram_count = defaultdict(lambda: 0.0)
    bigram_count = defaultdict(lambda: 0.0)
    model = defaultdict(lambda: 0.0)

    # for เเต่ละ sentence
    #   for เเต่ละ token ในรูปเเบบ bigram ที่ generate ขึ้นมา
    #     bigram_count[?] = ?
    #     unigram_count[?] = ?

    # for ไล่เเต่ละ token ใน bigram ทั้งหมด
    #   model[?] = ?

    for s in data:
        for w_p, w in bigrams(s.split('|'), pad_right=True, pad_left=True):
            
            b = (w_p, w)
            unigram_count[w_p] += 1
            bigram_count[b] += 1

    for b in bigram_count.keys():
        model[b] = bigram_count[b]/unigram_count[b[0]]

    return bigram_count, unigram_count, model

bigram_count2, unigram_count, bi_model = getBigramModel(train)

## TODO #3 **Calculate Perplexity for Bigram Model**



In [15]:
import numpy as np

def calculate_sentence_ln_prob(sentence, model, mode, count_table):
    # คำนวณจาก getBigramModel อย่าเติม <s> หรือ </s> เอง !!!
    word = sentence.split('|')
    ln_prob = 0

    # for ไล่เเต่ละคำใน sentence เพื่อคำนวณ LnValue --> sum ln_prob ทุกคำ

    for w_p, w in bigrams(word, pad_right=True, pad_left=True):
        if mode == 0 or model[(w_p, w)] != 0:
            ln_prob += getLnValue(model[(w_p, w)])
        elif mode == 1:
            ln_prob += getLnValue(1/(count_table[0][w_p]+len(count_table[0])))
        elif mode == 2:
            ln_prob += getLnValue((0.25*(count_table[0])[w]) + (0.05*(1/len(count_table[0]))))
        elif mode == 3:
            ln_prob += getLnValue(0)
            
        # (0.75*len(x[b[0]])/unigram_count[b[0]]) * (len(y[b[1]])/len(bigram_count.keys()))

    return ln_prob

def perplexity(test, model, mode=0, count_table=({}, {}, {}, {})):
    ln_prob = 0
    word_count = 0

    # for ไล่เเต่ละ sentence --> คำนวณ calculate_sentence_ln_prob ของเเต่ละ sentence --> sum ln_prob ทุก sentence
    for s in test:
        ln_prob += calculate_sentence_ln_prob(s, model, mode, count_table)
        word_count += len(s.split('|')) + 1
        
    print('word count for compute perplexity:', word_count)

    return np.exp(-ln_prob/word_count)

In [16]:
print(perplexity(train, bi_model) )
print(perplexity(test, bi_model))

# 58.78942889767147
# 146.26539331038614

word count for compute perplexity: 1064475
58.78942889767147
word count for compute perplexity: 626684
146.26539331038614


# Smoothing

Usually any ngram models have a sparsity problem, which means it does not have every possible ngram of words in the dataset. Smoothing techniques can alleviate this problem. In this section, you will implement two basic smoothing methods laplace smoothing and interpolation for bigram.

## TODO #4 **Bigram with add-one estimation**

In [17]:
#Laplace Smoothing
def getBigramWithAddOneEstimation(data):

    unigram_count = defaultdict(lambda: 0.0)
    bigram_count = defaultdict(lambda: 0.0)
    model = defaultdict(lambda : 0.0)
    V = set()

    for s in data:
        for w_p, w in bigrams(s.split('|'), pad_right=True, pad_left=True):
            V.add(w)
            b = (w_p, w)
            unigram_count[w] += 1
            bigram_count[b] += 1

    for b in bigram_count.keys():
        model[b] = (bigram_count[b]+1)/(unigram_count[b[0]]+len(V))

    return unigram_count, model

unigram_count_bi, bi_est_model = getBigramWithAddOneEstimation(train)
print(perplexity(train, bi_est_model, mode=1, count_table=(unigram_count_bi, {}, {}, {})))
print(perplexity(test, bi_est_model, mode=1, count_table=(unigram_count_bi, {}, {}, {})))

# 974.8134581679766
# 1098.1622194979489

word count for compute perplexity: 1064475
974.8134581679766
word count for compute perplexity: 626684
1160.6274349620871


## TODO #5 **Bigram with Interpolation**
lambda value is 0.7 for bigram, 0.25 for unigram, and 0.05 for unknown word

In [18]:
#interpolation
def getBigramWithInterpolation(data):
    unigram_count = defaultdict(lambda: 0.0)
    bigram_count = defaultdict(lambda: 0.0)
    model = defaultdict(lambda: 0.0)

    # for เเต่ละ sentence
    #   for เเต่ละ token ใน bigram ที่ generate ขึ้นมา
    #     bigram_count[?] = ?
    #     unigram_count[?] = ?

    # for เเต่ละ key ใน bigrams
    #   bigram_prob
    #   unigram_prob
    #   model[key] = สูตร bigram, unigram, unk_value (1/vocab)

    V = set()

    for s in data:
        for w_p, w in bigrams(s.split('|'), pad_right=True, pad_left=True):
            V.add(w)
            b = (w_p, w)
            unigram_count[w] += 1
            bigram_count[b] += 1
            w_p = w

    uni_model = getUnigramModel(data)
    _, _, bi_model = getBigramModel(data)

    for b in bigram_count.keys():
        model[b] = (0.7*bi_model[b])+(0.25*uni_model[b[1]])+(0.05*(1/len(V)))

    return uni_model, model

uni_model_inter, inter_model = getBigramWithInterpolation(train)

print(perplexity(train, inter_model, mode=2, count_table=(uni_model_inter, {}, {}, {})))        
print(perplexity(test, inter_model, mode=2, count_table=(uni_model_inter, {}, {}, {})))

# 73.38409869825665
# 172.67485908813356

word count for compute perplexity: 1064475
73.91948792641367
word count for compute perplexity: 626684
153.79532246166463


# Language modeling on multiple domains

Sometimes, we do not have enough data to create a language model for a new domain. In that case, we can improvised by combining several models to improve result on the new domain.

In this exercise you will try to merge two language models from news and article domains to create a language model for the encyclopedia domain.

In [19]:
# create article data
encyclo_data=[]
fp= io.open('BEST2010/encyclopedia.txt','r',encoding='utf-8')
for i,line in enumerate(fp):
    encyclo_data.append(line.strip()[:-1])
fp.close()

First, you should try to calculate perplexity of your bigram with interpolation using "news data" (train) on "encyclopedia data" (test). The result perplexity should be around 727.35.

For your information, a bigram model with interpolation using "ariticle data" (train) to test on "encyclopedia data" (test) has a perplexity of 505.79.

In [20]:
# print perplexity of bigram with interpolation on article data        
# 727.3502637212223
print(perplexity(encyclo_data, inter_model))

word count for compute perplexity: 1214496
734.6596778416083


## TODO #6 
Write a model that produce 450.0 or less perplexity on encyclopedia data without using data from the encyclopedia as training data. (Hint : Try to combine a model with news data and a model with article data together.)

In [21]:
# Fill code here
article_data=[]
fp = io.open('BEST2010/article.txt','r',encoding='utf-8')
for i,line in enumerate(fp):
    article_data.append(line.strip()[:-1])
fp.close()

new_sentences = best2010+article_data
# The data is separated to train and test set with 70:30 ratio.
new_train = new_sentences[:int(len(new_sentences)*0.7)]
new_test = new_sentences[int(len(new_sentences)*0.7):]

#Training data
new_train_vocabulary =set()
new_train_word_count = 0
for line in new_train:
    for word in line.split('|'):        
        new_train_vocabulary.add(word)
        new_train_word_count+=1
print ('Total sentences in BEST2010 news training dataset :\t'+ str(len(new_train)))
print ('Total word counts in BEST2010 news training dataset :\t'+ str(new_train_word_count))
print ('Total vocabuary in BEST2010 news training dataset :\t'+ str(len(new_train_vocabulary)))
# We will use 1/vocab_size as a default value for unknown word
new_unk_value = math.pow(len(new_train_vocabulary),-1)

_, _, combined_model = getBigramModel(new_sentences)

# 428.85251789073953 (on combined data)
print('Perplexity of combine Bigram model with interpolation smoothing on encyclopedia test data',perplexity(encyclo_data, combined_model))

Total sentences in BEST2010 news training dataset :	33571
Total word counts in BEST2010 news training dataset :	1859676
Total vocabuary in BEST2010 news training dataset :	39198
word count for compute perplexity: 1214496
Perplexity of combine Bigram model with interpolation smoothing on encyclopedia test data 378.8947982046678


## TODO #7 
## Kneser-ney on "News"

<!-- Reimplement equation 4.33 in SLP textbook (https://lagunita.stanford.edu/c4x/Engineering/CS-224N/asset/slp4.pdf) -->

Implement Bigram Knerser-ney LM. The result perplexity should be around 71.14054002208687 and 174.02464248000433 on train and test data. 


In [22]:
# Fill codehere

#-------------------------------------------
# Create unigram and bigram counting table
#-------------------------------------------
from numpy import max

def getBigramWithKneser(data):
    unigram_count = defaultdict(lambda: 0.0)
    bigram_count = defaultdict(lambda: 0.0)
    model = defaultdict(lambda: 0.0)

    x = defaultdict(lambda: set()) #w_p to other word
    y = defaultdict(lambda: set()) #other word to w

    X = set()
    Y = set()

    for s in data:
        for w_p, w in bigrams(s.split('|'), pad_left=True, pad_right=True):

            X = x[w_p]
            X.add(w)
            x[w_p] = X

            Y = y[w]
            Y.add(w_p)
            y[w] = Y

            b = (w_p, w)
            unigram_count[w] += 1
            bigram_count[b] += 1
            w_p = w

    for b in bigram_count.keys():
        p = (max(bigram_count[b]-0.75, 0)/unigram_count[b[0]])
        model[b] = p + (0.75*len(x[b[0]])/unigram_count[b[0]]) * (len(y[b[1]])/len(bigram_count.keys()))
        
    return unigram_count, bigram_count, x, y, model

unigram_count_kneser, bigram_count_kneser, x, y, kneser_model = getBigramWithKneser(train)
print (perplexity(train, kneser_model, mode=3, count_table=(unigram_count_kneser, bigram_count_kneser, x, y)))        
print (perplexity(test, kneser_model, mode=3, count_table=(unigram_count_kneser, bigram_count_kneser, x, y)))

# 71.14054002208687
# 174.02464248000433

word count for compute perplexity: 1064475
71.14054002208687
word count for compute perplexity: 626684
155.09274968738495


## TODO #8
## Neural LM 
do it on news corpus that we splitted into train and test sets at the beginning of this exercise. 

In [23]:
#find the perplexity of the model
#there are many ways to do this. e.g.:
#https://machinelearningmastery.com/develop-word-based-neural-language-models-python-keras/

In [78]:
import numpy as np
from keras.preprocessing.text import Tokenizer
from keras.utils import to_categorical
from keras.models import Model
from keras.layers import Dense, LSTM, Embedding, Reshape
from keras import Input
from keras.callbacks import ModelCheckpoint
from keras import models

# generate a sequence from the model
def generate_seq(model, tokenizer, seed_text, n_words):
	in_text, result = seed_text, seed_text
	# generate a fixed number of words
	for _ in range(n_words):
		# encode the text as integer
		encoded = tokenizer.texts_to_sequences([in_text])[0]
		encoded = np.array(encoded)
		# predict a word in the vocabulary
        # predict_x = model.predict(encoded)
		predict_x=model.predict(encoded) 
		yhat= np.argmax(predict_x,axis=1)
		# map predicted word index to word
		out_word = ''
		for word, index in tokenizer.word_index.items():
			if index == yhat:
				out_word = word
				break
		# append to input
		in_text, result = out_word, result + ' ' + out_word
	return result


In [115]:
def prepareDataTrain(data):
	corpus = [s.split('|') for s in data]
	data_train = ""
	list_data_train = []
	percentage = int(0.5*len(corpus))
	for s in corpus[:percentage]:
		data_train += ' '.join(s)
		list_data_train.append([' '.join(s)])

	tokenizer = Tokenizer()
	tokenizer.fit_on_texts([data_train])
	vocab_size = len(tokenizer.word_index) + 1

	encoded = list()
	for s in  list_data_train:
		encoded.append(tokenizer.texts_to_sequences(s)[0])

	sequences = list()
	for s in encoded:
		for w, p in bigrams(s):
			sequences.append([w, p])
			
	print('Vocabulary Size: %d' % vocab_size)
	print('Total Sequences: %d' % len(sequences))

	sequences = np.array(sequences)
	X, y = sequences[:,0],sequences[:,1]
	# one hot encode outputs
	y = to_categorical(y, num_classes=vocab_size)

	return tokenizer, vocab_size, X, y

In [90]:
def prepareDataTest(data, tokenizer):
	corpus = [s.split('|') for s in data]
	data_train = ""
	list_data_train = []
	percentage = int(0.01*len(corpus))
	for s in corpus[:percentage]:
		data_train += ' '.join(s)
		list_data_train.append([' '.join(s)])

	vocab_size = len(tokenizer.word_index) + 1

	encoded = list()
	for s in  list_data_train:
		encoded.append(tokenizer.texts_to_sequences(s)[0])

	sequences = list()
	for s in encoded:
		for w, p in bigrams(s):
			sequences.append([w, p])
			
	print('Vocabulary Size: %d' % vocab_size)
	print('Total Sequences: %d' % len(sequences))

	sequences = np.array(sequences)
	X, y = sequences[:,0],sequences[:,1]
	# one hot encode outputs
	y = to_categorical(y, num_classes=vocab_size)

	return X, y

In [None]:
tokenizer, vocab_size, x_train, y_train = prepareDataTrain(train)

In [None]:
x_test, y_test = prepareDataTest(test, tokenizer)

In [51]:
# define model
def LM_model():
  input1 = Input(shape=(1))
  x = Embedding(vocab_size, 50, input_length=1)(input1)
  x = LSTM(60)(x)
  x = Dense(vocab_size, activation='softmax')(x)
  model = Model(inputs=input1, outputs=x)
  model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
  print(model.summary())
  return model


In [53]:
NN_LM_model = LM_model()

Model: "model_6"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_7 (InputLayer)        [(None, 1)]               0         
                                                                 
 embedding_7 (Embedding)     (None, 1, 50)             1089400   
                                                                 
 lstm_7 (LSTM)               (None, 60)                26640     
                                                                 
 dense_6 (Dense)             (None, 21788)             1329068   
                                                                 
Total params: 2,445,108
Trainable params: 2,445,108
Non-trainable params: 0
_________________________________________________________________
None


In [55]:
weight_path ='lm2.h5'

callback = [
        ModelCheckpoint(
            weight_path,
            save_best_only=True,
            save_weights_only=True,
            monitor='val_loss',
            mode='min',
            verbose=1
        )]

NN_LM_model.fit(x_train, y_train, epochs=20, callbacks=callback, verbose=1)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<keras.callbacks.History at 0x2a6064e1ac0>

In [56]:
NN_LM_model.save_weights(weight_path)

In [96]:
NN_LM_model2 = LM_model()
NN_LM_model2.load_weights(weight_path)

Model: "model_10"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_11 (InputLayer)       [(None, 1)]               0         
                                                                 
 embedding_11 (Embedding)    (None, 1, 50)             1089400   
                                                                 
 lstm_11 (LSTM)              (None, 60)                26640     
                                                                 
 dense_10 (Dense)            (None, 21788)             1329068   
                                                                 
Total params: 2,445,108
Trainable params: 2,445,108
Non-trainable params: 0
_________________________________________________________________
None


In [99]:
J, _ = NN_LM_model2.evaluate(x_test, y_test, batch_size=400)
perplexity = np.exp(J)
print('perplexity', perplexity)

perplexity 184.36359829140466


In [114]:
# try to generate next word
print(generate_seq(NN_LM_model, tokenizer, 'ดี', 1))
print(generate_seq(NN_LM_model, tokenizer, 'แต่', 1))
print(generate_seq(NN_LM_model, tokenizer, 'นก', 1))

ดี ขึ้น
แต่ ก็
นก ใน
