In [6]:
! nvidia-smi

NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.



In [7]:
# ! wget --no-check-certificate https://raw.githubusercontent.com/ekapolc/nlp_course/master/HW2/BEST2010.zip
# ! unzip BEST2010.zip

# Language Modeling using Ngram

In this Exercise, you are going to use NLTK which is a natural language processing library for python to create a bigram language model and its variation. You will build one model for each of the following type and calculate their perplexity:
- Unigram Model
- Bigram Model
- Bigram Model with Laplace smoothing
- Bigram Model with Interpolation

As a reminder,
### Don't forget to shut down your instance on Gcloud when you are not using it ###


In [8]:
#First we import necessary library such as math, nltk, bigram, and collections.
import math
import nltk
import io
import random
from random import shuffle
from nltk import bigrams, trigrams
from collections import Counter, defaultdict
random.seed(999)
from IPython.core.debugger import set_trace

BEST2010 is a free Thai NLP dataset by NECTEC usually use as a standard benchmark for various NLP tasks includeing language modeling. BEST2010 is separated into 4 domain article, encyclopedia, news and novel. The data is already  tokenized using '|' as a separator.

For example,

ตาม|ที่|นางประนอม ทองจันทร์| |กับ| |ด.ช.กิตติพงษ์ แหลมผักแว่น| |และ| |ด.ญ.กาญจนา กรองแก้ว| |ป่วย|สงสัย|ติด|เชื้อ|ไข้|ขณะ|นี้|ยัง|ไม่|ดี|ขึ้น|

In [9]:
# We choose news domain as our dataset
best2010=[]
fp= io.open('BEST2010/news.txt','r',encoding='utf-8')
for i,line in enumerate(fp):
    best2010.append(line.strip()[:-1])
fp.close()
vocabuary = set()
total_word_count =0
for line in best2010:
    for word in line.split('|'):        
        vocabuary.add(word)
        total_word_count+=1

In [10]:
best2010[:5]

['สงสัย|ติด|หวัด|นก| |อีก|คน|ยัง|น่า|ห่วง',
 'ตาม|ที่|นางประนอม ทองจันทร์| |กับ| |ด.ช.กิตติพงษ์ แหลมผักแว่น| |และ| |ด.ญ.กาญจนา กรองแก้ว| |ป่วย|สงสัย|ติด|เชื้อ|ไข้|ขณะ|นี้|ยัง|ไม่|ดี|ขึ้น',
 'หลัง|เข้า|เยี่ยม|ดู|อาการ|ผู้|ป่วย|แล้ว| |น.พ.จรัล|ประชุม|ร่วม|กับ|เจ้าหน้าที่|ทุก|ฝ่าย| |เพื่อ|สรุป|ผล|การ|ดำเนิน|การ| |รวม|ทั้ง|สอบสวน|โรค|ก่อน|ที่|ผู้|ป่วย|จะ|ถูก|ส่ง|มา|รักษา|ตัว| |จาก|นั้น|ร่วม|กัน|แถลง|ข่าว| |โดย| |น.พ.จรัล|กล่าว|ว่า| |ขณะ|นี้|ผู้|ป่วย|ทั้ง| |3| |ราย| |อาการ|ยัง|ทรง| |โดย|ใน|ราย|ของ| |ด.ช.กิตติพงษ์| |กับ| |ด.ญ.กาญจนา| |ปอด|หาย|เป็น|ปกติ|แล้ว| |คาด|ว่า|จะ|กลับ|บ้าน|ได้|ใน|ไม่|ช้า|นี้| |แต่|ใน|ราย|ของ|นางประนอม|อาการ|ยัง|น่า|เป็นห่วง| |ซึ่ง|ทั้ง| |3| |ราย| |ใน|ชั้น|นี้|ถือ|ว่า|เป็น|ผู้|ป่วย|อยู่|ใน|ขั้น|น่า|สงสัย|อาจ|ติด|เชื้อ|ไข้|หวัด|นก| |เพราะ|ตรวจ|พบ|ผู้|ป่วย|มี|อาการ|ปอด|บวม|ปอด|อักเสบ| |เนื่อง|จาก|ติด|เชื้อ|ไวรัส| |แต่|ยัง|สรุป|ไม่|ได้|ว่า|ติด|เชื้อ|ไข้|หวัด|นก|แน่ชัด|หรือ|ไม่| |ต้อง|รอ|ผล|ตรวจ|จาก|ห้อง|ปฏิบัติการ|ที่|ได้|ส่ง|ตัวอย่าง|เลือด| |ไป|ตรวจ|พิสูจน์|ที่|กรมวิทยาศ

In [11]:
#For simplicity, we assumes that each line is a sentence.
print ('Total sentences in BEST2010 news dataset :\t'+ str(len(best2010)))
print ('Total word counts in BEST2010 news dataset :\t'+ str(total_word_count))
print ('Total vocabuary in BEST2010 news dataset :\t'+ str(len(vocabuary)))

Total sentences in BEST2010 news dataset :	30969
Total word counts in BEST2010 news dataset :	1660190
Total vocabuary in BEST2010 news dataset :	35488


We separate out input into 2 sets, train and test data with 70:30 ratio

In [12]:
sentences = best2010
# The data is separated to train and test set with 70:30 ratio.
train = sentences[:int(len(sentences)*0.7)]
test = sentences[int(len(sentences)*0.7):]

# We will use 1/total word count as a default value for unknown word
unk_value = math.pow(len(vocabuary),-1)

# Unigram

In this section, we will demonstrate how to build a unigram language model

In [13]:
def getUnigramModel(data):
    model = defaultdict(lambda: 0)
    total_word_count =0
    for sentence in data:
        sentence +=  u'|</s>'
        for w1 in sentence.split('|'):
            model[w1] +=1.0
            total_word_count+=1
    for w1 in model:
        model[w1] = model[w1]/(total_word_count)
    return model

In [14]:
model = getUnigramModel(train)

In [15]:
len(model), list(model.items())[100:110]

(26241,
 [('ห้อง', 0.0005251415016792316),
  ('ปฏิบัติการ', 5.35475234270415e-05),
  ('ตัวอย่าง', 0.00014655111674769251),
  ('เลือด', 0.00013903567486319546),
  ('ไป', 0.008119495525963503),
  ('พิสูจน์', 0.00011085276779633152),
  ('กรมวิทยาศาสตร์การแพทย์กระทรวงสาธารณสุข', 9.394302355621316e-07),
  ('รพ.ศิริราช', 1.78491744756805e-05),
  ('ทราบ', 0.0010878602127809484),
  ('วัน', 0.0037736912562530826)])

In [16]:
def getLnValue(x):
    if x >0.0:
        return math.log(x)
    else:
        return math.log(unk_value)

In [17]:
#problability of 'นายก'
print(getLnValue(model[u'นายก']))
#for example, problability of 'นายกรัฐมนตรี' which is an unknown word is equal to
print(getLnValue(model[u'นายกรัฐมนตรี']))
#problability of 'นายก' 'ได้' 'ให้' 'สัมภาษณ์' 'กับ' 'สื่อ'
prob = getLnValue(model[u'นายก'])+getLnValue(model[u'ได้'])+ getLnValue(model[u'ให้'])+getLnValue(model[u'สัมภาษณ์'])+getLnValue(model[u'กับ'])+getLnValue(model[u'สื่อ'])
print ('Problability of a sentence', math.exp(prob))

-6.551526663995246
-10.476949890150093
Problability of a sentence 2.75827124812635e-16


# Perplexity

In order to compare language model we need to calculate perplexity. In this task you should write a perplexity calculation code for the unigram model. The result perplexity should be around 513.97 and
452.66 on train and test data.

## TODO #1 Calculate perplexity

In [18]:
def calculate_sentence_ln_prob(sentence, model):
    res = 0.0    
    for word in sentence.split('|')[:-1]:
        res += getLnValue(model[word])
    return res

def perplexity(test,model):
    ln_prob = 0.0
    count = 0
    for sentence in test:
        sentence += u'|</s>'
        ln_prob += calculate_sentence_ln_prob(sentence, model) 
        count += len(sentence.split('|'))
    return  math.exp( -ln_prob/count)

In [19]:
w1 = calculate_sentence_ln_prob("นายก|ได้|ให้|สัมภาษณ์|กับ|สื่อ", model)
w1, math.exp(w1)

(-29.07324334845302, 2.3640183468996036e-13)

In [20]:
# train

In [21]:
print(perplexity(train,model))
print(perplexity(test,model))

513.9747870973313
452.65524668351003


## TODO #2 Please explain why this model give us such a high perplexity.

**Your answer**:  .........

# Bigram

Next, you will create a better language model than a unigram (which is not much to compare with). But first, it is very tedious to count every pair of words that occur in our corpus by ourselves. In this case, nltk provide us a simple library which will do it for us.

In [22]:
#example of nltk usage for bigram
sentence = 'I always search google for an answer .'

print('This is how nltk generate bigram.')
for w1,w2 in bigrams(sentence.split(), pad_right=True, pad_left=True):
    print (w1,w2)
print('None is used as a start and end of sentence symbol.')

This is how nltk generate bigram.
None I
I always
always search
search google
google for
for an
an answer
answer .
. None
None is used as a start and end of sentence symbol.


Now, you should be able to implement a bigram model by yourself. Also, you must create a new perplexity calculation for bigram. The result perplexity should be around 58.54 and 153.36 on train and test data.

## TODO #3 Write Bigram Model

In [23]:
# bigrams??

In [24]:
def getBigramModel(data):
    model = defaultdict(lambda: defaultdict(lambda: 0.0))
    
    for sentence in data:        
        for w1,w2 in bigrams(sentence.split("|"), pad_right=True, pad_left=True):
            model[w1][w2] +=1.0

    for w1 in model:
        unigram = float(sum(model[w1].values()))
        for w2 in model[w1]:
            model[w1][w2] /= unigram
    return model

model = getBigramModel(train)

In [25]:
len(model), list(model['กิน'])[:10]

(26241, ['ซาก', 'ไก่', 'สัตว์', None, ' ', 'ไข่', 'มา', 'ไม่', 'อาหาร', 'มาก'])

## TODO #4 Write Perplexity for Bigram Model

Sum perplexity score at a sentence level, instead of word level

In [26]:
def calculate_sentence_ln_prob(sentence, model):
    res = 0.0
    for w1, w2 in bigrams(sentence.split("|"), pad_right=True, pad_left=True):
        res += getLnValue(model[w1][w2])
    return res

def perplexity(test,model):
    ln_prob = 0.0
    count = 0    
    for sentence in test:        
        ln_prob += calculate_sentence_ln_prob(sentence, model) 
        count += len(sentence.split("|")) + 1         

    return  math.exp( -ln_prob / count)

In [27]:
model[u'None'][u'นายก'], model[u'นายก'][u'ได้'], model[u'ได้'][u'ให้'], model[u'ให้'][u'สัมภาษณ์'], model[u'สัมภาษณ์'][u'กับ'], model[u'กับ'][u'สื่อ'], model[u'สื่อ'][u'None']

(0.0,
 0.0,
 0.012016638422431059,
 0.02084212302759261,
 0.010452961672473868,
 0.004985459077690071,
 0.0)

In [28]:
w1 = calculate_sentence_ln_prob("นายก|ได้|ให้|สัมภาษณ์|กับ|สื่อ", model)
w1, math.exp(w1)

(-39.281485033665355, 8.715008377345561e-18)

In [29]:
print (perplexity(train,model) )
print (perplexity(test, model))

58.78942889767147
153.76867837707394


# Smoothing

Usually any ngram models have a sparsity problem, which means it does not have every possible ngram of words in the dataset. Smoothing techniques can alleviate this problem. In this section, you will implement two basic smoothing methods laplace smoothing and interpolation for bigram.

## TODO #5 write Bigram with Laplace smoothing (Add-One Smoothing)

In [37]:
#Laplace Smoothing
def getBigramWithLaplaceSmoothing(data):
    model = defaultdict(lambda: defaultdict(lambda: 0.0))
    
    for sentence in data:        
        for w1,w2 in bigrams(sentence.split("|"), pad_right=True, pad_left=True):
            model[w1][w2] += 1.0
            
    for w1 in model:
        unigram = float(sum(model[w1].values()))
        for w2 in model[w1]:            
            model[w1][w2] = (model[w1][w2] + 1.0) / (unigram + len(vocabuary))            
    return model

In [38]:
model = getBigramWithLaplaceSmoothing(train)
print (perplexity(train,model))
print (perplexity(test, model))    

1242.629211428872
1398.9466787861493


## TODO #6 Write Bigram with Interpolation
lambda value is 0.7 for bigram, 0.25 for unigram, and 0.05 for unknown word

In [49]:
#interpolation
def getBigramWithInterpolation(data):
    model = defaultdict(lambda: defaultdict(lambda: 0.0))
    
    for sentence in data:        
        for w1,w2 in bigrams(sentence.split("|"), pad_right=True, pad_left=True):
            model[w1][w2] += 1.0
            
    for w1 in model:
        unigram = float(sum(model[w1].values()))
        for w2 in model[w1]:            
            model[w1][w2] = 0.7 * model[w1][w2] / unigram  +  0.25 * unigram / total_word_count + 0.05 * unk_value          
    return model

In [50]:
model = getBigramWithInterpolation(train)
print (perplexity(train,model))        
print (perplexity(test,model))

44.68441323825217
118.71873335281947


the result perplexity on training and testing should be 

    1231.14, 1390.85 for Laplace smoothing
    39.90, 107.27 for Interpolation

# Language modeling on multiple domains

Sometimes, we do not have enough data to create a language model for a new domain. In that case, we can improvised by combining several models to improve result on the new domain.

In this exercise you will try to merge two language model from news and article domains to create a language model for the encyclopedia domain.

In [51]:
# create article data
encyclo_data=[]
fp= io.open('BEST2010/encyclopedia.txt','r',encoding='utf-8')
for i,line in enumerate(fp):
    encyclo_data.append(line.strip()[:-1])
fp.close()

First, you should try to calculate perplexity of your bigram with interpolation on article data. The result perplexity should be around 466.17 

For your information, a bigram model with interpolation using ariticle data to test on encyclopedia data has a perplexity of 281.24

In [52]:
#print perplexity of bigram with interpolation on article data        
print (perplexity(encyclo_data,model))

562.2437986285148


## TODO #7 
Write a model that produce 260.0 or less perplexity on encyclopedia data without using data from the encyclopedia as training data. (Hint : Try to combine a model with news data and a model with article data together.)

# Kneser-ney Smoothing



## TODO #8 (optional: Bonus 0.5 points) 
Implement kneser-ney smoothing with news and article data 

In [None]:
print (perplexity(encyclo_data,model))