# Assignment 1
You should submit the **UniversityNumber.ipynb** file and your final prediction file **UniversityNumber.test.out** to moodle. Make sure your code does not use your local files and that the results are reproducible. Before submitting, please **run your notebook and keep all running logs** so that we can check.

## 1 $n$-gram Language Model

In [None]:
!wget -O train.txt https://raw.githubusercontent.com/ranpox/comp7607-fall2022/main/assignments/A1/data/lm/train.txt
!wget -O dev.txt https://raw.githubusercontent.com/ranpox/comp7607-fall2022/main/assignments/A1/data/lm/dev.txt
!wget -O test.txt https://raw.githubusercontent.com/ranpox/comp7607-fall2022/main/assignments/A1/data/lm/test.txt

### 1.1 Building vocabulary

#### Code

In [2]:
import numpy as np
from collections import defaultdict

# So, build the model based on make the vocabulary set on training set
f = open("./data/lm/train.txt",'r+',encoding="utf-8")
lines = []
s = set(["<s>","</s>","<UNK>"])
sentences = []
for line in f.readlines():
    # some lines are still end with \n, need to remove \n
    line = line.strip()
    tmp = line.split(' ')
    sentences.append(tmp)
    for word in tmp:
        s.add(word)
f.close()

word_list = sorted(list(s))

word_count = defaultdict(int)


# count every word
for sen in sentences:
    for word in sen:
        word_count[word]+=1

# calculate <UNK>
word_count_dic = {"<UNK>":0}
for key in word_list:
    value = word_count[key]
    if value<3:
        word_count_dic["<UNK>"]+=1
    else:
        word_count_dic[key]=value

word_list = sorted(word_count_dic.keys())
word_dic = {key:idx for (idx,key) in enumerate(word_list)}
tmp = []
for key in word_dic:
    tmp.append(word_count_dic[key])
word_count = np.array(tmp)

# change sentence to id list

tmp_sentences = []
for sentence in sentences:
    tmp = []
    for word in sentence:
        if word in word_dic:
            tmp.append(word_dic[word])
        else:
            tmp.append(word_dic["<UNK>"])
    tmp_sentences.append(tmp[:])
sentences = tmp_sentences

#vocabulary size
print("vocabulary size: %d"%len(word_list))

del word_list
del word_count_dic
del tmp


# word_dic : the map between words and id, 
# word_count : the number of each word organized with id
# sentences : sentences in training data

vocabulary size: 22630


#### Discussion

Please show the vocabulary size and discuss the number of parameters of n-gram models.

The vocabulary size is 22,630, not including `<s>` and `</s>` characters. So, as for the number of parameters of n-gram models, when the n increases, the number of parameters increases sharply. For example, when n is 1 in this case, the number of parameters is 22,629 which is the vocabulary size. When n is 2, the vocabulary size is 512,116,900 which is the square of vocabulary size. For n equals 3, the vocabulary size is the cube of vocabulary size : 11,589,205,447,000. If we use a float variable to store a single probability, it would use around 84TB memory. So, the bigger n we have, more memory would be used by probability matrix. 

### 1.2 $n$-gram Language Modeling

After preparing your vocabulary, you are expected to build bigram and unigram language models and report their perplexity on the training set, and dev set. Please discuss your experimental results. If you encounter any problems, please analyze them and explain why.

#### Code

In [3]:
from nltk.util import bigrams
from collections import Counter
import math
import string


##### Model Definition

In [3]:
# unigram language model
# as for n=1, I only need to calculate every words' count and then calculate the probability matrix
# which is quite easy to write
# the probability matrix would look like this:
#           [P_0, P_1, P_2, P_3, ...... , P_n, P_start, P_end]
#           P_start, P_end are the probability of start padding and end padding

class UnigramModel:
    def __init__(self, sentences,word_dic,word_count,file = ""):
        if file!="":
            self.uniProb = np.load(file)
        else:
            self.uniProb = self.__cal_prob(sentence,word_dic,word_count)
        self.word_dic = word_dic
        
    # calculate the probability array
    def __cal_prob(self,sentences,word_dic,word_count):
        startpad, endpad = len(sentences),len(sentences)
        count = startpad+endpad
        for nu in word_count:
            count += nu

        uniProb = [0.0] * (len(word_dic)+2)
        for i in range(len(word_dic)):
            uniProb[i] = word_count[i] / count
        uniProb[-2] = startpad / count
        uniProb[-1] = endpad / count
        return np.array(uniProb)

In [12]:
# run the model and save the results
# uni = UnigramModel(sentences,word_dic,word_count)
# np.save("./unigram.npy",uni.uniProb)

In [7]:
uni = UnigramModel(sentences,word_dic,word_count,file="./unigram.npy")

In [4]:
# bigram language model
# the probability matrix would look like this:
#    first word(i) /second word(j)
#           [w_0, w_1, w_2, w_3, ...... , w_n, startpad, endpad]
#     [w_0]
#     ......
#     [startpad]
#     [endpad] this line is meaningless, deleted
#                       matrix[i][j] = P(j|i)

class BigramModel:
    def __init__(self,sentences,word_dic,word_count,file = ""):
        if file != "":
            self.biProb = np.load(file)
        else:
            # self.uniProb = UnigramModel(sentences,word_dic,word_count).uniProb
            self.biProb = self.__cal_prob(sentences,word_dic,word_count)
        self.word_dic = word_dic
    
    # calculate the probability matrix
    def __cal_prob(self,sentences,word_dic,word_count):
        count_matrix = np.zeros((len(word_dic)+1,len(word_dic)+2),dtype=float)
        startpad,endpad = len(word_dic),len(word_dic)+1
        for sen in sentences:
            for (before,after) in bigrams(sen,pad_left=True, pad_right=True, left_pad_symbol=startpad, right_pad_symbol=endpad):
                count_matrix[before][after] += 1
        
        # add <s> at the end of word_count, which is equal to the number of sentences
        word_count = np.append(word_count,np.array([len(sentences)]))
        for i in range(len(count_matrix)):
            for j in range(len(count_matrix[0])):
                count_matrix[i][j] /= word_count[i]
        return count_matrix



In [14]:
# calculate the model and save it to disk
bi = BigramModel(sentences,word_dic,word_count)    

np.save("./bigram.npy",bi.biProb)

In [7]:
# read model from file
bi = BigramModel(sentences,word_dic,word_count,"./bigram.npy")

##### calculate perplexity 

In [5]:
# calculate ppl of uni-gram model
def uni_ppl(un:UnigramModel,filePath:string):
    f = open(filePath,"r",encoding="utf-8")
    su = 0.0
    unkidx = word_dic["<UNK>"]
    M = 0
    for line in f.readlines():
        line = line.strip()
        words = line.split(" ")
        
        M += len(words)
        fail = False
        for word in words:
            if word in word_dic:
                su += math.log2(un.uniProb[word_dic[word]])
            else:
                su += math.log2(un.uniProb[unkidx])
                
        
        # for </s> multiply this probability
        su += math.log2(un.uniProb[-1])

    f.close()

    l = su/M
    # print("this is l",l)
    ppl = math.pow(2,-l)
    print("the perplexity of this uni-gram model on %s is:%f"%(filePath,ppl))

In [6]:
un = UnigramModel(sentences,word_dic,word_count,"./unigram.npy")
uni_ppl(un,"./data/lm/train.txt")
uni_ppl(un,"./data/lm/dev.txt")

the perplexity of this uni-gram model on ./data/lm/train.txt is:1520.742049
the perplexity of this uni-gram model on ./data/lm/dev.txt is:1453.860233


In [6]:
# calculate perplexity of bi-gram model
def bi_ppl(bi:BigramModel, filePath:string):
    f = open(filePath,"r",encoding="utf-8")
    su = 0.0
    unkidx = word_dic["<UNK>"]
    M = 0
    for line in f.readlines():
        line=line.strip()
        words = line.split(" ")
        M += (len(words)+2)
        # cur state at <s> which is -1 index
        cur = -1
        p = 0.0
        fail = False
        for word in words:
            if word not in word_dic:
                nextstep = unkidx
            else:
                nextstep = word_dic[word]
            if bi.biProb[cur][nextstep]==0:
                fail = True
                break
            else:
                p += math.log2(bi.biProb[cur][nextstep])
                cur = nextstep
                
        if not fail and bi.biProb[cur][-1]!=0:
            p += math.log2(bi.biProb[cur][-1]) # for </s> multiply this probability
            su += p
        else:
            su += float("-inf")
    f.close()
    l = su/M
    ppl = math.pow(2,-l)
    print("the perplexity of this bi-gram model on %s is:%f"%(filePath,ppl))

In [12]:
bi = BigramModel(sentences,word_dic,word_count,"./bigram.npy")
bi_ppl(bi,"./data/lm/train.txt")
bi_ppl(bi,"./data/lm/dev.txt")

the perplexity of this bi-gram model on ./data/lm/train.txt is:57.813127
the perplexity of this bi-gram model on ./data/lm/dev.txt is:inf


As for the first version of the program, p mutipled a lot times which leads to p equal to 0. As for the result, in the second version of the program, I change the way to calculate su.

sometimes, biProb[i][j]==0, whichi means test fail, at that time perplexity of this sentence would be float('-inf').

#### Discussion

##### Problem Encountered

During this process, I meet 2 diffient problems both related to how to calculate the perplexity. The one problem is occurred when calculate $p(x^{(i)})$, where $x^{(i)}$ is the $i^{th}$ sentence in the test set. The other problem happens when calculate the perplexity of bi-gran model.

+ Question 1

    The main issue of the first problem is mainly related to the represent ability of a float number, because Python uses float type to store decimal numbers. When a dicimal number is smaller than 0.00000001, the result of this number would be zero and this would lead to value error when I run log2 function. To resolve this problem, I use another way to calculate the perplexity. The definition of perplexity is listed below:
    $$ l = {{{1} \over {M}}  \sum^{m}_{i=1} \log_{2}{p(x^{(i)})}}  $$
    $$ ppl = 2^{-l} $$
    $$ {{p(x^{(i)})} = \prod_{j=0}^n p(w_j|w_{j-1})} \text{\quad where $w_j$ is the $j^{th}$ word in $sentence_i$, n=len($sentence_i$)} $$ 
    Because if I calculate $p(x^{(i)})$ roughly, $p(x^{(i)})$ would be too small to store in a float type number. So I decide to use this   function to calculate $l$:
    $$ l = {{{1} \over {M}}  \sum^{m}_{i=1} \sum^{n}_{j=0} \log_{2}{p(w_j|w_{j-1})}} $$
    $$\text{\quad where $w_j$ is the $j^{th}$ word in $sentence_i$, $w_{-1}$="<s>", n=len($sentence_i$)},m=len(sentences) $$
    With this function, the result of calculating perplexity would not be too small, so that the perplexity can be calculated successfully. 

+ Question 2 

    As for the second problem, it occored when $p(w_j|w_{j-1})$ equals 0 during testing bi-gram model. The mainly reason could be that the train set is not large enough to make sure that every possible $p(w_j|w_{j-1})$ is included. If a word set ($w_{j-1}$,$w_j$) is met and the probability $p(w_j|w_{j-1})$ equals zero, then the test prosess would be terminated because there is no probability to generate a sentence like the test sentence. 

    In this case, the $l$ should be self defigned in the program because calculating $\log_{2}0$ is actually an error mathmatically. So in the actual program, when then program test a sentence failed because $p(w_j|w_{j-1})=0$, it is better to add a punishment to the attribute $l$, I choose to add $-\infty$ as the result of $\sum^{n}_{j=0} \log_{2}{p(w_j|w_{j-1})}$ because $\lim_{x\rightarrow{0}}\log_{2}x=-\infty$.

##### Result Discussion

The perplexity of uni-gram model is around $1520.742049$ on training set and $1453.860233$ on dev set, and the perplexity of the bi-gram model is $57.813127$ on training set and $\infty$ on dev set.

As for the uni-gram model, the perplexity is quite high, which means this model cannot predict the sentences in the set. I believe the main reason is that this model only is contained the probability of each words and every sentence is generated in a certain probability. There are some uknown words in the dev set, in perplexity calculate process these words are be selected to <UNK> word, as a result the perplexity on dev set would be smaller than that on train set.

In [None]:
TODO:bigramppl有变 需要更改

As for the bi-gram model, the perplexity on dev set is infinite, which means this model is terrible.  Actually, in the probability matrix generated by the bi-gram model, nearly 99.9% of the conditional probability are 0. This would caused by the size of training set, if the training set is not large enough, it is possible that the probability matrix calculated by the bi-gram model could not contains every possible two word set in the real world. Therefore, the perplexity could be infinite.

In [10]:
le,wid = len(bi.biProb),len(bi.biProb[0])
t = le*wid
for i in range(le):
    for j in range(wid):
        if bi.biProb[i][j]!=0:
            t-=1
print("the probability is 0 occupied: %f%%"%(t/le/wid*100))

the probability is 0 occupied: 99.901397%


However, I believe that there is some ways that can improve this model. Firstly, increasing the size of training data. Secondly, using smoothing to give little probability to every zero members in the probability matrix. Finally, combining the bi-gram model with the uni-gram model.

### 1.3 Smoothing

#### 1.3.1 Add-one (Laplace) smoothing

##### Code

In [17]:
class AddOneBigramModel:
    def __init__(self,sentences,word_dic,word_count,file = ""):
        if file != "":
            self.biProb = np.load(file)
        else:
            # self.uniProb = UnigramModel(sentences,word_dic,word_count).uniProb
            self.biProb = self.__cal_prob(sentences,word_dic,word_count)
        self.word_dic = word_dic
    
    # with add-one smoothing
    def __cal_prob(self,sentences,word_dic,word_count):
        # for the one in add one smoothing
        count_matrix = np.ones((len(word_dic)+1,len(word_dic)+2),dtype=float)
        startpad,endpad = len(word_dic),len(word_dic)+1
        for sen in sentences:
            for (before,after) in bigrams(sen,pad_left=True, pad_right=True, left_pad_symbol=startpad, right_pad_symbol=endpad):
                count_matrix[before][after] += 1
        
        V = len(word_dic)+2
        # add <s> at the end of word_count, which is equal to the number of sentences
        word_count = np.append(word_count,np.array([len(sentences)]))

        for i in range(len(count_matrix)):
            for j in range(len(count_matrix[0])):
                if count_matrix[i][j] != 0:
                    count_matrix[i][j] /= (word_count[i]+V)
        return count_matrix


In [18]:
AddOneSmoothing = AddOneBigramModel(sentences,word_dic,word_count)
np.save("addone.npy",AddOneSmoothing.biProb)


In [19]:
AddOneSmoothing = AddOneBigramModel(sentences,word_dic,word_count,file="./addone.npy")
bi_ppl(AddOneSmoothing,"./data/lm/train.txt")
bi_ppl(AddOneSmoothing,"./data/lm/dev.txt")

the perplexity of this bi-gram model on ./data/lm/train.txt is:680.127225
the perplexity of this bi-gram model on ./data/lm/dev.txt is:773.448825


##### Discussion

The perplexity of this bi-gram model after add-one smoothing are $6613410.735104$ on training data and $7462337.935784$ on dev data.

This is the function to calculate the perplexity with add-one smoothing:
$$ P^{*}_{Add-one}(w_n|w_{n-1})={{c(w_n,w_{n-1})+1} \over {c(w_{n-1})+V}}$$

For every $P^{*}_{Add-one}(w_n|w_{n-1})$ that $c(w_n,w_{n-1})$ equals zero, the probability is $ {1} \over {c(w_{n-1})+V}$ rather than 0, which means that the model could also have the ability to predict those unseen sentences. Therefore, we can see from the result, the perplexity on the training data is increased significantly and that on the dev data is decreased a lot.

##### Optional: Add-k smoothing

##### Code

In [14]:
from collections import defaultdict

class AddKBigramModel:
    def __init__(self,sentences,word_dic,word_count,k:float =1.0):
        
        self.word_dic = word_dic
        self.count1 = word_count
        # self.count1[len(self.word_dic)],self.count1[len(self.word_dic)+1] = len(sentences),len(sentences)
        self.count1 = np.append(self.count1,[len(sentences),len(sentences)])
        self.V = len(self.count1)
        self.k = k
        self.count2 = self.__count(sentences)
    
    # with add-one smoothing
    def __count(self,sentences):
        # for the k
        # count_matrix = np.full((len(word_dic)+1,len(word_dic)+2),k,dtype=float)
        # startpad,endpad = len(word_dic),len(word_dic)+1
        # for sen in sentences:
        #     for (before,after) in bigrams(sen,pad_left=True, pad_right=True, left_pad_symbol=startpad, right_pad_symbol=endpad):
        #         count_matrix[before][after] += 1
        
        # V = (len(word_dic)+1)*(len(word_dic)+2)*k
        # # add <s> at the end of word_count, which is equal to the number of sentences
        # word_count = np.append(word_count,np.array([len(sentences)]))

        # for i in range(len(count_matrix)):
        #     for j in range(len(count_matrix[0])):
        #         if count_matrix[i][j] != 0:
        #             count_matrix[i][j] /= (word_count[i]+V)
        # return count_matrix
        count = defaultdict(int)
        startpad,endpad = len(word_dic),len(word_dic)+1

        for sen in sentences:
            for (before,after) in bigrams(sen,pad_left=True, pad_right=True, left_pad_symbol=startpad, right_pad_symbol=endpad):
                count[(before,after)] += 1
        return count
    
    def P(self,before,after):
        c = self.count2[(before,after)]+self.k
        
        return c/(self.count1[before]+self.k*self.V)
        
    
    def ppl(self,filePath = ""):
        f = open(filePath,"r",encoding="utf-8")
        su = 0.0
        unkidx = word_dic["<UNK>"]
        M = 0
        start_idx = len(self.word_dic)
        endidx = start_idx+1
        
        for line in f.readlines():
            line=line.strip()
            words = line.split(" ")
            M += (len(words)+2)
            # cur state at <s> which is start_idx
            cur = start_idx
            p = 0.0
            fail = False
            for word in words:
                if word not in word_dic:
                    nextstep = unkidx
                else:
                    nextstep = word_dic[word]
                try:
                    p += math.log2(self.P(cur,nextstep))
                except ValueError:
                    fail = True
                    break
                cur = nextstep
            if not fail and self.P(cur,endidx)!=0:
                p += math.log2(self.P(cur,endidx)) # for </s> multiply this probability
                su += p
            else:
                su += float("-inf")
        f.close()
        l = su/M
        ppl = math.pow(2,-l)
        print("the perplexity of this bi-gram model on %s is:%f"%(filePath,ppl))
            
        

try k=0.5, 0.05, 0.01 in the next block

In [43]:
klist = [0.5,0.05,0.01]

# for k in klist:
#     t = AddKBigramModel(sentences,word_dic,word_count,k=k)
#     t.ppl("./data/lm/train.txt")
#     t.ppl("./data/lm/dev.txt")
#     del t # to save memory
# print()
t = AddKBigramModel(sentences,word_dic,word_count,k=0.001)
t.ppl("./data/lm/train.txt")
t.ppl("./data/lm/dev.txt")

the perplexity of this bi-gram model on ./data/lm/train.txt is:65.850263
the perplexity of this bi-gram model on ./data/lm/dev.txt is:203.603063


#### Discussion

||k=1(Add-One smoothing)|k=0.5|k=0.05|k=0.01|k=0(Original Bigram Model)|
|----------------|--------------|-----------------|-----------------|-----------|------|
|ppl on train set|$680.127225$|$467.120613$|$158.389028$|$95.195821$|$57.813127$|
|ppl on dev set|$773.448825$|$571.122468$|$266.324065$|$205.331177$|$\infty$|

In [None]:
#TODO: 后续更进

According to the data shown in the table above, the smaller k is, the lower perplexity is and the better model I have,especially when k=0.01. In bigram model, the perplexity is extremely high because in most cases the predict process is failed as the conditional probability is zero. After add-one smoothing, the algorithm move some probability to the zero part. But it seems that this probability is so big that it interferes the normal predict process. Hence, add-k smoothing decreases the moved probability by using a parameter k to make sure that every conditional probability could have a minimun probability which is ${{k} \over {c(w_{n-1})+kV}}$. And in this case, the model works best when $k=0.01$.

#### 1.3.2 Linear Interpolation

##### Code

In [4]:
# a probability of a word is r1P1+r2P2+r3P3
from nltk.util import trigrams
from collections import defaultdict

class LinearNGram:
    def __init__(self, sentences:list, word_dic:dict,word_count:np.array,r1,r2,r3) -> None:
        self.word_dic = word_dic
        R = r1+r2+r3
        self.r = [r1/R,r2/R,r3/R]

        self.count_matrix1 = defaultdict(int)
        self.sum1 = sum(word_count)
        for (id,num) in enumerate(word_count):
            self.count_matrix1[id]=num        

        self.count_matrix2 = defaultdict(int)
        self.count_matrix3 = defaultdict(int)
        self.__genCount(sentences)
    
    def __genCount(self,sentences) :
        startpad,endpad = len(self.word_dic),len(self.word_dic)+1

        self.count_matrix1[startpad] = len(sentences)
        self.count_matrix1[endpad] = len(sentences)

        self.count_matrix2[(startpad,startpad)] = len(sentences)

        for sen in sentences:
            for (before,after) in bigrams(sen,pad_left=True, pad_right=True, left_pad_symbol=startpad, right_pad_symbol=endpad):
                self.count_matrix2[(before,after)] += 1
            for (before1,before2,after) in trigrams(sen,pad_left=True, pad_right=True, left_pad_symbol=startpad, right_pad_symbol=endpad):
                self.count_matrix3[(before1,before2,after)] += 1

    def _P1(self,word):
        return self.count_matrix1[word]/self.sum1
    
    def _P2(self, w_before,w_after):
        return self.count_matrix2[(w_before,w_after)]/self.count_matrix1[w_before]
    
    def _P3(self, w_before1,w_before2,w_after):
        if self.count_matrix3[(w_before1,w_before2,w_after)]==0:
            return 0
        # print(self.count_matrix3[(w_before1,w_before2,w_after)],"/",self.count_matrix2[(w_before1,w_before2)])
        return self.count_matrix3[(w_before1,w_before2,w_after)]/self.count_matrix2[(w_before1,w_before2)]
    
    def P(self,w_before1,w_before2,w_after):
        return self.r[0]*self._P1(w_after) + self.r[1]*self._P2(w_before2,w_after) + \
            self.r[2]*self._P3(w_before1,w_before2,w_after)
    
    def changeR(self,r1,r2,r3):
        R = r1+r2+r3
        self.r = [r1/R,r2/R,r3/R]

    def cal_ppl(self,filePath = ""):
        f = open(filePath,'r',encoding='utf-8')
        unkidx = word_dic["<UNK>"]
        su = 0.0
        M = 0
        start_status = [len(self.word_dic)]*2
        end_idx = len(self.word_dic)+1
        for line in f.readlines():
            line = line.strip()    # remove \n
            words = line.split(" ")
            M += (len(words)+4)
            p = 0.0
            before = start_status
            for word in line.split(" "):
                if word in self.word_dic:
                    nextstep = self.word_dic[word]
                else:
                    nextstep = unkidx
                
                P = self.P(before[0],before[1],nextstep)
                p += math.log2(P)
                before[0],before[1] = before[1],nextstep
            
            # (word_n-1,word_n,end)
            p += math.log2(self.P(before[0],before[1],end_idx))
            # (word_n,end,end)
            p += math.log2(self.P(before[1],end_idx,end_idx))
            su += p
        f.close()
        l = su/M
        ppl = math.pow(2,-l)
        print("the perplexity of Linear Interpolation on %s is:%f"%(filePath,ppl))
        return ppl
    

In [30]:
lin = LinearNGram(sentences,word_dic,word_count,1,1,1)
_ = lin.cal_ppl("./data/lm/train.txt")
_ = lin.cal_ppl("./data/lm/dev.txt")


the perplexity of Linear Interpolation on ./data/lm/train.txt is:16.263836
the perplexity of Linear Interpolation on ./data/lm/dev.txt is:116.733965


In [33]:
lin.changeR(0.2,0.5,0.3)
_ = lin.cal_ppl("./data/lm/train.txt")
_ = lin.cal_ppl("./data/lm/dev.txt")

the perplexity of Linear Interpolation on ./data/lm/train.txt is:16.427578
the perplexity of Linear Interpolation on ./data/lm/dev.txt is:115.295325


In [40]:
lin.changeR(0.4,0.4,0.2)
_ = lin.cal_ppl("./data/lm/train.txt")
_ = lin.cal_ppl("./data/lm/dev.txt")

the perplexity of Linear Interpolation on ./data/lm/train.txt is:21.154220
the perplexity of Linear Interpolation on ./data/lm/dev.txt is:116.853827


In [38]:
lin.changeR(0.3,0.3,0.4)
_ = lin.cal_ppl("./data/lm/train.txt")
_ = lin.cal_ppl("./data/lm/dev.txt")

the perplexity of Linear Interpolation on ./data/lm/train.txt is:14.740797
the perplexity of Linear Interpolation on ./data/lm/dev.txt is:118.667031


In [42]:
lin.changeR(0.3,0.4,0.3)
_ = lin.cal_ppl("./data/lm/train.txt")
_ = lin.cal_ppl("./data/lm/dev.txt")

the perplexity of Linear Interpolation on ./data/lm/train.txt is:16.900205
the perplexity of Linear Interpolation on ./data/lm/dev.txt is:114.909482


In [45]:
lin.changeR(0.99,0.005 ,0.005)
_ = lin.cal_ppl("./data/lm/train.txt")
_ = lin.cal_ppl("./data/lm/dev.txt")

the perplexity of Linear Interpolation on ./data/lm/train.txt is:179.900565
the perplexity of Linear Interpolation on ./data/lm/dev.txt is:336.230913


In [46]:
lin.changeR(0.005,0.99,0.005)
_ = lin.cal_ppl("./data/lm/train.txt")
_ = lin.cal_ppl("./data/lm/dev.txt")

the perplexity of Linear Interpolation on ./data/lm/train.txt is:54.274330
the perplexity of Linear Interpolation on ./data/lm/dev.txt is:241.058748


In [47]:
lin.changeR(0.005,0.005,0.99)
_ = lin.cal_ppl("./data/lm/train.txt")
_ = lin.cal_ppl("./data/lm/dev.txt")

the perplexity of Linear Interpolation on ./data/lm/train.txt is:10.893156
the perplexity of Linear Interpolation on ./data/lm/dev.txt is:778.714180


In [50]:
lin.changeR(0.3,0.4,0.3)
_ = lin.cal_ppl("./data/lm/train.txt")
_ = lin.cal_ppl("./data/lm/dev.txt")

the perplexity of Linear Interpolation on ./data/lm/train.txt is:16.900205
the perplexity of Linear Interpolation on ./data/lm/dev.txt is:114.909482


##### Discussion

When R=[0.3,0.4,0.3], the perplexity on train set and dev set are smallest, which are:16.427578 and 115.295325 respectively. Actually, during my test, I find that the perplexity on train set is mainly influenced by trigram model and bigram model, while the trigram model seems to have a more significant effect on that. And unigram model seems to have a more significant effect on the perplexity on dev set, which means that unigram have a effect on predicting unseen words. With the limitation of bigram model, the perplexity on dev set could decrease to 114.909482 in the case R=[0.3,0.4,0.3]. 

##### Optional: Optimization

梯度下降，对ppl进行求导
Three are several ways to get the local optimum paramaters like parametric sweep, gradient descent search and line search. Here I would like to use gradient descent in my model.

In [5]:
class LinearNGramWIthGrad(LinearNGram):
    def __init__(self, sentences:list, word_dic:dict,word_count:np.array,r1,r2,r3,learning_rate:float = 0.00001):
        super(LinearNGramWIthGrad, self).__init__(sentences, word_dic, word_count, r1,r2,r3)
        self.learning_rate = learning_rate
        self.partial = [0.0,0.0,0.0]
    
    def __addpartical(self,w_before1,w_before2,w_after):
        P = self.P(w_before1,w_before2,w_after)
        ln2 = math.log(2)
        P1=self._P1(w_after)
        P2=self._P2(w_before2,w_after)
        P3=self._P3(w_before1,w_before2,w_after)
        self.partial[0]+=((P1-P2)/P/ln2)
        self.partial[1]+=((P2-P3)/P/ln2)
        self.partial[2]+=((P3-P1)/P/ln2)
    
    def calDescent(self,ppl,M):
        tmp = ppl*math.log(2)*(-1/M)*self.learning_rate
        descent = [tmp*each for each in self.partial]
        self.partial = [0.0,0.0,0.0]
        return descent
    
    def adjustRwithDesc(self,descent):
        print("Original R:",self.r,end="--->")
        self.r[0] -= descent[0]
        self.r[1] -= descent[1]
        self.r[2] -= descent[2]
        R = sum(self.r)
        for i in range(3):
            self.r[i] /= R
        print(self.r)
    
    def cal_ppl(self,filePath = ""):
        f = open(filePath,'r',encoding='utf-8')
        unkidx = word_dic["<UNK>"]
        su = 0.0
        M = 0
        start_status = [len(self.word_dic)]*2
        end_idx = len(self.word_dic)+1
        for line in f.readlines():
            line = line.strip()    # remove \n
            words = line.split(" ")
            M += (len(words)+4)
            p = 0.0
            before = start_status
            for word in line.split(" "):
                if word in self.word_dic:
                    nextstep = self.word_dic[word]
                else:
                    nextstep = unkidx
                
                P = self.P(before[0],before[1],nextstep)
                p += math.log2(P)

                self.__addpartical(before[0],before[1],nextstep)

                before[0],before[1] = before[1],nextstep
            
            # (word_n-1,word_n,end)
            p += math.log2(self.P(before[0],before[1],end_idx))
            self.__addpartical(before[0],before[1],end_idx)
            # (word_n,end,end)
            p += math.log2(self.P(before[1],end_idx,end_idx))
            self.__addpartical(before[1],end_idx,end_idx)
            su += p
        f.close()
        l = su/M
        ppl = math.pow(2,-l)
        print("the perplexity of Linear Interpolation on %s is:%f"%(filePath,ppl))

        descent = self.calDescent(ppl,M)
        print("descent gradient descent search result:", descent)
    
        return (ppl,descent)
    


In [10]:
grad = LinearNGramWIthGrad(sentences,word_dic,word_count,1,1,1,0.001)
bestR = []
minppldev = float('inf')
ppltrain = 0
bestT = 0
for i in range(9):
    print("Turn ",i,)
    ppl,descent = grad.cal_ppl("./data/lm/train.txt")
    grad.adjustRwithDesc(descent)
    ppl_dev,_ = grad.cal_ppl("./data/lm/dev.txt")
    if ppl_dev<minppldev:
        bestT = i
        bestR = grad.r
        minppldev = ppl_dev
        ppltrain = ppl
print("Best Turn %d\n\rppl on train set %f\n\rppl on dev set %f\n\rhyper parameter is "%(bestT,ppltrain,minppldev), end="")
print(bestR)



Turn  0
the perplexity of Linear Interpolation on ./data/lm/train.txt is:16.263836
descent gradient descent search result: [0.005773954942551029, 0.023184705785425735, -0.02895866072802317]
Original R: [0.3333333333333333, 0.3333333333333333, 0.3333333333333333]--->[0.3275593783907671, 0.3101486275478932, 0.36229199406133966]
the perplexity of Linear Interpolation on ./data/lm/dev.txt is:117.735373
descent gradient descent search result: [0.038300759000789455, -0.046801384764369, 0.008500625763541]
Turn  1
the perplexity of Linear Interpolation on ./data/lm/train.txt is:15.599342
descent gradient descent search result: [0.005444389703770653, 0.020552056120207517, -0.02599644582395413]
Original R: [0.3275593783907671, 0.3101486275478932, 0.36229199406133966]--->[0.3221149886870042, 0.2895965714276927, 0.38828843988530315]
the perplexity of Linear Interpolation on ./data/lm/dev.txt is:118.853514
descent gradient descent search result: [0.041801761197270776, -0.05608472524574322, 0.014282

##### Discussion

## 2 Preposition Prediction

In [None]:
!wget -O dev.in https://raw.githubusercontent.com/ranpox/comp7607-fall2022/main/assignments/A1/data/prep/dev.in
!wget -O dev.out https://raw.githubusercontent.com/ranpox/comp7607-fall2022/main/assignments/A1/data/prep/dev.out

In [20]:
import random


def predict(model :LinearNGramWIthGrad,FilePath:string):
    word_predict = ["at","in","of","on","for"]
    f = open(FilePath,"r",encoding="utf-8")
    startpad,endpad = len(model.word_dic),len(model.word_dic)+1
    ret = []
    for line in f.readlines():
        line = line.strip()
        words = line.split(" ")
        
        line_pre = []

        for (idx,w) in enumerate(words):
            if w != "<PREP>":
                continue
            before1,before2 = 0,0
            if idx == 0:
                before1,before2 = startpad,startpad
            elif idx ==1:
                if words[idx-1] in model.word_dic:
                    before1,before2 = startpad,model.word_dic[words[idx-1]]
                else:
                    before1,before2 = startpad,model.word_dic["<UNK>"]
            else:
                if words[idx-2] in model.word_dic:
                    before1 = model.word_dic[words[idx-2]]
                else:
                    before1 = model.word_dic["<UNK>"]
                if words[idx-1] in model.word_dic:
                    before2 = model.word_dic[words[idx-1]]
                else:
                    before2 = model.word_dic["<UNK>"]
            predic = ""
            # choose max P
            maxp = float('-inf')
            for wp in word_predict:
                if maxp < model.P(before1,before2,model.word_dic[wp]):
                    # print(before1,before2,wp)
                    maxp = model.P(before1,before2,model.word_dic[wp])
                    predic = wp

            # random choose
            # weight = [model.P(before1,before2,model.word_dic[wp]) for wp in word_predict]
            # predic = random.choices(word_predict,weight,k=1)[0]
            
            line_pre.append(predic)
        ret.append(line_pre[:])
    f.close()
    return ret

def cal_correct(preResult, OutPath:string):
    f = open(OutPath,'r',encoding = 'utf-8')
    i = 0
    total_num = 0
    correct_num = 0
    for (idx,line) in enumerate(f.readlines()):
        line = line.strip()
        words = line.split(" ")
        if len(words)!=len(preResult[idx]):
            raise Exception("Invalaid input: number of input incorrect")
        for i in range(len(words)):
            total_num += 1
            if words[i] == preResult[idx][i]:
                correct_num +=1
    print("correct rate:",correct_num/total_num)
    f.close()
        


In [29]:
model = LinearNGramWIthGrad(sentences,word_dic,word_count,0.2888295493805419, 0.1856826522594521, 0.5254877983600059)
result = predict(model,FilePath="./data/prep/dev.in")
# print(result)
cal_correct(result,OutPath="./data/prep/dev.out")

correct rate: 0.5989829277152198
