# Assignment 1
You should submit the **UniversityNumber.ipynb** file and your final prediction file **UniversityNumber.test.out** to moodle. Make sure your code does not use your local files and that the results are reproducible. Before submitting, please **run your notebook and keep all running logs** so that we can check.

## 1 $n$-gram Language Model

In [None]:
!wget -O train.txt https://raw.githubusercontent.com/ranpox/comp7607-fall2022/main/assignments/A1/data/lm/train.txt
!wget -O dev.txt https://raw.githubusercontent.com/ranpox/comp7607-fall2022/main/assignments/A1/data/lm/dev.txt
!wget -O test.txt https://raw.githubusercontent.com/ranpox/comp7607-fall2022/main/assignments/A1/data/lm/test.txt

### 1.1 Building vocabulary

#### Code

In [1]:
import numpy as np
from collections import defaultdict

# So, build the model based on make the vocabulary set on training set
f = open("./data/lm/train.txt",'r+',encoding="utf-8")
lines = []
s = set(["<s>","</s>","<UNK>"])
sentences = []
for line in f.readlines():
    # some lines are still end with \n, need to remove \n
    line = line.strip()
    tmp = line.split(' ')
    sentences.append(tmp)
    for word in tmp:
        s.add(word.lower())
f.close()

word_list = sorted(list(s))

word_count = defaultdict(int)


# count every word
for sen in sentences:
    for word in sen:
        word_count[word.lower()]+=1

# calculate <UNK>
word_count_dic = {"<UNK>":0}
for key in word_list:
    value = word_count[key]
    if value<3:
        word_count_dic["<UNK>"]+=1
    else:
        word_count_dic[key]=value

word_list = sorted(word_count_dic.keys())
word_dic = {key:idx for (idx,key) in enumerate(word_list)}
tmp = []
for key in word_dic:
    tmp.append(word_count_dic[key])
word_count = np.array(tmp)

# change sentence to id list

tmp_sentences = []
for sentence in sentences:
    tmp = []
    for word in sentence:
        word = word.lower()
        if word in word_dic:
            tmp.append(word_dic[word])
        else:
            tmp.append(word_dic["<UNK>"])
    tmp_sentences.append(tmp[:])
sentences = tmp_sentences

#vocabulary size
print("vocabulary size: %d"%len(word_list))

del word_list
del word_count_dic
del tmp


# word_dic : the map between words and id, 
# word_count : the number of each word organized with id
# sentences : sentences in training data

vocabulary size: 20661


#### Discussion

Please show the vocabulary size and discuss the number of parameters of n-gram models.

The vocabulary size is 20,661, not including `<s>` and `</s>` characters. So, as for the number of parameters of n-gram models, when the n increases, the number of parameters increases sharply. For example, when n is 1 in this case, the number of parameters is 22,629 which is the vocabulary size. When n is 2, the vocabulary size is 426,876,921 which is the square of vocabulary size. For n equals 3, the vocabulary size is the cube of vocabulary size : 11,589,205,447,000. If we use a float variable to store a single probability, it would use around 32TB memory. So, the bigger n we have, more memory would be used by probability matrix. 

### 1.2 $n$-gram Language Modeling

After preparing your vocabulary, you are expected to build bigram and unigram language models and report their perplexity on the training set, and dev set. Please discuss your experimental results. If you encounter any problems, please analyze them and explain why.

#### Code

In [2]:
from nltk.util import bigrams
from collections import Counter
import math
import string


##### Model Definition

In [3]:
# unigram language model
# as for n=1, I only need to calculate every words' count and then calculate the probability matrix
# which is quite easy to write
# the probability matrix would look like this:
#           [P_0, P_1, P_2, P_3, ...... , P_n, P_start, P_end]
#           P_start, P_end are the probability of start padding and end padding

class UnigramModel:
    def __init__(self, sentences,word_dic,word_count,file = ""):
        if file!="":
            self.uniProb = np.load(file)
        else:
            self.uniProb = self.__cal_prob(sentence,word_dic,word_count)
        self.word_dic = word_dic
        
    # calculate the probability array
    def __cal_prob(self,sentences,word_dic,word_count):
        startpad, endpad = len(sentences),len(sentences)
        count = startpad+endpad
        for nu in word_count:
            count += nu

        uniProb = [0.0] * (len(word_dic)+2)
        for i in range(len(word_dic)):
            uniProb[i] = word_count[i] / count
        uniProb[-2] = startpad / count
        uniProb[-1] = endpad / count
        return np.array(uniProb)

In [4]:
# run the model and save the results
uni = UnigramModel(sentences,word_dic,word_count)
np.save("./unigram.npy",uni.uniProb)

In [5]:
uni = UnigramModel(sentences,word_dic,word_count,file="./unigram.npy")

In [6]:
# bigram language model
# the probability matrix would look like this:
#    first word(i) /second word(j)
#           [w_0, w_1, w_2, w_3, ...... , w_n, startpad, endpad]
#     [w_0]
#     ......
#     [startpad]
#     [endpad] this line is meaningless, deleted
#                       matrix[i][j] = P(j|i)

class BigramModel:
    def __init__(self,sentences,word_dic,word_count,file = ""):
        if file != "":
            self.biProb = np.load(file)
        else:
            # self.uniProb = UnigramModel(sentences,word_dic,word_count).uniProb
            self.biProb = self.__cal_prob(sentences,word_dic,word_count)
        self.word_dic = word_dic
    
    # calculate the probability matrix
    def __cal_prob(self,sentences,word_dic,word_count):
        count_matrix = np.zeros((len(word_dic)+1,len(word_dic)+2),dtype=float)
        startpad,endpad = len(word_dic),len(word_dic)+1
        for sen in sentences:
            for (before,after) in bigrams(sen,pad_left=True, pad_right=True, left_pad_symbol=startpad, right_pad_symbol=endpad):
                count_matrix[before][after] += 1
        
        # add <s> at the end of word_count, which is equal to the number of sentences
        word_count = np.append(word_count,np.array([len(sentences)]))
        for i in range(len(count_matrix)):
            for j in range(len(count_matrix[0])):
                count_matrix[i][j] /= word_count[i]
        return count_matrix



In [7]:
# calculate the model and save it to disk
bi = BigramModel(sentences,word_dic,word_count)    

np.save("./bigram.npy",bi.biProb)

In [7]:
# read model from file
bi = BigramModel(sentences,word_dic,word_count,"./bigram.npy")

##### calculate perplexity 

In [8]:
# calculate ppl of uni-gram model
def uni_ppl(un:UnigramModel,filePath:string):
    f = open(filePath,"r",encoding="utf-8")
    su = 0.0
    unkidx = word_dic["<UNK>"]
    M = 0
    for line in f.readlines():
        line = line.strip()
        words = line.split(" ")
        
        M += len(words)
        fail = False
        for word in words:
            word = word.lower()
            if word in word_dic:
                su += math.log2(un.uniProb[word_dic[word]])
            else:
                su += math.log2(un.uniProb[unkidx])
                
        
        # for </s> multiply this probability
        su += math.log2(un.uniProb[-1])

    f.close()

    l = su/M
    # print("this is l",l)
    ppl = math.pow(2,-l)
    print("the perplexity of this uni-gram model on %s is:%f"%(filePath,ppl))

In [9]:
un = UnigramModel(sentences,word_dic,word_count,"./unigram.npy")
uni_ppl(un,"./data/lm/train.txt")
uni_ppl(un,"./data/lm/dev.txt")

the perplexity of this uni-gram model on ./data/lm/train.txt is:1342.968477
the perplexity of this uni-gram model on ./data/lm/dev.txt is:1291.994081


In [10]:
# calculate perplexity of bi-gram model
def bi_ppl(bi:BigramModel, filePath:string):
    f = open(filePath,"r",encoding="utf-8")
    su = 0.0
    unkidx = word_dic["<UNK>"]
    M = 0
    for line in f.readlines():
        line=line.strip()
        words = line.split(" ")
        M += (len(words)+2)
        # cur state at <s> which is -1 index
        cur = -1
        p = 0.0
        fail = False
        for word in words:
            word = word.lower()
            if word not in word_dic:
                nextstep = unkidx
            else:
                nextstep = word_dic[word]
            if bi.biProb[cur][nextstep]==0:
                fail = True
                break
            else:
                p += math.log2(bi.biProb[cur][nextstep])
                cur = nextstep
                
        if not fail and bi.biProb[cur][-1]!=0:
            p += math.log2(bi.biProb[cur][-1]) # for </s> multiply this probability
            su += p
        else:
            su += float("-inf")
    f.close()
    l = su/M
    ppl = math.pow(2,-l)
    print("the perplexity of this bi-gram model on %s is:%f"%(filePath,ppl))

In [11]:
bi = BigramModel(sentences,word_dic,word_count,"./bigram.npy")
bi_ppl(bi,"./data/lm/train.txt")
bi_ppl(bi,"./data/lm/dev.txt")

the perplexity of this bi-gram model on ./data/lm/train.txt is:60.886884
the perplexity of this bi-gram model on ./data/lm/dev.txt is:inf


#### Discussion

##### Problem Encountered

During this process, I meet 2 diffient problems both related to how to calculate the perplexity. The one problem is occurred when calculate $p(x^{(i)})$, where $x^{(i)}$ is the $i^{th}$ sentence in the test set. The other problem happens when calculate the perplexity of bi-gran model.

+ Question 1

    The main issue of the first problem is mainly related to the represent ability of a float number, because Python uses float type to store decimal numbers. When a dicimal number is smaller than 0.00000001, the result of this number would be zero and this would lead to value error when I run log2 function. To resolve this problem, I use another way to calculate the perplexity. The definition of perplexity is listed below:
    $$ l = {{{1} \over {M}}  \sum^{m}_{i=1} \log_{2}{p(x^{(i)})}}  $$
    $$ ppl = 2^{-l} $$
    $$ {{p(x^{(i)})} = \prod_{j=0}^n p(w_j|w_{j-1})} \text{\quad where $w_j$ is the $j^{th}$ word in $sentence_i$, n=len($sentence_i$)} $$ 
    Because if I calculate $p(x^{(i)})$ roughly, $p(x^{(i)})$ would be too small to store in a float type number. So I decide to use this   function to calculate $l$:
    $$ l = {{{1} \over {M}}  \sum^{m}_{i=1} \sum^{n}_{j=0} \log_{2}{p(w_j|w_{j-1})}} $$
    $$\text{\quad where $w_j$ is the $j^{th}$ word in $sentence_i$, $w_{-1}$="<s>", n=len($sentence_i$)},m=len(sentences) $$
    With this function, the result of calculating perplexity would not be too small, so that the perplexity can be calculated successfully. 

+ Question 2 

    As for the second problem, it occored when $p(w_j|w_{j-1})$ equals 0 during testing bi-gram model. The mainly reason could be that the train set is not large enough to make sure that every possible $p(w_j|w_{j-1})$ is included. If a word set ($w_{j-1}$,$w_j$) is met and the probability $p(w_j|w_{j-1})$ equals zero, then the test prosess would be terminated because there is no probability to generate a sentence like the test sentence. 

    In this case, the $l$ should be self defigned in the program because calculating $\log_{2}0$ is actually an error mathmatically. So in the actual program, when then program test a sentence failed because $p(w_j|w_{j-1})=0$, it is better to add a punishment to the attribute $l$, I choose to add $-\infty$ as the result of $\sum^{n}_{j=0} \log_{2}{p(w_j|w_{j-1})}$ because $\lim_{x\rightarrow{0}}\log_{2}x=-\infty$.

##### Result Discussion

The perplexity of uni-gram model is around $1342.968477$ on training set and $1291.994081$ on dev set, and the perplexity of the bi-gram model is $60.886884$ on training set and $\infty$ on dev set.

As for the uni-gram model, the perplexity is quite high, which means this model cannot predict the sentences in the set. I believe the main reason is that this model only is contained the probability of each words and every sentence is generated in a certain probability. There are some uknown words in the dev set, in perplexity calculate process these words are be selected to <UNK> word, as a result the perplexity on dev set would be smaller than that on train set.

As for the bi-gram model, the perplexity on dev set is infinite, which means this model is terrible.  Actually, in the probability matrix generated by the bi-gram model, nearly 99.9% of the conditional probability are 0. This would caused by the size of training set, if the training set is not large enough, it is possible that the probability matrix calculated by the bi-gram model could not contains every possible two word set in the real world. Therefore, the perplexity could be infinite.

In [12]:
le,wid = len(bi.biProb),len(bi.biProb[0])
t = le*wid
for i in range(le):
    for j in range(wid):
        if bi.biProb[i][j]!=0:
            t-=1
print("the probability is 0 occupied: %f%%"%(t/le/wid*100))

the probability is 0 occupied: 99.885346%


However, I believe that there is some ways that can improve this model. Firstly, increasing the size of training data. Secondly, using smoothing to give little probability to every zero members in the probability matrix. Finally, combining the bi-gram model with the uni-gram model.

### 1.3 Smoothing

#### 1.3.1 Add-one (Laplace) smoothing

##### Code

In [13]:
class AddOneBigramModel:
    def __init__(self,sentences,word_dic,word_count,file = ""):
        if file != "":
            self.biProb = np.load(file)
        else:
            # self.uniProb = UnigramModel(sentences,word_dic,word_count).uniProb
            self.biProb = self.__cal_prob(sentences,word_dic,word_count)
        self.word_dic = word_dic
    
    # with add-one smoothing
    def __cal_prob(self,sentences,word_dic,word_count):
        # for the one in add one smoothing
        count_matrix = np.ones((len(word_dic)+1,len(word_dic)+2),dtype=float)
        startpad,endpad = len(word_dic),len(word_dic)+1
        for sen in sentences:
            for (before,after) in bigrams(sen,pad_left=True, pad_right=True, left_pad_symbol=startpad, right_pad_symbol=endpad):
                count_matrix[before][after] += 1
        
        V = len(word_dic)+2
        # add <s> at the end of word_count, which is equal to the number of sentences
        word_count = np.append(word_count,np.array([len(sentences)]))

        for i in range(len(count_matrix)):
            for j in range(len(count_matrix[0])):
                if count_matrix[i][j] != 0:
                    count_matrix[i][j] /= (word_count[i]+V)
        return count_matrix


In [14]:
AddOneSmoothing = AddOneBigramModel(sentences,word_dic,word_count)
np.save("addone.npy",AddOneSmoothing.biProb)


In [15]:
AddOneSmoothing = AddOneBigramModel(sentences,word_dic,word_count,file="./addone.npy")
bi_ppl(AddOneSmoothing,"./data/lm/train.txt")
bi_ppl(AddOneSmoothing,"./data/lm/dev.txt")

the perplexity of this bi-gram model on ./data/lm/train.txt is:616.761618
the perplexity of this bi-gram model on ./data/lm/dev.txt is:701.757009


##### Discussion

The perplexity of this bi-gram model after add-one smoothing are $616.761618$ on training data and $701.757009$ on dev data.

This is the function to calculate the perplexity with add-one smoothing:
$$ P^{*}_{Add-one}(w_n|w_{n-1})={{c(w_n,w_{n-1})+1} \over {c(w_{n-1})+V}}$$

For every $P^{*}_{Add-one}(w_n|w_{n-1})$ that $c(w_n,w_{n-1})$ equals zero, the probability is $ {1} \over {c(w_{n-1})+V}$ rather than 0, which means that the model could also have the ability to predict those unseen sentences. Therefore, we can see from the result, the perplexity on the training data is increased significantly and that on the dev data is decreased a lot.

##### Optional: Add-k smoothing

##### Code

In [16]:
from collections import defaultdict

class AddKBigramModel:
    def __init__(self,sentences,word_dic,word_count,k:float =1.0):
        
        self.word_dic = word_dic
        self.count1 = word_count
        self.count1 = np.append(self.count1,[len(sentences),len(sentences)])
        self.V = len(self.count1)
        self.k = k
        self.count2 = self.__count(sentences)
    
    # with add-one smoothing
    def __count(self,sentences):
        count = defaultdict(int)
        startpad,endpad = len(word_dic),len(word_dic)+1

        for sen in sentences:
            for (before,after) in bigrams(sen,pad_left=True, pad_right=True, left_pad_symbol=startpad, right_pad_symbol=endpad):
                count[(before,after)] += 1
        return count
    
    def P(self,before,after):
        c = self.count2[(before,after)]+self.k
        
        return c/(self.count1[before]+self.k*self.V)
        
    
    def ppl(self,filePath = ""):
        f = open(filePath,"r",encoding="utf-8")
        su = 0.0
        unkidx = word_dic["<UNK>"]
        M = 0
        start_idx = len(self.word_dic)
        endidx = start_idx+1
        
        for line in f.readlines():
            line=line.strip()
            words = line.split(" ")
            M += (len(words)+2)
            # cur state at <s> which is start_idx
            cur = start_idx
            p = 0.0
            fail = False
            for word in words:
                word = word.lower()
                if word not in word_dic:
                    nextstep = unkidx
                else:
                    nextstep = word_dic[word]
                try:
                    p += math.log2(self.P(cur,nextstep))
                except ValueError:
                    fail = True
                    break
                cur = nextstep
            if not fail and self.P(cur,endidx)!=0:
                p += math.log2(self.P(cur,endidx)) # for </s> multiply this probability
                su += p
            else:
                su += float("-inf")
        f.close()
        l = su/M
        ppl = math.pow(2,-l)
        print("the perplexity of this bi-gram model on %s is:%f"%(filePath,ppl))
            
        

try k=0.5, 0.05, 0.01 in the next block

In [18]:
klist = [0.5,0.05,0.01]

for k in klist:
    print("k=",k)
    t = AddKBigramModel(sentences,word_dic,word_count,k=k)
    t.ppl("./data/lm/train.txt")
    t.ppl("./data/lm/dev.txt")
    del t # to save memory
    print()
# t = AddKBigramModel(sentences,word_dic,word_count,k=0.001)
# t.ppl("./data/lm/train.txt")
# t.ppl("./data/lm/dev.txt")

k= 0.5
the perplexity of this bi-gram model on ./data/lm/train.txt is:428.730279
the perplexity of this bi-gram model on ./data/lm/dev.txt is:523.154802

k= 0.05
the perplexity of this bi-gram model on ./data/lm/train.txt is:152.856003
the perplexity of this bi-gram model on ./data/lm/dev.txt is:253.474545

k= 0.01
the perplexity of this bi-gram model on ./data/lm/train.txt is:95.180805
the perplexity of this bi-gram model on ./data/lm/dev.txt is:200.570883



#### Discussion

||k=1(Add-One smoothing)|k=0.5|k=0.05|k=0.01|k=0(Original Bigram Model)|
|----------------|--------------|-----------------|-----------------|-----------|------|
|ppl on train set|$616.761618$|$428.730279$|$152.856003$|$95.180805$|$60.886884$|
|ppl on dev set|$701.757009$|$523.154802$|$253.474545$|$200.570883$|$\infty$|

According to the data shown in the table above, the smaller k is, the lower perplexity is and the better model I have,especially when k=0.01. In bigram model, the perplexity is extremely high because in most cases the predict process is failed as the conditional probability is zero. After add-one smoothing, the algorithm move some probability to the zero part. But it seems that this probability is so big that it interferes the normal predict process. Hence, add-k smoothing decreases the moved probability by using a parameter k to make sure that every conditional probability could have a minimun probability which is ${{k} \over {c(w_{n-1})+kV}}$. And in this case, the model works best when $k=0.01$.

#### 1.3.2 Linear Interpolation

##### Code

In [3]:
# a probability of a word is r1P1+r2P2+r3P3
from nltk.util import trigrams
from collections import defaultdict

class LinearNGram:
    def __init__(self, sentences:list, word_dic:dict,word_count:np.array,
                r1,r2,r3) -> None:
        self.word_dic = word_dic
        R = r1+r2+r3
        self.r = [r1/R,r2/R,r3/R]

        self.count_matrix1 = defaultdict(int)
        self.sum1 = sum(word_count)
        for (id,num) in enumerate(word_count):
            self.count_matrix1[id]=num        
        self.V2 = len(self.count_matrix1)+2
        self.V3 = self.V2**2

        self.count_matrix2 = defaultdict(int)
        self.count_matrix3 = defaultdict(int)
        self.__genCount(sentences)
    
    def __genCount(self,sentences) :
        startpad,endpad = len(self.word_dic),len(self.word_dic)+1

        self.count_matrix1[startpad] = len(sentences)
        self.count_matrix1[endpad] = len(sentences)

        self.count_matrix2[(startpad,startpad)] = len(sentences)

        for sen in sentences:
            for (before,after) in bigrams(sen,pad_left=True, pad_right=True, left_pad_symbol=startpad, right_pad_symbol=endpad):
                self.count_matrix2[(before,after)] += 1
            for (before1,before2,after) in trigrams(sen,pad_left=True, pad_right=True, left_pad_symbol=startpad, right_pad_symbol=endpad):
                self.count_matrix3[(before1,before2,after)] += 1

    def _P1(self,word):
        return self.count_matrix1[word]/self.sum1
    
    def _P2(self, w_before,w_after):
        return self.count_matrix2[(w_before,w_after)]/self.count_matrix1[w_before]
    
    def _P3(self, w_before1,w_before2,w_after):
        if self.count_matrix2[(w_before1,w_before2)] == 0:
            return 0 
        else:
            return self.count_matrix3[(w_before1,w_before2,w_after)]/self.count_matrix2[(w_before1,w_before2)]

    def P(self,w_before1,w_before2,w_after):
        return self.r[0]*self._P1(w_after) + self.r[1]*self._P2(w_before2,w_after) + \
            self.r[2]*self._P3(w_before1,w_before2,w_after)
    
    def changeR(self,r1,r2,r3):
        R = r1+r2+r3
        self.r = [r1/R,r2/R,r3/R]

    def cal_ppl(self,filePath = ""):
        f = open(filePath,'r',encoding='utf-8')
        unkidx = word_dic["<UNK>"]
        su = 0.0
        M = 0
        start_status = [len(self.word_dic)]*2
        end_idx = len(self.word_dic)+1
        for line in f.readlines():
            line = line.strip()    # remove \n
            words = line.split(" ")
            M += (len(words)+4)
            p = 0.0
            before = start_status
            for word in line.split(" "):
                word = word.lower()
                if word in self.word_dic:
                    nextstep = self.word_dic[word]
                else:
                    nextstep = unkidx
                
                P = self.P(before[0],before[1],nextstep)
                p += math.log2(P)
                before[0],before[1] = before[1],nextstep
            
            # (word_n-1,word_n,end)
            p += math.log2(self.P(before[0],before[1],end_idx))
            # (word_n,end,end)
            p += math.log2(self.P(before[1],end_idx,end_idx))
            su += p
        f.close()
        l = su/M
        ppl = math.pow(2,-l)
        print("the perplexity of Linear Interpolation on %s is:%f"%(filePath,ppl))
        return ppl
    

In [4]:
lin = LinearNGram(sentences,word_dic,word_count,1,1,1)
_ = lin.cal_ppl("./data/lm/train.txt")
_ = lin.cal_ppl("./data/lm/dev.txt")

the perplexity of Linear Interpolation on ./data/lm/train.txt is:16.259040
the perplexity of Linear Interpolation on ./data/lm/dev.txt is:109.440165


In [13]:
lin.changeR(0.2,0.5,0.3)
_ = lin.cal_ppl("./data/lm/train.txt")
_ = lin.cal_ppl("./data/lm/dev.txt")

the perplexity of Linear Interpolation on ./data/lm/train.txt is:16.454264
the perplexity of Linear Interpolation on ./data/lm/dev.txt is:108.226245


In [6]:
lin.changeR(0.4,0.4,0.2)
_ = lin.cal_ppl("./data/lm/train.txt")
_ = lin.cal_ppl("./data/lm/dev.txt")

the perplexity of Linear Interpolation on ./data/lm/train.txt is:21.097855
the perplexity of Linear Interpolation on ./data/lm/dev.txt is:109.763287


In [7]:
lin.changeR(0.3,0.3,0.4)
_ = lin.cal_ppl("./data/lm/train.txt")
_ = lin.cal_ppl("./data/lm/dev.txt")

the perplexity of Linear Interpolation on ./data/lm/train.txt is:14.748860
the perplexity of Linear Interpolation on ./data/lm/dev.txt is:111.143467


In [8]:
lin.changeR(0.99,0.005 ,0.005)
_ = lin.cal_ppl("./data/lm/train.txt")
_ = lin.cal_ppl("./data/lm/dev.txt")

the perplexity of Linear Interpolation on ./data/lm/train.txt is:170.210268
the perplexity of Linear Interpolation on ./data/lm/dev.txt is:308.730019


In [9]:
lin.changeR(0.005,0.99,0.005)
_ = lin.cal_ppl("./data/lm/train.txt")
_ = lin.cal_ppl("./data/lm/dev.txt")

the perplexity of Linear Interpolation on ./data/lm/train.txt is:54.929393
the perplexity of Linear Interpolation on ./data/lm/dev.txt is:227.670777


In [10]:
lin.changeR(0.005,0.005,0.99)
_ = lin.cal_ppl("./data/lm/train.txt")
_ = lin.cal_ppl("./data/lm/dev.txt")

the perplexity of Linear Interpolation on ./data/lm/train.txt is:10.944311
the perplexity of Linear Interpolation on ./data/lm/dev.txt is:709.194908


In [11]:
lin.changeR(0.05,0.45,0.5)
_ = lin.cal_ppl("./data/lm/train.txt")
_ = lin.cal_ppl("./data/lm/dev.txt")

the perplexity of Linear Interpolation on ./data/lm/train.txt is:12.848475
the perplexity of Linear Interpolation on ./data/lm/dev.txt is:131.527135


In [20]:
lin.changeR(0.3,0.4,0.3)
_ = lin.cal_ppl("./data/lm/train.txt")
_ = lin.cal_ppl("./data/lm/dev.txt")

the perplexity of Linear Interpolation on ./data/lm/train.txt is:16.900118
the perplexity of Linear Interpolation on ./data/lm/dev.txt is:107.828938


##### Discussion

When R=[0.3,0.4,0.3], the perplexity on train set and dev set are smallest, which are:$16.900118$ and $107.828938$ respectively. Actually, during my test, I find that the perplexity on train set is mainly influenced by trigram model and bigram model, while the trigram model seems to have a more significant effect on that. And unigram model seems to have a more significant effect on the perplexity on dev set, which means that unigram have a effect on predicting unseen words. With the limitation of bigram model, the perplexity on dev set could decrease to 114.909482 in the case R=[0.3,0.4,0.3]. 

##### Optional: Optimization

In [4]:
class LinearNGramWIthGrad(LinearNGram):
    def __init__(self, sentences:list, word_dic:dict,word_count:np.array,r1,r2,r3,learning_rate:float = 0.00001):
        super(LinearNGramWIthGrad, self).__init__(sentences, word_dic, word_count, r1,r2,r3)
        self.learning_rate = learning_rate
        self.partial = [0.0,0.0,0.0]
    
    def __addpartical(self,w_before1,w_before2,w_after):
        P = self.P(w_before1,w_before2,w_after)
        ln2 = math.log(2)
        P1=self._P1(w_after)
        P2=self._P2(w_before2,w_after)
        P3=self._P3(w_before1,w_before2,w_after)
        self.partial[0]+=((P1-P2)/P/ln2)
        self.partial[1]+=((P2-P3)/P/ln2)
        self.partial[2]+=((P3-P1)/P/ln2)
    
    def calDescent(self,ppl,M):
        tmp = ppl*(-1/M)*math.log(2)*self.learning_rate
        descent = [tmp*each for each in self.partial]
        self.partial = [0.0,0.0,0.0]
        return descent
    
    def adjustRwithDesc(self,descent):
        print("Original R:",self.r,end="--->")
        self.r[0] -= descent[0]
        self.r[1] -= descent[1]
        self.r[2] -= descent[2]
        R = sum(self.r)
        for i in range(3):
            self.r[i] /= R
        print(self.r)
    
    def cal_ppl(self,filePath = ""):
        f = open(filePath,'r',encoding='utf-8')
        unkidx = word_dic["<UNK>"]
        su = 0.0
        M = 0
        start_status = [len(self.word_dic)]*2
        end_idx = len(self.word_dic)+1
        for line in f.readlines():
            line = line.strip()    # remove \n
            words = line.split(" ")
            M += (len(words)+4)
            p = 0.0
            before = start_status
            for word in line.split(" "):
                word = word.lower()
                if word in self.word_dic:
                    nextstep = self.word_dic[word]
                else:
                    nextstep = unkidx
                
                prop = self.P(before[0],before[1],nextstep)
                p += math.log2(prop)

                self.__addpartical(before[0],before[1],nextstep)

                before[0],before[1] = before[1],nextstep
            
            # (word_n-1,word_n,end)
            p += math.log2(self.P(before[0],before[1],end_idx))
            self.__addpartical(before[0],before[1],end_idx)
            # (word_n,end,end)
            p += math.log2(self.P(before[1],end_idx,end_idx))
            self.__addpartical(before[1],end_idx,end_idx)
            su += p
        f.close()
        l = su/M
        ppl = math.pow(2,-l)
        print("the perplexity of Linear Interpolation on %s is:%f"%(filePath,ppl))

        descent = self.calDescent(ppl,M)
        print("descent gradient descent search result:", descent)
    
        return (ppl,descent)
    


In [37]:
def  gradientDescent(sentences,word_dic,word_count,r1,r2,r3,learning_rate,times):

    grad = LinearNGramWIthGrad(sentences,word_dic,word_count,r1,r2,r3,learning_rate)
    bestR = [r1,r2,r3]
    minppldev,_ = grad.cal_ppl("./data/lm/dev.txt")
    ppltrain ,_ = grad.cal_ppl("./data/lm/train.txt")
    mul_ppl = (minppldev+ppltrain)/2
    grad.partial = [0.0]*3
    bestT = 0
    for i in range(times):
        print("Turn ",i,)
        ppl,descent = grad.cal_ppl("./data/lm/train.txt")
        grad.adjustRwithDesc(descent)
        ppl_dev,_ = grad.cal_ppl("./data/lm/dev.txt")
        grad.partial = [0.0]*3
        if (ppl_dev+ppl)/2<mul_ppl:
            bestT = i
            bestR = grad.r[:]
            minppldev = ppl_dev
            ppltrain = ppl
            mul_ppl = (minppldev+ppltrain)/2
    print("-------------------------------------------------")
    print("Best Turn %d\n\rppl on train set %f\n\rppl on dev set %f\n\rhyper parameter is "%(bestT,ppltrain,minppldev),bestR)
    print("\n\nTest on test set")
    grad.changeR(bestR[0],bestR[1],bestR[2])
    ppl,_ = grad.cal_ppl("./data/lm/test.txt")
    print("ppl on test set is: %f" % ppl)

gradientDescent(sentences,word_dic,word_count,0.3,0.4,0.3,0.01,10)

the perplexity of Linear Interpolation on ./data/lm/dev.txt is:107.828938
descent gradient descent search result: [0.15601839878983037, -0.17813680738560103, 0.022118408595770973]
the perplexity of Linear Interpolation on ./data/lm/train.txt is:16.900118
descent gradient descent search result: [0.05309769626447645, 0.2630433320565012, -0.3161410283220823]
Turn  0
the perplexity of Linear Interpolation on ./data/lm/train.txt is:16.900118
descent gradient descent search result: [0.05309769626447645, 0.2630433320565012, -0.3161410283220823]
Original R: [0.3, 0.4, 0.3]--->[0.2469023037352508, 0.13695666794334752, 0.6161410283214016]
the perplexity of Linear Interpolation on ./data/lm/dev.txt is:129.821888
descent gradient descent search result: [0.9711665641411176, -1.7783583689473146, 0.8071918048059038]
Turn  1
the perplexity of Linear Interpolation on ./data/lm/train.txt is:11.866648
descent gradient descent search result: [0.036861793066130855, 0.08333531036534615, -0.12019710343122825

##### Discussion

Three are several ways to get the local optimum paramaters like parametric sweep, gradient descent search, or using $\lambda$ matrix. Here I would like to use gradient descent in my model.
Actually, the function to calculate the perplexity is:
$$ ppl = 2^{-l} $$
$$ l = {{{1} \over {M}}  \sum^{m}_{i=1} \sum^{n}_{j=0} \log_{2}{p^{*}(w_j|w_{j-1},w_{j-2})}} $$
$$ p^{*}(w_j|w_{j-1},w_{j-2}) = \lambda_1P(w_j)+\lambda_2P(w_j|w_{j-1})+ \lambda_3P(w_j|w_{j-1},w_{j-2})$$ 
$$ \lambda_1+\lambda_2+\lambda_3=1 $$
So, we can get the partial derivative of perplexity about $\lambda_1,\lambda_2,\lambda_3$ like this:
$$ {{\partial ppl} \over {\partial \lambda_1}} = -{{1}\over M}*ppl*\log 2*\sum^{m}_{i=1} \sum^{n}_{j=0} {{P(w_j)-P(w_j|w_{j-1})} \over {p^{*}(w_j|w_{j-1},w_{j-2})*\log 2}} $$
$$ {{\partial ppl} \over {\partial \lambda_2}} = -{{1}\over M}*ppl*\log 2*\sum^{m}_{i=1} \sum^{n}_{j=0} {{P(w_j|w_{j-1})-P(w_j|w_{j-1},w_{j-2})} \over {p^{*}(w_j|w_{j-1},w_{j-2})*\log 2}} $$
$$ {{\partial ppl} \over {\partial \lambda_3}} = -{{1}\over M}*ppl*\log 2*\sum^{m}_{i=1} \sum^{n}_{j=0} {{P(w_j|w_{j-1},w_{j-2})-P(w_j)} \over {p^{*}(w_j|w_{j-1},w_{j-2})*\log 2}} $$
And it is easy to know that ${{\partial ppl} \over {\partial \lambda_1}}+{{\partial ppl} \over {\partial \lambda_2}}+{{\partial ppl} \over {\partial \lambda_3}}=0$, which can make sure that the adjustment would not break the normalization. The adjust functional is listed below:
$$ \lambda_1^{,} = \lambda_1-LearningRate*{{\partial ppl} \over {\partial \lambda_1}} $$
$$ \lambda_2^{,} = \lambda_2-LearningRate*{{\partial ppl} \over {\partial \lambda_2}} $$
$$ \lambda_3^{,} = \lambda_3-LearningRate*{{\partial ppl} \over {\partial \lambda_3}} $$

During the examination of the gradient descent process, the perplexity on training set decreases in every turn and the perplexity on the test set have a different trend. So at last I choose a group of superparameters that has the lowest perplexity on test set. The hyperparameters I choose are $[0.2589453005798403, 0.23756411382677395, 0.5034905855933857]$.
<!!!!>

## 2 Preposition Prediction

In [None]:
!wget -O dev.in https://raw.githubusercontent.com/ranpox/comp7607-fall2022/main/assignments/A1/data/prep/dev.in
!wget -O dev.out https://raw.githubusercontent.com/ranpox/comp7607-fall2022/main/assignments/A1/data/prep/dev.out

In [5]:
import random


def predict(model :LinearNGramWIthGrad,FilePath:string):
    word_predict = ["at","in","of","on","for"]
    f = open(FilePath,"r",encoding="utf-8")
    startpad,endpad = len(model.word_dic),len(model.word_dic)+1
    ret = []
    for line in f.readlines():
        line = line.strip()
        words = line.split(" ")
        
        line_pre = []
        for (idx,w) in enumerate(words):
            if w != "<PREP>":
                continue
            w = w.lower()
            before1,before2 = 0,0
            if idx == 0:
                before1,before2 = startpad,startpad
            elif idx ==1:
                if words[idx-1].lower() in model.word_dic:
                    before1,before2 = startpad,model.word_dic[words[idx-1].lower()]
                else:
                    before1,before2 = startpad,model.word_dic["<UNK>"]
            else:
                if words[idx-2].lower() in model.word_dic:
                    before1 = model.word_dic[words[idx-2].lower()]
                else:
                    before1 = model.word_dic["<UNK>"]
                if words[idx-1].lower() in model.word_dic:
                    before2 = model.word_dic[words[idx-1].lower()]
                else:
                    before2 = model.word_dic["<UNK>"]
            predic = ""
            # choose max P
            maxp = float('-inf')
            for wp in word_predict:
                if maxp < model.P(before1,before2,model.word_dic[wp]):
                    # print(before1,before2,wp)
                    maxp = model.P(before1,before2,model.word_dic[wp])
                    predic = wp

            # random choose
            # weight = [model.P(before1,before2,model.word_dic[wp]) for wp in word_predict]
            # predic = random.choices(word_predict,weight,k=1)[0]
            
            line_pre.append(predic)
        ret.append(line_pre[:])
    f.close()
    return ret

def cal_correct(model :LinearNGramWIthGrad,FilePath:string, OutPath:string):
    preResult = predict(model,FilePath)
    f = open(OutPath,'r',encoding = 'utf-8')
    i = 0
    total_num = 0
    correct_num = 0
    for (idx,line) in enumerate(f.readlines()):
        line = line.strip()
        words = line.split(" ")
        if len(words)!=len(preResult[idx]):
            raise Exception("Invalaid input: number of input incorrect")
        for i in range(len(words)):
            total_num += 1
            if words[i] == preResult[idx][i]:
                correct_num +=1
    print("correct rate:",correct_num/total_num)
    f.close()
    return correct_num/total_num

def saveFile(preResult,outPath:string):
    f = open(outPath,'w+',encoding = 'utf-8')
    for res in preResult:
        f.write(" ".join(res)+"\n")
    f.close()

def  gradientDescentforPredict(sentences,word_dic,word_count,r1,r2,r3,k2,k3,learning_rate,times):

    grad = LinearNGramWIthGrad(sentences,word_dic,word_count,r1,r2,r3,learning_rate=learning_rate)
    bestR = [r1,r2,r3]
    InPath = "./data/prep/dev.in"
    OutData = "./data/prep/dev.out"
    corr = cal_correct(grad,InPath,OutData)
    bestT = 0
    for i in range(times):
        print("Turn ",i,)
        ppl,descent = grad.cal_ppl("./data/lm/train.txt")
        grad.adjustRwithDesc(descent)
        
        co = cal_correct(grad,InPath,OutData)
        if co>corr:
            bestT = i
            bestR = grad.r[:]
            corr = co
    print("-------------------------------------------------")
    print("Best Turn %d\n\rhyper parameter is "%(bestT),bestR)
    print("correct_rate",corr)

In [11]:
gradientDescentforPredict(sentences,word_dic,word_count,r1=0.3,r2=0.4,r3=0.3,k2=0.01,k3=0.0001,learning_rate=0.005,times = 15)

correct rate: 0.7072284780239738
Turn  0
the perplexity of Linear Interpolation on ./data/lm/train.txt is:16.900118
descent gradient descent search result: [0.026548848132238224, 0.1315216660282506, -0.15807051416104115]
Original R: [0.3, 0.4, 0.3]--->[0.2734511518676107, 0.2684783339716011, 0.4580705141607881]
correct rate: 0.7104976389393389
Turn  1
the perplexity of Linear Interpolation on ./data/lm/train.txt is:13.716543
descent gradient descent search result: [0.019583908319163765, 0.071345477856925, -0.09092938617597808]
Original R: [0.2734511518676107, 0.2684783339716011, 0.4580705141607881]--->[0.25386724354847506, 0.19713285611469794, 0.548999900336827]
correct rate: 0.7101343988376316
Turn  2
the perplexity of Linear Interpolation on ./data/lm/train.txt is:12.524823
descent gradient descent search result: [0.017756487807274807, 0.052780434024197374, -0.07053692183135526]
Original R: [0.25386724354847506, 0.19713285611469794, 0.548999900336827]--->[0.23611075574122783, 0.14435

In [8]:
# use 0.2734511518676107, 0.2684783339716011, 0.4580705141607881
model = LinearNGramWIthGrad(sentences,word_dic,word_count,0.2734511518676107, 0.2684783339716011, 0.4580705141607881)
result = predict(model,"./data/prep/test.in")
saveFile(result,"./test.out")
cal_correct(model,"./data/prep/test.in","C:/Users/84200/Documents/WeChat Files/wxid_lt32y9km4l7s21/FileStorage/File/2022-10/test_pred.out")

correct rate: 0.9569701280227596


0.9569701280227596