# Assignment 1
You should submit the **UniversityNumber.ipynb** file and your final prediction file **UniversityNumber.test.out** to moodle. Make sure your code does not use your local files and that the results are reproducible. Before submitting, please **run your notebook and keep all running logs** so that we can check.

## 1 $n$-gram Language Model

In [None]:
!wget -O train.txt https://raw.githubusercontent.com/ranpox/comp7607-fall2022/main/assignments/A1/data/lm/train.txt
!wget -O dev.txt https://raw.githubusercontent.com/ranpox/comp7607-fall2022/main/assignments/A1/data/lm/dev.txt
!wget -O test.txt https://raw.githubusercontent.com/ranpox/comp7607-fall2022/main/assignments/A1/data/lm/test.txt

### 1.1 Building vocabulary

#### Code

You will download and preprocess the tokenized training data to build the vocabulary. To handle out-of-
vocabulary(OOV) words, you will convert tokens that occur less than three times in the training data into
a special unknown token 〈UNK〉. You should also add start-of-sentence tokens 〈s〉and end-of-sentence
〈/s〉tokens.

In [1]:
import numpy as np
from collections import defaultdict

# So, build the model based on make the vocabulary set on training set
f = open("./data/lm/train.txt",'r+',encoding="utf-8")
lines = []
s = set(["<s>","</s>","<UNK>"])
sentences = []
for line in f.readlines()[:-1]:
    # some lines are still end with \n, need to remove \n
    if line.endswith("\n"):
        line = line[:-1]
    tmp = line.split(' ')
    sentences.append(tmp)
    for word in tmp:
        s.add(word)
f.close()

word_list = sorted(list(s))

word_count = defaultdict(int)


# count every word
for sen in sentences:
    for word in sen:
        word_count[word]+=1

# calculate <UNK>
word_count_dic = {"<UNK>":0}
for key in word_list:
    value = word_count[key]
    if value<3:
        word_count_dic["<UNK>"]+=1
    else:
        word_count_dic[key]=value

word_list = sorted(word_count_dic.keys())
word_dic = {key:idx for (idx,key) in enumerate(word_list)}
tmp = []
for key in word_dic:
    tmp.append(word_count_dic[key])
word_count = np.array(tmp)

# change sentence to id list

tmp_sentences = []
for sentence in sentences:
    tmp = []
    for word in sentence:
        if word in word_dic:
            tmp.append(word_dic[word])
        else:
            tmp.append(word_dic["<UNK>"])
    tmp_sentences.append(tmp[:])
sentences = tmp_sentences

#vocabulary size
print("vocabulary size: %d"%len(word_list))

del word_list
del word_count_dic
del tmp


# word_dic : the map between words and id, 
# word_count : the number of each word organized with id
# sentences : sentences in training data

vocabulary size: 22629


#### Discussion

Please show the vocabulary size and discuss the number of parameters of n-gram models.

The vocabulary size is 22,629, not including `<s>` and `</s>` characters. So, as for the number of parameters of n-gram models, when the n increases, the number of parameters increases sharply. For example, when n is 1 in this case, the number of parameters is 22,629 which is the vocabulary size. When n is 2, the vocabulary size is 512,071,641 which is the square of vocabulary size. For n equals 3, the vocabulary size is the cube of vocabulary size : 11,587,669,164,189. If we use a float variable to store a single probability, it would use 86,334.86303827912GB memory. So, the bigger n I have, more memory would be used by probability matrix. 

### 1.2 $n$-gram Language Modeling

After preparing your vocabulary, you are expected to build bigram and unigram language models and report their perplexity on the training set, and dev set. Please discuss your experimental results. If you encounter any problems, please analyze them and explain why.

#### Code

In [2]:
from nltk.util import bigrams
from collections import Counter
import math

In [3]:
# unigram language model
# as for n=1, I only need to calculate every words' count and then calculate the probability matrix
# which is quite easy to write
# the probability matrix would look like this:
#           [P_0, P_1, P_2, P_3, ...... , P_n, P_start, P_end]
#           P_start, P_end are the probability of start padding and end padding

class UnigramModel:
    def __init__(self, sentences,word_dic,word_count,file = ""):
        if file!="":
            self.uniProb = np.load(file)
        else:
            self.uniProb = self.__cal_prob(sentence,word_dic,word_count)
        
    # calculate the probability array
    def __cal_prob(self,sentences,word_dic,word_count):
        startpad, endpad = len(sentences),len(sentences)
        count = startpad+endpad
        for nu in word_count:
            count += nu

        uniProb = [0.0] * (len(word_dic)+2)
        for i in range(len(word_dic)):
            uniProb[i] = word_count[i] / count
        uniProb[-2] = startpad / count
        uniProb[-1] = endpad / count
        return np.array(uniProb)

In [4]:
uni = UnigramModel(sentences,word_dic,word_count)
np.save("./unigram.npy",uni.uniProb)

In [4]:
uni = UnigramModel(sentences,word_dic,word_count,file="./unigram.npy")

In [6]:
# bigram language model
# the probability matrix would look like this:
#    first word(i) /second word(j)
#           [w_0, w_1, w_2, w_3, ...... , w_n, startpad, endpad]
#     [w_0]
#     ......
#     [startpad]
#     [endpad] this line is meanless, deleted
#                       matrix[i][j] = P(j|i)

class BigramModel:
    def __init__(self,sentences,word_dic,word_count,file = ""):
        if file != "":
            self.biProb = np.load(file)
        else:
            self.uniProb = UnigramModel(sentences,word_dic,word_count).uniProb
            self.biProb = self.__cal_prob(sentences,word_dic)
    
    # calculate the probability matrix
    def __cal_prob(self,sentences,word_dic):
        count_matrix = np.zeros((len(word_dic)+1,len(word_dic)+2),dtype=float)
        startpad,endpad = len(word_dic),len(word_dic)+1
        count = 0
        for sen in sentences:
            for (before,after) in bigrams(sen,pad_left=True, pad_right=True, left_pad_symbol=startpad, right_pad_symbol=endpad):
                count_matrix[before][after] += 1
                count += 1
        # FIXME:计算时间过长
        for i in range(len(count_matrix)):
            for j in range(len(count_matrix[0])):
                if count_matrix[i][j] != 0:
                    count_matrix[i][j] /= count
        return count_matrix



In [6]:
# calculate the model and save it to disk
bi = BigramModel(sentences,word_dic,word_count)    
#print(list(bi.probability))
np.save("./bigram.npy",bi.biProb)

In [7]:
# read model from file
bi = BigramModel(sentences,word_dic,word_count,"./bigram.npy")

In [8]:
# calculate perplexity
f = open("./data/lm/test.txt","r",encoding="utf-8")
su = 0.0
unkidx = word_dic["<UNK>"]
M = 0
for line in f.readlines():
    M += 1
    # cur state at <s> which is -1 index
    cur = -1
    p = 1
    for word in line:
        if word in word_dic:
            nextstep = word_dic[word]
            p *= bi.biProb[cur][nextstep]
            cur = nextstep
        else:
            p *= bi.biProb[cur][unkidx]
            cur = unkidx
    p *= bi.biProb[cur][-1] # for </s> multiply this probability
    print(p)
    su += math.log2(p)
f.close()
l = su/M
ppl = math.pow(2,-l)
print("the perplexity of this bi-gram model is :%f"%ppl)
            


ValueError: math domain error

#### Discussion

### 1.3 Smoothing

#### 1.3.1 Add-one (Laplace) smoothing

##### Code

##### Discussion

##### Optional: Add-k smoothing

###### Code

###### Discussion

#### 1.3.2 Linear Interpolation

##### Code

##### Discussion

##### Optional: Optimization

###### Discussion

## 2 Preposition Prediction

In [None]:
!wget -O dev.in https://raw.githubusercontent.com/ranpox/comp7607-fall2022/main/assignments/A1/data/prep/dev.in
!wget -O dev.out https://raw.githubusercontent.com/ranpox/comp7607-fall2022/main/assignments/A1/data/prep/dev.out