# 01.112 Machine Learning Design Project

## About the Project

We have 4 datasets in the `/data` folder. For each dataset, there is: 
- a labelled training set train, 
- an unlabelled development set `dev.in`
- a labelled development set `dev.out` 

The labelled data has the format of: `token` `\t` `tag`
- one token per line
- token and tag separated by tab 
- single empty lines that separates sentences

For the labels, they are slightly different for different datasets.
- SG, CN (Entity):
    - B-*: Beginning of entity
    - I-*: Inside of entity
    - O: Outside of any entity
- EN, AL (Phrase):
    - B-VP: Beginning of Verb Phrase
    - I-VP: Inside of Verb Phrase
    - *-NP: Noun Phrase
    - *PP: Propositional Phrase
    - O: Outside of any phrase

*Goal*: Build sequence labelling systems from training data (x) and use it to predict tag sequences for new sentences (y).

## Team members 
- Andri Setiawan Susanto
- Eldon Lim 
- Tey Siew Wen

## Part 1
Already completed individually.

## Part 2

a) Write a function that estimates the emission parameters from the training set using MLE (maximum likelihood estimation):

b)

1. Make a modified training set by replacing those words that appear $<k$ times in the training set with a special word token `#UNK#` before training.
2. During testing phase, ifaworddoesnot appear in the modified training set, we also replace that wordwith `#UNK#`.
3. Compute Emission Paramters with the function in (a)

For all the four datasets EN, AL, CN, and SG, learn these parameters with `train`, and evaluate your
system on the development set `dev.in` for each of the dataset. Write your output to `dev.p2.out`
for the four datasets respectively. Compare your outputs and the gold-standard outputs in `dev.out`
and report the precision, recall and F scores of such a baseline system for each dataset.

In [6]:
import numpy as np
import pandas as pd
from collections import defaultdict, Counter

def emissionPara(arr, k, replaceWord):
    x_counter = defaultdict(int)
    y_counter = defaultdict(int)
    xy_counter = defaultdict(int)
    x_labels = defaultdict(list)
    emission_params = {}
    xy_dict = {}

    for x_y in arr:
        x, y = x_y[0].split(" ")
        x_counter[x] += 1
        y_counter[y] += 1
        xy_counter[x,y] += 1
        if y not in x_labels[x]:
            x_labels[x].append(y)

    x_to_remove = [x for x in x_counter if x_counter[x] < k]

    for r in x_to_remove:
        count = x_counter[r]
        for label in x_labels[r]:
            x_labels[replaceWord] += x_labels[r]
            xy_counter[replaceWord, label] += count
        del x_labels[r]
        del x_counter[r]

    for x_y, x_y_count in xy_counter.items():
        y = x_y[1]
        emission_params[x_y] = x_y_count / y_counter[y]

    # get best labels
    for x, labels in x_labels.items():
        emission_probs = np.zeros(len(labels))
        for i in range(len(labels)):
            label = labels[i]
            if (x,label) in xy_counter:
                emission_probs[i] = xy_counter[x,label]
        xy_dict[x] = labels[np.argmax(emission_probs)]
    
    return emission_params, x_counter.keys(), xy_dict

def predict(data, xy_dict, replaceWord):
    print("Predicting labels")
    start = time.process_time()
    result = pd.DataFrame()
    
    def replace_string(x):
        if x not in xy_dict:
            return "{} {}".format(replaceWord, xy_dict[replaceWord])
        else:
            return "{} {}".format(x, xy_dict[x])
         
    return data["x"].apply(lambda s: replace_string(s) if str(s) != "nan" else " ")

In [39]:
for k,v in emission_dict.items():
    if "#UNK#" in k:
        print(k,v)

('#UNK#', 'B-NP') 0.09402811542120283
('#UNK#', 'I-NP') 0.1504094081441996
('#UNK#', 'I-ADJP') 0.2961672473867596
('#UNK#', 'B-ADVP') 0.0726507713884993
('#UNK#', 'B-VP') 0.08871365204534254
('#UNK#', 'I-VP') 0.12324047642484497
('#UNK#', 'B-ADJP') 0.1901770416904626
('#UNK#', 'O') 0.0030160857908847183
('#UNK#', 'B-PP') 0.0028280850600968075
('#UNK#', 'I-ADVP') 0.09917355371900827
('#UNK#', 'B-INTJ') 0.4230769230769231
('#UNK#', 'I-UCP') 0.5
('#UNK#', 'B-SBAR') 0.001579778830963665
('#UNK#', 'I-INTJ') 0.5714285714285714
('#UNK#', 'B-LST') 0.18181818181818182


In [94]:
from csv import QUOTE_NONE
import time

k = 3
replaceWord = "#UNK#"
data_folders = ["AL", "EN","CN","SG"]
for x in ["SG"]:
    print("Performing sentiment analysis for data folder ", x)
    train_data = "./data/{}/train".format(x)
    test_data = "./data/{}/dev.in".format(x)
    test_result = "./data/{}/dev.out".format(x)
    
    train_data = pd.read_csv(train_data, sep='\r\n', names=['x_y'],index_col=False, engine="python", encoding="UTF-8").to_numpy()
    emission_dict, valid_x, xy_dict = emissionPara(train_data, k, replaceWord)

    test_data = pd.read_csv(test_data, sep='\r\n', names=['x'],index_col=False,skip_blank_lines=False, engine="python", encoding="UTF-8")
    testdf = predict(test_data, xy_dict, replaceWord)
    print(testdf.head(3))
    
    print("Writing the final result to dev.p2.out...")
    testdf.to_csv('./output/{}/dev.p2.out'.format(x), header=False, index=False, na_rep="", sep="\n", quoting=QUOTE_NONE)

Performing sentiment analysis for data folder  SG
Predicting labels
0                Tour O
1    Scotland B-neutral
2           followers O
Name: x, dtype: object
Writing the final result to dev.p2.out...


## Part 3

Write a function that estimates the transition parameters from the training set using MLE (maximum likelihood estimation):

In [41]:
def split_into_columns(df_column):
    new = df_column.str.split(" ", n=1, expand=True)
    return new[0], new[1]

In [53]:
from collections import Counter, defaultdict

def transitionPara(data):
    train_data_blank=pd.read_csv(data, sep='/n', delimiter=None, names=['original'],index_col=False, engine="python", skip_blank_lines=False)
    x, y = split_into_columns(train_data_blank["original"])
    xy_dic = dict(zip(x, y))
    
    # Get bottom count (Count(yi))
    y_count = Counter(y)
    del y_count[np.nan]
    # Get top count (Count(yi-1, yi))
    subseq_count = defaultdict(int)
    for i in range(len(y)-1):    
        y1 = y[i]
        y2 = y[i+1]
        
        if i == 0:
            subseq_count[("START", y1)] +=1
            y_count["START"] +=1
        if str(y1) == "nan":
            subseq_count[("START", y2)] +=1
            y_count["START"] +=1
        elif i == len(y)-1 or str(y2) == "nan":
            subseq_count[(y1, "END")] +=1
            y_count["END"] +=1
        else:
            subseq_count[y1,y2] += 1
    
    # Calculation of transition params
    transition_dict = {}
    
    for k,v in subseq_count.items():
        y1 = k[0]
        y2 = k[1]
        transition_dict[y1,y2] = subseq_count[y1,y2] / y_count[y1]
        
    del y_count["START"]
    del y_count["END"]
    del y_count[None]
     
    return transition_dict, subseq_count, y_count

In [54]:
transition_dict, subseq_count, y_count = transitionPara("./data/SG/train")
y_count.keys()

dict_keys(['O', 'B-positive', 'I-positive', 'B-neutral', 'I-neutral', 'B-negative', 'I-negative'])

# Viterbi

In [95]:
def viterbi(unique_word_list):
    #This is for the starting for viterbi
    global nodes
    
    num_nodes_per_col = len(nodes)
    store=np.zeros(num_nodes_per_col)   #store = the storage for scores for all the nodes. 
    scorelist=np.zeros((num_nodes_per_col, len(unique_word_list) + 1))
    
    for i in range(num_nodes_per_col):
#         print("Calculating for first column")
#         print(f"Node {i}, {nodes[i]}, {unique_word_list[0]}")
        emission_score = emission(nodes[i],unique_word_list[0])
        transition_score = transition("START",nodes[i])
        if transition_score == 0 or emission_score == 0:
            store[i] = np.NINF
        else:
            store[i] = np.log(emission_score)+np.log(transition_score)
#             print(store[i])
    scorelist[:,0] = store
    store = np.zeros(num_nodes_per_col)
    score_per_node=np.zeros(num_nodes_per_col)
    
    #This is for the middle portion for viterbi
    #score per node = prevnode*emission*transition

    if len(unique_word_list)>1:
        for j in range(len(unique_word_list)-1): #for the whole length in sentence
            for k in range(num_nodes_per_col): #for each node
                for l in range(num_nodes_per_col): #for 1 node, transition from prev node to current node
                    prev_node = scorelist[l][j]
                    emission_score = emission(nodes[k],unique_word_list[j+1])
                    transition_score = transition(nodes[l],nodes[k])
                    
                    if transition_score == 0 or emission_score == 0:
                        score_per_node[l] = np.NINF
                    else:   
                        score_per_node[l] = prev_node+np.log(emission_score)+np.log(transition_score) 
                
                store[k] = np.max(score_per_node) # max path
                score_per_node=np.zeros(num_nodes_per_col)
            
            scorelist[:,j+1] = store
            store = np.zeros(num_nodes_per_col)
                      
        score_at_stop=np.zeros(num_nodes_per_col)
        
        #This is for the STOP for viterbi
        for m in range(num_nodes_per_col):
            score_at_stop[m] = np.log(transition(nodes[m],"END")) + scorelist[m][len(unique_word_list)-1]
        scorelist[:,-1] = np.full(num_nodes_per_col,np.max(score_at_stop))
        
    return scorelist
  
def viterbi_backtrack(scorelist):
    """
    back tracking for viterbi
    node value*transition = array, then find max, then find position. use position for next step.
    np.argmax returns index of max in the element.
    The final score on the score list is for end
    """ 
    global nodes
    
    scorelist = np.flip(scorelist,axis=1) #reverse the score list so easier to calculate.
    
    max_node_index = 0 
    num_obs = scorelist.shape[1]
    num_nodes = scorelist.shape[0]
    node_holder = np.zeros(num_nodes)
    path = []

    if (num_obs == 1):
        for k in range (num_nodes):
            calculate_max_node = scorelist[0][k] + np.log(transition(nodes[k],"END"))
            node_holder[i] = calculate_max_node
        path.append(nodes[np.argmax(node_holder)])
        return(path[::-1])

    for i in range (1,num_obs): # for length of sentence
        for j in range(num_nodes): #for each node
            if (i==1):
                calculate_max_node = scorelist[j][i] + np.log(transition(nodes[j],"END"))
                node_holder[j] = calculate_max_node
            else:
                calculate_max_node = scorelist[j][i] + np.log(transition(nodes[j],nodes[max_node_index]))
                node_holder[j] = calculate_max_node
        
        max_node_index=np.argmax(node_holder)
        path.append(nodes[np.argmax(node_holder)])
        node_holder=np.zeros(num_nodes)

    return(path[::-1])

def emission(node,word):
    global emission_dict
    global nodes
    pair = word,node
    detector = 0 # this is used to find if word exist in the dictionary
    if pair not in emission_dict.keys(): #if the combination cannot be found in the dictionary
                                         #Either the word exists, or word is new. 
        for o in nodes:
            missing_pair = word,o
            if missing_pair in emission_dict.keys(): #
                detector = 1 # to detect if word exist in dictionary.
                break
        if detector == 1:
            score=0   #this means that this node is not the correct node.
        else:
            replaced_text = "#UNK#",node
            if replaced_text in emission_dict.keys():
                score = emission_dict[replaced_text] #if label have #unk#
                
            else:
                score = 0   #if label does not have #unk#, then set to 0.
    else:
        score = emission_dict[pair]
    return score

def transition(x1,x2):
    global transition_dic
    #will use this to search the transition from x1 to x2
    pair = x1,x2
    if pair not in transition_dic.keys():
        score = 0
    else:
        score = transition_dic[x1,x2]
    return score

In [96]:
# FOR TESTING ONLY.

print("Performing Viterbi")
start = time.time()
for i in range(1):
    viterbioutput=viterbi(lines[i])
    log_array.append(viterbioutput)
    sequence_log.append(viterbi_backtrack(viterbioutput))
end = time.time()

print("Time taken for Viterbi and Backtrack", end - start)

Performing Viterbi
Time taken for Viterbi and Backtrack 0.4466869831085205




In [44]:
def preprocess_training_blank_row(data):
    start = time.process_time()   
    
    df= pd.read_csv(data, sep='/n', delimiter=None, names=['original'],index_col=False,engine="python",skip_blank_lines=False)
    # dropping null value columns to avoid errors 
    
    # new data frame with split value columns 
    df["x"], df["y"] = split_into_columns(df["original"])
    return df

In [45]:
def sentenceList(data):
    lines=[]
    line=[]
    x= data
    for label in x['x']:
        if pd.isnull(label)==False:
            line.append(label)
        else:
            lines.append(line)
            line = []
    return lines

In [46]:
def finalresult(sequence_log,predata_blank):
    dataframe = []
    count=0
    for i in range(len(sequence_log)):
        for text in sequence_log[i]:
            dataframe.append(text)
            count+=1
        dataframe.append("")
    dftest=pd.DataFrame(dataframe)
    final = pd.DataFrame()
    final['result'] = predata_blank['x'] + " " +dftest[0]
    return final

In [98]:
import time

data_folders = ["AL", "EN","CN","SG"]
for x in ["SG"]:
    print("Performing sentiment analysis for data folder ", x)
    train_data = "./data/{}/train".format(x)
    test_data = "./data/{}/dev.in".format(x)
    
    transition_dic, subseq_count, y_count = transitionPara(train_data)
    predata_blank=preprocess_training_blank_row(train_data)
    nodes = list(y_count.keys())
    testdf_unprocess = pd.read_csv(test_data, sep='/n', delimiter=None, names=['x'],index_col=False,skip_blank_lines=False, engine="python")
    lines= sentenceList(testdf_unprocess)
    
    log_array =[]
    sequence_log=[]
    
    print("Performing Viterbi")
    start = time.time()
    for i in range(len(lines)):
        viterbioutput=viterbi(lines[i])
        log_array.append(viterbioutput)
        sequence_log.append(viterbi_backtrack(viterbioutput))
    end = time.time()
    
    print("Time taken for Viterbi and Backtrack", end - start)
    
    result = finalresult(sequence_log,testdf_unprocess)
    
    print("Writing the final result to dev.p3.out...")
    f = open('./output/{}/dev.p3.out'.format(x) ,'w')
    for word in result['result']:
        if pd.isnull(word) == False:
            f.write(word + '\n')
        else:
            f.write("" +"\n")
    f.close()
    print("Finished Writing.")

Performing sentiment analysis for data folder  SG
Performing Viterbi


NameError: name 'curr_emission' is not defined

In [68]:
# for debugging
for key,val in emission_dict.items():
    if key[0].startswith('Tour') :
        print(key,val)

('Tour', 'O') 3.9368669696861246e-05
('Tourism', 'O') 3.5789699724419313e-06
('Tour', 'I-neutral') 0.00010585931297305881
('Touring', 'O') 7.1579399448838625e-06
('Tour', 'B-positive') 0.00013412017167381974
("Tour's", 'O') 3.5789699724419313e-06
