# 01.112 Machine Learning Design Project

## About the Project

We have 4 datasets in the `/data` folder. For each dataset, there is: 
- a labelled training set train, 
- an unlabelled development set `dev.in`
- a labelled development set `dev.out` 

The labelled data has the format of: `token` `\t` `tag`
- one token per line
- token and tag separated by tab 
- single empty lines that separates sentences

For the labels, they are slightly different for different datasets.
- SG, CN (Entity):
    - B-*: Beginning of entity
    - I-*: Inside of entity
    - O: Outside of any entity
- EN, AL (Phrase):
    - B-VP: Beginning of Verb Phrase
    - I-VP: Inside of Verb Phrase
    - *-NP: Noun Phrase
    - *PP: Propositional Phrase
    - O: Outside of any phrase

*Goal*: Build sequence labelling systems from training data (x) and use it to predict tag sequences for new sentences (y).

## Team members 
- Andri Setiawan Susanto
- Eldon Lim 
- Tey Siew Wen

## Part 1
Already completed individually.

## Part 2

a) Write a function that estimates the emission parameters from the training set using MLE (maximum likelihood estimation):

b)

1. Make a modified training set by replacing those words that appear $<k$ times in the training set with a special word token `#UNK#` before training.
2. During testing phase, ifaworddoesnot appear in the modified training set, we also replace that wordwith `#UNK#`.
3. Compute Emission Paramters with the function in (a)

For all the four datasets EN, AL, CN, and SG, learn these parameters with `train`, and evaluate your
system on the development set `dev.in` for each of the dataset. Write your output to `dev.p2.out`
for the four datasets respectively. Compare your outputs and the gold-standard outputs in `dev.out`
and report the precision, recall and F scores of such a baseline system for each dataset.

In [66]:
import numpy as np
import pandas as pd
from collections import defaultdict, Counter
import time

def emissionPara(df, k, replaceWord):
    """Return 3 values
    
    emissions_dict => { (x,y) : <prob> }
    valid_x => List of x that were not replaced 
    xy_dict => { "x" : "y" } given that x is the obs, y is the most likely label
    """
    
    x_y_lists = df['x_y'].str.split(" ")
    xy = df['x_y'].str.split(" ", n=1,expand=True)
    x, y = xy[0], xy[1]
    
    value_counts = x.value_counts()
    invalid_x = value_counts[value_counts < k].index.to_list()
    valid_x = value_counts[value_counts >= k].index.to_list()
    print("There are ", len(x), "observations", "and ", len(invalid_x), "is to be replaced.")
    
    print("Performing replacement...")
    x = x.apply(lambda row: replaceWord if (value_counts < k)[row] else row)
    
    print("Calculating Emission...")
    replaced_df = x + " " + y
    x_y_tuples = replaced_df.apply(lambda x: tuple(x.split()))
    x_y_counter = Counter(x_y_tuples)
    y_counter = Counter(y)    
    
    emission_params = {}
    
    for x_y, x_y_count in x_y_counter.items():
        y = x_y[1]
        emission_params[x_y] = x_y_count / y_counter[y]
    
    print("Getting the most likely labels for each observation")
    xy_dict = {}
    for obs in x:
        best_label = ""
        best_prob = 0
        for key in emission_params.keys():
            if obs == key[0]:
                if emission_params[key] > best_prob:
                    best_prob = emission_params[key]
                    best_label = key[1]
        xy_dict[obs] = best_label
    
    return emission_params, valid_x, xy_dict

def predict(data, xy_dict, replaceWord):
    print("Predicting labels")
    start = time.process_time()
    result = pd.DataFrame()
    
    def replace_string(x):
        if x not in xy_dict:
            return "{} {}".format(replaceWord, xy_dict[replaceWord])
        else:
            return "{} {}".format(x, xy_dict[x])
         
    return data["x"].apply(lambda s: replace_string(s) if str(s) != "nan" else " ") 

    print("Time taken for test data: ",time.process_time() - start)
    return testdf

In [68]:
from csv import QUOTE_NONE

k = 3
replaceWord = "#UNK#"
data_folders = ["AL", "EN","CN","SG"]
for x in ["EN"]:
    print("Performing sentiment analysis for data folder ", x)
    train_data = "./data/{}/train".format(x)
    test_data = "./data/{}/dev.in".format(x)
    test_result = "./data/{}/dev.out".format(x)
    
    train_data = pd.read_csv(train_data, sep='\r\n', names=['x_y'],index_col=False, engine="python")
    emission_dict, valid_x, xy_dict = emissionPara(train_data, k, replaceWord)

    test_data = pd.read_csv(test_data, sep='\r\n', names=['x'],index_col=False,skip_blank_lines=False, engine="python")
    testdf = predict(test_data, xy_dict, replaceWord)
    print(testdf.head(3))
    
    print("Writing the final result to dev.p2..out...")
    testdf.to_csv('./output/{}/dev.p2.out'.format(x), header=False, index=False, na_rep="", sep="\n", quoting=QUOTE_NONE)

Performing sentiment analysis for data folder  EN
There are  181628 observations
Out of those observations,  12026 is to be replaced.
Performing replacement...
Calculating Emission...
Getting the most likely labels for each observation
Predicting labels
0        HBO B-NP
1        has B-VP
2    close B-ADJP
Name: x, dtype: object
Writing the final result to dev.p2..out...


## Part 3

Write a function that estimates the transition parameters from the training set using MLE (maximum likelihood estimation):

In [17]:
def split_into_columns(df_column):
    new = df_column.str.split(" ", n=1, expand=True)
    return new[0], new[1]

In [18]:
from collections import Counter, defaultdict

def transitionPara(data):
    train_data_blank=pd.read_csv(data, sep='/n', delimiter=None, names=['original'],index_col=False, engine="python", skip_blank_lines=False)
    x, y = split_into_columns(train_data_blank["original"])
    xy_dic = dict(zip(x, y))
    
    # Get bottom count (Count(yi))
    y_count = Counter(y)
    
    # Get top count (Count(yi-1, yi))
    subseq_count = defaultdict(int)
    for i in range(len(y)-1):    
        y1 = y[i]
        y2 = y[i+1]
        
        if i == 0:
            subseq_count[("START", y1)] +=1
            y_count["START"] +=1
        if pd.isna(y1):
            subseq_count[("START", y2)] +=1
            y_count["START"] +=1
        elif i == len(y)-1 or pd.isna(y2):
            subseq_count[(y1, "END")] +=1
            y_count["END"] +=1
        else:
            subseq_count[y1,y2] += 1
    
    # Calculation of transition params
    result = np.empty(len(y)+2)
    transition_dict = {}
    
    for k,v in subseq_count.items():
        y1 = k[0]
        y2 = k[1]
        transition_dict[y1,y2] = subseq_count[y1,y2] / y_count[y2]
     
    return transition_dict, subseq_count, y_count

# transition_dic, subseq_count, y_count = transitionPara(test)

# Viterbi

In [19]:
import numpy as np

def viterbi(unique_word_list):
    #This is for the starting for viterbi
    store=[]   #store = the storage for scores for all the nodes.
    scorelist=[]
    
    #This is for the start
    for i in range(len(nodes)):
        emission_score = emission(nodes[i],unique_word_list[0], valid_x)
        transition_score = transition("START",nodes[i])
        score_at_start = np.log(emission_score)+np.log(transition_score)
        store.append(score_at_start)    
        
    scorelist.append(store)
    store=[]
    score_per_node=[]
    
    #This is for the middle portion for viterbi
    if len(unique_word_list) > 1:
        for j in range(len(unique_word_list) - 1): #for the whole length in sentence
            for k in range(len(nodes)): # for each node
                #score per node = prevnode*emission*transition
                
                for l in range(len(nodes)): # l = iterate thru previous node, k= iterate thru current node, j= iterate thru sentence
                    # This is to calculate the current node scores.
                    prev_node = scorelist[j][l]
                    emission_score = emission(nodes[k],unique_word_list[j+1], valid_x)
                    curr_node = nodes[k]
                    score_per_node.append(prev_node+np.log(emission_score)+np.log(transition(nodes[l],curr_node)))
                    
                store.append(max(score_per_node)) # found max path
                
#                 max_score = max(score_per_node)
#                 max_index = np.argmax(score_per_node)                
#                 label_max = labels[max_index]
                score_per_node=[]
            
            #print(store)
            scorelist.append(store) # store the scores for nodes
            store=[]

                      
        score_at_stop=[]
        #This is for the stop for viterbi
        for m in range(len(nodes)):
            score_at_stop.append(np.log(transition(nodes[m],"END"))+ (scorelist[len(unique_word_list)-1][m])) #at stop.
        scorelist.append(max(score_at_stop))
     
    return scorelist

def emission(node,word,x_dict):
    global emission_dict
    pair = word,node
    detector = 0 # this is used to find if word exist in the dictionary
    if pair not in emission_dict.keys(): #if the combination cannot be found in the dictionary
                                         #Either the word exists, or word is new. 
        if word in x_dict:
            score=0   #this means that this node is not the correct node.
        else:
            replaced_text = "#UNK#",node
            if replaced_text in emission_dict.keys():
                score = emission_dict[replaced_text] #if label have #unk#
            else:
                score = 0   #if label does not have #unk#, then set to 0.
    else:
        score = emission_dict[pair]
    return score

# def emission(node,word,x_dict):
#     global emission_dict
#     global nodes
#     pair = word,node
#     detector = 0 # this is used to find if word exist in the dictionary
#     if pair not in emission_dict.keys(): #if the combination cannot be found in the dictionary
#                                          #Either the word exists, or word is new. 
#         for o in nodes:
#             missing_pair = word,o
#             if missing_pair in emission_dict.keys(): #
#                 detector = 1 # to detect if word exist in dictionary.
#                 break
#         if detector == 1:
#             score=0   #this means that this node is not the correct node.
#         else:
#             replaced_text = "#UNK#",node
#             if replaced_text in emission_dict.keys():
#                 score = emission_dict[replaced_text] #if label have #unk#
#             else:
#                 score = 0   #if label does not have #unk#, then set to 0.
#     else:
#         score = emission_dict[pair]
#     return score



def transition(x1,x2):
    global transition_dic
    #will use this to search the transition from x1 to x2
    pair = x1,x2
    if pair not in transition_dic.keys():
        score = 0
    else:
        score = transition_dic[x1,x2]
    return score

In [21]:
def viterbi_backtrack(scorelist):
    ####### back tracking for viterbi
    # node value*transition = array, then find max, then find position. use position for next step.
    #np.argmax returns index of max in the element.
    # The final score on the score list is for end
    scorelist = scorelist[::-1] #reverse the score list so easier to calculate.
    node_holder=[]
    path = []
    max_node_index=0
    length_of_scorelist=len(scorelist)
    length_of_nodes=len(nodes)

    if (length_of_scorelist == 1):
        for k in range (length_of_nodes):
            calculate_max_node = (scorelist[0][k]) + np.log(transition(nodes[k],"END"))
            node_holder.append(calculate_max_node)
        path.append(nodes[np.argmax(node_holder)])
        node_holder=[]
        return(path[::-1])

    for i in range (1,length_of_scorelist): # for length of sentence

        for j in range(length_of_nodes): #for each node
            #each node*own path, find max
            if (i==1):
                calculate_max_node = (scorelist[i][j]) + np.log(transition(nodes[j],"END"))
                node_holder.append(calculate_max_node)
                #print(np.exp(calculate_max_node))
            else:
                
                calculate_max_node = (scorelist[i][j]) + np.log(transition(nodes[j],nodes[max_node_index]))#
                node_holder.append(calculate_max_node)
        
        max_node_index=(np.argmax(node_holder))
        path.append(nodes[np.argmax(node_holder)])
        node_holder=[]
        

    return(path[::-1])

In [22]:
def preprocess_training_blank_row(data):
    start = time.process_time()   
    
    df= pd.read_csv(data, sep='/n', delimiter=None, names=['original'],index_col=False,engine="python",skip_blank_lines=False)
    # dropping null value columns to avoid errors 
    
    # new data frame with split value columns 
    df["x"], df["y"] = split_into_columns(df["original"])
    return df

In [23]:
def sentenceList(data):
    lines=[]
    line=[]
    x= data
    for label in x['x']:
        if pd.isnull(label)==False:
            line.append(label)
        else:
            lines.append(line)
            line = []
    return lines

In [24]:
def finalresult(sequence_log,predata_blank):
    dataframe = []
    count=0
    for i in range(len(sequence_log)):
        for text in sequence_log[i]:
            dataframe.append(text)
            count+=1
        dataframe.append("")
    dftest=pd.DataFrame(dataframe)
    final = pd.DataFrame()
    final['result'] = predata_blank['x'] + " " +dftest[0]
    return final

In [27]:
data_folders = ["AL", "EN","CN","SG"]
for x in ["SG"]:
#     print("Performing sentiment analysis for data folder ", x)
    train_data = "./data/{}/train".format(x)
    test_data = "./data/{}/dev.in".format(x)
#     test_result = "./data/{}/dev.out".format(x)
    
# ##############################PART 3########################################################
    transition_dic, subseq_count, y_count = transitionPara(train_data)
    predata_blank=preprocess_training_blank_row(train_data)
    node = list(y_count.keys())
#     print(testdf_unprocess)
    testdf_unprocess = pd.read_csv(test_data, sep='/n', delimiter=None, names=['x'],index_col=False,skip_blank_lines=False, engine="python")
    lines= sentenceList(testdf_unprocess)
    
    nodes = node
    log_array =[]
    sequence_log=[]

    for i in range(len(lines)):
        viterbioutput=viterbi(lines[i])
        log_array.append(viterbioutput)
        sequence_log.append(viterbi_backtrack(viterbioutput))
    print(sequence_log)
    
    result = finalresult(sequence_log,testdf_unprocess)
    print(result)
    
    print("Writing the final result to dev.p3.out...")
#     f = open('./dev.p3.out'.format(x) ,'w')
    f = open('./output/{}/dev.p3.out'.format(x) ,'w')
    for word in result['result']:
        if pd.isnull(word) == False:
            f.write(word + '\n')
        else:
            f.write("" +"\n")
    f.close()

  if sys.path[0] == '':


[['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O'], ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O'], ['O', 'O', 'O', 'O', 'O', 'O', 'O'], ['O', 'O', 'O', 'O', 'O', 'O'], ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O'], ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O'], ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O'], ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O'], ['O', 'O', 'O', 'O', 'O'], ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O'], ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O'], ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', '