# 01.112 Machine Learning Design Project

## About the Project

We have 4 datasets in the `/data` folder. For each dataset, there is: 
- a labelled training set train, 
- an unlabelled development set `dev.in`
- a labelled development set `dev.out` 

The labelled data has the format of: `token` `\t` `tag`
- one token per line
- token and tag separated by tab 
- single empty lines that separates sentences

For the labels, they are slightly different for different datasets.
- SG, CN (Entity):
    - B-*: Beginning of entity
    - I-*: Inside of entity
    - O: Outside of any entity
- EN, AL (Phrase):
    - B-VP: Beginning of Verb Phrase
    - I-VP: Inside of Verb Phrase
    - *-NP: Noun Phrase
    - *PP: Propositional Phrase
    - O: Outside of any phrase

*Goal*: Build sequence labelling systems from training data (x) and use it to predict tag sequences for new sentences (y).

## Team members 
- Andri Setiawan Susanto
- Eldon Lim 
- Tey Siew Wen

## Part 1
Already completed individually.

## Part 2

a) Write a function that estimates the emission parameters from the training set using MLE (maximum likelihood estimation):

b)

1. Make a modified training set by replacing those words that appear $<k$ times in the training set with a special word token `#UNK#` before training.
2. During testing phase, ifaworddoesnot appear in the modified training set, we also replace that wordwith `#UNK#`.
3. Compute Emission Paramters with the function in (a)

For all the four datasets EN, AL, CN, and SG, learn these parameters with `train`, and evaluate your
system on the development set `dev.in` for each of the dataset. Write your output to `dev.p2.out`
for the four datasets respectively. Compare your outputs and the gold-standard outputs in `dev.out`
and report the precision, recall and F scores of such a baseline system for each dataset.

In [1]:
## FAST BUT NOT STABLE, ONLY FIX IF HAVE SPARE TIME ##

import numpy as np
import pandas as pd
from collections import defaultdict, Counter

def emissionPara(arr, k, replaceWord):
    """
    calculates the emission probabilities given a numpy array containing arrays of [(x,y)]
    
    returns:
    emission_dict => a dictionary in the form of {(x,y): probability}
    valid_x => list of x that has occurences more or equal to k times.
    xy_dic => a dictionary in the form of {x: y} where y is the most probable label given x
    """
    x_counter = defaultdict(int)
    y_counter = defaultdict(int)
    xy_counter = defaultdict(int)
    x_labels = defaultdict(list)
    emission_params = {}
    xy_dict = {}

    obs = 0
    for x_y in arr:
        x, y = x_y[0].split(" ")
        obs += 1
        x_counter[x] += 1
        y_counter[y] += 1
        xy_counter[x,y] += 1
        if y not in x_labels[x]:
            x_labels[x].append(y)
        
    x_to_replace = [x for x in x_counter if x_counter[x] < k]
    print(f"There are {obs} observations")
    print(f"{len(x_to_replace)} values of x is to be replaced ")

    # removing x occurences less than k times from x_counter and its labels
    for r in x_to_replace:
        for label in x_labels[r]:
            if label not in x_labels[replaceWord]:
                x_labels[replaceWord].append(label)
            xy_counter[replaceWord, label] += xy_counter[r, label]
            
            del xy_counter[r,label]
        del x_counter[r]
        del x_labels[r]
        
    # calculation of emission probabilities        
    for x_y, x_y_count in xy_counter.items():
        y = x_y[1]
        emission_params[x_y] = x_y_count / y_counter[y]

    # get best labels
    for x, labels in x_labels.items():
        emission_probs = np.zeros(len(labels))
        for i in range(len(labels)):
            label = labels[i]
            emission_probs[i] = emission_params[x,label]
        xy_dict[x] = labels[np.argmax(emission_probs)]
    
    return emission_params, x_counter.keys(), xy_dict

def predict(data, xy_dict, replaceWord):
    print("Predicting labels")
    result = pd.DataFrame()
    
    def replace_string(x):
        if x not in xy_dict:
            return "{} {}".format(replaceWord, xy_dict[replaceWord])
        else:
            return "{} {}".format(x, xy_dict[x])
         
    return data["x"].apply(lambda s: replace_string(s) if str(s) != "nan" else " ")

In [22]:
from csv import QUOTE_NONE
import time

k = 3
replaceWord = "#UNK#"
data_folders = ["AL", "EN","CN","SG"]
for x in ["AL"]:
    print("Performing sentiment analysis for data folder ", x)
    train_data = "./data/{}/train".format(x)
    test_data = "./data/{}/dev.in".format(x)
    test_result = "./data/{}/dev.out".format(x)
    
    train_data = pd.read_csv(train_data, sep='\r\\n', index_col=False, engine="python", encoding="UTF-8").to_numpy()
    emission_dict, valid_x, xy_dict = emissionPara(train_data, k, replaceWord)

    test_data = pd.read_csv(test_data, sep='\r\n', names=['x'],index_col=False,skip_blank_lines=False, engine="python", encoding="UTF-8")
    testdf = predict(test_data, xy_dict, replaceWord)
    print(testdf.head(3))
    
    print("Writing the final result to dev.p2.out...")
    testdf.to_csv('./output/{}/dev.p2.out'.format(x), header=False, index=False, na_rep="", sep="\n", quoting=QUOTE_NONE)

Performing sentiment analysis for data folder  AL
There are 174948 observations
2624 values of x is to be replaced 
Predicting labels
0    杭 B-CITY
1    州 I-CITY
2    市 I-CITY
Name: x, dtype: object
Writing the final result to dev.p2.out...


In [3]:
df = "./data/EN/train"
tes = pd.read_csv(df, sep='\r\n', names=['x_y'],index_col=False, engine="python", encoding="UTF-8").to_numpy()
emission_dict, valid_x, xy_dict = emissionPara(tes, k, replaceWord)

# for debugging
for key,val in emission_dict.items():
    if key[0].startswith('close'):
        print(key,val)

# emission_dict

There are 181628 observations
12026 values of x is to be replaced 
('closed', 'B-VP') 0.0030118832484529873
('closed', 'B-ADJP') 0.0028555111364934323
('closed', 'I-VP') 0.0006890441972635102
('closely', 'B-VP') 0.00010952302721647227
('close', 'I-VP') 0.0018702628211438135
('closed', 'B-NP') 2.1139414438220062e-05
('closely', 'I-NP') 9.159018885896943e-05
('close', 'I-NP') 0.00031140664212049606
('close', 'I-ADVP') 0.0027548209366391185
('closely', 'B-ADVP') 0.0008415147265077139
('close', 'B-NP') 8.455765775288025e-05
('close', 'B-ADJP') 0.0034266133637921186
('closer', 'B-ADVP') 0.0005610098176718093
('closer', 'B-NP') 4.2278828876440124e-05
('close', 'O') 4.1890080428954424e-05
('close', 'I-ADJP') 0.0017421602787456446
('closely', 'I-VP') 9.84348853233586e-05
('closely', 'I-ADVP') 0.0027548209366391185
('close', 'B-ADVP') 0.0005610098176718093
('closer', 'B-ADJP') 0.0005711022272986865
('closer', 'I-ADJP') 0.0017421602787456446
('closed', 'I-NP') 1.8318037771793885e-05
('closely', 

In [4]:
!python3 evalResult.py


 ----------- Evaluation Results for AL
##Part 2##

#Entity in gold data: 8408
#Entity in prediction: 19472

#Correct Entity : 1804
Entity  precision: 0.0926
Entity  recall: 0.2146
Entity  F: 0.1294

#Correct Sentiment : 1061
Sentiment  precision: 0.0545
Sentiment  recall: 0.1262
Sentiment  F: 0.0761

##Part3##

#Entity in gold data: 8408
#Entity in prediction: 0

Traceback (most recent call last):
  File "evalResult.py", line 248, in <module>
    compare_observed_to_predicted(observed, predicted)
  File "evalResult.py", line 205, in compare_observed_to_predicted
    prec = correct_entity/total_predicted
ZeroDivisionError: float division by zero


In [5]:
emission_dict

{('bonds', 'I-NP'): 0.0018134857394075947,
 ('are', 'B-VP'): 0.03707354471277586,
 ('generally', 'B-ADVP'): 0.0033660589060308557,
 ('a', 'B-ADJP'): 0.0017133066818960593,
 ('bit', 'I-ADJP'): 0.003484320557491289,
 ('than', 'B-PP'): 0.006961440147930603,
 ('corporate', 'B-NP'): 0.0005919036042701618,
 ('in', 'B-PP'): 0.15565345080763582,
 ('a', 'B-NP'): 0.0758693584187718,
 ('recession', 'I-NP'): 0.000641131322012786,
 (',', 'O'): 0.36465315013404825,
 ('but', 'O'): 0.012231903485254691,
 ('not', 'B-ADJP'): 0.0034266133637921186,
 ('as', 'I-ADJP'): 0.012195121951219513,
 ('safe', 'I-ADJP'): 0.003484320557491289,
 ('as', 'B-PP'): 0.019796595420677653,
 ('bonds', 'B-NP'): 0.0004439277032026213,
 ('issued', 'B-VP'): 0.0007666611905153059,
 ('by', 'B-PP'): 0.047207266003154405,
 ('the', 'B-NP'): 0.1639572983828348,
 ('federal', 'I-NP'): 0.001117400304079427,
 ('government', 'I-NP'): 0.002234800608158854,
 ('.', 'O'): 0.3120392091152815,
 ('He', 'B-NP'): 0.0032343304090476695,
 ('added', 'B

## Part 3

Write a function that estimates the transition parameters from the training set using MLE (maximum likelihood estimation):

In [6]:
def split_into_columns(df_column):
    new = df_column.str.split(" ", n=1, expand=True)
    return new[0], new[1]

In [15]:
from collections import Counter, defaultdict

def transitionPara(data):
    train_data_blank=pd.read_csv(data, sep='/r/n', delimiter=None, names=['original'],index_col=False, engine="python", skip_blank_lines=False, encoding="UTF-8")
    x, y = split_into_columns(train_data_blank["original"])
    xy_dic = dict(zip(x, y))
    
    # Get bottom count (Count(yi))
    y_count = Counter(y)
    del y_count[np.nan]
    
    # Get top count (Count(yi-1, yi))
    subseq_count = defaultdict(int)
    for i in range(len(y)-1):    
        y1 = y[i]
        y2 = y[i+1]
        
        if i == 0:
            subseq_count[("START", y1)] +=1
            y_count["START"] +=1
        if pd.isna(y1):
            subseq_count[("START", y2)] +=1
            y_count["START"] +=1
        elif i == len(y)-1 or pd.isna(y2):
            subseq_count[(y1, "END")] +=1
            y_count["END"] +=1
        else:
            subseq_count[y1,y2] += 1
    
    # Calculation of transition params
    result = np.empty(len(y)+2)
    transition_dict = {}
    
    for k,v in subseq_count.items():
        y1 = k[0]
        y2 = k[1]
        transition_dict[y1,y2] = subseq_count[y1,y2] / y_count[y2]
     
    return transition_dict, subseq_count, y_count


In [8]:
# from collections import Counter, defaultdict

# def transitionPara(data):
#     y_count = defaultdict(int)
#     subseq_count = defaultdict(int)
    
#     for i in range(len(data) - 1):
#         curr_xy = data[i][0]
#         next_xy = data[i+1][0]
        
#         if i == 0:
#             x, y = curr_xy.split(" ")
#             subseq_count["START", y] += 1
#             y_count[y] += 1
#             y_count["START"] += 1
#             continue
        
#         elif curr_xy == None or curr_xy == np.nan or str(curr_xy) == "nan":
#             x, y = next_xy.split(" ")
#             subseq_count["START", y] += 1
#             y_count[y] +=1
#             y_count["START"] += 1
        
#         elif i+1 == len(data) or next_xy == None or next_xy == np.nan or str(next_xy) == "nan":
#             x, y = curr_xy.split(" ")
#             subseq_count[y,"END"] +=1
#             y_count[y] += 1
#             y_count["END"] += 1
            
#         else:
#             y1 = curr_xy.split()[1]
#             y2 = next_xy.split()[1]
#             subseq_count[y1,y2] +=1
#             y_count[y1] += 1
        
#         # Calculation of transition params
#     transition_dict = {}

#     for k,v in subseq_count.items():
#         y1 = k[0]
#         y2 = k[1]
#         transition_dict[y1,y2] = subseq_count[y1,y2] / y_count[y1]

#     return transition_dict, subseq_count, y_count



In [16]:
def viterbi(unique_word_list):
    #This is for the starting for viterbi
    global nodes
    
    num_nodes_per_col = len(nodes)
    store=np.zeros(num_nodes_per_col)   #store = the storage for scores for all the nodes. 
    scorelist=np.zeros((num_nodes_per_col, len(unique_word_list) + 1))
    
    for i in range(num_nodes_per_col):
        emission_score = emission(nodes[i],unique_word_list[0])
        transition_score = transition("START",nodes[i])
        if transition_score == 0 or emission_score == 0:
            store[i] = np.NINF
        else:
            store[i] = np.log(emission_score)+np.log(transition_score)
            
    scorelist[:,0] = store
    store = np.zeros(num_nodes_per_col)
    score_per_node=np.zeros(num_nodes_per_col)
    
    #This is for the middle portion for viterbi
    #score per node = prevnode*emission*transition

    if len(unique_word_list)>1:
        for j in range(len(unique_word_list)-1): #for the whole length in sentence
            for k in range(num_nodes_per_col): #for each node
                for l in range(num_nodes_per_col): #for 1 node, transition from prev node to current node
                    prev_node = scorelist[l][j]
                    emission_score = emission(nodes[k],unique_word_list[j+1])
                    transition_score = transition(nodes[l],nodes[k])
                    
                    if transition_score == 0 or emission_score == 0:
                        score_per_node[l] = np.NINF
                    else:   
                        score_per_node[l] = prev_node+np.log(emission_score)+np.log(transition_score) 
                
                store[k] = np.max(score_per_node) # max path
                score_per_node=np.zeros(num_nodes_per_col)
            
            scorelist[:,j+1] = store
            store = np.zeros(num_nodes_per_col)
                      
        score_at_stop=np.zeros(num_nodes_per_col)
        
        #This is for the STOP for viterbi
        for m in range(num_nodes_per_col):
            score_at_stop[m] = np.log(transition(nodes[m],"END")) + scorelist[m][len(unique_word_list)-1]
        scorelist[:,-1] = np.full(num_nodes_per_col,np.max(score_at_stop))
        
    return scorelist
  
def viterbi_backtrack(scorelist):
    """
    back tracking for viterbi
    node value*transition = array, then find max, then find position. use position for next step.
    np.argmax returns index of max in the element.
    The final score on the score list is for end
    """ 
    global nodes
    
    scorelist = np.flip(scorelist,axis=1) #reverse the score list so easier to calculate.
    
    max_node_index = 0 
    num_obs = scorelist.shape[1]
    num_nodes = scorelist.shape[0]
    node_holder = np.zeros(num_nodes)
    path = []

    if (num_obs == 1):
        for k in range (num_nodes):
            calculate_max_node = scorelist[0][k] + np.log(transition(nodes[k],"END"))
            node_holder[i] = calculate_max_node
        path.append(nodes[np.argmax(node_holder)])
        return(path[::-1])

    for i in range (1,num_obs): # for length of sentence
        for j in range(num_nodes): #for each node
            if (i==1):
                calculate_max_node = scorelist[j][i] + np.log(transition(nodes[j],"END"))
                node_holder[j] = calculate_max_node
            else:
                calculate_max_node = scorelist[j][i] + np.log(transition(nodes[j],nodes[max_node_index]))
                node_holder[j] = calculate_max_node
        
        max_node_index=np.argmax(node_holder)
        path.append(nodes[np.argmax(node_holder)])
        node_holder=np.zeros(num_nodes)

    return(path[::-1])

def emission(node,word):
    global emission_dict
    global nodes
    pair = word,node
    detector = 0 # this is used to find if word exist in the dictionary
    if pair not in emission_dict.keys(): #if the combination cannot be found in the dictionary
                                         #Either the word exists, or word is new. 
        for o in nodes:
            missing_pair = word,o
            if missing_pair in emission_dict.keys(): #
                detector = 1 # to detect if word exist in dictionary.
                break
        if detector == 1:
            score=0   #this means that this node is not the correct node.
        else:
            replaced_text = "#UNK#",node
            if replaced_text in emission_dict.keys():
                score = emission_dict[replaced_text] #if label have #unk#
                
            else:
                score = 0   #if label does not have #unk#, then set to 0.
    else:
        score = emission_dict[pair]
    return score


def transition(x1,x2):
    global transition_dic
    #will use this to search the transition from x1 to x2
    pair = x1,x2
    if pair not in transition_dic.keys():
        score = 0
    else:
        score = transition_dic[x1,x2]
    return score

In [17]:
def preprocess_training_blank_row(data):
    start = time.process_time()   
    
    df= pd.read_csv(data, sep='/n', delimiter=None, names=['original'],index_col=False,engine="python",skip_blank_lines=False)
    # dropping null value columns to avoid errors 
    
    # new data frame with split value columns 
    df["x"], df["y"] = split_into_columns(df["original"])
    return df

In [18]:
def sentenceList(data):
    lines=[]
    line=[]
    x= data
    for label in x['x']:
        if pd.isnull(label)==False:
            line.append(label)
        else:
            lines.append(line)
            line = []
    return lines

In [19]:
def finalresult(sequence_log,predata_blank):
    dataframe = []
    count=0
    for i in range(len(sequence_log)):
        for text in sequence_log[i]:
            dataframe.append(text)
            count+=1
        dataframe.append("")
    dftest=pd.DataFrame(dataframe)
    final = pd.DataFrame()
    final['result'] = predata_blank['x'] + " " +dftest[0]
    return final

In [23]:
import time

data_folders = ["AL", "EN","CN","SG"]
for x in ["AL"]:
    print("Performing sentiment analysis for data folder ", x)
    train_data = "./data/{}/train".format(x)
    test_data = "./data/{}/dev.in".format(x)
    
    train_df = train_data
#     train_df = pd.read_csv(train_data, sep='/r/n' ,index_col=False, engine="python", skip_blank_lines=False, encoding="UTF-8").to_numpy()
    transition_dic, subseq_count, y_count = transitionPara(train_df)
    predata_blank=preprocess_training_blank_row(train_data)
    nodes = list(y_count.keys())
    testdf_unprocess = pd.read_csv(test_data, sep='/n', delimiter=None, names=['x'],index_col=False,skip_blank_lines=False, engine="python")
    lines= sentenceList(testdf_unprocess)
    
    
    log_array =[]
    sequence_log=[]
    
    print("Performing Viterbi")
    start = time.time()
    for i in range(len(lines)):
        viterbioutput=viterbi(lines[i])
        log_array.append(viterbioutput)
        sequence_log.append(viterbi_backtrack(viterbioutput))
    end = time.time()
    
    print("Time taken for Viterbi and Backtrack", end - start)
    
    result = finalresult(sequence_log,testdf_unprocess)
    
    print("Writing the final result to dev.p3.out...")
    f = open('./output/{}/dev.p3.out'.format(x) ,'w')
    for word in result['result']:
        if pd.isnull(word) == False:
            f.write(word + '\n')
        else:
            f.write("" +"\n")
    f.close()
    print("Finished Writing.")

Performing sentiment analysis for data folder  AL
Performing Viterbi




Time taken for Viterbi and Backtrack 205.58896827697754
Writing the final result to dev.p3.out...
Finished Writing.


In [None]:
# # FOR TESTING ONLY.

# print("Performing Viterbi")
# sequence_log = []
# start = time.time()
# for i in range(1):
#     viterbioutput=viterbi(lines[i])
#     log_array.append(viterbioutput)
#     sequence_log.append(viterbi_backtrack(viterbioutput))
# end = time.time()

# print("Time taken for Viterbi and Backtrack", end - start)

# print(sequence_log)

In [None]:
transition_dic

In [24]:
!python3 evalResult.py


 ----------- Evaluation Results for AL
##Part 2##

#Entity in gold data: 8408
#Entity in prediction: 19472

#Correct Entity : 1804
Entity  precision: 0.0926
Entity  recall: 0.2146
Entity  F: 0.1294

#Correct Sentiment : 1061
Sentiment  precision: 0.0545
Sentiment  recall: 0.1262
Sentiment  F: 0.0761

##Part3##

#Entity in gold data: 8408
#Entity in prediction: 8514

#Correct Entity : 3275
Entity  precision: 0.3847
Entity  recall: 0.3895
Entity  F: 0.3871

#Correct Sentiment : 2404
Sentiment  precision: 0.2824
Sentiment  recall: 0.2859
Sentiment  F: 0.2841

 ----------- Evaluation Results for EN
##Part 2##

#Entity in gold data: 13179
#Entity in prediction: 19406

#Correct Entity : 9152
Entity  precision: 0.4716
Entity  recall: 0.6944
Entity  F: 0.5617

#Correct Sentiment : 7644
Sentiment  precision: 0.3939
Sentiment  recall: 0.5800
Sentiment  F: 0.4692

##Part3##
dev.p3.out not present

 ----------- Evaluation Results for CN
##Part 2##

#Entity in gold data: 1478
#Entity in prediction

In [None]:
result

In [None]:
train_df = pd.read_csv("./data/EN/train", sep='/r/n' ,index_col=False, engine="python", skip_blank_lines=False, encoding="UTF-8").to_numpy()
transition_dic, subseq_count, y_count = transitionPara(train_df)
# for debugging
for key,val in transition_dic.items():
    if key[1] == "END":
        print(key, val)