In [1]:
import numpy as np
import pandas as pd
import time

# 01.112 Machine Learning Design Project

## About the Project

We have 4 datasets in the `/data` folder. For each dataset, there is: 
- a labelled training set train, 
- an unlabelled development set `dev.in`
- a labelled development set `dev.out` 

The labelled data has the format of: `token` `\t` `tag`
- one token per line
- token and tag separated by tab 
- single empty lines that separates sentences

For the labels, they are slightly different for different datasets.
- SG, CN (Entity):
    - B-*: Beginning of entity
    - I-*: Inside of entity
    - O: Outside of any entity
- EN, AL (Phrase):
    - B-VP: Beginning of Verb Phrase
    - I-VP: Inside of Verb Phrase
    - *-NP: Noun Phrase
    - *PP: Propositional Phrase
    - O: Outside of any phrase

*Goal*: Build sequence labelling systems from training data (x) and use it to predict tag sequences for new sentences (y).

## Team members 
- Andri Setiawan Susanto
- Eldon Lim 
- Tey Siew Wen

## Part 1
Already completed individually.

## Part 2

a) Write a function that estimates the emission parameters from the training set using MLE (maximum likelihood estimation):

In [2]:
def emissionPara(xy,y):
    e_x_y= xy/y
    
    return e_x_y

b)

1. Make a modified training set by replacing those words that appear $<k$ times in the training set with a special word token `#UNK#` before training.
2. During testing phase, ifaworddoesnot appear in the modified training set, we also replace that wordwith `#UNK#`.
3. Compute Emission Paramters with the function in (a)

In [3]:
k = 3
replaceWord = "#UNK#" 

In [4]:
def split_into_columns(df_column):
    new = df_column.str.split(" ", n=1, expand=True)
    return new[0], new[1]

In [5]:
def preprocess_training(data):
    global replaceWord
    
      
    x_dic = {}
    
    # dropping null value columns e.g. index_col to avoid errors 
    df= pd.read_csv(data, sep='/n', delimiter=None, names=['original'],index_col=False, engine="python")
    
    # new data frame with split value columns 
    df["x"], df["y"] = split_into_columns(df["original"])
    
    return df
   

In [6]:
def uniqueCount(df,k,replaceWord):
    start = time.process_time() 
    x_dic={}
     # df display: record x value and replace y values with replaceWord when necessary, in respective dictionaries
    uniqueX, uniqueCountX= np.unique(df['x'].astype(str),return_counts=True)
    for i in range(len(uniqueX)):
        x_dic[uniqueX[i]] = uniqueCountX[i]

    for i, text in enumerate(df['x']):
        if x_dic[text] < k:
            df['x'][i] = replaceWord
            df['original'][i]=df['original'][i].replace(text,replaceWord, 1)
    
    y_dic={}
    
    uniqueY, uniqueCountY= np.unique(df['y'].astype(str),return_counts=True)
    for i in range(len(uniqueY)):
        y_dic[uniqueY[i]] = uniqueCountY[i]
        
    xy_dic = {}
    df1= df.copy()
    
    # Get a tuple of unique values & their count from a numpy array
    df1.dropna(inplace = True) 
    uniqueXY, uniqueCountXY= np.unique(df1['original'].astype(str),return_counts=True)

    for i in range(len(uniqueXY)):
        xy_dic[uniqueXY[i]] = uniqueCountXY[i]
    # print('Unique Values : ', uniqueValues)
    
    # print('Count of Unique Values : ', uniqueCount)
    dft = pd.DataFrame([uniqueXY,uniqueCountXY]).T
    dft=dft.rename({0:'x_y',1:'count_x_y'},axis='columns')
    
    dft['count_y']=0
    for i,text in enumerate(dft['x_y']):
        data = text.split(" ")
        dft['count_y'][i]=y_dic[data[1]]
    print("Time taken for train data: ", time.process_time() - start)
    return dft

In [7]:
def emissionCalcu(df):
    emission_dict={}
    df['emission']=emissionPara(df['count_x_y'], df['count_y'])
    for i in range(df.shape[0]):
        emission_dict[df['x_y'][i]]= df['emission'][i]
    return df,emission_dict

def xyPrediction(dft):
    # new data frame with split value columns 
    dft1 = dft.copy()
    dft1["x"], dft1["y"] = split_into_columns(dft1["x_y"])
    
    xy_pred_dic = {}

    for word in dft1['x']:
        index = pd.Series.idxmax((dft1.loc[dft1['x'] == word]['emission']).astype(float))
        xy_pred_dic[word]=dft1['y'][index] 
    
    return xy_pred_dic

In [8]:
def preprocess_test(data,k):
    global replaceWord
    
    start = time.process_time()   

    testdf1= pd.read_csv(data, sep='/n', delimiter=None, names=['original'],index_col=False, engine="python")
    testdf= pd.read_csv(data, sep='/n', delimiter=None, names=['original'],index_col=False,skip_blank_lines=False, engine="python")

    x_dic = {}

    uniqueX, uniqueCountX= np.unique(testdf1['original'].astype(str),return_counts=True)
    for i in range(len(uniqueX)):
        x_dic[uniqueX[i]] = uniqueCountX[i]

    testdf['modified']=''
#     print(testdf)
    for i, text in enumerate(testdf['original']):
    #         df['x'][i] = replaceWord
        try:
            if text not in xy_pred_dic:
            
                testdf['modified'][i]=testdf['original'][i].replace(text,replaceWord)
            else:
                testdf['modified'][i]=testdf['original'][i]
        except:
            continue
    testdf['predict_label']=''
    for index, word in enumerate(testdf['modified']):
#     print(word)
        try:
            testdf['predict_label'][index]= xy_pred_dic[word]
        except:
            continue
    print("Time taken for test data: ",time.process_time() - start)
    return testdf

For all the four datasets EN, AL, CN, and SG, learn these parameters with `train`, and evaluate your
system on the development set `dev.in` for each of the dataset. Write your output to `dev.p2.out`
for the four datasets respectively. Compare your outputs and the gold-standard outputs in `dev.out`
and report the precision, recall and F scores of such a baseline system for each dataset.

## Part 3

Write a function that estimates the transition parameters from the training set using MLE (maximum likelihood estimation):

In [9]:
# test = pd.read_csv("./data/EN/train", sep='/n', delimiter=None, names=['original'],index_col=False, engine="python", skip_blank_lines=False)
# test.replace(np.nan, None, inplace=True)
# test.isnull().sum()

In [10]:
from collections import Counter, defaultdict

def transitionPara(data):
    train_data_blank=pd.read_csv(data, sep='/n', delimiter=None, names=['original'],index_col=False, engine="python", skip_blank_lines=False)
    x, y = split_into_columns(train_data_blank["original"])
    xy_dic = dict(zip(x, y))
    
    # Get bottom count (Count(yi))
    y_count = Counter(y)
    
    # Get top count (Count(yi-1, yi))
    subseq_count = defaultdict(int)
    for i in range(len(y)-1):    
        y1 = y[i]
        y2 = y[i+1]
        
        if i == 0:
            subseq_count[("START", y1)] +=1
            y_count["START"] +=1
        if pd.isna(y1):
            subseq_count[("START", y2)] +=1
            y_count["START"] +=1
        elif i == len(y)-1 or pd.isna(y2):
            subseq_count[(y1, "END")] +=1
            y_count["END"] +=1
        else:
            subseq_count[y1,y2] += 1
    
    # Calculation of transition params
    result = np.empty(len(y)+2)
    transition_dict = {}
    
    for k,v in subseq_count.items():
        y1 = k[0]
        y2 = k[1]
        transition_dict[y1,y2] = subseq_count[y1,y2] / y_count[y2]
     
    return transition_dict, subseq_count, y_count

# transition_dic, subseq_count, y_count = transitionPara(test)

# Viterbi

In [11]:
import numpy as np

def viterbi(unique_word_list):
    #This is for the starting for viterbi
    store=[]   #store = the storage for scores for all the nodes.
    scorelist=[]
    
    
    
    #This is for the start
    for i in range(len(nodes)):
        emission_score = emission(nodes[i],unique_word_list[0])
        transition_score = transition("START",nodes[i])
        score_at_start = np.log(emission_score)+np.log(transition_score)
        store.append(score_at_start)    
        
    scorelist.append(store)
    store=[]
    score_per_node=[]
    
    #This is for the middle portion for viterbi
    if len(unique_word_list) > 1:
        for j in range(len(unique_word_list) - 1): #for the whole length in sentence
            for k in range(len(nodes)): # for each node
                #score per node = prevnode*emission*transition
                
                for l in range(len(nodes)): # l = iterate thru previous node, k= iterate thru current node, j= iterate thru sentence
                    # This is to calculate the current node scores.
                    prev_node = scorelist[j][l]
                    emission_score = emission(nodes[k],unique_word_list[j+1])
                    curr_node = nodes[k]
                    score_per_node.append(prev_node+np.log(emission_score)+np.log(transition(nodes[l],curr_node)))
                    
                store.append(max(score_per_node)) # found max path
                
#                 max_score = max(score_per_node)
#                 max_index = np.argmax(score_per_node)                
#                 label_max = labels[max_index]
                score_per_node=[]
            
            #print(store)
            scorelist.append(store) # store the scores for nodes
            store=[]

                      
        score_at_stop=[]
        #This is for the stop for viterbi
        for m in range(len(nodes)):
            score_at_stop.append(np.log(transition(nodes[m],"END"))+ (scorelist[len(unique_word_list)-1][m])) #at stop.
        scorelist.append(max(score_at_stop))
     
    return scorelist

    
def emission(node,word):
    global emission_dict
    # Will use this to search the emission score for the given word
    #emission_dict={"Athe":0.9,"Bthe":0.1,"Adog":0.1,"Bdog":0.9,"Astop":0}   # takes out from dictionary
    pair = word+" "+node
    if pair not in emission_dict.keys():
        replaced_text = "#UNK# "+ node
        if replaced_text in emission_dict.keys():
            score = 0.01*emission_dict[replaced_text]
        else:
            score = 0 # our train data does not have this node
    else:
        score = emission_dict[pair]
    
    return score

def transition(x1,x2):
    global transition_dic
    #will use this to search the transition from x1 to x2
    pair = x1,x2
    if pair not in transition_dic.keys():
        score = 0
    else:
        score = transition_dic[x1,x2]
    return score







In [12]:
def viterbi_backtrack(scorelist):
    ####### back tracking for viterbi
    # node value*transition = array, then find max, then find position. use position for next step.
    #np.argmax returns index of max in the element.
    # The final score on the score list is for end
    scorelist = scorelist[::-1] #reverse the score list so easier to calculate.
    node_holder=[]
    path = []
    max_node_index=0
    length_of_scorelist=len(scorelist)
    length_of_nodes=len(nodes)

    if (length_of_scorelist == 1):
        for k in range (length_of_nodes):
            calculate_max_node = (scorelist[0][k]) + np.log(transition(nodes[k],"END"))
            node_holder.append(calculate_max_node)
        path.append(nodes[np.argmax(node_holder)])
        node_holder=[]
        return(path[::-1])

    for i in range (1,length_of_scorelist): # for length of sentence

        for j in range(length_of_nodes): #for each node
            #each node*own path, find max
            if (i==1):
                calculate_max_node = (scorelist[i][j]) + np.log(transition(nodes[j],"END"))
                node_holder.append(calculate_max_node)
                #print(np.exp(calculate_max_node))
            else:
                
                calculate_max_node = (scorelist[i][j]) + np.log(transition(nodes[j],nodes[max_node_index]))#
                node_holder.append(calculate_max_node)
        
        max_node_index=(np.argmax(node_holder))
        path.append(nodes[np.argmax(node_holder)])
        node_holder=[]
        

    return(path[::-1])

In [13]:
def preprocess_training_blank_row(data):
    start = time.process_time()   
    
    df= pd.read_csv(data, sep='/n', delimiter=None, names=['original'],index_col=False,engine="python",skip_blank_lines=False)
    # dropping null value columns to avoid errors 
    
    # new data frame with split value columns 
    df["x"], df["y"] = split_into_columns(df["original"])
    return df

In [14]:
def sentenceList(data):
    lines=[]
    line=[]
    x= data
    for label in x['x']:
        if pd.isnull(label)==False:
            line.append(label)
        else:
    #         line += ' stop'
            lines.append(line)
            line = []
    return lines


In [15]:
def finalresult(sequence_log,predata_blank):
    dataframe = []
    count=0
    for i in range(len(sequence_log)):
        for text in sequence_log[i]:
            dataframe.append(text)
            count+=1
        dataframe.append("")
    dftest=pd.DataFrame(dataframe)
    final = pd.DataFrame()
    final['result'] = predata_blank['x'] + " " +dftest[0]
    return final

In [16]:
data_folders = ["AL", "EN","CN","SG"]
for x in ["EN"]:
    print("Performing sentiment analysis for data folder ", x)
    train_data = "./data/{}/train".format(x)
    test_data = "./data/{}/dev.in".format(x)
    test_result = "./data/{}/dev.out".format(x)
    
    predata = preprocess_training(train_data)
    countData=uniqueCount(predata,k,replaceWord)
    emissiondf = emissionCalcu(countData)
    emission_dict = emissiondf[1]
    xy_pred_dic = xyPrediction(emissiondf[0])
    testdf_unprocess = pd.read_csv(test_data, sep='/n', delimiter=None, names=['x'],index_col=False,skip_blank_lines=False, engine="python")
    testdf = preprocess_test(test_data,k)
    
    testresultdf = pd.read_csv(test_result, sep='/n', delimiter=None, names=['original'],index_col=False, engine="python")
    new = testresultdf["original"].str.split(" ", n=1,expand=True) 

    # making separate first name column from new data frame 
    testresultdf["x"]= new[0] 

    # making separate last name column from new data frame 
    testresultdf["y"]= new[1]
    final = pd.DataFrame()
    
    final['result'] = testdf['modified'] + ' ' + testdf['predict_label']
#     print(final.head(3))
    
    print("Writing the final result to dev.out...")
    f = open('./output/{}/dev.p2.out'.format(x) ,'w')
    for word in final['result']:
        f.write(word + '\n')
    f.close()
    
##############################PART 3########################################################
    transition_dic, subseq_count, y_count = transitionPara(train_data)
    predata_blank=preprocess_training_blank_row(train_data)
    node = np.unique(predata['y'].astype(str))
    print(testdf_unprocess)
    lines= sentenceList(testdf_unprocess)
    
    nodes = node
    log_array =[]
    sequence_log=[]

    for i in range(len(lines)):
        viterbioutput=viterbi(lines[i])
        log_array.append(viterbioutput)
        sequence_log.append(viterbi_backtrack(viterbioutput))
    
    result = finalresult(sequence_log,testdf_unprocess)
    print(result)
    
    print("Writing the final result to dev.p3.out...")
#     f = open('./dev.p3.out'.format(x) ,'w')
    f = open('./output/{}/dev.p3.out'.format(x) ,'w')
    for word in result['result']:
        if pd.isnull(word) == False:
            f.write(word + '\n')
        else:
            f.write("" +"\n")
    f.close()

Performing sentiment analysis for data folder  EN


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


Time taken for train data:  299.703125
Time taken for test data:  9.46875
Writing the final result to dev.out...
                 x
0              HBO
1              has
2            close
3               to
4               24
5          million
6      subscribers
7               to
8              its
9              HBO
10             and
11         Cinemax
12        networks
13               ,
14           while
15        Showtime
16             and
17             its
18          sister
19         service
20               ,
21             The
22           Movie
23         Channel
24               ,
25            have
26            only
27           about
28              10
29         million
...            ...
27195           is
27196          due
27197         only
27198       partly
27199           to
27200          the
27201    austerity
27202      program
27203     launched
27204           in
27205    September
27206         1988
27207           to
27208         cool
27209        

  


                 result
0            HBO B-ADVP
1              has B-VP
2            close I-VP
3               to B-PP
4               24 B-NP
5          million I-NP
6      subscribers B-VP
7               to I-VP
8              its B-NP
9              HBO B-VP
10                and O
11         Cinemax B-NP
12        networks I-NP
13                  , O
14         while B-SBAR
15        Showtime B-NP
16                and O
17             its B-NP
18          sister I-NP
19         service I-NP
20                  , O
21             The B-NP
22           Movie I-NP
23         Channel I-NP
24                  , O
25            have B-VP
26            only I-VP
27           about B-PP
28              10 B-NP
29         million I-NP
...                 ...
27195           is B-VP
27196        due B-ADJP
27197       only I-ADJP
27198     partly I-ADJP
27199           to B-PP
27200          the B-NP
27201    austerity I-NP
27202      program I-NP
27203     launched I-NP
27204           