# Mandatory Exercise - Session 5

### Students: Nafis Banirazi & Jan Carbonell

### Lab Objective:
The Objective of this lab is to categorize the given pairs, print their most frequent WordNet synset, their corresponding least common subsumer (LCS) and their similarity using the following functions:

- Path Similarity
- Leacock-Chodorow Similarity
- Wu-Palmer Similarity
- Lin Similarity


In [1]:
# initial imports. Could also be done in the PC
import nltk
import numpy as np
import pandas as pd
import itertools

#additional set of imports
from nltk.corpus import wordnet as wn
from nltk.corpus import wordnet_ic


#given set of pairs
pairs = [('the', 'DT'), ('man', 'NN'), ('swim', 'VB'), \
         ('with', 'PR'), ('a', 'DT'), ('girl', 'NN'), \
         ('and', 'CC'), ('a', 'DT'), ('boy', 'NN'), \
         ('whilst', 'PR'), ('the', 'DT'), ('woman', 'NN'), \
         ('walk', 'VB')]
n = {}
v = {}
aux = {}
freq = []
lcs = []
definition = []

Now, for each pair, we will search for their most frequent WordNet sysnset. In the documentation we can find that there are also adjectives and adverbs listed as options but are not used in this given set of pairs. Listing it as a reference: https://wordnet.princeton.edu/documentation/wn1wn

In [2]:
for e in pairs:
    if e[0] not in n and e[0] not in v:
        if e[1] == 'NN':
            n[e[0]] = wn.synset(e[0]+'.n.01')
        elif e[1] == 'VB':
            v[e[0]] = wn.synset(e[0]+'.v.01')
            
#verification that it is properly stored
for keys,values in n.items():
    print(keys, values)
    
for keys,values in v.items():
    print(keys, values)
            

man Synset('man.n.01')
girl Synset('girl.n.01')
boy Synset('male_child.n.01')
woman Synset('woman.n.01')
swim Synset('swim.v.01')
walk Synset('walk.v.01')


For each of the pairs that was found to be on WordNet, we now get their corresponding least commmon subsumer. That is the most specific common ancestor (hypernym) of two concepts found in a given ontology. For example, the LCS of moose and kangaroo in WordNet is mammal.

In [3]:
for key in n:
    freq.append([key,n[key], 'noun'])

for key in v:
    freq.append([key,v[key], 'verb'])

In [4]:
for key in freq:
    row = []
    for alt in freq:
        if key[2] == alt [2]:
            row.append(key[1].lowest_common_hypernyms(alt[1]))
        else:
            row.append('-')
    lcs.append(row)

print(lcs)

[[[Synset('man.n.01')], [Synset('adult.n.01')], [Synset('male.n.02')], [Synset('adult.n.01')], '-', '-'], [[Synset('adult.n.01')], [Synset('girl.n.01')], [Synset('person.n.01')], [Synset('woman.n.01')], '-', '-'], [[Synset('male.n.02')], [Synset('person.n.01')], [Synset('male_child.n.01')], [Synset('person.n.01')], '-', '-'], [[Synset('adult.n.01')], [Synset('woman.n.01')], [Synset('person.n.01')], [Synset('woman.n.01')], '-', '-'], ['-', '-', '-', '-', [Synset('swim.v.01')], [Synset('travel.v.01')]], ['-', '-', '-', '-', [Synset('travel.v.01')], [Synset('walk.v.01')]]]


In [8]:
lcs_np = np.array(lcs)
label = [i[0] for i in freq]
pd.DataFrame(lcs_np, columns = label, index= label)

Unnamed: 0,man,girl,boy,woman,swim,walk
man,[Synset('man.n.01')],[Synset('adult.n.01')],[Synset('male.n.02')],[Synset('adult.n.01')],-,-
girl,[Synset('adult.n.01')],[Synset('girl.n.01')],[Synset('person.n.01')],[Synset('woman.n.01')],-,-
boy,[Synset('male.n.02')],[Synset('person.n.01')],[Synset('male_child.n.01')],[Synset('person.n.01')],-,-
woman,[Synset('adult.n.01')],[Synset('woman.n.01')],[Synset('person.n.01')],[Synset('woman.n.01')],-,-
swim,-,-,-,-,[Synset('swim.v.01')],[Synset('travel.v.01')]
walk,-,-,-,-,[Synset('travel.v.01')],[Synset('walk.v.01')]


Now we proceed to implement and evaluate the algorithms

In [34]:
dog = wn.synset('plantation.n.01')
swim = wn.synset('dinosaur.n.01')
a = dog.lowest_common_hypernyms(swim)
b = wn.lch_similarity(dog, swim)
print(a, b)

[Synset('entity.n.01')] 0.6418538861723948


In [31]:
semcor_ic = wordnet_ic.ic('ic-semcor.dat')

path, lch, wup, lin = [], [], [], []
v1, v2, v3, v4 = 0, 0, 0, 0

# initializing two words only connected by entity
a = wn.synset('plantation.n.01')
b = wn.synset('dinosaur.n.01')
relation = a.lowest_common_hypernyms(b)

#calculating max values as the result of running the algorithms on the same element
max1 = wn.path_similarity(a, a)
max2 = wn.lch_similarity(a, a)
max3 = wn.wup_similarity(a, a)
max4 = a.lin_similarity(a, semcor_ic)

#calculating min value of the algorithms with words only connected by entity
min1 = wn.lch_similarity(a, b)
min2 = wn.lch_similarity(a, b)
min3 = wn.wup_similarity(a, b)
min4 = a.lin_similarity(b, semcor_ic)

#max and min done outside the loop to preserve as we loop through elements

for key in freq:
    
    #initializing the rows of the matrices
    row1 = []
    row2 = []
    row3 = []
    row4 = []
    
    for alt in freq:
        #adding only if they belong to the same class 'noun' // 'verb'
        if key[2] == alt[2]:
            #calculating the result of the algorithm
            v1 = wn.path_similarity(key[1], alt[1])
            v2 = wn.lch_similarity(key[1], alt[1])
            v3 = wn.wup_similarity(key[1], alt[1])
            v4 = key[1].lin_similarity(alt[1], semcor_ic)
            
            #normalizing it, rounding 3 decimals and appending to row
            row1.append(round(v1/max1, 3))
            row2.append(round(v2/max2, 3))
            row3.append(round(v3/max3, 3))
            row3.append(round(v4/max4, 3))
            
        else:
            row1.append('-')
            row2.append('-')
            row3.append('-')
            row4.append('-')
    path.append(row1)
    lch.append(row2)
    wup.append(row3)
    lin.append(row4)

print(path)
print(lch)
print(wup)
print(wup)

[[1.0, 0.25, 0.333, 0.333, '-', '-'], [0.25, 1.0, 0.167, 0.5, '-', '-'], [0.333, 0.167, 1.0, 0.2, '-', '-'], [0.333, 0.5, 0.2, 1.0, '-', '-'], ['-', '-', '-', '-', 1.0, 0.333], ['-', '-', '-', '-', 0.333, 1.0]]
[[1.0, 0.619, 0.698, 0.698, '-', '-'], [0.619, 1.0, 0.507, 0.809, '-', '-'], [0.698, 0.507, 1.0, 0.558, '-', '-'], [0.698, 0.809, 0.558, 1.0, '-', '-'], ['-', '-', '-', '-', 1.0, 0.663], ['-', '-', '-', '-', 0.663, 1.0]]
[[1.0, 1.0, 0.632, 0.702, 0.667, 0.798, 0.667, 0.802, '-', '-'], [0.632, 0.702, 1.0, 1.0, 0.632, 0.274, 0.632, 0.882, '-', '-'], [0.667, 0.798, 0.632, 0.274, 1.0, 1.0, 0.667, 0.308, '-', '-'], [0.667, 0.802, 0.947, 0.882, 0.667, 0.308, 1.0, 1.0, '-', '-'], ['-', '-', '-', '-', 1.0, 1.0, 0.333, 0.466], ['-', '-', '-', '-', 0.333, 0.466, 1.0, 1.0]]
[[1.0, 1.0, 0.632, 0.702, 0.667, 0.798, 0.667, 0.802, '-', '-'], [0.632, 0.702, 1.0, 1.0, 0.632, 0.274, 0.632, 0.882, '-', '-'], [0.667, 0.798, 0.632, 0.274, 1.0, 1.0, 0.667, 0.308, '-', '-'], [0.667, 0.802, 0.947, 0.88

<b>Which model would you select?</b> Justify the answer.

Based on the initial graph, both CRF and PER seem like the better performing algorithms. If we could only make a decision based on this data, we would pick the Perceptron model.

In order to pick the best overall model, another relevant measure is the speed of the system. Because of this, we also have implemented a timer in between each of the models and will plot it accordingly to figure out which algorithms were more effective. 

In [None]:
#preparing and plotting the accuracy graph
x = train_stop
plt.figure()
plt.plot(x, time['HMM'], label='HMM')
plt.plot(x, time['TnT'], label='TnT')
plt.plot(x, time['PER'], label='PER')
plt.plot(x, time['CRF'], label='CRF')

#adding the legend showing the plot
plt.xlabel('Number of Sentences')
plt.ylabel('Model Execution Time')
plt.title('Part Of Speech Models')
plt.legend()
plt.show()

In this case, PER seems to be the most effective -accuracy wise- and second most time efficient algorithm. After having performed this additional execution time analysis <b> we are reinsured in our conclusion to pick the Perceptron Model.</b> 

## Conclusions
Over the development of this lab, we have implemented 4 different POS Models, tested them with different segments of the treebank corpus and trained them with another set of segments. From those results, we have plotted their performanced based on accuracy and execution time and have selected the best performing model for this specific case, which happened to be the Perceptron Model. 

Another thing to highlight is that the HiddenMarkovModel looks really efficient timewise. It is likely that its just not in the desired set of conditions regarding the amount of data to perform correctly. 
