## Mandatory Exercise - Session 5
### Olga Valls & Lavanya Mandadapu

We import the packages that we need

In [1]:
from nltk.corpus import wordnet as wn
from nltk.corpus import wordnet_ic
import pandas as pd
import numpy as np

Given the following (lemma, category) <b>pairs</b>:

In [2]:
pairs = [('the', 'DT'), ('man', 'NN'), ('swim', 'VB'), ('with', 'PR'), ('a', 'DT'), ('girl', 'NN'), ('and', 'CC'), ('a', 'DT'), ('boy', 'NN'), ('whilst', 'PR'), ('the', 'DT'), ('woman', 'NN'), ('walk', 'VB')]

We select the words of the pairs where the <b>POS tag</b> is NN (<u>noun</u>) or VB (<u>verb</u>), so that we can create an array for nouns and an array for verbs and store the the synsets of those words (nouns/verbs).

In [3]:
nouns = []
verbs = []

for p in pairs:
    if p[1][0] == 'N':
        nouns.append(p[0])
    elif p[1][0] == 'V':
        verbs.append(p[0])

Create:
- list "types" for the different types of POS tags that we will use (nouns and verbs)
- diccionary "diccio", so that when we read nouns/verbs from the corresponding array we can add n/v to the synsets' command, to get the synsets for that word.
- dictionary "diccio_synsets", so that when we create the common array of most frequent Wordnet synsets of nouns and verbs, we can also store each noun and each verb, in his corresponding array.

In [92]:
types = ['n', 'v']
diccio = {'n': nouns, 'v': verbs}
diccio_synsets = {'n': [], 'v': []}

### 1. Most frequent WordNet synsets
For each type of POS tag (noun/verb), we access to the synset of each element of his corresponding array.<br>
For each lemma, we get the first one of the output list of synsets, which is the <u>most frequent</u>.

In [5]:
freq_synsets = []
for t in types:
    print('{}: {} '.format(t,diccio[t]))
    for x in diccio[t]:
        x_synsets = wn.synsets(x, t)
        freq_synsets.append(x_synsets[0])
        diccio_synsets[t].append(x_synsets[0])
        print('Most frequent {}: {}'.format(x, x_synsets[0]))

n: ['man', 'girl', 'boy', 'woman'] 
Most frequent man: Synset('man.n.01')
Most frequent girl: Synset('girl.n.01')
Most frequent boy: Synset('male_child.n.01')
Most frequent woman: Synset('woman.n.01')
v: ['swim', 'walk'] 
Most frequent swim: Synset('swim.v.01')
Most frequent walk: Synset('walk.v.01')


### 2. Least Common Subsumer (LCS)
For all the synsets in the list of most frequent WordNet synsets, we search for the Least Common Subsummer (LCS) for each pair of synsets.<br>
It's computed with the Wordnet's method lowest_common_hypernyms(), which is used to locate the lowest single hypernym that is shared by two given synsets.<br>
<i>http://www.nltk.org/howto/wordnet_lch.html</i>

In [6]:
llcs = []
for i in range(len(freq_synsets)):
    rows = []
    for j in range(len(freq_synsets)):
        rows.append(freq_synsets[i].lowest_common_hypernyms(freq_synsets[j]))
    llcs.append(rows)

<b>LCS</b> among each pair of lemmas:

In [14]:
llcs_np = np.array(llcs)
pd.DataFrame(llcs_np, columns=freq_synsets, index=freq_synsets)

Unnamed: 0,Synset('man.n.01'),Synset('girl.n.01'),Synset('male_child.n.01'),Synset('woman.n.01'),Synset('swim.v.01'),Synset('walk.v.01')
Synset('man.n.01'),[Synset('man.n.01')],[Synset('adult.n.01')],[Synset('male.n.02')],[Synset('adult.n.01')],[],[]
Synset('girl.n.01'),[Synset('adult.n.01')],[Synset('girl.n.01')],[Synset('person.n.01')],[Synset('woman.n.01')],[],[]
Synset('male_child.n.01'),[Synset('male.n.02')],[Synset('person.n.01')],[Synset('male_child.n.01')],[Synset('person.n.01')],[],[]
Synset('woman.n.01'),[Synset('adult.n.01')],[Synset('woman.n.01')],[Synset('person.n.01')],[Synset('woman.n.01')],[],[]
Synset('swim.v.01'),[],[],[],[],[Synset('swim.v.01')],[Synset('travel.v.01')]
Synset('walk.v.01'),[],[],[],[],[Synset('travel.v.01')],[Synset('walk.v.01')]


There are no Least Common Subsumer for any noun vs any verb, as they belong to different categories.

The two verbs are related with the lemma travel, which defines movement; walk is traveling on foot, and swimming is traveling in an aquous environment.

The four nouns have different relationships, the common one would be person, which would belong to a higher category.<br>
As an example we can see that man and woman are adults, girl is a woman but not an adult. Boy is a male, a male_child, which means that he is not an adult as man is.

### 3. Similarity Value
We separate nouns and verbs to compute similarities among each category.<br>
For each of the Similarities that we want to compute, we create a list where we will store a list of similarities of nouns and a list of similarities of verbs.

For the Lin Similarity we also need to compute the Information Content (IC) value, which load an information content file from the wordnet_ic corpus.

Comparing a synset with itself will return 1. That means, that if some similarity result is higher than 1 we will have to normalise the values between 0 and 1.

In [8]:
brown_ic = wordnet_ic.ic('ic-brown.dat')

sim_path_all = []
sim_lch_all = []
sim_wup_all = []
sim_lin_all = []

for x, t in enumerate(types):
    sim_path = []
    sim_lch = []
    sim_wup = []
    sim_lin = []
    for i in range(len(diccio_synsets[t])):
        rows_path = []
        rows_lch = []
        rows_wup = []
        rows_lin = []
        for j in range(len(diccio_synsets[t])):
            rows_path.append(diccio_synsets[t][i].path_similarity(diccio_synsets[t][j]))
            rows_lch.append(diccio_synsets[t][i].lch_similarity(diccio_synsets[t][j]))
            rows_wup.append(diccio_synsets[t][i].wup_similarity(diccio_synsets[t][j]))
            rows_lin.append(diccio_synsets[t][i].lin_similarity(diccio_synsets[t][j], brown_ic))
        sim_path.append(rows_path)
        sim_lch.append(rows_lch)
        sim_wup.append(rows_wup)
        sim_lin.append(rows_lin)
    sim_path_all.append(sim_path)
    sim_lch_all.append(sim_lch)
    sim_wup_all.append(sim_wup)
    sim_lin_all.append(sim_lin)

#### Path Similarity
<i>Returns a score denoting how similar two word senses are, based on the shortest path that connects the senses in the is-a (hypernym/hypnoym) taxonomy. The score is in the range 0 to 1. By default, there is now a fake root node added to verbs so for cases where previously a path could not be found---and None was returned---it should return a value. The old behavior can be achieved by setting simulate_root to be False. A score of 1 represents identity i.e. comparing a sense with itself will return 1.<br>
synset1.path_similarity(synset2)</i>

Sim(s1, s2) = 1 / 1+SPL(s1,s2) ,

<i>where SPL(s1,s2) = Shortest Path Length from s1 to s2</i>

<u>Path Similarity for nouns</u>:

In [55]:
path_n_np = np.array(sim_path_all[0])
pd.DataFrame(path_n_np, columns=diccio_synsets['n'], index=diccio_synsets['n'])

Unnamed: 0,Synset('man.n.01'),Synset('girl.n.01'),Synset('male_child.n.01'),Synset('woman.n.01')
Synset('man.n.01'),1.0,0.25,0.333333,0.333333
Synset('girl.n.01'),0.25,1.0,0.166667,0.5
Synset('male_child.n.01'),0.333333,0.166667,1.0,0.2
Synset('woman.n.01'),0.333333,0.5,0.2,1.0


Higher values: man/boy and man/woman: 0.333333<br>
Lower values: girl/boy: 0.166667

<u>Path Similarity for verbs</u>:

In [1]:
path_v_np = np.array(sim_path_all[1])
pd.DataFrame(path_v_np, columns=diccio_synsets['v'], index=diccio_synsets['v'])

NameError: name 'np' is not defined

#### Leacock-Chodorow Similarity
<i>Returns a score denoting how similar two word senses are, based on the shortest path that connects the senses (as above) and the maximum depth of the taxonomy in which the senses occur. The relationship is given as -log(p/2d) where p is the shortest path length and d the taxonomy depth.
synset1.lch_similarity(synset2)</i>

Sim(s1, s2) = −log(SPL(s1,s2)/2*MaxDepth) ,

<i>where depth(s) = depth of s in the ontology<br>MaxDepth = maxs∈WN depth(s)</i>

<u>Leacock-Chodorow Similarity for nouns</u>:

In [57]:
lch_n_np = np.array(sim_lch_all[0])
pd.DataFrame(lch_n_np, columns=diccio_synsets['n'], index=diccio_synsets['n'])

Unnamed: 0,Synset('man.n.01'),Synset('girl.n.01'),Synset('male_child.n.01'),Synset('woman.n.01')
Synset('man.n.01'),3.637586,2.251292,2.538974,2.538974
Synset('girl.n.01'),2.251292,3.637586,1.845827,2.944439
Synset('male_child.n.01'),2.538974,1.845827,3.637586,2.028148
Synset('woman.n.01'),2.538974,2.944439,2.028148,3.637586


<u>Leacock-Chodorow Similarity for verbs</u>:

In [58]:
lch_n_np = np.array(sim_lch_all[1])
pd.DataFrame(lch_n_np, columns=diccio_synsets['v'], index=diccio_synsets['v'])

Unnamed: 0,Synset('swim.v.01'),Synset('walk.v.01')
Synset('swim.v.01'),3.258097,2.159484
Synset('walk.v.01'),2.159484,3.258097


As we see, the results for Leacock-Chodorow should <b>normalised</b> [0,1]. As we stated before, "Comparing a synset with itself will return 1" and it returns 3.258097.

We search for the minimun and maximum values of similarity for the nouns, as well as for the verbs (separately). With those values for nouns and verbs we apply the following formula for each element of each correspondant category:<br>
<i>x = (x - min) / (max - min)</i>

In [73]:
lch_norm = []

for i, t in enumerate(types):
    mins = []
    maxs = []
    for l in sim_lch_all[i]:
        mins.append(min(l))
        maxs.append(max(l))
    minim = min(mins)
    maxim = max(maxs)
    print('min {}: {}'.format(t,minim))
    print('max {}: {}'.format(t,maxim))

    temp = []
    for l in sim_lch_all[i]:
        x_norm = []
        for x in l:
            x = (x - minim) / (maxim - minim)
            x_norm.append(x)
        temp.append(x_norm)
    
    lch_norm.append(temp)

min n: 1.845826690498331
max n: 3.6375861597263857
min v: 2.159484249353372
max v: 3.258096538021482


<u>Normalised Leacock-Chodorow Similarity for nouns</u>:

In [81]:
lch_n_np = np.array(lch_norm[0])
pd.DataFrame(lch_n_np, columns=diccio_synsets['n'], index=diccio_synsets['n'])

Unnamed: 0,Synset('man.n.01'),Synset('girl.n.01'),Synset('male_child.n.01'),Synset('woman.n.01')
Synset('man.n.01'),1.0,0.226294,0.386853,0.386853
Synset('girl.n.01'),0.226294,1.0,0.0,0.613147
Synset('male_child.n.01'),0.386853,0.0,1.0,0.101756
Synset('woman.n.01'),0.386853,0.613147,0.101756,1.0


Higher values: girl/woman: 0.613147<br>
Lower values: girl/boy: 0.000000

<u>Normalised Leacock-Chodorow Similarity for verbs</u>:

In [82]:
lch_v_np = np.array(lch_norm[1])
pd.DataFrame(lch_v_np, columns=diccio_synsets['v'], index=diccio_synsets['v'])

Unnamed: 0,Synset('swim.v.01'),Synset('walk.v.01')
Synset('swim.v.01'),1.0,0.0
Synset('walk.v.01'),0.0,1.0


#### Wu-Palmer Similarity
Returns a score denoting how similar two word senses are, based on the depth of the two senses in the taxonomy and that of their Least Common Subsumer (most specific ancestor node). Note that at this time the scores given do _not_ always agree with those given by Pedersen's Perl implementation of Wordnet Similarity.
<i>ynset1.wup_similarity(synset2)</i>

Sim(s1, s2) = 2·depth(LCS(s1,s2)) / (depth(s1)+depth(s2)) ,

<i>where LCS(s1,s2) = Lowest Common Subsumer of s1 and s2</i>

<u>Wu-Palmer Similarity for nouns</u>:

In [84]:
wup_n_np = np.array(sim_wup_all[0])
pd.DataFrame(wup_n_np, columns=diccio_synsets['n'], index=diccio_synsets['n'])

Unnamed: 0,Synset('man.n.01'),Synset('girl.n.01'),Synset('male_child.n.01'),Synset('woman.n.01')
Synset('man.n.01'),1.0,0.631579,0.666667,0.666667
Synset('girl.n.01'),0.631579,1.0,0.631579,0.631579
Synset('male_child.n.01'),0.666667,0.631579,1.0,0.666667
Synset('woman.n.01'),0.666667,0.947368,0.666667,1.0


Higher values: woman/girl: 0.947368<br>
Lower values: girl/man, man/girl, boy/girl, girl/boy, girl/woman: 0.631579

<u>Wu-Palmer Similarity for verbs</u>:

In [85]:
wup_v_np = np.array(sim_wup_all[1])
pd.DataFrame(wup_v_np, columns=diccio_synsets['v'], index=diccio_synsets['v'])

Unnamed: 0,Synset('swim.v.01'),Synset('walk.v.01')
Synset('swim.v.01'),1.0,0.333333
Synset('walk.v.01'),0.333333,1.0


#### Lin Similarity
<i>Returns a score denoting how similar two word senses are, based on the Information Content (IC) of the Least Common Subsumer (most specific ancestor node) and that of the two input Synsets. The relationship is given by the equation 2 * IC(lcs) / (IC(s1) + IC(s2)).
synset1.lin_similarity(synset2, ic)</i>

Sim(s1,s2) = 2·IC(LCS(s1,s2)) / (IC(s1)+IC(s2)) ,

<i>where IC(s) = −log2P(s) = information content of s (from frequencies in a corpus)</i>

<u>Lin Similarity for nouns</u>:

In [88]:
lin_n_np = np.array(sim_lin_all[0])
pd.DataFrame(lin_n_np, columns=diccio_synsets['n'], index=diccio_synsets['n'])

Unnamed: 0,Synset('man.n.01'),Synset('girl.n.01'),Synset('male_child.n.01'),Synset('woman.n.01')
Synset('man.n.01'),1.0,0.713511,0.729472,0.787084
Synset('girl.n.01'),0.713511,1.0,0.292728,0.90678
Synset('male_child.n.01'),0.729472,0.292728,1.0,0.318423
Synset('woman.n.01'),0.787084,0.90678,0.318423,1.0


Higher values: woman/girl, girl/woman: 0.906780<br>
Lower values: boy/girl, girl/boy: 0.292728

<u>Lin Similarity for verbs</u>:

In [89]:
lin_v_np = np.array(sim_lin_all[1])
pd.DataFrame(lin_v_np, columns=diccio_synsets['v'], index=diccio_synsets['v'])

Unnamed: 0,Synset('swim.v.01'),Synset('walk.v.01')
Synset('swim.v.01'),1.0,0.491005
Synset('walk.v.01'),0.491005,1.0


### What similarity seems better?

Leacock-Chodorow, Wu-Palmer and Lin agree that the higher similarity is for the pair: woman/girl. The highest value is Wu-Palmer with 0.947368.
The wierd thing is that, for Wu-Palmer, the pair girl/woman has the lowest value witj 0.631579, which is confusing, as the formula doesn't seem to apply any difference in the order of the synsets that we are computing!!!!!

All the methods agree that the lowest value is for the pair girl/boy, and Leacock-Chodorow score the lower one with 0.
Wu-Palmer also finds the pair girl/man as one of the lowest values, with 0.631579.

The difference between the highest and the lower values for each method are:
- Path: 0.333333 - 0.166667 = 0.166667
- Leacock-Chodorow: 0.613147 - 0 = 0.613147
- Wu-Palmer: 0.947368 - 0.631579 = 0.334221
- Lin: 0.906780 - 0.292728 = 0,614052

As a <b><u>conclusion</u></b> we would say that:<br>
<b>Wu-Palmer</b> is <u>confusing</u> as it gives, as we can see below, a <u>different result</u> for woman/girl than for girl/woman, and both results have a hight weight.

In [97]:
wn.synset('woman.n.01').wup_similarity(wn.synset('girl.n.01'))

0.9473684210526315

In [98]:
wn.synset('girl.n.01').wup_similarity(wn.synset('woman.n.01'))

0.631578947368421

<b>Path</b> and <b>Leacock-Chodorow</b> have a <u>distribution of weights quite similar</u>.<br>
However, <b>Lin</b> gives <u>lower</u> weights to <u>differences</u> and <u>higher</u> weights to <u>similarities</u>, and, in our example, it has more different weights than the other methods.

So we would choose <b><u>Lin</u></b> method.