# IHLT - Lab 5

- Given the following (lemma, category) pairs:

`(’the’,’DT’), (’man’,’NN’), (’swim’,’VB’), (’with’, ’PR’), (’a’, ’DT’),
(’girl’,’NN’), (’and’, ’CC’), (’a’, ’DT’), (’boy’, ’NN’), (’whilst’, ’PR’),
(’the’, ’DT’), (’woman’, ’NN’), (’walk’, ’VB’)`

- For each pair, when possible, print their most frequent WordNet synset, their corresponding least common subsumer (LCS) and their similarity value, using the following functions:

    - Path Similarity

    - Leacock-Chodorow Similarity

    - Wu-Palmer Similarity

    - Lin Similarity

Normalize similarity values when necessary. What similarity seems better?

## Imports

In [1]:
from nltk.corpus import wordnet_ic, wordnet as wn
import itertools

## 1. Data preparation

Synsets are created by taking the lemmas and the categories from the given pairs. Returned synsets are sorted by its frequency, so the first one is the most frequent synset for that pair.

In [2]:
pairs = [('the','DT'), ('man','NN'), ('swim','VB'), ('with', 'PR'), ('a', 'DT'),
         ('girl','NN'), ('and', 'CC'), ('a', 'DT'), ('boy', 'NN'), ('whilst', 'PR'),
         ('the', 'DT'), ('woman', 'NN'), ('walk', 'VB')]

synsets = list()
for lemma, category in pairs:
    try:
        synset = wn.synsets(lemma, category[0].lower())[0]
        print("Appending most frequent synset for lemma '" + lemma + "' (category: '" + category + "'):\n" + str(synset) + "\n")
        synsets.append(synset)
    except:
        print("Lemma '" + lemma + "' cannot be a synset because its category is " + category + ".\n")

Lemma 'the' cannot be a synset because its category is DT.

Appending most frequent synset for lemma 'man' (category: 'NN'):
Synset('man.n.01')

Appending most frequent synset for lemma 'swim' (category: 'VB'):
Synset('swim.v.01')

Lemma 'with' cannot be a synset because its category is PR.

Lemma 'a' cannot be a synset because its category is DT.

Appending most frequent synset for lemma 'girl' (category: 'NN'):
Synset('girl.n.01')

Lemma 'and' cannot be a synset because its category is CC.

Lemma 'a' cannot be a synset because its category is DT.

Appending most frequent synset for lemma 'boy' (category: 'NN'):
Synset('male_child.n.01')

Lemma 'whilst' cannot be a synset because its category is PR.

Lemma 'the' cannot be a synset because its category is DT.

Appending most frequent synset for lemma 'woman' (category: 'NN'):
Synset('woman.n.01')

Appending most frequent synset for lemma 'walk' (category: 'VB'):
Synset('walk.v.01')



## 2. Pairs of most frequent WordNet synsets
In order to compare the most frequent WordNet synsets between them, only all the possible combinations between them will be obtained (taking into consideration the following properties: `LCS(a,b) = LCS (b,a)`, `sim(a,b) = sim(b,a)`, `LCS(a,a) = a` and `sim(a,a) = max(sim)`) to optimize the rest of sections of this assignment.

In [3]:
mf_synset = list(itertools.combinations(synsets, 2))
print("Pairs of most frequent WordNet synsets:")
mf_synset

Pairs of most frequent WordNet synsets:


[(Synset('man.n.01'), Synset('swim.v.01')),
 (Synset('man.n.01'), Synset('girl.n.01')),
 (Synset('man.n.01'), Synset('male_child.n.01')),
 (Synset('man.n.01'), Synset('woman.n.01')),
 (Synset('man.n.01'), Synset('walk.v.01')),
 (Synset('swim.v.01'), Synset('girl.n.01')),
 (Synset('swim.v.01'), Synset('male_child.n.01')),
 (Synset('swim.v.01'), Synset('woman.n.01')),
 (Synset('swim.v.01'), Synset('walk.v.01')),
 (Synset('girl.n.01'), Synset('male_child.n.01')),
 (Synset('girl.n.01'), Synset('woman.n.01')),
 (Synset('girl.n.01'), Synset('walk.v.01')),
 (Synset('male_child.n.01'), Synset('woman.n.01')),
 (Synset('male_child.n.01'), Synset('walk.v.01')),
 (Synset('woman.n.01'), Synset('walk.v.01'))]

## 3. Least Common Subsumer (LCS)
To find the LCS of each pair of synsets, their lexical relation tree must be computed. After that, the common synset whose path to the root is farther will be the LCS (*i.e.* the lowest common hypernym). In this section, the `nltk` function `synset_1.lowest_common_hypernyms(synset_2)` will be used.

As it is not possible to get the LCS for synsets with different part of speech and therefore, their similarities, each time that an empty list is returned by the function, the pair will be deleted for the list of pair of synsets.

Pairs of synsets composed by the same synsets have that synset as the LCS, so these pairs have not been computed (`LCS(a,a) = a`).

In [4]:
del_i = []

for index, pair in enumerate(mf_synset):
    lcs = pair[0].lowest_common_hypernyms(pair[1])
    if lcs:
        print("LCS of pair of synsets: " 
              + str(pair[0]) + " vs. " 
              + str(pair[1]) + ": " + 
              str(lcs[0]) + "\n")
    else:
        print("Deleting the pair of synsets: "
              + str(pair[0]) + " vs. " 
              + str(pair[1]) + " as they are from different PoS.\n")
        del_i.append(index)
        
mf_synset = [pair for index, pair in enumerate(mf_synset) if index not in del_i]

print("Final pairs of synsets to compare:")
mf_synset

Deleting the pair of synsets: Synset('man.n.01') vs. Synset('swim.v.01') as they are from different PoS.

LCS of pair of synsets: Synset('man.n.01') vs. Synset('girl.n.01'): Synset('adult.n.01')

LCS of pair of synsets: Synset('man.n.01') vs. Synset('male_child.n.01'): Synset('male.n.02')

LCS of pair of synsets: Synset('man.n.01') vs. Synset('woman.n.01'): Synset('adult.n.01')

Deleting the pair of synsets: Synset('man.n.01') vs. Synset('walk.v.01') as they are from different PoS.

Deleting the pair of synsets: Synset('swim.v.01') vs. Synset('girl.n.01') as they are from different PoS.

Deleting the pair of synsets: Synset('swim.v.01') vs. Synset('male_child.n.01') as they are from different PoS.

Deleting the pair of synsets: Synset('swim.v.01') vs. Synset('woman.n.01') as they are from different PoS.

LCS of pair of synsets: Synset('swim.v.01') vs. Synset('walk.v.01'): Synset('travel.v.01')

LCS of pair of synsets: Synset('girl.n.01') vs. Synset('male_child.n.01'): Synset('person.n.

[(Synset('man.n.01'), Synset('girl.n.01')),
 (Synset('man.n.01'), Synset('male_child.n.01')),
 (Synset('man.n.01'), Synset('woman.n.01')),
 (Synset('swim.v.01'), Synset('walk.v.01')),
 (Synset('girl.n.01'), Synset('male_child.n.01')),
 (Synset('girl.n.01'), Synset('woman.n.01')),
 (Synset('male_child.n.01'), Synset('woman.n.01'))]

## 4. Similarity Values
In this section, several similarity functions will be used to measure the level of equivalence between each pair of synsets.

It must be taken into account that a synset compared to itself is supposed to have the maximum similarity, so  these pairs have not been included in the most frequent pairs of synsets. Moreover, similarity between synsets from different PoS cannot be compared using these similarities, so they are supposed to have a similarity value of `0` (these pairs have not been included in the most frequent pairs of synsets too).

In [5]:
brown_ic = wordnet_ic.ic('ic-brown.dat')

lch = list()
for pair in mf_synset:
    lch_value = round(pair[0].lch_similarity(pair[1]), 3)
    lch.append(lch_value)
    print("Similarity values between " + str(pair[0]) + " and " + str(pair[1]) + ":") 
    print("- Path Similarity: " + str(round(pair[0].path_similarity(pair[1]), 3)))
    print("- Leacock-Chodorow Similarity: " + str(lch_value))
    print("- Wu-Palmer Similarity: " + str(round(pair[0].wup_similarity(pair[1]), 3)))
    print("- Lin Similarity: " + str(round(pair[0].lin_similarity(pair[1], brown_ic), 3)) + "\n")

Similarity values between Synset('man.n.01') and Synset('girl.n.01'):
- Path Similarity: 0.25
- Leacock-Chodorow Similarity: 2.251
- Wu-Palmer Similarity: 0.632
- Lin Similarity: 0.714

Similarity values between Synset('man.n.01') and Synset('male_child.n.01'):
- Path Similarity: 0.333
- Leacock-Chodorow Similarity: 2.539
- Wu-Palmer Similarity: 0.667
- Lin Similarity: 0.729

Similarity values between Synset('man.n.01') and Synset('woman.n.01'):
- Path Similarity: 0.333
- Leacock-Chodorow Similarity: 2.539
- Wu-Palmer Similarity: 0.667
- Lin Similarity: 0.787

Similarity values between Synset('swim.v.01') and Synset('walk.v.01'):
- Path Similarity: 0.333
- Leacock-Chodorow Similarity: 2.159
- Wu-Palmer Similarity: 0.333
- Lin Similarity: 0.491

Similarity values between Synset('girl.n.01') and Synset('male_child.n.01'):
- Path Similarity: 0.167
- Leacock-Chodorow Similarity: 1.846
- Wu-Palmer Similarity: 0.632
- Lin Similarity: 0.293

Similarity values between Synset('girl.n.01') and S

All the similarities are given within the interval [0, 1], except *Leacock-Chodorow Similarity*. Since it has not got a specific maximum value and in order to normalize their values, each of these similarities will be divided by the maximum *Leacock-Chodorow Similarity* for each case.

In [6]:
for index, pair in enumerate(mf_synset):
    print("Similarity values between " + str(pair[0]) + " and " + str(pair[1]) + ":") 
    print("- Leacock-Chodorow Similarity: " 
          + str(round(lch[index]/pair[0].lch_similarity(pair[0]), 3)) + "\n")

Similarity values between Synset('man.n.01') and Synset('girl.n.01'):
- Leacock-Chodorow Similarity: 0.619

Similarity values between Synset('man.n.01') and Synset('male_child.n.01'):
- Leacock-Chodorow Similarity: 0.698

Similarity values between Synset('man.n.01') and Synset('woman.n.01'):
- Leacock-Chodorow Similarity: 0.698

Similarity values between Synset('swim.v.01') and Synset('walk.v.01'):
- Leacock-Chodorow Similarity: 0.663

Similarity values between Synset('girl.n.01') and Synset('male_child.n.01'):
- Leacock-Chodorow Similarity: 0.507

Similarity values between Synset('girl.n.01') and Synset('woman.n.01'):
- Leacock-Chodorow Similarity: 0.809

Similarity values between Synset('male_child.n.01') and Synset('woman.n.01'):
- Leacock-Chodorow Similarity: 0.558



**What similarity seems better?**

The similarity that performs better depends on the belief about the level of similarity of each pair of synsets.
Considering the following ranking about similarities:

1, 2) Man vs Male child [2] / Girl vs Woman [6] (Same gender)

3, 4) Girl vs Male child [5] / Man vs Woman [3] (Same age)

5, 6) Man vs girl [1] / Male child vs Woman [7] (Human being)

7) Swim vs Walk [4] (Movement)

And taking into consideration that the order of each similarity* (as it was shown in the previous section) using these algorithms is:
- Path Similarity: [6, 2/3/4, 1, 7, 5]
    - It returns `Swim vs. Walk` as similar as than `Man vs Woman` or `Man vs Male child`.
    - Small variance between similarities (Only from 0.167 to 0.5)
- Leacock-Chodorow Similarity: [6, 2/3, 4, 1, 7, 5]
    - It returns `Man vs. Male child` with same similarity than `Man vs. Woman`
    - Small variance between similarities after normalization (Only from 0.507 to 0.809)
- Wu-Palmer Similarity: [2/3/7, 1/5/6, 4]
    - Small variance between similarities (From 0.333 to 0.667).
        - Only 3 sets of different similarities can be obtained.
        - First 2 sets have nearly the same similarities (0.667 vs. 0.632).
- Lin Similarity: [6, 3, 2, 1, 4, 7, 5]
    - Except for `Male child vs. Woman` and `Girl vs. Male child` cases, the rest of similarities perform in a sensible way.
    - Good variability of similarities (and good distance between them).
    
**The quantity of similarity provided by each algorithm has been taken into account too.*
    
**Lin Similarity** seems to perform better. The underlying reason behind is the computation on this similarity as it takes into consideration the LCS instead of the Shortest Path Length (SPL) as *Path Similarity* and *Leacock-Chodorow Similarity* do and the Information Content (frequencies in *wordnet_ic* corpus). This is the reason of the good variability of its similarities with respect to the other similarities algorithms.

The formula that this similarity computes is the following:

$ Sim(s_1, s_2) = \frac{2*IC(LCS(s1,s2))}{IC(s1)+IC(s2)} $