# Solution

# IHLT Lab Exercise 5
## This file contains code to complete the exercise for the fifth lab session of IHLT
Authors:


*   Kacper Poniatowski (kacper.krzysztof.poniatowski@estudiantat.upc.edu)
*   Pau Blanco (pablo.blanco@estudiantat.upc.edu)

### Notes


# Requirements

In [None]:
# Imports & downloads required for this lab

import nltk
nltk.download('wordnet')
from nltk.corpus import wordnet as wn
nltk.download('omw-1.4')

# Lemma category pairs provided for lab

lemma_pairs = [
    ('the', 'DT'),
    ('man', 'NN'),
    ('swim', 'VB'),
    ('with', 'IN'),
    ('a', 'DT'),
    ('girl', 'NN'),
    ('and', 'CC'),
    ('a', 'DT'),
    ('boy', 'NN'),
    ('whilst', 'IN'),
    ('the', 'DT'),
    ('woman', 'NN'),
    ('walk', 'VB')
]

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


# Exercise

## Find the most frequest synset of each word

WordNet only has 4 categories of words: Nouns, Verbs, Adjectives and Adverbs. It is expected that we don't have synsets for the words not contained in these categories

In [None]:
def get_most_frequent_synset(lemma, category):
    # If the category is not a noun, verb, adjective, or adverb, we can't use wordnet as its not applicable
    if category not in pos_map:
        return None

    pos = pos_map[category]
    synsets = wn.synsets(lemma, pos=pos)

    if synsets:
        return synsets[0] # First in list -> most frequent

    return None

# Map wordnet tags to category tags
pos_map = {
    'RB': wn.ADV,
    'VB': wn.VERB,
    'NN': wn.NOUN,
    'JJ': wn.ADJ
}

comparable = []

for lemma, category in lemma_pairs:
    synset = get_most_frequent_synset(lemma, category)
    if synset:
        print(f"{'*' * 10} Lemma: {lemma} | Category: {category} {'*' * 10}")
        print(f"{'Most Frequent Synset':<25}: {synset.name()}")
        print(f"{'Definition':<25}: {synset.definition()}")
        print(f"{'*' * 42}\n")
        comparable.append({'lemma': lemma, 'category': category, 'synset': synset})
    else:
        print(f"{'*' * 10} Lemma: {lemma} | Category: {category} {'*' * 10}")
        print(f"{'No Synset Found':<25}")
        print(f"{'*' * 42}\n")


********** Lemma: the | Category: DT **********
No Synset Found          
******************************************

********** Lemma: man | Category: NN **********
Most Frequent Synset     : man.n.01
Definition               : an adult person who is male (as opposed to a woman)
******************************************

********** Lemma: swim | Category: VB **********
Most Frequent Synset     : swim.v.01
Definition               : travel through water
******************************************

********** Lemma: with | Category: IN **********
No Synset Found          
******************************************

********** Lemma: a | Category: DT **********
No Synset Found          
******************************************

********** Lemma: girl | Category: NN **********
Most Frequent Synset     : girl.n.01
Definition               : a young woman
******************************************

********** Lemma: and | Category: CC **********
No Synset Found          
*****************

## Find similarities

For each pair of words pertaining to the same category, compute the similarity using the 4 specified metrics.

We will save the values in a dataFrame for further analysis.

**Note**: we compare also the distance from one word to itself. This way we can obtain the maximum similarity in that category, with a priori is unknown for Leacock-Chodorow (being 1 for the rest of the metrics).

In [None]:
# Additional downloads
nltk.download('wordnet_ic')
from nltk.corpus import wordnet_ic
brown_ic = wordnet_ic.ic('ic-brown.dat')
import pandas as pd

def compute_pair(cmp1, cmp2, ics, df):
  try:
    # Calculating similarity values using the different functions
    lch = cmp1['synset'].lowest_common_hypernyms(cmp2['synset'])
    path_s = cmp1['synset'].path_similarity(cmp2['synset'])
    leacock_s = cmp1['synset'].lch_similarity(cmp2['synset'])
    wu_s = cmp1['synset'].wup_similarity(cmp2['synset'])
    lin_s = cmp1['synset'].lin_similarity(cmp2['synset'], ics)

  except Exception as e:
    # Print exception if similariy value calculation failed
    print(f"Error calculating similarity: {e}")
    lch, path_s, leacock_s, wu_s, lin_s = None, None, None, None, None

  # Format information in a neat, presentable manner
  print(f"{'*' * 8} {cmp1['lemma']} <==> {cmp2['lemma']} {'*' * 8}")
  print(f"{'Metric':<30} {'Value':>10}")
  print(f"{'-' * 42}")
  print(f"{'Least Common Subsumer':<30} {str(lch):>10}")
  print(f"{'Path Similarity':<30} {path_s:>10.4f}")
  print(f"{'Leacock-Chodorow Similarity':<30} {leacock_s:>10.4f}")
  print(f"{'Wu-Palmer Similarity':<30} {wu_s:>10.4f}")
  print(f"{'Lin Similarity':<30} {lin_s:>10.4f}")
  print(f"{'*' * 42}\n")

  df.loc[len(df.index)] = [cmp1['lemma'], cmp2['lemma'], cmp1['category'], path_s, leacock_s, wu_s, lin_s]


df = pd.DataFrame(columns=["word1", "word2", "category", "path", "lch", "wup", "lin"])

# Iterate over pairs and compute similarities
nc = len(comparable)
for f in range(nc):
  for s in range(f, nc):
    if (comparable[f]['category'] == comparable[s]['category']):
      compute_pair(comparable[f], comparable[s], brown_ic, df)

print(df);

[nltk_data] Downloading package wordnet_ic to /root/nltk_data...
[nltk_data]   Package wordnet_ic is already up-to-date!


******** man <==> man ********
Metric                              Value
------------------------------------------
Least Common Subsumer          [Synset('man.n.01')]
Path Similarity                    1.0000
Leacock-Chodorow Similarity        3.6376
Wu-Palmer Similarity               1.0000
Lin Similarity                     1.0000
******************************************

******** man <==> girl ********
Metric                              Value
------------------------------------------
Least Common Subsumer          [Synset('adult.n.01')]
Path Similarity                    0.2500
Leacock-Chodorow Similarity        2.2513
Wu-Palmer Similarity               0.6316
Lin Similarity                     0.7135
******************************************

******** man <==> boy ********
Metric                              Value
------------------------------------------
Least Common Subsumer          [Synset('male.n.02')]
Path Similarity                    0.3333
Leacock-Chodorow Similarit

## Normalize Leacock-Chodorow values
The similarity value Leacock-Chodorow outputs is not in the range [0, 1], like the other metrics. This is a problem if we want to compare this metric with the rest, which are normalized to this range.

In the above table we have the maximum similarity that Leacock-Chodorow will give for nouns and verbs by looking at the similarity value given by 'man <==> man' and 'swin <==> swim'. We will normalize the values to the [0, 1] range using these maximum values.


In [None]:
max_NN = df[(df['word1'] == 'man') & (df['word2'] == 'man')]['lch'].values[0]
max_VB = df[(df['word1'] == 'swim') & (df['word2'] == 'swim')]['lch'].values[0]

nn_rows = df['category'] == 'NN'
df.loc[nn_rows, 'lch'] = df.loc[nn_rows, 'lch'] / max_NN

vb_rows = df['category'] == 'VB'
df.loc[vb_rows, 'lch'] = df.loc[vb_rows, 'lch'] / max_VB

print(df)


    word1  word2 category      path       lch       wup       lin
0     man    man       NN  1.000000  1.000000  1.000000  1.000000
1     man   girl       NN  0.250000  0.618897  0.631579  0.713511
2     man    boy       NN  0.333333  0.697983  0.666667  0.729472
3     man  woman       NN  0.333333  0.697983  0.666667  0.787084
4    swim   swim       VB  1.000000  1.000000  1.000000  1.000000
5    swim   walk       VB  0.333333  0.662805  0.333333  0.491005
6    girl   girl       NN  1.000000  1.000000  1.000000  1.000000
7    girl    boy       NN  0.166667  0.507432  0.631579  0.292728
8    girl  woman       NN  0.500000  0.809449  0.631579  0.906780
9     boy    boy       NN  1.000000  1.000000  1.000000  1.000000
10    boy  woman       NN  0.200000  0.557553  0.666667  0.318423
11  woman  woman       NN  1.000000  1.000000  1.000000  1.000000
12   walk   walk       VB  1.000000  1.000000  1.000000  1.000000


## Comparing metrics

We need a methodology to compare the different metrics. Trying to extract conclusions from the values in the table is a difficult process.

We can see how, in general Path Similarity gives lower values than the other three metrics. In the case of Wu-Palmer, the similarity in swim/walk is significantly lower than in the case of comparing nouns.

But is this kind of analysis really meaningful? What is the value that best describes the smilarity between 'man' and 'woman'? -- 0.33, 0.70, 0.67 or 0.79? Analyzing each pair in isolation doesn't provide much insight to us.

So, our analysis will be driven by trying to find the coherence in each metric by comparing pairs of words that intuitively should have the same (or very close) similarity.

We will compare three sets of two pairs:
* 1: man/woman and boy/girl (comparing gender)
* 2: man/boy and woman/girl (comparing age)
* 3: man/girl and woman/boy (comparing gender and age)

Intuitively, each pair in the same group should have the same similarity, and similarities in group 3 must be significantly lower than those in groups 1 and 2.

In [None]:
def v(sim, df, w1, w2):
  return df[(df['word1'] == w1) & (df['word2'] == w2)][sim].values[0]

def p(sim, df, w1, w2):
  ret = v(sim, df, w1, w2)
  print(f"{w1} <==> {w2}: {ret}")
  return ret

def compare_sims(sim, df, title):
  print("###### " + title + " similarity ######")
  v1 = p(sim, df, 'man', 'woman')
  v2 = p(sim, df, 'girl', 'boy')
  acum = abs(v1 - v2)
  print(f'Difference: {abs(v1 - v2)}')
  print()
  v1 = p(sim, df, 'man', 'boy')
  v2 = p(sim, df, 'girl', 'woman')
  acum += abs(v1 - v2)
  print(f'Difference: {abs(v1 - v2)}')
  print()
  v1 = p(sim, df, 'man', 'girl')
  v2 = p(sim, df, 'boy', 'woman')
  acum += abs(v1 - v2)
  print(f'Difference: {abs(v1 - v2)}')
  print()
  print(f'{title} mean: {acum / 3}')
  print()
  print()

compare_sims('path', df, 'Path')
compare_sims('lch', df, 'Leacock-Chodorow')
compare_sims('wup', df, 'Wu-Palmer')
compare_sims('lin', df, 'Lin')

###### Path similarity ######
man <==> woman: 0.3333333333333333
girl <==> boy: 0.16666666666666666
Difference: 0.16666666666666666

man <==> boy: 0.3333333333333333
girl <==> woman: 0.5
Difference: 0.16666666666666669

man <==> girl: 0.25
boy <==> woman: 0.2
Difference: 0.04999999999999999

Path mean: 0.1277777777777778


###### Leacock-Chodorow similarity ######
man <==> woman: 0.6979831568441128
girl <==> boy: 0.5074317444173395
Difference: 0.19055141242677331

man <==> boy: 0.6979831568441128
girl <==> woman: 0.8094485875732267
Difference: 0.11146543072911386

man <==> girl: 0.6188971751464533
boy <==> woman: 0.5575533219658061
Difference: 0.06134385318064717

Leacock-Chodorow mean: 0.12112023211217811


###### Wu-Palmer similarity ######
man <==> woman: 0.6666666666666666
girl <==> boy: 0.631578947368421
Difference: 0.03508771929824561

man <==> boy: 0.6666666666666666
girl <==> woman: 0.631578947368421
Difference: 0.03508771929824561

man <==> girl: 0.631578947368421
boy <==> wom

## First conclusions

From the above output we can see how the mean of differences for each metric are:

* Path: 0.128
* Leacock-Chodorow: 0.121
* Wu-Palmer: 0.035
* Lin: 0.356

We can say that Wu-Palmer provides the most coherent values, followed by Path and Leacock-Chodorow and finally Lin (by a considerable margin).

Looking at the values, we can say that in general similarities in group 3 are lower than groups 1 and 2. But surprisingly, in all the cases the similarity given by boy and woman is bigger than by girl and boy. How could this be possible?

This leads to the next question: are we using the right synsets to compare the words? We are using the most common synsets for each word, as specified in the exercise instructions and also what the automatic tool most likely uses. However, is this the best set of synsets for our specific case?

## Analyzing Synsets
Lets take a closer look to the synsets available for each of our 4 lemmas:

In [None]:
lemma_pairs = [
    ('man', 'NN'),
    ('woman', 'NN'),
    ('boy', 'NN'),
    ('girl', 'NN'),
]

def get_most_frequent_synset(lemma, category):
    # If the category is not a noun, verb, adjective, or adverb, we can't use wordnet as its not applicable
    if category not in pos_map:
        return None

    pos = pos_map[category]
    synsets = wn.synsets(lemma, pos=pos)

    if synsets:
        for s in synsets:
          print(f"{s.name()} / {s.definition()}")
        return synsets[0] # First in list -> most frequent

    return None

# Map wordnet tags to category tags
pos_map = {
    'RB': wn.ADV,
    'VB': wn.VERB,
    'NN': wn.NOUN,
    'JJ': wn.ADJ
}

for lemma, category in lemma_pairs:
    print(f"******** {lemma}")
    synset = get_most_frequent_synset(lemma, category)

******** man
man.n.01 / an adult person who is male (as opposed to a woman)
serviceman.n.01 / someone who serves in the armed forces; a member of a military force
man.n.03 / the generic use of the word to refer to any human being
homo.n.02 / any living or extinct member of the family Hominidae characterized by superior intelligence, articulate speech, and erect carriage
man.n.05 / a male subordinate
man.n.06 / an adult male person who has a manly character (virile and courageous competent)
valet.n.01 / a manservant who acts as a personal attendant to his employer
man.n.08 / a male person who plays a significant role (husband or lover or boyfriend) in the life of a particular woman
man.n.09 / one of the British Isles in the Irish Sea
man.n.10 / game equipment consisting of an object used in playing certain board games
world.n.08 / all of the living human inhabitants of the earth
******** woman
woman.n.01 / an adult female person (as opposed to a man)
woman.n.02 / a female person who pla

So far we have used:

* man: man.n.01 / an adult person who is male (as opposed to a woman)
* woman: woman.n.01 / an adult female person (as opposed to a man)
* boy: male_child.n.01 / a youthful male person
* girl: girl.n.01 / a young woman

In the case of man/woman we are using synsets with the same definition, but this is not the case with boy and girl.

Let's repeat our study using a comparable synset to male_child.n.o1: female_child.n.01.

In [None]:
# Prepare the good synsets
comparable = []
comparable.append({'lemma': 'man', 'category': 'NN', 'synset': wn.synset('man.n.01')})
comparable.append({'lemma': 'girl', 'category': 'NN', 'synset': wn.synset('female_child.n.01')})
comparable.append({'lemma': 'boy', 'category': 'NN', 'synset': wn.synset('male_child.n.01')})
comparable.append({'lemma': 'woman', 'category': 'NN', 'synset': wn.synset('woman.n.01')})

# Create the new dataframe
df = pd.DataFrame(columns=["word1", "word2", "category", "path", "lch", "wup", "lin"])

# Iterate over pairs and compute similarities
nc = len(comparable)
for f in range(nc):
  for s in range(f, nc):
    if (comparable[f]['category'] == comparable[s]['category']):
      compute_pair(comparable[f], comparable[s], brown_ic, df)

# Normalize lch
max_NN = df[(df['word1'] == 'man') & (df['word2'] == 'man')]['lch'].values[0]
nn_rows = df['category'] == 'NN'
df.loc[nn_rows, 'lch'] = df.loc[nn_rows, 'lch'] / max_NN

# Show results
print(df);
print()
print()

compare_sims('path', df, 'Path')
compare_sims('lch', df, 'Leacock-Chodorow')
compare_sims('wup', df, 'Wu-Palmer')
compare_sims('lin', df, 'Lin')

******** man <==> man ********
Metric                              Value
------------------------------------------
Least Common Subsumer          [Synset('man.n.01')]
Path Similarity                    1.0000
Leacock-Chodorow Similarity        3.6376
Wu-Palmer Similarity               1.0000
Lin Similarity                     1.0000
******************************************

******** man <==> girl ********
Metric                              Value
------------------------------------------
Least Common Subsumer          [Synset('person.n.01')]
Path Similarity                    0.2000
Leacock-Chodorow Similarity        2.0281
Wu-Palmer Similarity               0.6667
Lin Similarity                     0.3167
******************************************

******** man <==> boy ********
Metric                              Value
------------------------------------------
Least Common Subsumer          [Synset('male.n.02')]
Path Similarity                    0.3333
Leacock-Chodorow Similari

## Conclusions
These are the means achieved this time:

* Path: 0.045
* Leacock-Chodorow: 0.047
* Wu-Palmer: 0.000
* Lin: 0.196

We can see improvements across all metrics, with the relative rankings remaining the same as before: Leacock-Chodorow continues to perform best, followed by Path and Wu-Palmer in a second group, and Lin showing the worst performance.

Besides this, in all the groups except Lin, the differences for the third group are lower or equal than 1 and 3, as expected. For Lin, boy/woman is more similar than boy/girl, which we find incoherent.

Finally, it's outstanding how in the case of Wu-Palmer, all the pairs in each group have exactly the same similarity. We expected that the third group would have less similarity than the other two, but the results have turned out to be the same.

With all these results we can conclude that, for our four words, the best metric is Wu-Palmer.

It's important to note that these results are not extrapolable to the entire language corpus. We have used only 4 words of one category, which is not statistically significant.

Finally, we have learnt how important is to use the right synsets when comparing words. This means that, when developing an automatic tool it's important how to choose the right synsets - using the most common is not enough.

Hopefully we will learn more complex skills in the course to select the synsets based on the context or other more suitable criteria.