# Lab session 5 (Lexical Semantics) - IHLT

**Students:** Lauren Tucker & Mario Rosas

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1_LPYWCgYj3PtId08KN_aBmRxuiXMU4Pv?usp=sharing)

### Importing libraries

In [1]:
import nltk
nltk.download('wordnet')
nltk.download('omw-1.4')

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\mario\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\mario\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

In [7]:
from nltk.corpus import wordnet as wn
from collections import Counter

### Manual filtering of available PoS tags for synsets

In [3]:
match = {'JJ':"a", 'JJ':"s", 'RB':"r", 'NN':"n", 'VB':"v"}

In [4]:
data = [('man', 'NN'), ('swim','VB'), ('girl','NN'), ('boy', 'NN'),  ('woman', 'NN'), ('walk', 'VB')] 

### Getting the most common lemma.

In [8]:
counts = [Counter([j for i in wn.synsets(data[r][0], match[data[r][1]]) for j in i.lemmas()]).most_common(1) for r in range(len(data))]
counts

[[(Lemma('man.n.01.man'), 10)],
 [(Lemma('swim.v.01.swim'), 5)],
 [(Lemma('girl.n.01.girl'), 5)],
 [(Lemma('male_child.n.01.boy'), 4)],
 [(Lemma('woman.n.01.woman'), 4)],
 [(Lemma('walk.v.01.walk'), 10)]]

### Definition of the most common lema

In [9]:
res = {}
for i,j in enumerate(data):
   res[data[i][0]] = wn.synsets(data[i][0], match[data[i][1]])[0]
   print(res[data[i][0]], res[data[i][0]].definition())

Synset('man.n.01') an adult person who is male (as opposed to a woman)
Synset('swim.v.01') travel through water
Synset('girl.n.01') a young woman
Synset('male_child.n.01') a youthful male person
Synset('woman.n.01') an adult female person (as opposed to a man)
Synset('walk.v.01') use one's feet to advance; advance by steps


### Preparing data to calculate similiaty metrics

In [10]:
import numpy as np
import pandas as pd

In [11]:
pairs = np.array(np.meshgrid(list(res.keys()), list(res.keys()))).T.reshape(-1,2)

### Getting the least common subsumer (LCS)

In [12]:
lcs = []
for i, pair in enumerate(pairs):
  lcs.append(res[pair[0]].lowest_common_hypernyms(res[pair[1]]))

lcs = np.array(lcs, dtype=object).reshape((6,6))

df_lcs = pd.DataFrame(lcs)
df_lcs.columns = list(res.keys())
df_lcs.index = list(res.keys())

df_lcs

Unnamed: 0,man,swim,girl,boy,woman,walk
man,[Synset('man.n.01')],[],[Synset('adult.n.01')],[Synset('male.n.02')],[Synset('adult.n.01')],[]
swim,[],[Synset('swim.v.01')],[],[],[],[Synset('travel.v.01')]
girl,[Synset('adult.n.01')],[],[Synset('girl.n.01')],[Synset('person.n.01')],[Synset('woman.n.01')],[]
boy,[Synset('male.n.02')],[],[Synset('person.n.01')],[Synset('male_child.n.01')],[Synset('person.n.01')],[]
woman,[Synset('adult.n.01')],[],[Synset('woman.n.01')],[Synset('person.n.01')],[Synset('woman.n.01')],[]
walk,[],[Synset('travel.v.01')],[],[],[],[Synset('walk.v.01')]


### Getting the Path Similarity

In [13]:
path_sim = []
for i, pair in enumerate(pairs):
  path_sim.append(res[pair[0]].path_similarity(res[pair[1]]))

In [14]:
path_sim = np.array(path_sim, dtype=object).reshape((6,6))

df_path_sim = pd.DataFrame(path_sim)
df_path_sim.columns = list(res.keys())
df_path_sim.index = list(res.keys())

df_path_sim

Unnamed: 0,man,swim,girl,boy,woman,walk
man,1.0,0.1,0.25,0.333333,0.333333,0.1
swim,0.1,1.0,0.090909,0.1,0.1,0.333333
girl,0.25,0.090909,1.0,0.166667,0.5,0.090909
boy,0.333333,0.1,0.166667,1.0,0.2,0.1
woman,0.333333,0.1,0.5,0.2,1.0,0.1
walk,0.1,0.333333,0.090909,0.1,0.1,1.0


### Getting the Leacock-Chodorow Similarity

In [15]:
lch = []
for i, pair in enumerate(pairs):
  try:
     lch.append(res[pair[0]].lch_similarity(res[pair[1]]))
  except:
    lch.append(0)

# lch = np.array(lch)
# lch /= lch.max()
lch = np.array(lch, dtype=object).reshape((6,6))

df_lch = pd.DataFrame(lch)
df_lch.columns = list(res.keys())
df_lch.index = list(res.keys())

for column in df_lch.columns:
  df_lch[column] /= df_lch[column].max()

df_lch

Unnamed: 0,man,swim,girl,boy,woman,walk
man,1.0,0.0,0.618897,0.697983,0.697983,0.0
swim,0.0,1.0,0.0,0.0,0.0,0.662805
girl,0.618897,0.0,1.0,0.507432,0.809449,0.0
boy,0.697983,0.0,0.507432,1.0,0.557553,0.0
woman,0.697983,0.0,0.809449,0.557553,1.0,0.0
walk,0.0,0.662805,0.0,0.0,0.0,1.0


### Getting the Wu-Palmer Similarity

In [16]:
wup = []
for i, pair in enumerate(pairs):
  wup.append(res[pair[0]].wup_similarity(res[pair[1]]))

wup = np.array(wup, dtype=object).reshape((6,6))

df_wup = pd.DataFrame(wup)
df_wup.columns = list(res.keys())
df_wup.index = list(res.keys())

df_wup

Unnamed: 0,man,swim,girl,boy,woman,walk
man,1.0,0.181818,0.631579,0.666667,0.666667,0.181818
swim,0.181818,1.0,0.166667,0.181818,0.181818,0.333333
girl,0.631579,0.166667,1.0,0.631579,0.631579,0.166667
boy,0.666667,0.181818,0.631579,1.0,0.666667,0.181818
woman,0.666667,0.181818,0.947368,0.666667,1.0,0.181818
walk,0.181818,0.333333,0.166667,0.181818,0.181818,1.0


### Getting the Lin Similarity

In [17]:
nltk.download('wordnet_ic')
from nltk.corpus import wordnet_ic
brown_ic = wordnet_ic.ic('ic-brown.dat')
lin = []
for i, pair in enumerate(pairs):
  try:
    lin.append(res[pair[0]].lin_similarity(res[pair[1]],brown_ic))
  except:
    lin.append(0)

lin = np.array(lin, dtype=object).reshape((6,6))

df_lin = pd.DataFrame(lin)
df_lin.columns = list(res.keys())
df_lin.index = list(res.keys())

df_lin

[nltk_data] Downloading package wordnet_ic to
[nltk_data]     C:\Users\mario\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\wordnet_ic.zip.


Unnamed: 0,man,swim,girl,boy,woman,walk
man,1.0,0.0,0.713511,0.729472,0.787084,0.0
swim,0.0,1.0,0.0,0.0,0.0,0.491005
girl,0.713511,0.0,1.0,0.292728,0.90678,0.0
boy,0.729472,0.0,0.292728,1.0,0.318423,0.0
woman,0.787084,0.0,0.90678,0.318423,1.0,0.0
walk,0.0,0.491005,0.0,0.0,0.0,1.0


## Conclusion:
*Choosing the best algorithm for calculating the similaity is a subjective decision, since deciding which has the best performance depends on the user's interpretations of which words are the most similar. For example, are "man" and "woman" more similar to each other than "man" and "boy"? Using Path Similarity, for instance, "man"  and "boy" have a similarity score of 0.333333, even though in our opinion, these two words are more than just slightly similar. On the other hand, Leacock-Chodorow provides higher similarity values for this pairing, at 0.697983, however, the difference between this similarity value and the one between "man" and "girl" is very small. Having such a small difference makes it more difficult to numerically distinguish differences between words. Wu-Palmer and Lin also have this issue, with even smaller differences between their two respective similarity values, making the two words even harder to distinguish. Nevertheless,  Lin method more often provides the highest similarity values for the words that we personally believe are extremely similar. One counterexample to this is that we would expect "boy" and "girl" to be very similar, however, their similarity score is only 0.292728, compared to the higher values found in the table that were closer to ~0.7.*

*Path Similarity and Wu-Palmer were the only two algorithms that allowed for the direct comparison between nouns and verbs. This could be advantageous in situations where you need to distingsh between words from different classes when performing nummerical analysis or when in some contexts, some words change their class. On some other cases, it could provide inaccurate similarity values when values of words ot the same type have also lower values.*

*While Path Similarity does not provide numerical similarity values, it is helpful in context analysis to determine similarity using a common word between the two, which provides even greater context.*

*It is also interesting to notice that in all of the 
man/boy and woman/girl similarity values, the woman/girl similarity is always higher than the man/boy similarity, even though in both cases, it is a comparison between an adult and a child version of the same gender.*

*In conclusion, the algorithm that appears to perform best is the Wu-Palmer algorithm, since the similarity values it provides most closely align with our personal perceptions of the word similarities. Although Leacock-Chodorow has a similar performance, Wu-Palmer has the advantage in that it is able to directly compare nouns and verbs, unlike Leacock-Chodorow.*

