Written by Santiago del Rey Juarez and Nikita Belooussov

Given the following (lemma, category) pairs:

```python
(’the’,’DT’), (’man’,’NN’), (’swim’,’VB’), (’with’, ’PR’), (’a’, ’DT’),
(’girl’,’NN’), (’and’, ’CC’), (’a’, ’DT’), (’boy’, ’NN’), (’whilst’, ’PR’),
(’the’, ’DT’), (’woman’, ’NN’), (’walk’, ’VB’)
```

For each pair, when possible, print their most frequent WordNet synset, their corresponding least common subsumer (LCS) and their similarity value, using the following functions:

* Path Similarity
* Leacock-Chodorow Similarity
* Wu-Palmer Similarity
* Lin Similarity

In [None]:
import nltk
import numpy as np
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\santi\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [None]:
nltk.download('wordnet_ic')
from nltk.corpus import wordnet as wn
from nltk.corpus import wordnet_ic
brown_ic=wordnet_ic.ic('ic-brown.dat')
words=[('the','DT'), ('man','NN'), ('swim','VB'), ('with', 'PR'), ('a', 'DT'),
('girl','NN'), ('and', 'CC'), ('a', 'DT'), ('boy', 'NN'), ('whilst', 'PR'),
('the', 'DT'), ('woman', 'NN'), ('walk', 'VB'),('dog','NN'),('table','NN'),('lamp','NN')]

[nltk_data] Downloading package wordnet_ic to
[nltk_data]     C:\Users\santi\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet_ic is already up-to-date!


In [None]:
syn=[]
for word in words:
    try:
        syn.append(wn.synsets(word[0],word[1].lower()[0])[0])
        print(f"Most common synset of {word[0]}:")
        print(wn.synsets(word[0],word[1].lower()[0])[0])
        print ("\n")
    except:
        pass

Most common synset of man:
Synset('man.n.01')


Most common synset of swim:
Synset('swim.v.01')


Most common synset of girl:
Synset('girl.n.01')


Most common synset of boy:
Synset('male_child.n.01')


Most common synset of woman:
Synset('woman.n.01')


Most common synset of walk:
Synset('walk.v.01')


Most common synset of dog:
Synset('dog.n.01')


Most common synset of table:
Synset('table.n.01')


Most common synset of lamp:
Synset('lamp.n.01')




In [None]:
pSimA=[round(syn[0].path_similarity(syn[0]),3)]
wSimA=[round(syn[0].wup_similarity(syn[0]),3)]
liSimA=[round(syn[0].lin_similarity(syn[0],brown_ic),3)]
lSimA=[round(syn[0].lch_similarity(syn[0]),3)]

for i,word in enumerate(syn):
    for j,word2 in enumerate(syn[i+1:]):
        print("\nPath Similarity:")
        pSim=round(word.path_similarity(word2),3)
        pSimA.append(round(word.path_similarity(word2),3))
        print(str(word)+" "+str(word2))
        print(pSim)

        print("\nLeacock Chodorow Similarity:")
        try:
            lSim=round(word.lch_similarity(word2)/lSimA[0],3)
            lSimA.append(round(word.lch_similarity(word2)/lSimA[0],3))
            print(str(word)+" "+str(word2))
            print(lSim)
        except:
            lSimA.append(0)
            print("None")

        print("\nWu-Palmer Similarity:")
        wSim=round(word.wup_similarity(word2),3)
        wSimA.append(round(word.wup_similarity(word2),3))
        print(str(word)+" "+str(word2))
        print(wSim)

        print("\nlin Similarity:")
        try:
            liSim=round(word.lin_similarity(word2,brown_ic),3)
            liSimA.append(round(word.lin_similarity(word2,brown_ic),3))
            print(str(word)+" "+str(word2))
            print(pSim)
        except:
            liSimA.append(0)
            print("None")
        
        print("Least common subsumer")
        print(word.lowest_common_hypernyms(word2))

        print("\n\n")

print ("Normalized Vectors:")
print([pSimA,wSimA,liSimA,lSimA])


Path Similarity:
Synset('man.n.01') Synset('swim.v.01')
0.1

Leacock Chodorow Similarity:
None

Wu-Palmer Similarity:
Synset('man.n.01') Synset('swim.v.01')
0.182

lin Similarity:
None
Least common subsumer
[]




Path Similarity:
Synset('man.n.01') Synset('girl.n.01')
0.25

Leacock Chodorow Similarity:
Synset('man.n.01') Synset('girl.n.01')
0.619

Wu-Palmer Similarity:
Synset('man.n.01') Synset('girl.n.01')
0.632

lin Similarity:
Synset('man.n.01') Synset('girl.n.01')
0.25
Least common subsumer
[Synset('adult.n.01')]




Path Similarity:
Synset('man.n.01') Synset('male_child.n.01')
0.333

Leacock Chodorow Similarity:
Synset('man.n.01') Synset('male_child.n.01')
0.698

Wu-Palmer Similarity:
Synset('man.n.01') Synset('male_child.n.01')
0.667

lin Similarity:
Synset('man.n.01') Synset('male_child.n.01')
0.333
Least common subsumer
[Synset('male.n.02')]




Path Similarity:
Synset('man.n.01') Synset('woman.n.01')
0.333

Leacock Chodorow Similarity:
Synset('man.n.01') Synset('woman.n.01')

In [None]:
size = len(syn)

subsumerMat =[[None]*size for _ in range(size)]
pSimMat=np.ndarray((size,size))
wSimMat=np.ndarray((size,size))
lSimMat=np.ndarray((size,size))
liSimMat=np.ndarray((size,size))

for i,word in enumerate(syn):
    for j,word2 in enumerate(syn):
        pSim=word.path_similarity(word2)
        pSimMat[i,j]=pSim

        try:
            lSim=word.lch_similarity(word2)/lSimA[0]
            lSimMat[i,j]=lSim
        except:
            lSimMat[i,j]=0

        wSim=word.wup_similarity(word2)
        wSimMat[i,j]=wSim

        try:
            liSim=word.lin_similarity(word2,brown_ic)
            liSimMat[i,j]=liSim
        except:
            liSimMat[i,j]=0
        
        subsumers = word.lowest_common_hypernyms(word2)
        if len(subsumers) > 0:
            subsumerMat[i][j] = subsumers[0].name()


In [None]:
import pandas as pd

word_list = ["man", "swim", "girl", "boy", "woman", "walk", "dog", "table", "lamp"]

pd.set_option('display.float_format', lambda x: '%.3f' % x)

df_subsummer = pd.DataFrame(subsumerMat, index = word_list, columns = word_list)
print("Matrix of least common subsumers")
print(df_subsummer)

Matrix of least common subsumers
                 man         swim           girl              boy  \
man         man.n.01         None     adult.n.01        male.n.02   
swim            None    swim.v.01           None             None   
girl      adult.n.01         None      girl.n.01      person.n.01   
boy        male.n.02         None    person.n.01  male_child.n.01   
woman     adult.n.01         None     woman.n.01      person.n.01   
walk            None  travel.v.01           None             None   
dog    organism.n.01         None  organism.n.01    organism.n.01   
table    entity.n.01         None    entity.n.01      entity.n.01   
lamp      whole.n.02         None     whole.n.02       whole.n.02   

               woman         walk            dog        table         lamp  
man       adult.n.01         None  organism.n.01  entity.n.01   whole.n.02  
swim            None  travel.v.01           None         None         None  
girl      woman.n.01         None  organism.n

In [None]:
df_path = pd.DataFrame(pSimMat, index = word_list, columns = word_list)
print("Matrix of Path Similarity Method")
print(df_path)

Matrix of Path Similarity Method
        man  swim  girl   boy  woman  walk   dog  table  lamp
man   1.000 0.100 0.250 0.333  0.333 0.100 0.143  0.091 0.091
swim  0.100 1.000 0.091 0.100  0.100 0.333 0.083  0.111 0.083
girl  0.250 0.091 1.000 0.167  0.500 0.091 0.125  0.083 0.083
boy   0.333 0.100 0.167 1.000  0.200 0.100 0.143  0.091 0.091
woman 0.333 0.100 0.500 0.200  1.000 0.100 0.143  0.091 0.091
walk  0.100 0.333 0.091 0.100  0.100 1.000 0.083  0.111 0.083
dog   0.143 0.083 0.125 0.143  0.143 0.083 1.000  0.071 0.091
table 0.091 0.111 0.083 0.091  0.091 0.111 0.071  1.000 0.071
lamp  0.091 0.083 0.083 0.091  0.091 0.083 0.091  0.071 1.000


In [None]:
df_wu = pd.DataFrame(wSimMat, index = word_list, columns = word_list)
print("Matrix of Wu-Palmer Similarity Method")
print(df_wu)

Matrix of Wu-Palmer Similarity Method
        man  swim  girl   boy  woman  walk   dog  table  lamp
man   1.000 0.182 0.632 0.667  0.667 0.182 0.667  0.167 0.444
swim  0.182 1.000 0.167 0.182  0.182 0.333 0.154  0.200 0.154
girl  0.632 0.167 1.000 0.632  0.632 0.167 0.632  0.154 0.421
boy   0.667 0.182 0.632 1.000  0.667 0.182 0.667  0.167 0.444
woman 0.667 0.182 0.947 0.667  1.000 0.182 0.667  0.167 0.444
walk  0.182 0.333 0.167 0.182  0.182 1.000 0.154  0.200 0.154
dog   0.667 0.154 0.632 0.667  0.667 0.154 0.929  0.133 0.444
table 0.167 0.200 0.154 0.167  0.167 0.200 0.133  1.000 0.133
lamp  0.444 0.154 0.421 0.444  0.444 0.154 0.444  0.133 1.000


In [None]:
df_l = pd.DataFrame(lSimMat, index = word_list, columns = word_list)
print("Matrix of Leacock Chodorow Similarity Method")
print(df_l)

Matrix of Leacock Chodorow Similarity Method
        man  swim  girl   boy  woman  walk   dog  table  lamp
man   1.000 0.000 0.619 0.698  0.698 0.000 0.465  0.341 0.341
swim  0.000 0.896 0.000 0.000  0.000 0.594 0.000  0.000 0.000
girl  0.619 0.000 1.000 0.507  0.809 0.000 0.428  0.317 0.317
boy   0.698 0.000 0.507 1.000  0.557 0.000 0.465  0.341 0.341
woman 0.698 0.000 0.809 0.557  1.000 0.000 0.465  0.341 0.341
walk  0.000 0.594 0.000 0.000  0.000 0.896 0.000  0.000 0.000
dog   0.465 0.000 0.428 0.465  0.465 0.000 1.000  0.274 0.341
table 0.341 0.000 0.317 0.341  0.341 0.000 0.274  1.000 0.274
lamp  0.341 0.000 0.317 0.341  0.341 0.000 0.341  0.274 1.000


In [None]:
df_lin = pd.DataFrame(liSimMat, index = word_list, columns = word_list)
print("Matrix of Lin Similarity Method")
print(df_lin)

Matrix of Lin Similarity Method
         man  swim   girl    boy  woman  walk    dog  table   lamp
man    1.000 0.000  0.714  0.729  0.787 0.000  0.292 -0.000  0.200
swim   0.000 1.000  0.000  0.000  0.000 0.491  0.000  0.000  0.000
girl   0.714 0.000  1.000  0.293  0.907 0.000  0.269 -0.000  0.184
boy    0.729 0.000  0.293  1.000  0.318 0.000  0.256 -0.000  0.175
woman  0.787 0.000  0.907  0.318  1.000 0.000  0.291 -0.000  0.199
walk   0.000 0.491  0.000  0.000  0.000 1.000  0.000  0.000  0.000
dog    0.292 0.000  0.269  0.256  0.291 0.000  1.000 -0.000  0.169
table -0.000 0.000 -0.000 -0.000 -0.000 0.000 -0.000  1.000 -0.000
lamp   0.200 0.000  0.184  0.175  0.199 0.000  0.169 -0.000  1.000


Normalize similarity values when necessary. What similarity seems better?

According to our results the Lin similarties and Leacock Chodorow were the best to use. We added a couple more words to get a better idea of how it compares different types of nouns to each other, for example animals and furniture. It appears that the Leacock Chodorow is the perfered method for nouns, due to nouns such as humans being more close together than nouns that are animals. The Lin method is better for verbs. This is becuase in the Leacock Chodorow similarity the verb compared to itself wouldn't produce the result of 1. A 1 meant that the nouns were similar so it should have mirrored this for the verbs. In the Lin method all the words compared to them selves would produce a 1. So it would be better suited for comparing verbs. The other two methods produced results such as stating that a boy is pratically just as similar to a girl as a dog is. Which in most cases would be stated as not being true. 