## IHLT S5: Lexical semantics

Jordi Armengol, Joan Llop.

In [16]:
from nltk.corpus import wordnet as wn
from nltk.corpus import wordnet_ic
from math import log
import nltk
# nltk.download('wordnet_ic')

In [4]:
# information content from frequencies in the Brown Corpus
brown_ic = wordnet_ic.ic('ic-brown.dat')

# (lemma, category) pairs
data = ('the','DT'), ('man','NN'), ('swim','VB'), ('with', 'PR'), ('a', 'DT'), \
        ('girl','NN'), ('and', 'CC'), ('a', 'DT'), ('boy', 'NN'), ('whilst', 'PR'), \
        ('the', 'DT'), ('woman', 'NN'), ('walk', 'VB')

#### Converting the NLTK PoS tags into WordNet PoS tags

We need to transform the PoS into WordNet tags since they are named differently. In case that the PoS tag is not a noun, verb, adjective or adverb, we return None, because WordNet only deals with these four categories (and in the other cases there are no synsets).

In [5]:
def nltk_pos_to_wordnet_pos(nltk_pos):
    mapping = {'NN': wn.NOUN, 'JJ': wn.ADJ, 'VB': wn.VERB, 'RB': wn.ADV}
    if nltk_pos in mapping:
        return mapping[nltk_pos]
    else:
        return None

#### For each pair, when possible, print their most frequent WordNet synset
The only categories that have synsets are nouns, verbs, adjectives and adverbs.

In [6]:
saved_synsets = []
print('Most frequent synsets:\n')
print()
for lemma, category in data:
    wordnet_category = nltk_pos_to_wordnet_pos(category)
    if wordnet_category is not None:
        word_synsets = wn.synsets(lemma, wordnet_category)
        if len(word_synsets) > 0:
            most_freq_synset = word_synsets[0] # The most frequent synset is the first one
            print(tuple((lemma, category)), 'is', most_freq_synset)
            saved_synsets.append(most_freq_synset)
        else:
            # Just in case
            print(tuple((lemma, category)), "doesn't have an assigned synset")
    else:
        # The lemma is not a noun, nor a verb, nor an adjective nor an adverb
        print(tuple((lemma, category)), "doesn't have an assigned synset")

Most frequent synsets:


('the', 'DT') doesn't have an assigned synset
('man', 'NN') is Synset('man.n.01')
('swim', 'VB') is Synset('swim.v.01')
('with', 'PR') doesn't have an assigned synset
('a', 'DT') doesn't have an assigned synset
('girl', 'NN') is Synset('girl.n.01')
('and', 'CC') doesn't have an assigned synset
('a', 'DT') doesn't have an assigned synset
('boy', 'NN') is Synset('male_child.n.01')
('whilst', 'PR') doesn't have an assigned synset
('the', 'DT') doesn't have an assigned synset
('woman', 'NN') is Synset('woman.n.01')
('walk', 'VB') is Synset('walk.v.01')


#### Similarities

For each pair of the found synsets, we are going to print their corresponding least common subsumer (LCS) and their similarity value, using the following functions:

- Path Similarity
- Leacock-Chodorow Similarity
- Wu-Palmer Similarity
- Lin Similarity

Notice that computing those similarities is not always possible, typically because they don't have the same PoS. In the first version we handled errors for each case, but we found out that checking whether they share the same category avoids any errors. In WordNet, it doesn't make sense to compute similarities between, say, a verb and a noun, because having different categories they don't share a common ancestor.

Leacock-Chodorow Similarity have to be normalized (by dividing by the maximum possible similarity, in case the shortest path was 1, and taking into account that the maximum depth of WordNet is 20).

In [17]:
print('Similarities between pairs of synsets:')
print()
for index1, synset1 in enumerate(saved_synsets):
    for index2, synset2 in enumerate(saved_synsets):
        if index1 < index2:
            if synset1.pos() != synset2.pos():
                print('Skipping', synset1, 'and', synset2, 'because they have different PoS')
                print()
                continue
            print('Similarities between', synset1, synset2)
            lcs = synset1.lowest_common_hypernyms(synset2)
            print('LCS:', lcs[0])
            path_sim = synset1.path_similarity(synset2)
            print('Path similarity:', path_sim)
            leacock_chodorow_sim = synset1.lch_similarity(synset2)/-log(1/(2*20)) # Normalize by max similarity
            print('Leacock-Chodorow Similarity', leacock_chodorow_sim)
            wu_palmer_sim = synset1.wup_similarity(synset2)
            print('Wu-Palmer Similarity:', wu_palmer_sim)
            lin_sim = synset1.lin_similarity(synset2,brown_ic)
            print('Lin similarity', lin_sim)
            print()

Similarities between pairs of synsets:

Skipping Synset('man.n.01') and Synset('swim.v.01') because they have different PoS

Similarities between Synset('man.n.01') Synset('girl.n.01')
LCS: Synset('adult.n.01')
Path similarity: 0.25
Leacock-Chodorow Similarity 0.6102915062989643
Wu-Palmer Similarity: 0.631578947368421
Lin similarity 0.7135111237276783

Similarities between Synset('man.n.01') Synset('male_child.n.01')
LCS: Synset('male.n.02')
Path similarity: 0.3333333333333333
Leacock-Chodorow Similarity 0.6882778097361639
Wu-Palmer Similarity: 0.6666666666666666
Lin similarity 0.7294717876200584

Similarities between Synset('man.n.01') Synset('woman.n.01')
LCS: Synset('adult.n.01')
Path similarity: 0.3333333333333333
Leacock-Chodorow Similarity 0.6882778097361639
Wu-Palmer Similarity: 0.6666666666666666
Lin similarity 0.7870841372982784

Skipping Synset('man.n.01') and Synset('walk.v.01') because they have different PoS

Skipping Synset('swim.v.01') and Synset('girl.n.01') because the

#### Conclusions: Which similarity seems better?

Each metric may have its own usefulness, but as far as we are concerned, we observe that:
- LCS, even if it's not a similarity measure, could be used as a qualitative measure of proximity.
- Path similarity seems to give way low similarities (for instance, woman and girl always have a similarity of 0.5). It is the simplest metric, so perhaps is missing complexities.
- Leacock-Chodorow seems to be relatively robust, but gives similar results in the different cases.
- Wu-Palmer exhibits gives similar results in all cases, which may be true (all the examples are relatively close, semantically), but perhaps it's not that useful if what we really want to do is to fine-grained-ly compare words of the same domain, which is what we will want to do in many case.
- Lin exhibits high variance, but at the same time it seems to give very high similarities when it needs to (woman and girl, for instance). It doesn't seem very consistent (eg. shouldn't man vs woman have the same similarity than male_child vs girl? 0.78 vs 0.29!)

In our view, Leacock-Chodorow seems to be the most consistent (woman vs man and girl vs male_child gives similar results) and robust, although the results are too similar.