# IHLT-MAI S6: Word sense disambiguation

In [25]:
import nltk
from nltk.metrics import jaccard_distance
from scipy.stats import pearsonr
from nltk.wsd import lesk
from nltk import pos_tag
from nltk.corpus import wordnet as wn
from nltk.stem import WordNetLemmatizer
#nltk.download('punkt')
#nltk.download('wordnet')
#nltk.download('universal_tagset')

We start by reading the trial set sentences and its respective gold standard.

In [26]:
with open('trial/STS.input.txt') as fp:
    data = fp.readlines()
    
with open('trial/STS.gs.txt') as e:
    gs = e.readlines()

We define auxiliary function to map the NLTK pos tags to the ones used by WordNet.

In [27]:
def nltk_pos_to_wordnet_pos(nltk_pos):
    mapping = {'NOUN': wn.NOUN, 'ADJ': wn.ADJ, 'VERB': wn.VERB, 'ADP': wn.ADV}
    if nltk_pos in mapping:
        return mapping[nltk_pos]
    else:
        return None

We define auxiliary functions to get the disambiguated synsets from a given context (ie. sentence) and
the part-of-speech, respectively. For getting the disambiguated synsets, we use the Lesk algorithm for
word sense disambiguation, which requires part-of-speech.

In [28]:
def get_synsets(context):
    words = [word for (word, pos) in context]
    return {lesk(words, w, p) for w, p in context if lesk(words, w, p)}


def get_pos(sent):
    tok_sent = nltk.word_tokenize(sent)
    pos_sent = pos_tag(tok_sent, tagset='universal')
    w_pos_sent = [(word, nltk_pos_to_wordnet_pos(pos)) for (word, pos) in pos_sent]
    filtered_pos_sent = [(word, pos) for (word, pos) in w_pos_sent if pos is not None]
    return filtered_pos_sent

Finally, we define two functions for the two different ways we have investigated to compute the Jaccard distance:
- Synsets (eval_synsets): Compute the Jaccard distance between the set of the disambiguated synsets of the first
    sentence and the set of disambiguated synsets of the second sentence.
- Definitions (eval_definitions): Compute the Jaccard distance between the set of words in the **definitions** of the disambiguated synsets of the first sentence and the ones in the second sentence.

In [29]:
def eval_synsets(sent1, sent2):
    pos_sent1 = get_pos(sent1)
    pos_sent2 = get_pos(sent2)
    synsets1 = get_synsets(pos_sent1)
    synsets2 = get_synsets(pos_sent2)
    return jaccard_distance(synsets1, synsets2)

def eval_definitions(sent1, sent2):
    pos_sent1 = get_pos(sent1)
    pos_sent2 = get_pos(sent2)
    synsets1 = get_synsets(pos_sent1)
    synsets2 = get_synsets(pos_sent2)
    definitions1 = set([])
    for synset in synsets1:
        for word in nltk.word_tokenize(synset.definition()):
            definitions1.add(word)
    definitions2 = set([])
    for synset in synsets2:
        for word in nltk.word_tokenize(synset.definition()):
            definitions2.add(word)
    return jaccard_distance(definitions1, definitions2)
    

jaccard_synsets = []
jaccard_definitions = []
gold = []
for index, line in enumerate(data):
    (num, sent1, sent2) = line.split('\t')
    jaccard_synsets.append(eval_synsets(sent1, sent2))
    jaccard_definitions.append(eval_definitions(sent1, sent2))
    gold.append(int(gs[index].split('\t')[1][0]))

print('Pearson correlation between gold and jaccard distance using synsets:', pearsonr(gold, jaccard_synsets)[0])
print('Pearson correlation between gold and jaccard distance using words in the definition:', pearsonr(gold, jaccard_definitions)[0])

Pearson correlation between gold and jaccard distance using synsets: 0.4697360281253835
Pearson correlation between gold and jaccard distance using words in the definition: 0.45389450800788006


### Comparison and conclusions
Recall that in session 2 (Document) we obtained a correlation of 0.3962389776119233 when doing the same exercise but considering words themselves and not applying any word sense disambiguation technique. This time, by applying word sense disambiguation techniques, we have slightly improved the result.

However, the result obtained in this session is slightly worse than the one obtained in session 3 (Morphology), when we used lemmas instead of the original words. Lemmatization gives us the canonical form of words. Therefore, all the words derived from a root word will be considered the same. Since we are keen on measuring semantical distances, this is useful, because in word-level settings, morphological information (at least in English) introduces "noise".

Still, using word sense disambiguation techniques should obtain better results than just using the lemmas, which is a simpler technique, at least intuitevely. The reason why we believe this is happening is that the Lesk algorithm is too simple and has a relatively low accuracy. As a future work, more complex algorithms for word sense disambiguation could be investigated.