# IHLT-MAI S6: Word sense disambiguation

In [12]:
import nltk
from nltk.metrics import jaccard_distance
from scipy.stats import pearsonr
from nltk import pos_tag
nltk.download('maxent_ne_chunker')
nltk.download('conll2000')
from nltk import word_tokenize, pos_tag, ne_chunk
nltk.download('words')
#nltk.download('punkt')
#nltk.download('wordnet')
#nltk.download('universal_tagset')

[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /home2/users/alumnes/1202114/nltk_data...
[nltk_data]   Package maxent_ne_chunker is already up-to-date!
[nltk_data] Downloading package conll2000 to
[nltk_data]     /home2/users/alumnes/1202114/nltk_data...
[nltk_data]   Package conll2000 is already up-to-date!
[nltk_data] Downloading package words to
[nltk_data]     /home2/users/alumnes/1202114/nltk_data...
[nltk_data]   Package words is already up-to-date!


True

We start by reading the trial set sentences and its respective gold standard.

In [9]:
with open('trial/STS.input.txt') as fp:
    data = fp.readlines()
    
with open('trial/STS.gs.txt') as e:
    gs = e.readlines()

We define auxiliary functions to get the disambiguated synsets from a given context (ie. sentence) and
the part-of-speech, respectively. For getting the disambiguated synsets, we use the Lesk algorithm for
word sense disambiguation, which requires part-of-speech.

In [81]:
def get_nes(sentence, binary):
    x = pos_tag(word_tokenize(sentence))
    nes = ne_chunk(x, binary=binary)
    res = set([])
    for i in nes:
        if type(i) == nltk.tree.Tree:
            res.add((i.label()))
        else:
            # pos : res.add(i[1])
            res.add(i[0]) # word
    return res

Finally, we define two functions for the two different ways we have investigated to compute the Jaccard distance:
- Synsets (eval_synsets): Compute the Jaccard distance between the set of the disambiguated synsets of the first
    sentence and the set of disambiguated synsets of the second sentence.
- Definitions (eval_definitions): Compute the Jaccard distance between the set of words in the **definitions** of the disambiguated synsets of the first sentence and the ones in the second sentence.

In [80]:
def jaccard_with_nes(sent1, sent2, binary):
    nes_sent1 = get_nes(sent1, binary)
    nes_sent2 = get_nes(sent2, binary)
    return jaccard_distance(nes_sent1, nes_sent2)

jaccard_binary = []
jaccard_non_binary = []
gold = []
for index, line in enumerate(data):
    (num, sent1, sent2) = line.split('\t')
    jaccard_binary.append(jaccard_with_nes(sent1, sent2, binary=True))
    jaccard_non_binary.append(jaccard_with_nes(sent1, sent2, binary=False))
    gold.append(int(gs[index].split('\t')[1][0]))

print('Pearson correlation between gold and jaccard distance using NEs with binary to True:', pearsonr(gold, jaccard_binary)[0])
print('Pearson correlation between gold and jaccard distance using NEs with binary to false:', pearsonr(gold, jaccard_non_binary)[0])

Pearson correlation between gold and jaccard distance using NEs with binary to True: 0.5320054206469561
Pearson correlation between gold and jaccard distance using NEs with binary to false: 0.5141109167924274


### Comparison and conclusions
Recall that in session 2 (Document) we obtained a correlation of 0.3962389776119233 when doing the same exercise but considering words themselves and not applying any word sense disambiguation technique. This time, by applying word sense disambiguation techniques, we have slightly improved the result.

However, the result obtained in this session is slightly worse than the one obtained in session 3 (Morphology), when we used lemmas instead of the original words. Lemmatization gives us the canonical form of words. Therefore, all the words derived from a root word will be considered the same. Since we are keen on measuring semantical distances, this is useful, because in word-level settings, morphological information (at least in English) introduces "noise".

Still, using word sense disambiguation techniques should obtain better results than just using the lemmas, which is a simpler technique, at least intuitevely. The reason why we believe this is happening is that the Lesk algorithm is too simple and has a relatively low accuracy. As a future work, more complex algorithms for word sense disambiguation could be investigated.