# IHLT - Lab 6

1. Read all pairs of sentences of the trial set within the evaluation framework of the project.

2. Apply Lesk’s algorithm to the words in the sentences.

3. Compute their similarities by considering senses and Jaccard coefficient.

4. Compare the results with those in session 2 (document) and 3 (morphology) in which words and lemmas were considered.

5. Compare the results with gold standard by giving the pearson correlation between them.

## Imports

In [1]:
import nltk

from nltk import pos_tag
from nltk.wsd import lesk
from nltk.metrics import jaccard_distance
from scipy.stats import pearsonr

## 1. Data Preparation
As it was done in other lab sessions, all pairs of sentences of the trial set are read and stored in a variable.

In [2]:
pairs = list()
with open('trial/STS.input.txt','r') as f:
    lines = f.readlines()
    for line in lines:
        line = nltk.TabTokenizer().tokenize(line.strip())
        pairs.append((line[1], line[2]))
        
for index, pair in enumerate(pairs):
    print(str(index + 1) + ".", pair)

1. ('The bird is bathing in the sink.', 'Birdie is washing itself in the water basin.')
2. ('In May 2010, the troops attempted to invade Kabul.', 'The US army invaded Kabul on May 7th last year, 2010.')
3. ('John said he is considered a witness but not a suspect.', '"He is not a suspect anymore." John said.')
4. ('They flew out of the nest in groups.', 'They flew into the nest together.')
5. ('The woman is playing the violin.', 'The young lady enjoys listening to the guitar.')
6. ('John went horse back riding at dawn with a whole group of friends.', 'Sunrise at dawn is a magnificent view to take in if you wake up early enough for it.')


Then, each sentence is separated by its words.

In [3]:
pairs = [(nltk.word_tokenize(p[0]), nltk.word_tokenize(p[1])) for p in pairs]

for index, pair in enumerate(pairs):
    print(str(index + 1) + ".", pair, '\n')

1. (['The', 'bird', 'is', 'bathing', 'in', 'the', 'sink', '.'], ['Birdie', 'is', 'washing', 'itself', 'in', 'the', 'water', 'basin', '.']) 

2. (['In', 'May', '2010', ',', 'the', 'troops', 'attempted', 'to', 'invade', 'Kabul', '.'], ['The', 'US', 'army', 'invaded', 'Kabul', 'on', 'May', '7th', 'last', 'year', ',', '2010', '.']) 

3. (['John', 'said', 'he', 'is', 'considered', 'a', 'witness', 'but', 'not', 'a', 'suspect', '.'], ['``', 'He', 'is', 'not', 'a', 'suspect', 'anymore', '.', "''", 'John', 'said', '.']) 

4. (['They', 'flew', 'out', 'of', 'the', 'nest', 'in', 'groups', '.'], ['They', 'flew', 'into', 'the', 'nest', 'together', '.']) 

5. (['The', 'woman', 'is', 'playing', 'the', 'violin', '.'], ['The', 'young', 'lady', 'enjoys', 'listening', 'to', 'the', 'guitar', '.']) 

6. (['John', 'went', 'horse', 'back', 'riding', 'at', 'dawn', 'with', 'a', 'whole', 'group', 'of', 'friends', '.'], ['Sunrise', 'at', 'dawn', 'is', 'a', 'magnificent', 'view', 'to', 'take', 'in', 'if', 'you', '

As the performance of the Lesk's Algorithm is improved if the PoS is inserted, it will be computed:

In [4]:
pairs_pos = [(pos_tag(p[0]), pos_tag(p[1])) for p in pairs]

for index, pair in enumerate(pairs_pos, 1):
    print(str(index) + ".", pair, '\n')

1. ([('The', 'DT'), ('bird', 'NN'), ('is', 'VBZ'), ('bathing', 'VBG'), ('in', 'IN'), ('the', 'DT'), ('sink', 'NN'), ('.', '.')], [('Birdie', 'NNP'), ('is', 'VBZ'), ('washing', 'VBG'), ('itself', 'PRP'), ('in', 'IN'), ('the', 'DT'), ('water', 'NN'), ('basin', 'NN'), ('.', '.')]) 

2. ([('In', 'IN'), ('May', 'NNP'), ('2010', 'CD'), (',', ','), ('the', 'DT'), ('troops', 'NNS'), ('attempted', 'VBD'), ('to', 'TO'), ('invade', 'VB'), ('Kabul', 'NNP'), ('.', '.')], [('The', 'DT'), ('US', 'NNP'), ('army', 'NN'), ('invaded', 'VBD'), ('Kabul', 'NNP'), ('on', 'IN'), ('May', 'NNP'), ('7th', 'CD'), ('last', 'JJ'), ('year', 'NN'), (',', ','), ('2010', 'CD'), ('.', '.')]) 

3. ([('John', 'NNP'), ('said', 'VBD'), ('he', 'PRP'), ('is', 'VBZ'), ('considered', 'VBN'), ('a', 'DT'), ('witness', 'NN'), ('but', 'CC'), ('not', 'RB'), ('a', 'DT'), ('suspect', 'NN'), ('.', '.')], [('``', '``'), ('He', 'PRP'), ('is', 'VBZ'), ('not', 'RB'), ('a', 'DT'), ('suspect', 'NN'), ('anymore', 'RB'), ('.', '.'), ("''", "''

## 2. Lesk's Algorithm
`lesk(context_sentence, ambiguous_word, pos=None, ...)` from `nltk` performs the classic Lesk's algorithm for Word Sense Disambiguation (WSD) using the definitions of the ambiguous word. Lesk's algorithm uses the following formula:

$ Lesk(w) = argmax_{s_i \in S(\{w\})} \forall _{s_j \in S(C(w))} |Def(s_i) \cap Def(s_j)| $

Where:

- $S(x)$ is the set of senses for all lemmas in X.
- $C(w)$ is the set of lemmas in the context of word w.
- $Def(s)$ is the set of lemmas in the definition of sense s.

If `lesk(context_sentence, ambiguous_word, pos=None, ...)` returns `None` (PoS is not an Open-Class word or no synsets available), it is removed from the pairs list.

In [5]:
# If PoS is adjective ('j'), then return 'a' as PoS.
def get_correct_pos(pos):
    if pos == 'j':
        return 'a'
    return pos

pairs_lesk = list()
for index, pair in enumerate(pairs):
    
    first_sentence_lesk = [lesk(pair[0], word[0], pos=get_correct_pos(word[1][0].lower())) for word in pairs_pos[index][0]]
    second_sentence_lesk = [lesk(pair[1], word[0], pos=get_correct_pos(word[1][0].lower())) for word in pairs_pos[index][1]]
    
    pairs_lesk.append(([synset for synset in first_sentence_lesk if synset is not None],
                       [synset for synset in second_sentence_lesk if synset is not None]))
    
for index, pair in enumerate(pairs_lesk, 1):
    print(str(index) + ".", pair, '\n')

1. ([Synset('bird.n.02'), Synset('be.v.12'), Synset('bathe.v.01'), Synset('sinkhole.n.01')], [Synset('shuttlecock.n.01'), Synset('be.v.12'), Synset('wash.v.09'), Synset('body_of_water.n.01'), Synset('washbasin.n.01')]) 

2. ([Synset('whitethorn.n.01'), Synset('troop.n.02'), Synset('undertake.v.01'), Synset('invade.v.01'), Synset('kabul.n.01')], [Synset('uranium.n.01'), Synset('united_states_army.n.01'), Synset('invade.v.03'), Synset('kabul.n.01'), Synset('whitethorn.n.01'), Synset('last.a.02'), Synset('year.n.02')]) 

3. ([Synset('whoremaster.n.01'), Synset('suppose.v.01'), Synset('embody.v.02'), Synset('view.v.02'), Synset('witness.n.05'), Synset('not.r.01'), Synset('defendant.n.01')], [Synset('embody.v.02'), Synset('not.r.01'), Synset('defendant.n.01'), Synset('anymore.r.01'), Synset('whoremaster.n.01'), Synset('suppose.v.01')]) 

4. ([Synset('fly.v.12'), Synset('group.n.02')], [Synset('fly.v.10'), Synset('together.r.04')]) 

5. ([Synset('woman.n.02'), Synset('be.v.01'), Synset('play

## 3. Similarities with Senses and *Jaccard Coefficient*

As we have done in other lab sessions, the *Jaccard Coefficient* is used in order to compute the similarity between pairs of sentences. In this case, their senses are used instead of words or lemmas. Consequently, in this comparison, the similarities will be analyzed using the synsets.

In [6]:
similarities = [1 - jaccard_distance(set(p[0]), set(p[1])) for p in pairs_lesk]

print("Similarities (considering senses):\n")
for index, similarity in enumerate(similarities, 1):
    print(str(index) + ".", similarity)

Similarities (considering senses):

1. 0.125
2. 0.19999999999999996
3. 0.625
4. 0.0
5. 0.0
6. 0.0


## 4. Comparison with the previous results (words & lemmas)
In order to analyze the comparison with the previous results (in session 2 and 3), those similarities will be shown:

1) Similarities using words
```
1. 0.3076923076923077
2. 0.26315789473684215
3. 0.4666666666666667
4. 0.4545454545454546
5. 0.23076923076923073
6. 0.13793103448275867
```

2) Similarities using lemmas
```
1. 0.3076923076923077
2. 0.33333333333333337
3. 0.4666666666666667
4. 0.4545454545454546
5. 0.23076923076923073
6. 0.13793103448275867
```

In these similarities, the second pair of sentences was able to improve its performance as the word `invaded` was transformed to `invade` due to the lemma tokenization (second pair). Consequently, the similarities using lemmas improved the *Pearson Correlation Coefficient* from 0.39  (similarities using words) to 0.49 (similarities using lemmas).

Nevertheless, in this session example, where the senses are being used (*i.e.* synsets), the results get very different. The underlying reason behind this is the poor existence of matching between exact meanings in the pairs of sentences, resulting in 3 pairs with no equivalence or similarity (pairs 4, 5 and 6), a pair with good similarity (3) and 2 pairs that are supposed to have a good similarity but get a small similarity (pairs 1 and 2).

For example, the fourth pair of sentences can be studied:

1. *They flew out of the nest in groups.*
2. *They flew into the nest together.*

Whose transformation into synsets taking into account the Lesk's algorithm is the following:

1. *[Synset('fly.v.12'), Synset('group.n.02')]*
2. *[Synset('fly.v.10'), Synset('together.r.04')]*

In this case, word *flew* or lemma *fly* could support the similarity score. Nevertheless, in sense disambiguation, the synset is different.

A comparison of their definition is inserted below:

In [7]:
pairs_lesk[3][0][0].definition()

'travel over (an area of land or sea) in an aircraft'

In [8]:
pairs_lesk[3][1][0].definition()

'display in the air or cause to float'

If the original sentences are analyzed, it is clear that the meaning is a bit different (fly out of the nest vs fly into the nest). However, a similarity score of `0` is too low for this example.

The same situation happens with *groups vs. together*.

If the second pair of sentences (with higher score using senses) is analyzed, the following *synsets* are proposed from these original sentences:

1. *In May 2010, the troops attempted to invade Kabul.*
2. *The US army invaded Kabul on May 7th last year, 2010.*


1. *[**Synset('whitethorn.n.01')**, Synset('troop.n.02'), Synset('undertake.v.01'), Synset('invade.v.01'), **Synset('kabul.n.01')**]*.
2. *[Synset('uranium.n.01'), Synset('united_states_army.n.01'), Synset('invade.v.03'), **Synset('kabul.n.01')**, **Synset('whitethorn.n.01')**, Synset('last.a.02'), Synset('year.n.02')]* 

As it can be seen, the synsets *kabul.n.01* and *Synset('whitethorn.n.01')* have the same sense in both sentences, increasing the similarity in this pair (*invade* has not got the same meaning in both sentences). Nevertheless, a similarity score of `0.20` is lower than expected in this case too.

## 5. Comparison to the Gold Standard

Finally, as it has been done in the other compared lab sessions, the similarities will be compared to the Gold Standard using the *Pearson Correlation Coefficient*.

According to `00-readme.txt`, the similarity between two sentences that are completely equivalent should be 1/1 (or 5/5) and the similarity between two sentences that are on different topics should be 0/1 (or 0/5). For that reason, the reference similarities should be [1.0, 0.8, 0.6, 0.4, 0.2, 0] or its proportional values [5, 4, 3, 2, 1, 0].

The values in the gold standard file STS.gs.txt are reversed, so they will be read and then inverted in order to get them correctly:

In [9]:
gs = list()
with open('trial/STS.gs.txt','r') as f:
    lines = f.readlines()
    for line in lines:
        line = nltk.TabTokenizer().tokenize(line.strip())
        gs.append(int(line[1]))

gs.reverse()
print("Gold standard:", gs)
print("Pearson correlation (sense):", pearsonr(gs, similarities)[0])

Gold standard: [5, 4, 3, 2, 1, 0]
Pearson correlation (sense): 0.40653613588298887


If this *Pearson correlation coefficient* is compared to the ones obtained in the previous sessions:

```
Pearson correlation (words): 0.3962389776119232
Pearson correlation (lemmas): 0.490670810375692
```

It is supposed that, in this case, comparing the similarities between pairs of sentences using the word sense is better than comparing them using just words and worse than comparing them utilizing lemmas.

Nevertheless, it must be noticed that the resulting similarities in this lab session are unstable. As it has been explained before, sometimes the similarity is `0` (fourth pair of sentences) since the words have a minimal difference in their sense. Moreover, Lesk's algorithm is not able occasionally to find any similar lemma in the definition of the senses in the context of the original word.