# Mandatory exercise

In [1]:
import nltk
from nltk.wsd import lesk
from nltk import pos_tag
from nltk.corpus import wordnet as wn
from nltk.metrics import jaccard_distance
from scipy.stats import pearsonr

## 1. Read all pairs of sentences of the trial set within the evaluation framework of the project.

First we will be reading the trial set, which is located at ../trial/STS.input.txt

In [2]:
with open('../trial/STS.input.txt','r') as f:
    raw_text = f.read()

Once we have the raw text, it is better to convert it into a list:

In [3]:
#First we will separate each line of the text
formatted_ids = raw_text.split(sep='\n')
#Removing the final blank
formatted_ids.remove('')

#Then we are going to split once again each line to get 
#the sentences we need
formatted_words = [lst.lower().split(sep='\t') for lst in formatted_ids]
formatted_words

[['id1',
  'the bird is bathing in the sink.',
  'birdie is washing itself in the water basin.'],
 ['id2',
  'in may 2010, the troops attempted to invade kabul.',
  'the us army invaded kabul on may 7th last year, 2010.'],
 ['id3',
  'john said he is considered a witness but not a suspect.',
  '"he is not a suspect anymore." john said.'],
 ['id4',
  'they flew out of the nest in groups.',
  'they flew into the nest together.'],
 ['id5',
  'the woman is playing the violin.',
  'the young lady enjoys listening to the guitar.'],
 ['id6',
  'john went horse back riding at dawn with a whole group of friends.',
  'sunrise at dawn is a magnificent view to take in if you wake up early enough for it.']]

## 2. Apply Lesk’s algorithm to the words in the sentences.

In [1]:
#First we need the words:
test_words = [[formatted_words[i][0],nltk.word_tokenize(formatted_words[i][1]),
               nltk.word_tokenize(formatted_words[i][2])] for i in range(len(formatted_words))]
#Then their POS:
test_pos = [[i[0], pos_tag(i[1]), pos_tag(i[2])] for i in test_words]
print(test_post)

NameError: name 'formatted_words' is not defined

As Lesk's algorithm works with synsets, not every word will be accepted. Just nouns, verbs, adjectives and adverbs

In [6]:
valid_pairs = []
for i in range(len(test_pos)):
    valid_pairs.append([])
    valid_pairs[i].append(test_pos[i][0])
    for j in range(2):
        valid_pairs[i].append([])
        valid_pairs[i][j+1].extend(lesker_sentence(test_pos[i][j+1]))

In [5]:
def lesker_sentence(pos_tag_sentence):
    """
    Returns a sentence as the given sentece using lesker algorithms.
    The input sentence must be a pos_tagged sentence (e.g. [('The', 'DN'),
    ('sun', 'NN')]).
    """
    sentence = [i[0] for i in pos_tag_sentence]
    final_sentence = []
    for word, tag in pos_tag_sentence:
        #if word is a noun
        if tag.startswith('N') and (type(lesk(sentence, word, wn.NOUN))!=type(None)):
            final_sentence.append(lesk(sentence, word, wn.NOUN).name())
        #if word is a verb
        elif tag.startswith('V') and (type(lesk(sentence, word, wn.VERB))!=type(None)):
            final_sentence.append(lesk(sentence, word, wn.VERB).name())
        #if word is an adjective
        elif tag.startswith('J') and (type(lesk(sentence, word, wn.ADJ))!=type(None)):
            final_sentence.append(lesk(sentence, word, wn.ADJ).name())
        #if word is a verb
        elif tag.startswith('R') and (type(lesk(sentence, word, wn.ADV))!=type(None)):
            final_sentence.append(lesk(sentence, word, wn.ADV).name())
        else:
            final_sentence.append(word)
    return final_sentence

## 3. Compute their similarities by considering senses and Jaccard coefficient.

Now that we hace converted all our synset words considering their most probable sense, we are able to compute their similarities

In [7]:
similarities = [1.-jaccard_distance(set(i[1]), set(i[2])) for i in valid_pairs]
similarities

[0.33333333333333337,
 0.33333333333333337,
 0.5714285714285714,
 0.33333333333333337,
 0.16666666666666663,
 0.09999999999999998]

## 4. Compare the results with those in session 2 (document) and 3 (morphology) in which words and lemmas were considered.

In session 2 we got the following results inwhich raw words were considered:
* **'id1': 0.3076923076923077**
* **'id2': 0.26315789473684215**
* **'id3': 0.4666666666666667**
* **'id4': 0.4545454545454546**
* **'id5': 0.23076923076923073**
* **'id6': 0.13793103448275867**

In ession 3, using lemma similarity, we got:
* **'id1': 0.33333333333333337**
* **'id2': 0.4117647058823529**
* **'id3': 0.5714285714285714**
* **'id4': 0.4545454545454546**
* **'id5': 0.16666666666666663**
* **'id6': 0.13793103448275867**

As we can see, using synsets the similarities have changed significantly. The main changes that we see are:
* Same similarity in the first pair: this is due to the fact that although considering synsets, the results weren't the same synsets for the words considered. If two words don't have the same sense, this won't count as a similarity.
* Lower similarity on the last pair: using word sense disambiguation has made that this two sentences have became less similar. This is due to the fact that although having a same word, their senses are not the same, and so they are not similar

Tere have been also some minor changes in the second pair, in which similarity has decreased. Finally, the fourth part has been labeled more properly, decreasing its significance also.

## 5. Compare the results with gold standard by giving the pearson correlation between them.

In [8]:
#Reading he gold standard:
with open('../trial/STS.gs.txt','r') as f:
    plain_gs = f.read()

#First we will separate each line of the text
formatted_gs = plain_gs.split(sep='\n')
#Removing the final blank
formatted_gs.remove('')

words_gs = [lst.split(sep='\t') for lst in formatted_gs]
gold_standard = [int(i[1]) for i in words_gs]
gs_list = list(gold_standard)

In [9]:
pearsonr(gs_list, similarities)[0]

0.6206712050540985

As we can see, the correlation with the gold standard has increased significantly compared to the previous results (0.579 the previous better one). Using word disambiguation seems a good aproach towards semantic textual similarity. Although that, this is not good enough, this process keeps asigning a low similarity to the first pair. 

In order to increase the similarity of those sentences, we could use more information from the synsets that we have just extracted. Comparing the distances in the synset graph could give us more information in order to get a better similarity. FOr instance, two words could be similar, and they might have different synsets. But those synsets would be similar if those words are clo though.