## Mandatory Exercise - Session 6
### Olga Valls & Lavanya Mandadapu

Import the packages that we need:

In [1]:
import nltk
import re
from nltk.text import Text
from nltk.wsd import lesk
from nltk.metrics import jaccard_distance
from scipy import stats

Define a function to remove punctuation from the texts and convert them into lowercase:

In [2]:
def preprocess(line):
    # Remove all the digits and punctuation from data
    # decimal digit or not word+space = space
    line = re.sub(r'(\d|[^\w ])', ' ', line)
    # Remove spaces at the start of the sentence
    # spaces at the start (^) (+: one or more repetitions)
    line = re.sub(r'^[ ]+', '', line)
    # Convert all the texts to lower case
    line = line.lower()
    # Replace continuous white spaces by a single one
    line = re.sub(r'[ ]+', ' ', line)
    return line

Create empty lists to fill with the results of:
- Jaccard distances
- Similarities
- Golden records (extracted from file "STS.gs.txt")

In [3]:
jaccards = []
similarities = []
golden = []

### Tasks 1-3:
1. Read all pairs of sentences of the **trial set** within the evaluation framework of the project.
2. Apply **Lesk’s algorithm** to the words in the sentences.
3. Compute their **similarities** by considering senses and **Jaccard coefficient**.

For each line of the file
- Split the sentences
- Remove punctuation for each sentence and convert into lowercase
- Tokenize to create a list of words
- Create POS Tag pairs
- Apply Lesk algorithm for Word Sense Disambiguation
- Compute Jaccard Distance and Similarities between the sets of disambiguated synsets

Jaccard is a similarity between to sets (intersection / union)
As the python function is jaccard_distance(set1,set2), we compute the similarity:
D = 1 / (1+S), where S = Jaccard distance

In [4]:
txt_file = open('../00_data/trial/STS.input.txt', 'r')

for i, line in enumerate(txt_file):
    sentences = nltk.sent_tokenize(line)
    temp = sentences[0].split('\t')
    temp.append(sentences[1])
    # For each sentence, remove punctuation and convert to lowercase
    sent = []
    for t in temp:
        pt = preprocess(t)
        sent.append(pt)
    print("** Line{}: {}".format((i + 1), temp))

    # tokenizer
    words = [nltk.word_tokenize(s) for s in sent]
    sentence1 = words[1]
    sentence2 = words[2]
    print('- Sentence1: {}'.format(sentence1))
    print('- Sentence2: {}'.format(sentence2))

    # Apply Lesk’s algorithm to the words in the sentences.
    # POS tagging for each sentence
    pairs1 = nltk.pos_tag(sentence1)
    pairs2 = nltk.pos_tag(sentence2)
    print('- Pairs 1: {}'.format(pairs1))
    print('- Pairs 2: {}'.format(pairs2))

    # Apply Lesk Algortithm to both sentences
    temp1 = [lesk(sentence1, p[0], p[1][0].lower()) for p in pairs1]
    temp2 = [lesk(sentence2, p[0], p[1][0].lower()) for p in pairs2]
    # We keep those different from None (no duplicates)
    sentlesk1 = [t for t in temp1 if t is not None]
    sentlesk2 = [t for t in temp2 if t is not None]
    print('- Sentlesk 1: {}'.format(sentlesk1))
    print('- Sentlesk 2: {}'.format(sentlesk2))

    # Compute their similarities by considering senses and Jaccard coefficient.
    # jaccard_distance() of the Lesk WSD of each sentence
    jd = jaccard_distance(set(sentlesk1), set(sentlesk2))
    jaccards.append(jd)
    print('> Jaccard Distance: {}'.format(jd))
    # similarities: 1 / (1+S) (where S = Jaccard distance)
    sim = 1 / (1 + jd)
    sim2 = 1 - jd
    similarities.append(sim)
    print('> Similarities (1 / (1 + jd)): {}'.format(sim))
    
txt_file.close()

** Line1: ['id1', 'The bird is bathing in the sink.', 'Birdie is washing itself in the water basin.']
- Sentence1: ['the', 'bird', 'is', 'bathing', 'in', 'the', 'sink']
- Sentence2: ['birdie', 'is', 'washing', 'itself', 'in', 'the', 'water', 'basin']
- Pairs 1: [('the', 'DT'), ('bird', 'NN'), ('is', 'VBZ'), ('bathing', 'VBG'), ('in', 'IN'), ('the', 'DT'), ('sink', 'NN')]
- Pairs 2: [('birdie', 'NN'), ('is', 'VBZ'), ('washing', 'VBG'), ('itself', 'PRP'), ('in', 'IN'), ('the', 'DT'), ('water', 'NN'), ('basin', 'NN')]
- Sentlesk 1: [Synset('bird.n.02'), Synset('be.v.12'), Synset('bathe.v.01'), Synset('sinkhole.n.01')]
- Sentlesk 2: [Synset('shuttlecock.n.01'), Synset('be.v.12'), Synset('wash.v.09'), Synset('body_of_water.n.01'), Synset('washbasin.n.01')]
> Jaccard Distance: 0.875
> Similarities (1 / (1 + jd)): 0.5333333333333333
** Line2: ['id2', 'In May 2010, the troops attempted to invade Kabul.', 'The US army invaded Kabul on May 7th last year, 2010.']
- Sentence1: ['in', 'may', 'the',

In [5]:
print('Jaccard distances: {}'.format(jaccards))
print('Similarities (1 / (1 + jd)): {}'.format(similarities))

Jaccard distances: [0.875, 0.8, 0.625, 1.0, 1.0, 1.0]
Similarities (1 / (1 + jd)): [0.5333333333333333, 0.5555555555555556, 0.6153846153846154, 0.5, 0.5, 0.5]


### 4. Compare the results with those in session 2 (document) and 3 (morphology) in which words and lemmas were considered.

Being the <u>pairs of sentences</u>:<br>
id1	- The bird is bathing in the sink. - Birdie is washing itself in the water basin.<br>
id2	- In May 2010, the troops attempted to invade Kabul. - The US army invaded Kabul on May 7th last year, 2010.<br>
id3	- John said he is considered a witness but not a suspect. - "He is not a suspect anymore." John said.<br>
id4	- They flew out of the nest in groups. - They flew into the nest together.<br>
id5	- The woman is playing the violin. - The young lady enjoys listening to the guitar.<br>
id6	- John went horse back riding at dawn with a whole group of friends. - Sunrise at dawn is a magnificent view to take in if you wake up early enough for it.

We have to take into account the <u>pre-processing</u> that we have made to the sentences:
- remove punctuation
- convert to lowercase
For all cases, we have computed the Similarity as 1 / (1 + jd) , being jd: Jaccard Distance.

From this <u>formula</u> we understand that:
- when jd = 0, the similarity gets the maximum value, which is 1.
- when jd = 1, the similarity gets the minimum value, which is 0.5
Thus, two words/lemmas/sense that are equal should get a similarity value of 1; and 0.5 when they are totally different.

If we can compare the results obtained, we can see that the values for words/lemmas are always higher than the values for senses.

For **words** and **lemmas** we get the same results:<br>
<u>Jaccard distances</u>: [0.7272727272727273, 0.8, 0.5454545454545454, 0.6, 0.9090909090909091, 0.8928571428571429<br>
<u>Similarities (1 / (1 + jd))</u>: [0.5789473684210527, 0.5555555555555556, 0.6470588235294118, 0.625, 0.5238095238095238, 0.5283018867924528]<br>

The <u>ranking</u> for the sentences would be: id3, id4, id1, id2, id6, id5

For disambiguated **senses** we get the following results:<br>
<u>Jaccard distances</u>: [0.875, 0.8, 0.625, 1.0, 1.0, 1.0]<br>
<u>Similarities (1 / (1 + jd))</u>: [0.5333333333333333, 0.5555555555555556, 0.6153846153846154, 0.5, 0.5, 0.5]<br>

The <u>ranking</u> for the sentences would be: id3, id2, id1, id4/id5/id6<br>

Here we can see that, acording to Lesk, the last three pairs of sentences are totally different between them.

id3 always get the best ranking position, id1 is always in the middle of the ranking, and id5 in the lower position.

### 5. Compare the results with gold standard by giving the pearson correlation between them.

Open the Golden Records file to extract the values:

In [6]:
# Golden Records file
golden_file = open('../00_data/trial/STS.gs.txt', 'r')

for line in golden_file:
    cols = nltk.sent_tokenize(line)
    columnes = cols[0].split('\t')
    golden.append(columnes[1])

golden_file.close()

golden = list(map(int, golden))  # convert int strings into int

In [7]:
print('Golden Standards: {}'.format(golden))

Golden Standards: [0, 1, 2, 3, 4, 5]


**Pearson Correlation**<br>
It shows the linear relationship between two sets of data. That means: the strength of the association between the two variables.
It has a value between +1 and −1, where 1 is total positive linear correlation, 0 is no linear correlation, and −1 is total negative linear correlation

Coefficient Value -- Strength of Association<br>
0.1 < | r | < .3 -- small correlation<br>
0.3 < | r | < .5 -- medium/moderate correlation<br>
       | r | > .5 -- large/strong correlation

In [8]:
# Pearson correlation, with golden standards and similarities
pearson_s = stats.pearsonr(similarities, golden)[0]
print('==> Pearson Correlation (with Similarities (1 / (1 + jd))): {}'.format(pearson_s))

==> Pearson Correlation (with Similarities (1 / (1 + jd))): -0.5219919882301676


Being the pairs of sentences:<br>
id1	- The bird is bathing in the sink. - Birdie is washing itself in the water basin.<br>
id2	- In May 2010, the troops attempted to invade Kabul. - The US army invaded Kabul on May 7th last year, 2010.<br>
id3	- John said he is considered a witness but not a suspect. - "He is not a suspect anymore." John said.<br>
id4	- They flew out of the nest in groups. - They flew into the nest together.<br>
id5	- The woman is playing the violin. - The young lady enjoys listening to the guitar.<br>
id6	- John went horse back riding at dawn with a whole group of friends. - Sunrise at dawn is a magnificent view to take in if you wake up early enough for it.

According to **Golden Records file**, the ranking from "most similar" to "less similar" sentences would be:<br>
id1, id2, id3, id4, id5, id6

We have seen that we have been getting the following **Similarities**:
- for <u>Words</u> and <u>Lemmas</u>: id3, id4, id1, id2, id6, id5
- for <u>Senses</u>: id3, id2, id1, id4/id5/id6

and the following **Pearson Correlations**:
- for <u>Words</u> and <u>Lemmas</u>: 0.3902992700114095
- for <u>Senses</u>: 0.5219919882301676

For **words** and **lemmas** we have a <u>small/medium</u> correlation, whereas using Lesk for disambiguating of **senses** we have a <u>moderate</u> correlation.