# IHLT - Lab 2

1. Read all pairs of sentences of the trial set within the evaluation framework of the project.

2. Compute their similarities by considering words and Jaccard distance. A distance should be obtained for each pair of sentences (a vector of similarities).

3. Compare the previous results with gold standard by giving the pearson correlation between them. Only a global measure should be obtained from all previous distances.

Notes:

- Read the file 00-readme.txt of the trial data set to prepare the exercise.

- Justify the answer.

## Imports

In [1]:
import nltk, re

from nltk.metrics import jaccard_distance
from scipy.stats import pearsonr
from nltk.corpus import stopwords

In [4]:
pairs = list()
with open('trial/STS.input.txt','r') as f:
    lines = f.readlines()
    for line in lines:
        line = nltk.TabTokenizer().tokenize(line.strip())
        pairs.append((line[1], line[2]))
        
for index, pair in enumerate(pairs):
    print(str(index + 1) + ".", pair)

1. ('The bird is bathing in the sink.', 'Birdie is washing itself in the water basin.')
2. ('In May 2010, the troops attempted to invade Kabul.', 'The US army invaded Kabul on May 7th last year, 2010.')
3. ('John said he is considered a witness but not a suspect.', '"He is not a suspect anymore." John said.')
4. ('They flew out of the nest in groups.', 'They flew into the nest together.')
5. ('The woman is playing the violin.', 'The young lady enjoys listening to the guitar.')
6. ('John went horse back riding at dawn with a whole group of friends.', 'Sunrise at dawn is a magnificent view to take in if you wake up early enough for it.')


Then, each sentence is tokenized by words.

In [9]:
pairs = [(nltk.word_tokenize(p[0]), nltk.word_tokenize(p[1])) for p in pairs]

for index, pair in enumerate(pairs):
    print(str(index + 1) + ".", pair, '\n')

1. (['The', 'bird', 'is', 'bathing', 'in', 'the', 'sink', '.'], ['Birdie', 'is', 'washing', 'itself', 'in', 'the', 'water', 'basin', '.']) 

2. (['In', 'May', '2010', ',', 'the', 'troops', 'attempted', 'to', 'invade', 'Kabul', '.'], ['The', 'US', 'army', 'invaded', 'Kabul', 'on', 'May', '7th', 'last', 'year', ',', '2010', '.']) 

3. (['John', 'said', 'he', 'is', 'considered', 'a', 'witness', 'but', 'not', 'a', 'suspect', '.'], ['``', 'He', 'is', 'not', 'a', 'suspect', 'anymore', '.', "''", 'John', 'said', '.']) 

4. (['They', 'flew', 'out', 'of', 'the', 'nest', 'in', 'groups', '.'], ['They', 'flew', 'into', 'the', 'nest', 'together', '.']) 

5. (['The', 'woman', 'is', 'playing', 'the', 'violin', '.'], ['The', 'young', 'lady', 'enjoys', 'listening', 'to', 'the', 'guitar', '.']) 

6. (['John', 'went', 'horse', 'back', 'riding', 'at', 'dawn', 'with', 'a', 'whole', 'group', 'of', 'friends', '.'], ['Sunrise', 'at', 'dawn', 'is', 'a', 'magnificent', 'view', 'to', 'take', 'in', 'if', 'you', '

## 2. Similarities computation

[*Jaccard distance*](https://www.nltk.org/api/nltk.metrics.html#nltk.metrics.distance.jaccard_distance) is used in order to get the similarity between a pair of sentences.

$ Similarity = 1 - Jaccard_{Distance} $

In [10]:
similarities = [1 - jaccard_distance(set(p[0]), set(p[1])) for p in pairs]

print("Similarities:\n")
for index, similarity in enumerate(similarities):
    print(str(index + 1) + ".", similarity)

Similarities:

1. 0.3076923076923077
2. 0.26315789473684215
3. 0.4666666666666667
4. 0.4545454545454546
5. 0.23076923076923073
6. 0.13793103448275867


## 3. Results comparison with gold standard

Finally, similarities are compared with the gold standard to analyze the precision of the results.

As the similarity between two sentences that are completely equivalent should be 1 and the similarity between two sentences that are on different topics should be 0, the reference similarities (according to `STS.gs.txt` and `00-readme.txt`) should be:

`1.0, 0.8, 0.6, 0.4, 0.2, 0` or `5, 4, 3, 2, 1, 0` (proportional values).

In [11]:
print("Pearson correlation:", pearsonr([5, 4, 3, 2, 1, 0], similarities)[0])
print("Pearson correlation:", pearsonr([1, 0.8, 0.6, 0.4, 0.2, 0], similarities)[0])

Pearson correlation: 0.3962389776119232
Pearson correlation: 0.3962389776119233


In [*Pearson correlation coefficient*](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.pearsonr.html), the coefficient varies between -1 and +1, with 0 implying no correlation. Correlations of -1 and +1 imply an exact linear relationship.


The reason of this low correlation between the obtained results and the correct results is that measuring the similarities of two sentences taking into account only the *Jaccard distance* of tokenized words is not enough to get the expected results.

For example, in third sentence (0.47 vs 0.6 of similarity), a good result is obtained as several words are located in both sentences:

1. '**John said he is** considered a witness but **not a suspect.**'
2. '"**He is not a suspect** anymore". **John said.**'

Other example is the fifth sentence (0.23 vs 0.2 of similarity), a good result is obtained too because only some words are located in both samples:

1. '**The** woman is playing **the** violin.'
2. '**The** young lady enjoys listening to **the** guitar.'

Nevertheless, when two sentences (as first case, 0.31 vs 1 of similarity) mean the same thing but are written using other words, the use of only one technique (*Jaccard distance*) can fail:

1. 'The bird **is** bathing **in the** sink.'
2. 'Birdie **is** washing itself **in the** water basin.'

Other elements, as sentiment analysis or sentence context, should be studied in order to get a higher pearson correlation coefficient between obtained results and expected results.

Finally, what would be the value of the *Pearson correlation coefficient* if lower case were used in each pairs of sentences, stopwords were removed and only words were taken into consideration?

In [None]:
sw = set(stopwords.words('english'))

new_pairs = list()
for pair in pairs:
    new_pairs.append(([w.lower() for w in pair[0] if re.search(r"\w", w) and w.lower() not in sw],
                     [w.lower() for w in pair[1] if re.search(r"\w", w) and w.lower() not in sw]))
similarities = [1 - jaccard_distance(set(p[0]), set(p[1])) for p in new_pairs]

for index, pair in enumerate(new_pairs):
    print(str(index + 1) + ".", pair, '\n')

print("Similarities:\n")
for index, similarity in enumerate(similarities):
    print(str(index + 1) + ".", similarity)    
    
print("\nPearson correlation:", pearsonr([5, 4, 3, 2, 1, 0], similarities)[0])

As the correlation coefficient shows, it gets a worse result. Consequently, the utilization of stopwords and other elements can improve the results in these cases as they provide a structure that helps to compare sentences. Nevertheless, as it was mentioned above, other features should be analyzed to get a higher *Pearson Correlation Coefficient*.

In [12]:
As the correlation coefficient shows, it gets a worse result. Consequently, the utilization of stopwords and other elements can improve the results in these cases as they provide a structure that helps to compare sentences. Nevertheless, as it was mentioned above, other features should be analyzed to get a higher *Pearson Correlation Coefficient*.

1. (['bird', 'bathing', 'sink'], ['birdie', 'washing', 'water', 'basin']) 

2. (['may', '2010', 'troops', 'attempted', 'invade', 'kabul'], ['us', 'army', 'invaded', 'kabul', 'may', '7th', 'last', 'year', '2010']) 

3. (['john', 'said', 'considered', 'witness', 'suspect'], ['suspect', 'anymore', 'john', 'said']) 

4. (['flew', 'nest', 'groups'], ['flew', 'nest', 'together']) 

5. (['woman', 'playing', 'violin'], ['young', 'lady', 'enjoys', 'listening', 'guitar']) 

6. (['john', 'went', 'horse', 'back', 'riding', 'dawn', 'whole', 'group', 'friends'], ['sunrise', 'dawn', 'magnificent', 'view', 'take', 'wake', 'early', 'enough']) 

Similarities:

1. 0.0
2. 0.25
3. 0.5
4. 0.5
5. 0.0
6. 0.0625

Pearson correlation: 0.09894548898363074


As the correlation coefficient shows, it gets a worse result. Consequently, the utilization of stopwords and other elements can improve the results in these cases as they provide a structure that helps to compare sentences. Nevertheless, as it was mentioned above, other features should be analyzed to get a higher *Pearson Correlation Coefficient*.