# IHLT - Lab 3

1. Read all pairs of sentences of the trial set within the evaluation framework of the project.

2. Compute their similarities by considering lemmas and Jaccard distance.

3. Compare the results with those in session 2 (document structure) in which words were considered.

4. Compare the results with gold standard by giving the pearson correlation between them.

5. Questions (justify the answers):

    - Which is better: words or lemmas?

    - Do you think that could perform better for any pair of texts?

## Imports

In [3]:
import nltk, re

from nltk import pos_tag
from nltk.corpus import stopwords
from nltk.metrics import jaccard_distance
from nltk.stem import WordNetLemmatizer
from scipy.stats import pearsonr

## 1. Data preparation

As it was done in the last lab session, all pairs of sentences of the trial set are read and stored in a variable.

In [12]:
pairs = list()
with open('STS.input.txt','r') as f:
    lines = f.readlines()
    for line in lines:
        line = nltk.TabTokenizer().tokenize(line.strip())
        pairs.append((line[1], line[2]))
        
for index, pair in enumerate(pairs):
    print(str(index + 1) + ".", pair)

1. ('The bird is bathing in the sink.', 'Birdie is washing itself in the water basin.')
2. ('In May 2010, the troops attempted to invade Kabul.', 'The US army invaded Kabul on May 7th last year, 2010.')
3. ('John said he is considered a witness but not a suspect.', '"He is not a suspect anymore." John said.')
4. ('They flew out of the nest in groups.', 'They flew into the nest together.')
5. ('The woman is playing the violin.', 'The young lady enjoys listening to the guitar.')
6. ('John went horse back riding at dawn with a whole group of friends.', 'Sunrise at dawn is a magnificent view to take in if you wake up early enough for it.')


In [13]:
print(pairs[0][0])

The bird is bathing in the sink.


Then, each sentence is separated by its words.

In [14]:
pairs = [(nltk.word_tokenize(p[0]), nltk.word_tokenize(p[1])) for p in pairs]

for index, pair in enumerate(pairs):
    print(str(index + 1) + ".", pair, '\n')

1. (['The', 'bird', 'is', 'bathing', 'in', 'the', 'sink', '.'], ['Birdie', 'is', 'washing', 'itself', 'in', 'the', 'water', 'basin', '.']) 

2. (['In', 'May', '2010', ',', 'the', 'troops', 'attempted', 'to', 'invade', 'Kabul', '.'], ['The', 'US', 'army', 'invaded', 'Kabul', 'on', 'May', '7th', 'last', 'year', ',', '2010', '.']) 

3. (['John', 'said', 'he', 'is', 'considered', 'a', 'witness', 'but', 'not', 'a', 'suspect', '.'], ['``', 'He', 'is', 'not', 'a', 'suspect', 'anymore', '.', "''", 'John', 'said', '.']) 

4. (['They', 'flew', 'out', 'of', 'the', 'nest', 'in', 'groups', '.'], ['They', 'flew', 'into', 'the', 'nest', 'together', '.']) 

5. (['The', 'woman', 'is', 'playing', 'the', 'violin', '.'], ['The', 'young', 'lady', 'enjoys', 'listening', 'to', 'the', 'guitar', '.']) 

6. (['John', 'went', 'horse', 'back', 'riding', 'at', 'dawn', 'with', 'a', 'whole', 'group', 'of', 'friends', '.'], ['Sunrise', 'at', 'dawn', 'is', 'a', 'magnificent', 'view', 'to', 'take', 'in', 'if', 'you', '

In this session, lemmas are considered instead of words, so they are computed using the class `WordNetLemmatizer` from `nltk`.

In [4]:
wnl = WordNetLemmatizer()

def lemmatize(p, lower=False):
    try:
        return wnl.lemmatize(p[0].lower(), pos=p[1][0].lower())
    except:
        if lower:
            return p[0].lower()
        return p[0]

In [5]:
pairs = [(pos_tag(p[0]), pos_tag(p[1])) for p in pairs]

for index, pair in enumerate(pairs):
    print(str(index + 1) + ".", pair, '\n')

1. ([('The', 'DT'), ('bird', 'NN'), ('is', 'VBZ'), ('bathing', 'VBG'), ('in', 'IN'), ('the', 'DT'), ('sink', 'NN'), ('.', '.')], [('Birdie', 'NNP'), ('is', 'VBZ'), ('washing', 'VBG'), ('itself', 'PRP'), ('in', 'IN'), ('the', 'DT'), ('water', 'NN'), ('basin', 'NN'), ('.', '.')]) 

2. ([('In', 'IN'), ('May', 'NNP'), ('2010', 'CD'), (',', ','), ('the', 'DT'), ('troops', 'NNS'), ('attempted', 'VBD'), ('to', 'TO'), ('invade', 'VB'), ('Kabul', 'NNP'), ('.', '.')], [('The', 'DT'), ('US', 'NNP'), ('army', 'NN'), ('invaded', 'VBD'), ('Kabul', 'NNP'), ('on', 'IN'), ('May', 'NNP'), ('7th', 'CD'), ('last', 'JJ'), ('year', 'NN'), (',', ','), ('2010', 'CD'), ('.', '.')]) 

3. ([('John', 'NNP'), ('said', 'VBD'), ('he', 'PRP'), ('is', 'VBZ'), ('considered', 'VBN'), ('a', 'DT'), ('witness', 'NN'), ('but', 'CC'), ('not', 'RB'), ('a', 'DT'), ('suspect', 'NN'), ('.', '.')], [('``', '``'), ('He', 'PRP'), ('is', 'VBZ'), ('not', 'RB'), ('a', 'DT'), ('suspect', 'NN'), ('anymore', 'RB'), ('.', '.'), ("''", "''

In [6]:
sw = set(stopwords.words('english'))

t_pairs = list()
t_pairs_lower = list()
t_pairs_lower_no_sw = list()
t_pairs_lower_ow = list()

for pair in pairs:
    t_pairs.append(([lemmatize(word) for word in pair[0]], [lemmatize(word) for word in pair[1]]))
    t_pairs_lower.append(([lemmatize(word, True) for word in pair[0]],
                          [lemmatize(word, True) for word in pair[1]]))
    t_pairs_lower_no_sw.append(([lemmatize(word, True) for word in pair[0] if word[0].lower() not in sw],
                                [lemmatize(word, True) for word in pair[1] if word[0].lower() not in sw]))
    t_pairs_lower_ow.append(([lemmatize(word, True) for word in pair[0] if re.search(r"\w", word[0])],
                             [lemmatize(word, True) for word in pair[1] if re.search(r"\w", word[0])]))
    
for index, t_pair in enumerate(t_pairs):
    print(str(index + 1) + ".", t_pair, '\n')

1. (['The', 'bird', 'be', 'bath', 'in', 'the', 'sink', '.'], ['birdie', 'be', 'wash', 'itself', 'in', 'the', 'water', 'basin', '.']) 

2. (['In', 'may', '2010', ',', 'the', 'troop', 'attempt', 'to', 'invade', 'kabul', '.'], ['The', 'u', 'army', 'invade', 'kabul', 'on', 'may', '7th', 'last', 'year', ',', '2010', '.']) 

3. (['john', 'say', 'he', 'be', 'consider', 'a', 'witness', 'but', 'not', 'a', 'suspect', '.'], ['``', 'He', 'be', 'not', 'a', 'suspect', 'anymore', '.', "''", 'john', 'say', '.']) 

4. (['They', 'fly', 'out', 'of', 'the', 'nest', 'in', 'group', '.'], ['They', 'fly', 'into', 'the', 'nest', 'together', '.']) 

5. (['The', 'woman', 'be', 'play', 'the', 'violin', '.'], ['The', 'young', 'lady', 'enjoy', 'listen', 'to', 'the', 'guitar', '.']) 

6. (['john', 'go', 'horse', 'back', 'rid', 'at', 'dawn', 'with', 'a', 'whole', 'group', 'of', 'friend', '.'], ['sunrise', 'at', 'dawn', 'be', 'a', 'magnificent', 'view', 'to', 'take', 'in', 'if', 'you', 'wake', 'up', 'early', 'enough',

## 2. Similarities computation

After tokenizing the sentences using lemmas, the similarities can be computed using the same steps of the last laboratory session. For that reason, [*Jaccard distance*](https://www.nltk.org/api/nltk.metrics.html#nltk.metrics.distance.jaccard_distance) is used in order to get the similarity between a pair of sentences.

$ Similarity = 1 - Jaccard_{Distance} $ 

In [7]:
similarities_l = [1 - jaccard_distance(set(p[0]), set(p[1])) for p in t_pairs]
similarities_l_lower = [1 - jaccard_distance(set(p[0]), set(p[1])) for p in t_pairs_lower]
similarities_l_lower_no_sw = [1 - jaccard_distance(set(p[0]), set(p[1])) for p in t_pairs_lower_no_sw]
similarities_l_lower_ow = [1 - jaccard_distance(set(p[0]), set(p[1])) for p in t_pairs_lower_ow]

print("Similarities (considering lemmas):\n")
for index, similarity in enumerate(similarities_l):
    print(str(index + 1) + ".", similarity)

Similarities (considering lemmas):

1. 0.3076923076923077
2. 0.33333333333333337
3. 0.4666666666666667
4. 0.4545454545454546
5. 0.23076923076923073
6. 0.13793103448275867


### 2.1 Similarities comparison with Session 2

If each similarity is computed using the *Jaccard distance* of pairs tokenized by words (Session 2), the following results are obtained:

In [8]:
similarities_w = [1 - jaccard_distance(set(p[0]), set(p[1])) for p in pairs]

print("Similarities (considering words):\n")
for index, similarity in enumerate(similarities_w):
    print(str(index + 1) + ".", similarity)

Similarities (considering words):

1. 0.3076923076923077
2. 0.26315789473684215
3. 0.4666666666666667
4. 0.4545454545454546
5. 0.23076923076923073
6. 0.13793103448275867


As the results show, only the second similarity increases with the tokenization with lemmas (and improves since it should be a value near to 0.8/1.0 or 4.0/5.0).

The reason behind this is that only in the second pair of sentences there is one more word that are in both sets in the *Jaccard distance* computation with lemmas:

**Original Sentences**
- 'In May 2010, the troops attempted to *invade* Kabul.'
- 'The US army *invaded* Kabul on May 7th last year, 2010.'

**Sentences tokenized by lemmas**
- 'In may 2010, the troop attempt to **invade** kabul.'
- 'The u army **invade** kabul on may 7th last year, 2010. '

## 3. Results comparison with gold standard
Finally, as it was done in the last session, similarities are compared with the gold standard again to analyze the precision of the results and if they have improved or not.

According to `00-readme.txt`, the similarity between two sentences that are completely equivalent should be 1/1 (or 5/5) and the similarity between two sentences that are on different topics should be 0/1 (or 0/5). For that reason, the reference similarities should be `[1.0, 0.8, 0.6, 0.4, 0.2, 0]` or its proportional values `[5, 4, 3, 2, 1, 0]`. 

The values in the gold standard file `STS.gs.txt` are reversed, so they will be read and then inverted in order to get them correctly:

In [9]:
gs = list()
with open('trial/STS.gs.txt','r') as f:
    lines = f.readlines()
    for line in lines:
        line = nltk.TabTokenizer().tokenize(line.strip())
        gs.append(int(line[1]))

gs.reverse()
print("Gold standard:", gs)
print("Pearson correlation (words):", pearsonr(gs, similarities_w)[0])
print("Pearson correlation (lemmas):", pearsonr(gs, similarities_l)[0])

Gold standard: [5, 4, 3, 2, 1, 0]
Pearson correlation (words): 0.3962389776119232
Pearson correlation (lemmas): 0.490670810375692


In the next cell, [*Pearson correlation coefficients*](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.pearsonr.html) of several types of words treatment before the lemmas tokenization are computed:

In [10]:
print("Pearson correlation (lemmas + lowercase):", 
      pearsonr(gs, similarities_l_lower)[0])
print("Pearson correlation (lemmas + lowercase + no stopwords):", 
      pearsonr(gs, similarities_l_lower_no_sw)[0])
print("Pearson correlation (lemmas + lowercase + only words (no punctuation marks)):", 
      pearsonr(gs, similarities_l_lower_ow)[0])

Pearson correlation (lemmas + lowercase): 0.5790860088205632
Pearson correlation (lemmas + lowercase + no stopwords): 0.2294630424388053
Pearson correlation (lemmas + lowercase + only words (no punctuation marks)): 0.47220783380206716


As it can be seen, the preprocessing that fits better for this task in these examples is done by taking lowercase of pairs of sentences without removing stopwords and punctuation marks. Nevertheless, stopwords and punctuation tokens do not add any meaning, so the underlying reason of this decrement of correlation between obtained results and expected results is the presence of tokens like "." and "the" in both sentences in the first pair.

These are the pairs of sentences in the lemmas + lowercase computation:

In [11]:
for index, t_pair in enumerate(t_pairs_lower):
    print(str(index + 1) + ".", t_pair, '\n')

1. (['the', 'bird', 'be', 'bath', 'in', 'the', 'sink', '.'], ['birdie', 'be', 'wash', 'itself', 'in', 'the', 'water', 'basin', '.']) 

2. (['in', 'may', '2010', ',', 'the', 'troop', 'attempt', 'to', 'invade', 'kabul', '.'], ['the', 'u', 'army', 'invade', 'kabul', 'on', 'may', '7th', 'last', 'year', ',', '2010', '.']) 

3. (['john', 'say', 'he', 'be', 'consider', 'a', 'witness', 'but', 'not', 'a', 'suspect', '.'], ['``', 'he', 'be', 'not', 'a', 'suspect', 'anymore', '.', "''", 'john', 'say', '.']) 

4. (['they', 'fly', 'out', 'of', 'the', 'nest', 'in', 'group', '.'], ['they', 'fly', 'into', 'the', 'nest', 'together', '.']) 

5. (['the', 'woman', 'be', 'play', 'the', 'violin', '.'], ['the', 'young', 'lady', 'enjoy', 'listen', 'to', 'the', 'guitar', '.']) 

6. (['john', 'go', 'horse', 'back', 'rid', 'at', 'dawn', 'with', 'a', 'whole', 'group', 'of', 'friend', '.'], ['sunrise', 'at', 'dawn', 'be', 'a', 'magnificent', 'view', 'to', 'take', 'in', 'if', 'you', 'wake', 'up', 'early', 'enough',

And these are its similarities:

In [12]:
for index, similarity in enumerate(similarities_l_lower):
    print(str(index + 1) + ".", similarity)

1. 0.33333333333333337
2. 0.4117647058823529
3. 0.5714285714285714
4. 0.4545454545454546
5. 0.16666666666666663
6. 0.13793103448275867


The reason behind this improvement is that there are tokens that are in both sentences but they differ in its first letter (as one is in uppercase and the other one is in lowercase). For example, in the third pair of sentences:

- 'John said **he** is considered a witness but not a suspect.'
- '"**He** is not a suspect anymore." John said.'

### Which is better: words or lemmas?

The results show a better performance in lemmas tokenization than in words tokenization with a slightly increment of 0.1 in *Pearson correlation coefficient* as it is nearer to 1.0 (exact linear relationship). The reason of this increment is the second pair of sentences that was explained before (with the transformation of *invaded* to *invade*).

The reason of this low correlation between the obtained results and the correct results, as it was explained in the last session, is that measuring similarities of a pair of sentences taking into consideration just the *Jaccard distance* (even with lemmas instead of words) is not enough to get the expected results.

For example, the first pair of sentences (0.31 obtained vs 1.0 expected) means the same thing but are written using other words and consequently, the use of this technique might fail:

- 'The bird **is** bathing **in the** sink.'
- 'Birdie **is** washing itself **in the** water basin.'

Other elements (for example, sentiment analysis or sentence context) should be studied to get a higher *Pearson correlation coefficient* between obtained results and expected results.



### Do you think that could perform better for any pair of texts?

The performance of lemmas tokenization is better than word tokenization only in the pair of texts that includes the appropiate morphemes. For example, as it was shown in the third pair of sentences:

**Original Sentences**
- 'In May 2010, the troops attempted to *invade* Kabul.'
- 'The US army *invaded* Kabul on May 7th last year, 2010.'

**Sentences tokenized by lemmas**
- 'In may 2010, the troop attempt to **invade** kabul.'
- 'The u army **invade** kabul on may 7th last year, 2010. '

The performance is better as using the lemma token instead of the word token allows the system to compare correctly that both sentences have the idea of an "invasion".

Nevertheless, the lemma tokenization does not perform better always as it can be seen in the following pair of sentences (first pair):

**Original Sentences**
- 'The bird **is** bathing in the sink.' 
- 'Birdie **is** washing itself in the water basin.'

**Sentences tokenized by lemmas**
- 'The bird **be** bath in the sink.'
- 'birdie **be** wash itself in the water basin.'

In this case, as it was already in the same tense, *Jaccard distance* computation will have the same result.

Moreover, the morphemes of the words give information about the similarity of a pair of texts sometimes. For example:

**Original Sentences Example**
- 'He **went** to the supermarket'. 
- 'He **is going** to the supermarket'.

When it is tokenized by words and lemmas results in:

In [13]:
new_pair_w = ((nltk.word_tokenize('He went to the supermarket'), 
               nltk.word_tokenize('He is going to the supermarket')))

new_pair_l = ([lemmatize(word) for word in pos_tag(['He', 'went', 'to', 'the', 'supermarket'])],
             [lemmatize(word) for word in pos_tag(['He', 'is', 'going', 'to', 'the', 'supermarket'])])

print("Sentences tokenized by words:", new_pair_w)
print("Sentences tokenized by lemmas:", new_pair_l)

Sentences tokenized by words: (['He', 'went', 'to', 'the', 'supermarket'], ['He', 'is', 'going', 'to', 'the', 'supermarket'])
Sentences tokenized by lemmas: (['He', 'go', 'to', 'the', 'supermarket'], ['He', 'be', 'go', 'to', 'the', 'supermarket'])


In [14]:
print("Similarity between sentences tokenized by words:", 
     (1 - jaccard_distance(set(new_pair_w[0]), set(new_pair_w[1]))))

print("Similarity between sentences tokenized by lemmas:", 
     (1 - jaccard_distance(set(new_pair_l[0]), set(new_pair_l[1]))))

Similarity between sentences tokenized by words: 0.5714285714285714
Similarity between sentences tokenized by lemmas: 0.8333333333333334


In this case, the similarity between sentences tokenized by words is more accurate because the meaning of the sentences changes when a tokenization by lemmas is computed. 