# Solution

# IHLT Lab Exercise 3
## This file contains code to complete the exercise for the third lab session of IHLT
Authors:


*   Kacper Poniatowski (kacper.krzysztof.poniatowski@estudiantat.upc.edu)
*   Pau Blanco (pablo.blanco@estudiantat.upc.edu)


In [2]:
# Pre-req code - provided from Lab 2

from google.colab import drive
drive.mount('/content/drive')

import pandas as pd

dt = pd.read_csv('/content/drive/MyDrive/Notebooks/IHLT/Week3/test-gold/STS.input.SMTeuroparl.txt',sep='\t',header=None)
dt['gs'] = pd.read_csv('/content/drive/MyDrive/Notebooks/IHLT/Week3/test-gold/STS.gs.SMTeuroparl.txt',sep='\t',header=None)

dt.head()

Mounted at /content/drive


Unnamed: 0,0,1,gs
0,The leaders have now been given a new chance a...,The leaders benefit aujourd' hui of a new luck...,4.5
1,Amendment No 7 proposes certain changes in the...,Amendment No 7 is proposing certain changes in...,5.0
2,Let me remind you that our allies include ferv...,I would like to remind you that among our alli...,4.25
3,The vote will take place today at 5.30 p.m.,The vote will take place at 5.30pm,4.5
4,"The fishermen are inactive, tired and disappoi...","The fishermen are inactive, tired and disappoi...",5.0


In [3]:
# Imports
import spacy
from scipy.stats import pearsonr
from nltk.metrics import jaccard_distance

nlp = spacy.load("en_core_web_sm")

In [None]:
# Imports
import spacy
from scipy.stats import pearsonr
from nltk.metrics import jaccard_distance

nlp = spacy.load("en_core_web_sm")

# Prepare new columns in the DataFrame
dt['jac_lemma'] = 0.0
dt['jac_low_lemma'] = 0.0
dt['jac_low_stop_lemma'] = 0.0
dt['jac_punct_lemma'] = 0.0

rowNums = dt.shape[0]

# Helper function to remove punctuation
def remove_punctuation(tokens):
    return [token for token in tokens if not token.is_punct]

for i in range(rowNums):

    # Get the sentences
    sentence0 = dt.at[i, 0]
    sentence1 = dt.at[i, 1]

    # Process sentences using spaCy
    doc0 = nlp(sentence0)
    doc1 = nlp(sentence1)

    # 1. Original Jaccard Similarity using lemmatized words
    lemma0 = [token.lemma_ for token in doc0]
    lemma1 = [token.lemma_ for token in doc1]
    dt.at[i, 'jac_lemma'] = 1 - jaccard_distance(set(lemma0), set(lemma1))

    # 2. Lowercase Jaccard Similarity using lemmatized words
    lemma0_low = [token.lemma_.lower() for token in doc0]
    lemma1_low = [token.lemma_.lower() for token in doc1]
    dt.at[i, 'jac_low_lemma'] = 1 - jaccard_distance(set(lemma0_low), set(lemma1_low))

    # 3. Lowercase and Remove Stopwords using lemmatized words
    lemma0_low_stop = [token.lemma_.lower() for token in doc0 if not token.is_stop]
    lemma1_low_stop = [token.lemma_.lower() for token in doc1 if not token.is_stop]
    dt.at[i, 'jac_low_stop_lemma'] = 1 - jaccard_distance(set(lemma0_low_stop), set(lemma1_low_stop))

    # 4. Remove Punctuation (after lowercase) using lemmatized words
    tokens0_no_punct = remove_punctuation(doc0)
    tokens1_no_punct = remove_punctuation(doc1)
    lemma0_no_punct = [token.lemma_.lower() for token in tokens0_no_punct]
    lemma1_no_punct = [token.lemma_.lower() for token in tokens1_no_punct]
    dt.at[i, 'jac_punct_lemma'] = 1 - jaccard_distance(set(lemma0_no_punct), set(lemma1_no_punct))

# Function to compute and store Pearson correlations
def compute_pearsonr(column_name, label):
    correlation = pearsonr(dt['gs'], dt[column_name])[0]
    print(f'Pearson correlation for {label} ({column_name}): {correlation:.6f}')
    return correlation

average_jac_lemma = dt['jac_lemma'].mean()
average_jac_low_lemma = dt['jac_low_lemma'].mean()
average_jac_low_stop_lemma = dt['jac_low_stop_lemma'].mean()
average_jac_punct_lemma = dt['jac_punct_lemma'].mean()

print('Jaccard similarity mean for each preprocessing testcase: ')
print(f'Average Jaccard similarity for raw sentences (lemmatized): {average_jac_lemma:.6f}')
print(f'Average Jaccard similarity for sentences in lower case (lemmatized): {average_jac_low_lemma:.6f}')
print(f'Average Jaccard similarity for sentences in lower case without stop words (lemmatized): {average_jac_low_stop_lemma:.6f}')
print(f'Average Jaccard similarity for sentences in lower case without punctuation signs (lemmatized): {average_jac_punct_lemma:.6f}')

print('\nPearson correlation between gold values and Jaccard similarity using lemmas (to 6 decimal places):')
# Store correlations in a dictionary for later use
correlations = {
    'Original': compute_pearsonr('jac_lemma', 'raw sentences (lemmatized)'),
    'Lowercase': compute_pearsonr('jac_low_lemma', 'sentences in lower case (lemmatized)'),
    'Lowercase without Stopwords': compute_pearsonr('jac_low_stop_lemma', 'sentences in lower case without stop words (lemmatized)'),
    'Without Punctuation': compute_pearsonr('jac_punct_lemma', 'sentences in lower case without punctuation signs (lemmatized)')
}

dt.head()


Jaccard similarity mean  for each preprocessing testcase: 
Average Jaccard similarity for raw sentences (lemmatized): 0.552004
Average Jaccard similarity for sentences in lower case (lemmatized): 0.557334
Average Jaccard similarity for sentences in lower case without stop words (lemmatized): 0.581773
Average Jaccard similarity for sentences in lower case without punctuation signs (lemmatized): 0.530221

Pearson correlation between gold values and Jaccard similarity using lemmas (to 6 decimal places):
Pearson correlation for raw sentences (lemmatized) (jac_lemma): 0.476777
Pearson correlation for sentences in lower case (lemmatized) (jac_low_lemma): 0.481508
Pearson correlation for sentences in lower case without stop words (lemmatized) (jac_low_stop_lemma): 0.495948
Pearson correlation for sentences in lower case without punctuation signs (lemmatized) (jac_punct_lemma): 0.499157


Unnamed: 0,0,1,gs,jac_lemma,jac_low_lemma,jac_low_stop_lemma,jac_punct_lemma
0,The leaders have now been given a new chance a...,The leaders benefit aujourd' hui of a new luck...,4.5,0.4,0.4,0.384615,0.391304
1,Amendment No 7 proposes certain changes in the...,Amendment No 7 is proposing certain changes in...,5.0,0.923077,0.923077,1.0,0.916667
2,Let me remind you that our allies include ferv...,I would like to remind you that among our alli...,4.25,0.454545,0.454545,0.363636,0.45
3,The vote will take place today at 5.30 p.m.,The vote will take place at 5.30pm,4.5,0.6,0.6,0.333333,0.6
4,"The fishermen are inactive, tired and disappoi...","The fishermen are inactive, tired and disappoi...",5.0,1.0,1.0,1.0,1.0


# Comparison of Jaccard similarities


### Jaccard similarity mean for each preprocessing testcase from Lab 2 (words):
- Average Jaccard similarity for raw sentences: 0.500240
- Average Jaccard similarity for sentences in lower case: 0.526905
- Average Jaccard similarity for sentences in lower case without stop words : 0.531768
- Average Jaccard similarity for sentences in lower case without punctuation signs: 0.518007


### Jaccard similarity mean for each preprocessing testcase (lemmas):
- Average Jaccard similarity for raw sentences (lemmatized): 0.552004
- Average Jaccard similarity for sentences in lower case (lemmatized): 0.557334
- Average Jaccard similarity for sentences in lower case without stop words (lemmatized): 0.581773
- Average Jaccard similarity for sentences in lower case without punctuation signs (lemmatized): 0.530221

| Preprocessing Step | Using Words| Using Lemmas | Difference (%) |
|-----------------|-----------------|-----------------|---|
| Raw Sentences | 0.500240 | 0.552004  | 10.34% similarity increase using lemmas |
| Lower Case | 0.526905  | 0.557334  | 5.78% similarity increase using lemmas |
| Lower Case + No Stopwords | 0.531768  | 0.581773  | 9.40% similarity increase using lemmas |
| Lower Case + No Punctuation | 0.518007  | 0.530221 | 2.36% similarity increase using lemmas |

In all preprocessing step testcases, the use of lemmas resulted in higher Jaccard similarity scores compared to using words. This means the lemmatization process reduces the variety of word forms which increases the overlap between pairs of sentences, making them appear more similar. This leads to a higher similarity score across all of the preprocessing testcases that were conducted.

We also want to highlight how, despite the fact that the library usually gives us lemmas in lowercase, differences can be observed when comparing the distances after converting to lowercase or not. This is because when the library detects that the word corresponds to a proper noun, the lemma is returned with the first letter capitalized. This can cause differences when it is detected as a proper noun in one sentence but not in the other.

# Comparison of Pearson Correlations

### Pearson correlation between gold values and Jaccard distances using words from Lab 2 (to 6 decimal places):
- Pearson correlation for raw sentences: 0.450498
- Pearson correlation for sentences in lower case: 0.462495
- Pearson correlation for sentences in lower case without stop words: 0.445160
- Pearson correlation for sentences in lower case without punctuation signs: 0.458716


### Pearson correlation between gold values and Jaccard distances using lemmas (to 6 decimal places):
- Pearson correlation for raw sentences (lemmatized): 0.476777
- Pearson correlation for sentences in lower case (lemmatized): 0.481508
- Pearson correlation for sentences in lower case without stop words (lemmatized): 0.495948
- Pearson correlation for sentences in lower case without punctuation signs (lemmatized): 0.499157


To visualise the difference between using words and lemmas, let's use a table to compare test results between the two:

| Preprocessing Step | Using Words| Using Lemmas | Difference (%) |
|-----------------|-----------------|-----------------|---|
| Raw Sentences | 0.450498 | 0.476777  | 5.83% similarity increase using lemmas |
| Lower Case | 0.462495  | 0.481508  | 4.11% similarity increase using lemmas |
| Lower Case + No Stopwords | 0.445160  | 0.495948  | 11.41% similarity increase using lemmas |
| Lower Case + No Punctuation | 0.458716  | 0.499157 | 8.82% similarity increase using lemmas |

In all 4 test cases we performed from last week, using lemmas instead of words led to a increase in the Pearson correlation between the gold standard and the Jaccard similarity for each sentence pair. The most significant improvement was observed in the lowercase without stopwords testcase, with an 11.41% increase.

Taking into account absolute values, the best Pearson correlation is obtained when using lemmas and removing stopwords and punctuation. This is the option we would recommend at this point.


# Methodology
- For the purpose of comparing the Jaccard distances accurately we decided to calculate the mean from the lab 2 results, and also implemented this for this lab results.
- Certain sentence pairs within the sample text file contain non-English words. Lemmatization cannot be performed on these words, so the token form of the word was used instead. The words were not removed to maintain the integrity and context of the sentences. Removing them could have led to information loss which could skew the similarity scores.


# Conclusions
The results from this lab exercise show that lemmatization consistently improves the accuracy of sentence similarity when compared to using raw word forms.

This conclusion was expected, as words with the same lemma usually have similar meanings. Lemmatization (making this words identical) causes these words to contribute to reduce the distance between the sentences.

### Q5a) Which is better: words or lemmas?
Based on our results, lemmas are generally better than words for text similarity tasks. Across each of the preprocessing testcases we performed, we observed an increase in the Pearson correlation between the Jaccard distance and the gold standard using lemmas instead of words. This suggests that lemmatization is useful in aligning word forms that would otherwise be treated as distinct tokens (e.g.: "swimming" and "swim").

### Q5b) Do you think that could perform better for any pair of texts?
Generally lemmatization improves text similarity tasks, but there are scenarios where using the non-lemmatized form of the word might lead to better performance. For example, if two sentences / texts contain terms that have different meanings in different forms (e.g.: "run" as in running vs "run" as in a stock market run), lemmatizing would increase similarity and lead to an incorrect reduction in Jaccard distance.

For instance, we can have two sentences with totally different meanings (the similarity should be as low as possible) but containing words that, despite being different, share the same lemma:

*   "The last show from Cirque du Soleil looks great"
*   "John is finally showing his paintings at the museum"

Lemmatization is also harmful when the structure and style of the sentence / document is relevant as the information is lost. For example, consider the following sentences "I bought apples", and "I buy apples". After lemmatization their Jaccard similarity would be 1, suggesting that the sentences are identical even though they convey different meanings.

You can see in the next cell how these two sentences have a Jaccard similarity of 0.059 when using the surface words, whereas using lemmas, the Jaccard similarity increases to 0.125. This is due to "show" and "showing," which have different meanings but share the same lemma.


In [None]:
import nltk
nltk.download('punkt')

# Test sentences
sentence0 = "The last show from Cirque du Soleil looks great"
sentence1 = "John is finally showing his paintings at the museum"

# Lemmas
doc0 = nlp(sentence0)
doc1 = nlp(sentence1)
lemma0 = [token.lemma_.lower() for token in doc0 ]
lemma1 = [token.lemma_.lower() for token in doc1 ]

# Words
st0 = nltk.word_tokenize(sentence0)
st1 = nltk.word_tokenize(sentence1)
word0 = [w.lower() for w in st0 ]
word1 = [w.lower() for w in st1 ]

# Jacc similarity
jac_lemmas = 1 - jaccard_distance(set(lemma0), set(lemma1))
jac_words = 1 - jaccard_distance(set(word0), set(word1))

print("Lemmas: ", lemma0, "/", lemma1)
print("l1", lemma1)
print("Words", word0, "/", word1)
print("Words similarity:", jac_words)
print("Lemmas similarity:", jac_lemmas)



Lemmas:  ['the', 'last', 'show', 'from', 'cirque', 'du', 'soleil', 'look', 'great'] / ['john', 'be', 'finally', 'show', 'his', 'painting', 'at', 'the', 'museum']
l1 ['john', 'be', 'finally', 'show', 'his', 'painting', 'at', 'the', 'museum']
Words ['the', 'last', 'show', 'from', 'cirque', 'du', 'soleil', 'looks', 'great'] / ['john', 'is', 'finally', 'showing', 'his', 'paintings', 'at', 'the', 'museum']
Words similarity: 0.05882352941176472
Lemmas similarity: 0.125


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
