# IHLT Lab Exercise 2
## This file contains code to complete the exercise for the second lab session of IHLT
Authors:


*   Kacper Poniatowski (kacper.krzysztof.poniatowski@estudiantat.upc.edu)
*   Pau Blanco (pablo.blanco@estudiantat.upc.edu)


# Provided code
## Paraphrases Template

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
import pandas as pd

In [3]:
dt = pd.read_csv('/content/drive/MyDrive/Notebooks/IHLT/Week2/test-gold/STS.input.SMTeuroparl.txt',sep='\t',header=None)

In [4]:
dt.head()

Unnamed: 0,0,1
0,The leaders have now been given a new chance a...,The leaders benefit aujourd' hui of a new luck...
1,Amendment No 7 proposes certain changes in the...,Amendment No 7 is proposing certain changes in...
2,Let me remind you that our allies include ferv...,I would like to remind you that among our alli...
3,The vote will take place today at 5.30 p.m.,The vote will take place at 5.30pm
4,"The fishermen are inactive, tired and disappoi...","The fishermen are inactive, tired and disappoi..."


In [5]:
dt['gs'] = pd.read_csv('/content/drive/MyDrive/Notebooks/IHLT/Week2/test-gold/STS.gs.SMTeuroparl.txt',sep='\t',header=None)

In [6]:
dt.shape

(459, 3)

In [7]:
dt.head()

Unnamed: 0,0,1,gs
0,The leaders have now been given a new chance a...,The leaders benefit aujourd' hui of a new luck...,4.5
1,Amendment No 7 proposes certain changes in the...,Amendment No 7 is proposing certain changes in...,5.0
2,Let me remind you that our allies include ferv...,I would like to remind you that among our alli...,4.25
3,The vote will take place today at 5.30 p.m.,The vote will take place at 5.30pm,4.5
4,"The fishermen are inactive, tired and disappoi...","The fishermen are inactive, tired and disappoi...",5.0


### TODO
1. Compute the Jaccard similarity between every paraphrase and add a column *jaccard* to *dt* variable.
2. Compute the pearson correlation as: <br>
```
from scipy.stats import pearsonr
pearsonr(dt['gs'], dt['jaccard'])[0]
```

# Solution

In [8]:
# Imports
import nltk
from nltk.metrics import jaccard_distance
from scipy.stats import pearsonr
from nltk.corpus import stopwords

nltk.download('stopwords')
nltk.download('punkt')

stopwords = set(stopwords.words('english'))

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [9]:
# Prepare new columns in the dataFrame
dt['jac'] = 0.0
dt['jac_low'] = 0.0
dt['jac_low_stop'] = 0.0
dt['jac_punct'] = 0.0

rowNums = dt.shape[0]

# Helper function to remove punctuation
def remove_punctuation(sentence):
    return [word for word in sentence if word.isalpha()]  # Only keep alphabetic words

for i in range(rowNums):
  # Get the sentences tokenized by punkt
  sentence0 = nltk.word_tokenize(dt.at[i, 0])
  sentence1 = nltk.word_tokenize(dt.at[i, 1])

  # 1. Original Jaccard Similarity
  dt.at[i, 'jac'] = 1- jaccard_distance(set(sentence0), set(sentence1))

  # 2. Lowercase Jaccard Similarity
  sentence0_low = [w.lower() for w in sentence0]
  sentence1_low = [w.lower() for w in sentence1]
  dt.at[i, 'jac_low'] = 1 - jaccard_distance(set(sentence0_low), set(sentence1_low))

  # 3. Lowercase and Remove Stopwords
  sentence0_low_stop = [w for w in sentence0_low if w not in stopwords]
  sentence1_low_stop = [w for w in sentence1_low if w not in stopwords]
  dt.at[i, 'jac_low_stop'] = 1 - jaccard_distance(set(sentence0_low_stop), set(sentence1_low_stop))

  # 4. Remove Punctuation (after lowercase)
  sentence0_no_punct = remove_punctuation(sentence0_low)
  sentence1_no_punct = remove_punctuation(sentence1_low)
  dt.at[i, 'jac_punct'] = 1 - jaccard_distance(set(sentence0_no_punct), set(sentence1_no_punct))

# Function to compute and store Pearson correlations
def compute_pearsonr(column_name, label):
    correlation = pearsonr(dt['gs'], dt[column_name])[0]
    print(f'Pearson correlation for {label} ({column_name}): {correlation:.6f}')
    return correlation

average_jac = dt['jac'].mean()
average_jac_low = dt['jac_low'].mean()
average_jac_low_stop = dt['jac_low_stop'].mean()
average_jac_punct = dt['jac_punct'].mean()

print('Jaccard distance mean for each preprocessing testcase: ')
print(f'Average Jaccard distance for raw sentences (lemmatized): {average_jac:.6f}')
print(f'Average Jaccard distance for sentences in lower case (lemmatized): {average_jac_low:.6f}')
print(f'Average Jaccard distance for sentences in lower case without stop words (lemmatized): {average_jac_low_stop:.6f}')
print(f'Average Jaccard distance for sentences in lower case without punctuation signs (lemmatized): {average_jac_punct:.6f}')

print('Pearson correlation between gold values and Jaccard distances (to 6 decimal places):')

# Store correlations in a dictionary for later use
correlations = {
    'Original': compute_pearsonr('jac', 'raw sentences'),
    'Lowercase': compute_pearsonr('jac_low', 'sentences in lower case'),
    'Lowercase without Stopwords': compute_pearsonr('jac_low_stop', 'sentences in lower case without stop words'),
    'Without Punctuation': compute_pearsonr('jac_punct', 'sentences in lower case without punctuation signs')
}

dt.head()


Jaccard distance mean for each preprocessing testcase: 
Average Jaccard distance for raw sentences (lemmatized): 0.500240
Average Jaccard distance for sentences in lower case (lemmatized): 0.526905
Average Jaccard distance for sentences in lower case without stop words (lemmatized): 0.531768
Average Jaccard distance for sentences in lower case without punctuation signs (lemmatized): 0.518007
Pearson correlation between gold values and Jaccard distances (to 6 decimal places):
Pearson correlation for raw sentences (jac): 0.450498
Pearson correlation for sentences in lower case (jac_low): 0.462495
Pearson correlation for sentences in lower case without stop words (jac_low_stop): 0.445160
Pearson correlation for sentences in lower case without punctuation signs (jac_punct): 0.458716


Unnamed: 0,0,1,gs,jac,jac_low,jac_low_stop,jac_punct
0,The leaders have now been given a new chance a...,The leaders benefit aujourd' hui of a new luck...,4.5,0.346154,0.346154,0.3125,0.347826
1,Amendment No 7 proposes certain changes in the...,Amendment No 7 is proposing certain changes in...,5.0,0.785714,0.785714,0.777778,0.75
2,Let me remind you that our allies include ferv...,I would like to remind you that among our alli...,4.25,0.391304,0.391304,0.307692,0.380952
3,The vote will take place today at 5.30 p.m.,The vote will take place at 5.30pm,4.5,0.545455,0.545455,0.375,0.857143
4,"The fishermen are inactive, tired and disappoi...","The fishermen are inactive, tired and disappoi...",5.0,1.0,1.0,1.0,1.0


# Methodology
We have been asked to correlate the gold standard results with the Jaccard distance. Initially, we tokenized the text directly with Punkt tokenizer, which showed that the correlation exists but has a poor result. Since the method behind the gold standard's creation wasn't explained, we have tried to preprocess the sentences to improve the results. These included converting all text to lowercase, removing stop words, and removing punctuation. In each subsequent test, we built on the previous test result that yielded the highest correlation.

It should also be noted that gold standard values range from 0 to 5 while Jaccard distance ranges from 0 to 1. While the scales differ, it does not affect the correlation between the two variables therefore we decided against normalizing.


# Conclusions
In the 4 cases we have tested, we have been able to verify that there is indeed some correlation between the gold standard and the Jaccard distance. However, the results are modest, around 0.45, indicating that there is a certain degree of correlation but not a high degree of correlation.

Lowercasing the text improved the results slightly, because it reduces the number of variations in the union and increases overlaps in similar phrases (e.g.: 'The leaders...' and 'Leaders...'). However, removing stop words worsened the results, and removing punctuation led to a minor improvement. We believe this inconsistency is created due to the gold standard potentially using more advanced NLP techniques such as analyzing sentence structure and synonym usage, while Jaccard distance simply measures the percentage of matching words.

**Main Conclusion**
The Jaccard distance, based on percentage of matching words, is too simplistic to reliably gauge sentence similarity.


# Annex
Comparing sentences with different Jaccard distance when transforming to lower case

In [None]:
# Filter and print rows where the jac_low value differs from the jac value
# by more than 0.001.

# The aim is to highlight cases where preprocessing significantly impacts
# the similarity scores, which supports our argument that the Jaccard distance
# is too simplistic to capture sentence similarity accurately.

# By printing these cases, it offers concrete examples that shows why the Jaccard
# approach doesn't align well with the more complex gold standard.

dif_low = dt[(dt['jac_low'] - dt['jac']) > 0.001]
print(f"Number of rows where 'jac_low' differs from 'jac' by more than 0.001: {dif_low.shape[0]}")

dt.head()

for row in dif_low.itertuples():
    print(f"Sentence 1: {row[1]}")
    print(f"Sentence 2: {row[2]}")
    print(f"Original Jaccard: {row.jac:.4f}, Preprocessed Jaccard (Lowercased): {row.jac_low:.4f}")
    print("-" * 80)

Number of rows where 'jac_low' differs from 'jac' by more than 0.001: 117
Sentence 1: Neither was there a qualified majority within this House to revert to Article 272.
Sentence 2: There was not a majority voting in Parliament to go back to Article 272.
Original Jaccard: 0.3333, Preprocessed Jaccard (Lowercased): 0.4000
--------------------------------------------------------------------------------
Sentence 1: The leaders have now been given a new chance and let us hope they seize it.
Sentence 2: Leaders now have another chance to let them and therefore take.
Original Jaccard: 0.2609, Preprocessed Jaccard (Lowercased): 0.3182
--------------------------------------------------------------------------------
Sentence 1: (Parliament adopted the legislative resolution)
Sentence 2: (The Parliament adopts legislative resolution)
Original Jaccard: 0.5556, Preprocessed Jaccard (Lowercased): 0.7500
--------------------------------------------------------------------------------
Sentence 1: As I