# Solution

# IHLT Lab Exercise 7
## This file contains code to complete the exercise for the seventh lab session of IHLT
Authors:


*   Kacper Poniatowski (kacper.krzysztof.poniatowski@estudiantat.upc.edu)
*   Pau Blanco (pablo.blanco@estudiantat.upc.edu)



## Methodology

In previous exercises, we applied different preprocessing techniques to try to improve the Pearson correlation. As we want to compare the results of this laboratory with the previous ones, we need to decide what preprocessing to apply to make the results as comparable as possible.

We observed in the past that removing stop words and punctuation generally improved the results. However, applying such techniques in this case could interfere with the algorithm to extract Named Entities (NEs). For instance, lowercasing or removing punctuation could prevent identifying "Apple" (and its use in "Apple Inc.") as a company versus "apple," the fruit.

Consequently, we have decided, for this exercise, to work with the raw sentences and compare the lemmas of the words and NEs extracted. We will need to compare the results obtained with the closest ones from previous exercises.




## Pre-reqs

In [None]:
from google.colab import drive
import sys

drive.mount('/content/drive', force_remount=True)
sys.path.insert(0, '/content/drive/MyDrive/Notebooks/IHLT/Week6')
import pandas as pd

dt = pd.read_csv('/content/drive/MyDrive/Notebooks/IHLT/Week6/test-gold/STS.input.SMTeuroparl.txt',sep='\t',header=None)
dt['gs'] = pd.read_csv('/content/drive/MyDrive/Notebooks/IHLT/Week6/test-gold/STS.gs.SMTeuroparl.txt',sep='\t',header=None)

dt.head()

Mounted at /content/drive


Unnamed: 0,0,1,gs
0,The leaders have now been given a new chance a...,The leaders benefit aujourd' hui of a new luck...,4.5
1,Amendment No 7 proposes certain changes in the...,Amendment No 7 is proposing certain changes in...,5.0
2,Let me remind you that our allies include ferv...,I would like to remind you that among our alli...,4.25
3,The vote will take place today at 5.30 p.m.,The vote will take place at 5.30pm,4.5
4,"The fishermen are inactive, tired and disappoi...","The fishermen are inactive, tired and disappoi...",5.0


In [None]:
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_ne_chunker')
nltk.download('words')

from nltk.metrics import jaccard_distance
from nltk.tokenize import word_tokenize
from scipy.stats import pearsonr
import spacy
from typing import List
import string

nlp = spacy.load('en_core_web_sm')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping chunkers/maxent_ne_chunker.zip.
[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Unzipping corpora/words.zip.


## Exercise

### Functions

In [None]:
# Function to compute and store Pearson correlations
def calculate_pearsonr(column_name: str, label: str) -> float:
    """
    Calculate Pearson correlation coefficient between a column and the gold standard scores.
    """
    correlation = pearsonr(dt['gs'], dt[column_name])[0]
    print(f'Pearson correlation for {label} ({column_name}): {correlation:.6f}')
    return correlation


def get_nes_tokens(sent):
  # In this exercise we work with the raw sentences because lowercasing or
  # eliminating non alphanumerical words can interfere with the words sequences
  doc = nlp(sent)
  with doc.retokenize() as retokenizer:
    tokens = [token for token in doc]
    for ent in doc.ents:
        retokenizer.merge(doc[ent.start:ent.end],
            attrs={"LEMMA": " ".join([tokens[i].text
                                for i in range(ent.start, ent.end)])})

  # We only need the token lemma
  # return [(token.text, token.pos_, token.tag_, token.lemma_, token.is_stop,
  #  token.ent_iob_, token.ent_type_) for token in doc]
  return [token.lemma_ for token in doc]

def calculate_similarity_scores(df):
    """Calculate similarity scores for all sentence pairs"""
    similarities = []

    for i in range(df.shape[0]):
        sent1 = str(df.at[i, 0])
        sent2 = str(df.at[i, 1])

        # Get the words and and NEs
        words_nes1 = get_nes_tokens(sent1)
        words_nes2 = get_nes_tokens(sent2)
        #print(words_nes1)

        # Calculate similarity
        similarity = 1 - jaccard_distance(set(words_nes1), set(words_nes2))
        similarities.append(similarity)

    return similarities

def lemmatise_tokens(tokens: str) -> str:
    """
    Provide lemmas for the corresponding words
    and also remove non-alphanumeric words.
    """
    return [w.lower() for w in tokens if w.isalpha()]

### Compute Jaccard Similarities and Pearson Colleraltion

In [None]:
dt_jac = dt
dt_jac['jac_word_nes'] = calculate_similarity_scores(dt_jac)

average_similarity = dt_jac['jac_word_nes'].mean()
print(f"Average Jaccard similarity: {average_similarity:.6f}")

pearson_corr = calculate_pearsonr('jac_word_nes', 'Words and NEs')

Average Jaccard similarity: 0.527462
Pearson correlation for Words and NEs (jac_word_nes): 0.430006


## Conclusions

In this exercise we have obtained a Pearson correlation of 0.430006, indicating that there is some degree of correlation with the gold standard.
In order to evaluate this results, we will compare this results with the obtained in previous labs.
These are the relevant correlations obtained so far:

| Comparison method | Mean Jaccard Distance | Pearson Correlation| Methodology |
|-|-|-|-|
| Words (Lab 2) | 0.500240 |  0.450498  | Raw sentences, without preprocessing |
| Lemmas (Lab 3) | 0.552004  | 0.476777  | Raw sentences using lemmas |
| Senses (Lab 6) | 0.457442 | 0.412781  | Lower case, no punctuation, using lemmas |
| Lemmas + NEs | 0.527462  | **0.430006** | Raw sentences using lemmas |

We can see how the method tested in this lab has not been able to outperform the most basic methods used in Lab 2 and 3. Only the poor results obtained in Lab 6 have been improved.


Putting aside the results of Lab 6 (where we already saw that the low result was due to the assignment of different synsets for the same word), the fact that the results are worse than in the first labs makes sense due to the use of Jaccard distance.

As we know, the Jaccard distance is based on comparing the number of words that appear in both sentences with the total number of words. With this simple calculation, each matching word adds up. If we group matching words into a single Named Entity (NE), instead of having a match for each word, we will have only one, lowering the Jaccard similarity.

For instance, let's compare these two sentences:
* "The European Union will grow"
* "The European Union is expanding"

If we directly compare the words, we will have a Jaccard score of 3 / 7 = 0.43. However, if we consider "The European Union" as a single token, the result will be 1 / 5 = 0.20.

An interesting observation can be made when comparing the results of Lemmas + NEs against only words from lab 2. The mean Jaccard distance using lemmas + NEs is higher (0.527462) than using just words (0.500240) but the corresponding Pearson correlation is lower (0.430006 against 0.450498). A possible explanation for this is lemmatisation is creating more matches by normalising word forms (proved on multiple occasions throughout our lab work) but this lemmatisation and NEs merging may result in less varied pairs, thus reducing correlation with the more vaired gold standard scores.

Without lemmatisation + NEs: Words like "European" appear multiple times, which creates similar frequency patterns. These patterns contribute to a higher Pearson correlation.

With lemmatisation + NEs: "European Union" becomes one merged token.

As we can see, simply applying the extraction of NEs without combining it with other techniques generally worsens the results.