# Solution

# IHLT Lab Exercise 6
## This file contains code to complete the exercise for the sixth lab session of IHLT
Authors:


*   Kacper Poniatowski (kacper.krzysztof.poniatowski@estudiantat.upc.edu)
*   Pau Blanco (pablo.blanco@estudiantat.upc.edu)


## Methodology
In Lab 2 and Lab 3 we compared pairs of sentences using Jaccard distance and then compared the obtained results with the provided gold standard values using the Pearson correlation.

In Lab 2 we used words while in Lab 3 the corresponding lemmas were used.

In each lab, we computed the values using four different variations:
* Raw Sentences
* Sentences in lower case
* Sentences in lower case removing stop words
* Sentences in lower case removing punctuation (keeping only alphanumeric words)

Following is the Pearson correlation we got for each variant:

| Preprocessing Step | Using Words| Using Lemmas | Difference (%) |
|-----------------|-----------------|-----------------|---|
| Raw Sentences | 0.450498 | 0.476777  | 5.83% similarity increase using lemmas |
| Lower Case | 0.462495  | 0.481508  | 4.11% similarity increase using lemmas |
| Lower Case + No Stopwords | 0.445160  | 0.495948  | 11.41% similarity increase using lemmas |
| Lower Case + No Punctuation | 0.458716  | 0.499157 | 8.82% similarity increase using lemmas |

As seen in the table above, the best correlation using words was obtained when only applying lowercase. But when using lemmas, the best results were obtained when the punctuation was removed after lowercasing.

In this exercise we are required to do a similar study but using synsets after disambiguating words instead of words or lemmas. Given our reference value is 0.499157 which was achieved using lemmas after applying lowercasing and removing punctuation, this steps aims to further refine word sense disambiguation. Similar to how lemmatisation led to improved scores compared to raw word tokens, we anticipate that incorporating word sense information will yield better results than the previously stated reference value.

On a final note: when disambiguating with Lesk, we will get word senses for the 4 open-class categories only. This means a significant portion of words will be removed from the sentences. To be able to analyze the impact of this words, we will compute the Jaccard distances and correlation by using only the open-class categories and also including the lemmas of the words belonging to the rest of categories.

It is worth mentioning that our intention was to disambiguate the words using Lesk and UKB. Unfortunately, we have not been able to complete the results of UKB due to the TextServer instability or quota limits. Before the conclusion section we have added the cells and code we intended to use.

## Pre-reqs

In [None]:
from google.colab import drive
import sys

drive.mount('/content/drive', force_remount=True)
sys.path.insert(0, '/content/drive/MyDrive/Notebooks/IHLT/Week6')
import pandas as pd

dt = pd.read_csv('/content/drive/MyDrive/Notebooks/IHLT/Week6/test-gold/STS.input.SMTeuroparl.txt',sep='\t',header=None)
dt['gs'] = pd.read_csv('/content/drive/MyDrive/Notebooks/IHLT/Week6/test-gold/STS.gs.SMTeuroparl.txt',sep='\t',header=None)

dt.head()

Mounted at /content/drive


Unnamed: 0,0,1,gs
0,The leaders have now been given a new chance a...,The leaders benefit aujourd' hui of a new luck...,4.5
1,Amendment No 7 proposes certain changes in the...,Amendment No 7 is proposing certain changes in...,5.0
2,Let me remind you that our allies include ferv...,I would like to remind you that among our alli...,4.25
3,The vote will take place today at 5.30 p.m.,The vote will take place at 5.30pm,4.5
4,"The fishermen are inactive, tired and disappoi...","The fishermen are inactive, tired and disappoi...",5.0


In [None]:
import nltk
nltk.download('wordnet')
nltk.download('omw-1.4')
nltk.download('punkt_tab')
nltk.download("punkt")

from nltk.metrics import jaccard_distance
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from scipy.stats import pearsonr

# from textserver import TextServer

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


## Exercise

### Common Functions

In [None]:
# Function to compute and store Pearson correlations
def compute_pearsonr(column_name, label):
    correlation = pearsonr(dt['gs'], dt[column_name])[0]
    print(f'Pearson correlation for {label} ({column_name}): {correlation:.6f}')
    return correlation

def preProcessTokens(tokens):
  # Lowerize and keep only alphanumeric words (exclude numbers and punctuation)
  return [w.lower() for w in tokens if w.isalpha()]

### Lesk in NLTK

In [None]:
# Functions and pre-work
lemmatizer = WordNetLemmatizer()

dt_jac = dt
dt_jac['jac_lesk_remove_no_sense'] = 0.0
dt_jac['jac_lesk_lemmatise_no_sense'] = 0.0

def disambiguate_sentence(sentence, remove_word, verbose = False):
    tokens = preProcessTokens(word_tokenize(sentence))
    senses = []
    if verbose:
      print("Tokens:", tokens)
    for word in tokens:
      sense = nltk.wsd.lesk(tokens, word)  # Apply Lesk for Word Sense Disambiguation
      if sense:
        if verbose:
          print(word, '->', sense.name(), ':', sense.definition())
        senses.append(sense.name())  # Get the synset name (e.g., 'bank.n.01')
      else:
        if remove_word:
          continue  # If no synset is found, remove word in this case
        else:
          lemma = lemmatizer.lemmatize(word)
          senses.append(lemma)  # Append the lemma to the list
    return senses

def calculate_lesk_jaccard(remove_word, verbose = False):
  similarities = []

  row_nums = dt.shape[0]

  for i in range(row_nums):
    sentence0 = dt.at[i, 0]
    sentence1 = dt.at[i, 1]

    if verbose:
      print("#################")
      print("     Sentence 1:", sentence0)
    senses0 = disambiguate_sentence(sentence0, remove_word, verbose)
    if verbose:
      print("     Sentence 2:", sentence1)
    senses1 = disambiguate_sentence(sentence1, remove_word, verbose)

    if remove_word:
      dt_jac.at[i, 'jac_lesk_remove_no_sense'] = 1 - jaccard_distance(set(senses0), set(senses1))
    else:
      dt_jac.at[i, 'jac_lesk_lemmatise_no_sense'] = 1 - jaccard_distance(set(senses0), set(senses1))

#### Lesk + Removing Words without Senses

In [None]:
# Compute similarities by considering senses and Jaccard coefficient (REMOVE WORDS WITHOUT SENSES)
calculate_lesk_jaccard(remove_word = True)

average_jac_lesk_remove_no_sense = dt_jac['jac_lesk_remove_no_sense'].mean()

print(f"Average Jaccard similarity for sentences using Lesk (removing words with no senses): {average_jac_lesk_remove_no_sense:.6f}")

Average Jaccard similarity for sentences using Lesk (removing words with no senses): 0.423957


#### Lesk + Using lemmas / forms

In [None]:
# Compute similarities by considering senses and Jaccard coefficient (USE EITHER LEMMAS OR WORD FORMS FOR WORDS WITHOUT SENSES)
calculate_lesk_jaccard(remove_word = False)

average_jac_lesk_lemmatise_no_sense = dt_jac['jac_lesk_lemmatise_no_sense'].mean()

print(f"Average Jaccard similarity for sentences using Lesk (using lemmas with no senses): {average_jac_lesk_lemmatise_no_sense:.6f}")

Average Jaccard similarity for sentences using Lesk (using lemmas with no senses): 0.457442


#### Pearson Correlation with Lesk

In [None]:
correlations = {
    'Lesk (sense only)': compute_pearsonr('jac_lesk_remove_no_sense', 'lesk with removal of words with no sense'),
    'Lesk (lemmas for sensesless words)': compute_pearsonr('jac_lesk_lemmatise_no_sense', 'lesk and using lemmas for senseless words'),
}

Pearson correlation for lesk with removal of words with no sense (jac_lesk_remove_no_sense): 0.413353
Pearson correlation for lesk and using lemmas for senseless words (jac_lesk_lemmatise_no_sense): 0.412781


### UKB in TextServer (Work in Progress due to TextServer limitations)

In [1]:
ts = TextServer("kacperp12", "textserver123-", "senses") # This works, surprisingly

dt_jac['jac_ukb_remove_no_senses'] = 0.0
dt_jac['jac_ukb_lemmatise_no_senses'] = 0.0

cache = {}

def calculate_ukb_jaccard_dist(remove_word):
  similarities = []
  row_nums = dt.shape[0]

  for i in range(5):
    sentence0 = dt.at[i, 0]
    sentence1 = dt.at[i, 1]

    # Cache results because TextServer is slow as Christmas
    if sentence0 in cache:
        senses0 = cache[sentence0]
    else:
        senses0 = ts.senses(sentence0)
        cache[sentence0] = senses0

    if sentence1 in cache:
        senses1 = cache[sentence1]
    else:
        senses1 = ts.senses(sentence1)
        cache[sentence1] = senses1

    # If remove_word = true, remove entries where no senses were found ('N/A')
    if remove_word:
        # Filter out words with no senses (senses0 and senses1 are lists of words with their senses)
        filtered_senses0 = [word[4] for word in senses0 if word[4] != 'N/A']
        filtered_senses1 = [word[4] for word in senses1 if word[4] != 'N/A']

        # Calculate Jaccard similarity for filtered senses
        dt_jac.at[i, 'jac_ukb_remove_no_sense'] = 1 - jaccard_distance(set(filtered_senses0), set(filtered_senses1))

    # If remove_word = false, fallback to the lemma form where no sense is found
    else:
        # If the sense is 'N/A', use the lemma (word[1])
        processed_senses0 = [word[4] if word[4] != 'N/A' else word[1] for word in senses0]
        processed_senses1 = [word[4] if word[4] != 'N/A' else word[1] for word in senses1]

        # Calculate Jaccard similarity using the processed senses (sense or lemma)
        dt_jac.at[i, 'jac_ukb_lemmatise_no_sense'] = 1 - jaccard_distance(set(processed_senses0), set(processed_senses1))

NameError: name 'TextServer' is not defined

#### UKB + Removing Words without Senses

In [2]:
# Compute similarities by considering senses and Jaccard coefficient (REMOVE WORDS WITHOUT SENSES)

#calculate_ukb_jaccard_dist(remove_word = True)


row_nums = dt_jac.shape[0]

sentence0 = dt.at[0, 0]

ts.senses(sentence0)

NameError: name 'dt_jac' is not defined

#### UKB + Using lemmas / forms

In [None]:
# Compute similarities by considering senses and Jaccard coefficient (USE EITHER LEMMAS OR WORD FORMS FOR WORDS WITHOUT SENSES)

## Conclusions

The following table summarizes the results obtained so far for lowercase sentences and without punctuations:

| Comparison method | Mean Jaccard Distance | Pearson Correlation|
|-|-|-|
| Words (Lab 2) | 0.518007 | 0.458716  |
| Lemmas (Lab 3) | 0.530221  | 0.499157  |
| Senses | 0.423957 | **0.413353**  |
| Senses + Lemmas | 0.457442  | **0.412781** |

The first evident conclusion is that our hypothesis stating that the use of senses should perform better than using lemmas (or words) was incorrect. The mean Jaccard similarities are lower than those obtained by words and lemmas.
The decrease in correlation is noteworthy, especially since it performed worse than even the most basic word comparison methods.

From these results, we can conclude that Lesk performs poorly at disambiguating words. We can visualise this poor performance by further analysis: the output of Lesk. In the next cell we can see each sentence and the synsets assigned.

It is not difficult to find poor disambiguation. For instance:

------
Sentence 1: Let me remind you that our allies include fervent supporters of this tax.

tax -> **tax.v.02** : set or determine the amount of (a payment such as a fine)

Sentence 2: I want to remind you that of our allies, it is of major of tax.
major -> major.v.01 : have as one's principal field of study

tax -> **tax.v.03** : use to the limit

------
In the above sentences, tax should have the same meaning, but Lesk is assigning different senses. Words or lemmas would count the words as positive match.

Another example:

------
Sentence 1: The leaders have now been given a new chance and let us hope they seize it.

have -> **have.v.07** : have a personal or business relationship with someone

Sentence 2: The leaders now have another chance to let them and therefore seize it.

have -> **take.v.35** : have sex with; archaic use

------

Once again, the same problem is evident. The word 'have' (no comments about the hilarious meaning assigned in one of the sentences...).

In this same pair of sentences we can see how an incorrect meaning is assigned to "it". Besides that, it has no negative impact as the same wrong meaning is assigned in both sentences:

it -> information_technology.n.01 : the branch of engineering that deals with the use of computers and telecommunications to retrieve and store and transmit information.

We can therefore conclude that the results have worsened due to Lesk's poor performance for the assigned task. Perhaps with other disambiguation algorithms (we regret not being able to complete the tests with UKB), the results could be improved.

### Detailed view on Lesk disambiguation

In [None]:
# Let's give some light
calculate_lesk_jaccard(remove_word = True, verbose = True)

[1;30;43mSe han truncado las últimas 5000 líneas del flujo de salida.[0m
honourably -> honorably.r.02 : with honor
#################
     Sentence 1: Mr President, the Cashman report can be summarised in four words: citizens' power over bureaucracy.
Tokens: ['mr', 'president', 'the', 'cashman', 'report', 'can', 'be', 'summarised', 'in', 'four', 'words', 'citizens', 'power', 'over', 'bureaucracy']
mr -> mister.n.01 : a form of address for a man
president -> president_of_the_united_states.n.02 : the office of the United States head of state
report -> report.v.05 : be responsible for reporting the details of, as in journalism
can -> can.n.02 : the quantity contained in a can
be -> exist.v.01 : have an existence, be extant
summarised -> summarize.v.02 : be a summary of
in -> indium.n.01 : a rare soft silvery metallic element; occurs in small quantities in sphalerite
four -> four.n.01 : the cardinal number that is the sum of three and one
words -> words.n.01 : the words that are spoken
ci