<a href="https://colab.research.google.com/github/marco-siino/SemEval2024/blob/main/Task%201/SemEval2024_Task1_esp_subA_all_mpnet_base_v2_MSiino.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Semantic Relatedness -- English Example

### Package Imports

In [None]:
import re
import pandas as pd
import numpy as np
from scipy.stats import spearmanr, pearsonr
import matplotlib.pyplot as plt
plt.style.use('ggplot')

### Data Import

The training data will have a real-values semantic textual relatedness score (between 0 and 1) for a pair of English-language sentences.

The data is structured as a CSV file with the following fields:
- PairID: a unique identifier for the sentence pair
- Text: two sentences separated by a newline ('\n') character
- Score: the semantic textual relatedness score for the two sentences

Below we will show you how to load and re-format the provided data file.

In [None]:
# Load the Test File
df_str_rel = pd.read_csv('esp_test.csv')
df_str_rel.head()

Unnamed: 0,PairID,Text
0,ESP-test-0000,Los menonitas amish con ascendencia suiza de G...
1,ESP-test-0001,El perro negro está jugando con el perro marró...
2,ESP-test-0002,"Cuando se agita un disolvente, dos líquidos in..."
3,ESP-test-0003,La exsoldado de Estados Unidos Chelsea Manning...
4,ESP-test-0004,La catedral de Módena es uno de los lugares de...


In [None]:
# Load the File
df_str_rel = pd.read_csv('Semantic_Relatedness_SemEval2024/Pilot_data/sem_text_rel_ranked.csv', usecols=[3,4,5])
df_str_rel.head()

Unnamed: 0,PairID,Text,Score
0,Formality_pp_222,"It that happens, just pull the plug.\nif that ...",1.0
1,STS_237,A black dog running through water.\nA black do...,1.0
2,ParaNMT_pp_204,I've been searchingthe entire abbey for you.\n...,1.0
3,Formality_pp_119,If he is good looking and has a good personali...,1.0
4,Formality_pp_174,"She does not hate you, she is just annoyed wit...",1.0


In [None]:
df_str_rel['Text'].values

array(['Los menonitas amish con ascendencia suiza de Galicia se establecieron en 1815 cerca de Dubno.\nLos amonios menonitas de origen suizo de Galicia se establecieron cerca de Dubno en 1815.',
       'El perro negro está jugando con el perro marrón en la arena.\nEl peludo perro marrón corre por la zona de césped',
       'Cuando se agita un disolvente, dos líquidos inmiscibles se extraen juntos.\nBrian Meehan huyó a Portugal con Traynor (quien más tarde escapó a Amsterdam).',
       'La exsoldado de Estados Unidos Chelsea Manning fue arrestada este viernes por negarse a declarar en el marco de una investigación sobre WikiLeaks.\nAgotaré todos los recursos legales disponibles.',
       'La catedral de Módena es uno de los lugares de estilo románico más importantes de Europa y a su vez Patrimonio de la Humanidad.\nBauer se casó con un patinador artístico húngaro, István Szenes.',
       '"Si la gente pobre supiera cuán rica es la gente rica habría disturbios en las calles".\nLa forma m

In [None]:
# Creating a column "Split_Text" which is a list of two sentences.
df_str_rel['Split_Text'] = df_str_rel['Text'].apply(lambda x: x.split("\n"))
df_str_rel.head()

Unnamed: 0,PairID,Text,Split_Text
0,ESP-test-0000,Los menonitas amish con ascendencia suiza de G...,[Los menonitas amish con ascendencia suiza de ...
1,ESP-test-0001,El perro negro está jugando con el perro marró...,[El perro negro está jugando con el perro marr...
2,ESP-test-0002,"Cuando se agita un disolvente, dos líquidos in...","[Cuando se agita un disolvente, dos líquidos i..."
3,ESP-test-0003,La exsoldado de Estados Unidos Chelsea Manning...,[La exsoldado de Estados Unidos Chelsea Mannin...
4,ESP-test-0004,La catedral de Módena es uno de los lugares de...,[La catedral de Módena es uno de los lugares d...


# Dice Score (Overlap Score)

A simple baseline for estimating semantic relatedness between two sentences is to look at the proportion of words that they share in common.

There are many ways to change the score below. Consider:
1. Removing stop words and/or puncutation
2. Counting duplicate words (currently not counted)
3. Weighting rarer words differently
4. Splitting tokens differently

In [None]:
def dice_score(s1,s2):
  s1 = s1.lower()
  s1_split = re.findall(r"\w+|[^\w\s]", s1, re.UNICODE)

  s2 = s2.lower()
  s2_split = re.findall(r"\w+|[^\w\s]", s2, re.UNICODE)

  dice_coef = len(set(s1_split).intersection(set(s2_split))) / (len(set(s1_split)) + len(set(s2_split)))
  return round(dice_coef, 2)

# Transformers Score

In [None]:
!pip install sentence_transformers

from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer('all-mpnet-base-v2')


Collecting sentence_transformers
  Downloading sentence_transformers-2.3.0-py3-none-any.whl (132 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m132.8/132.8 kB[0m [31m1.7 MB/s[0m eta [36m0:00:00[0m
Collecting sentencepiece (from sentence_transformers)
  Downloading sentencepiece-0.1.99-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m11.9 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: sentencepiece, sentence_transformers
Successfully installed sentence_transformers-2.3.0 sentencepiece-0.1.99


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [None]:
def transformer_score(s1,s2):

  query_embedding = model.encode(s1, convert_to_tensor=True)
  passage_embedding = model.encode(s2, convert_to_tensor=True)

  print(util.cos_sim(query_embedding, passage_embedding).numpy()[0][0])

  return round(util.cos_sim(query_embedding, passage_embedding).numpy()[0][0], 2)

## Calculate Score

In [None]:
#true_scores = df_str_rel['Score'].values
pred_scores = []

for index,row in df_str_rel.iterrows():
  s1,s2 = row["Text"].split("\n")

  # Overlap score
  pred_scores.append(transformer_score(s1,s2))

0.7862749
0.60591716
0.3618881
0.4380784
0.3465919
0.5263203
0.20976025
0.21170175
0.7816691
0.3704561
0.5051442
0.42191756
0.5349462
0.42114514
0.255666
0.93639493
0.38840967
0.992826
0.63347185
0.5484288
0.28990117
0.8941356
0.631478
0.6722802
0.45421103
0.96204257
0.37417752
0.30325055
0.9347656
0.50791466
0.44808674
0.35258505
0.451835
0.5873288
0.21252754
0.3290878
0.9475702
1.0000002
0.35067922
0.9899763
0.22791249
0.30834377
0.31718907
0.48497516
0.9918673
0.88225853
0.955107
0.4958588
0.43950862
0.37685025
0.35170916
0.5541036
0.9156812
0.8884003
0.38823304
0.48409724
0.94704086
0.91106725
0.57599604
0.5091741
0.5143227
0.62118834
0.56627095
0.79861283
0.37381056
0.4534775
0.41697642
0.40395576
0.47305942
0.39967528
0.59094375
0.544163
0.38979167
0.4104363
0.41191635
0.45869613
0.20765752
0.8114996
0.4854412
0.2864092
0.44062987
0.86153364
0.39126092
0.48523033
0.41640198
0.96139824
0.3476671
0.6953325
0.9751401
0.15712458
0.96906734
0.2063466
0.3356008
0.6529733
0.4885695
0.78

In [None]:
# How well does the baseline correlate with human judgments?
print("Pearson Correlation:", round(pearsonr(true_scores,pred_scores)[0],2))

Pearson Correlation: 0.58


# Generate submission file

### Append prediction to dataframe

In [None]:
df_str_rel['Pred_Score'] = pred_scores
df_str_rel.head()

Unnamed: 0,PairID,Text,Split_Text,Pred_Score
0,ESP-test-0000,Los menonitas amish con ascendencia suiza de G...,[Los menonitas amish con ascendencia suiza de ...,0.79
1,ESP-test-0001,El perro negro está jugando con el perro marró...,[El perro negro está jugando con el perro marr...,0.61
2,ESP-test-0002,"Cuando se agita un disolvente, dos líquidos in...","[Cuando se agita un disolvente, dos líquidos i...",0.36
3,ESP-test-0003,La exsoldado de Estados Unidos Chelsea Manning...,[La exsoldado de Estados Unidos Chelsea Mannin...,0.44
4,ESP-test-0004,La catedral de Módena es uno de los lugares de...,[La catedral de Módena es uno de los lugares d...,0.35


### Generate submission file

Submission file has two columns: '**PairID**' and '**Pred_Score**'

In [None]:
df_str_rel[['PairID', 'Pred_Score']].to_csv('pred_esp_a.csv', index=False)