<a href="https://colab.research.google.com/github/marco-siino/SemEval2024/blob/main/Task%201/SemEval2024_Task1_eng_subA_all_mpnet_base_v2_MSiino.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Semantic Relatedness -- English Example

### Package Imports

In [None]:
import re
import pandas as pd
import numpy as np
from scipy.stats import spearmanr, pearsonr
import matplotlib.pyplot as plt
plt.style.use('ggplot')

### Data Import

The training data will have a real-values semantic textual relatedness score (between 0 and 1) for a pair of English-language sentences.

The data is structured as a CSV file with the following fields:
- PairID: a unique identifier for the sentence pair
- Text: two sentences separated by a newline ('\n') character
- Score: the semantic textual relatedness score for the two sentences

Below we will show you how to load and re-format the provided data file.

In [None]:
# Load the Test File
df_str_rel = pd.read_csv('eng_test.csv')
df_str_rel.head()

Unnamed: 0,PairID,Text
0,ENG-test-0000,Egypt's Brotherhood stands ground after killin...
1,ENG-test-0001,install it for fre and get to know what all u ...
2,ENG-test-0002,"Also, it was one of the debut novels that I wa..."
3,ENG-test-0003,"Therefore, you can use the code BRAIL, BASIL, ..."
4,ENG-test-0004,Solid YA novel with a funky take on zombies an...


In [None]:
# Load the File
df_str_rel = pd.read_csv('Semantic_Relatedness_SemEval2024/Pilot_data/sem_text_rel_ranked.csv', usecols=[3,4,5])
df_str_rel.head()

Unnamed: 0,PairID,Text,Score
0,Formality_pp_222,"It that happens, just pull the plug.\nif that ...",1.0
1,STS_237,A black dog running through water.\nA black do...,1.0
2,ParaNMT_pp_204,I've been searchingthe entire abbey for you.\n...,1.0
3,Formality_pp_119,If he is good looking and has a good personali...,1.0
4,Formality_pp_174,"She does not hate you, she is just annoyed wit...",1.0


In [None]:
df_str_rel['Text'].values

array(["Egypt's Brotherhood stands ground after killings\nEgypt: Muslim Brotherhood Stands Behind Morsi",
       'install it for fre and get to know what all u have to download\nInstall the program, which is free to download, then get to know all of the download options.',
       'Also, it was one of the debut novels that I was most excited to read.\nPretty much the first thing people mentioned when I asked about YA genderbending novels.',
       ..., 'Find out in the book.\nbook with character that has my name',
       "And, Cassandra Clare wrapped up all the characters for us.\nI wasn't left wondering about any of them.",
       'Just go ahead and read Delirium, because this is what it is: a population living without love.\nRead only in December to enjoy what little is there.'],
      dtype=object)

In [None]:
# Creating a column "Split_Text" which is a list of two sentences.
df_str_rel['Split_Text'] = df_str_rel['Text'].apply(lambda x: x.split("\n"))
df_str_rel.head()

Unnamed: 0,PairID,Text,Split_Text
0,ENG-test-0000,Egypt's Brotherhood stands ground after killin...,[Egypt's Brotherhood stands ground after killi...
1,ENG-test-0001,install it for fre and get to know what all u ...,[install it for fre and get to know what all u...
2,ENG-test-0002,"Also, it was one of the debut novels that I wa...","[Also, it was one of the debut novels that I w..."
3,ENG-test-0003,"Therefore, you can use the code BRAIL, BASIL, ...","[Therefore, you can use the code BRAIL, BASIL,..."
4,ENG-test-0004,Solid YA novel with a funky take on zombies an...,[Solid YA novel with a funky take on zombies a...


# Dice Score (Overlap Score)

A simple baseline for estimating semantic relatedness between two sentences is to look at the proportion of words that they share in common.

There are many ways to change the score below. Consider:
1. Removing stop words and/or puncutation
2. Counting duplicate words (currently not counted)
3. Weighting rarer words differently
4. Splitting tokens differently

In [None]:
def dice_score(s1,s2):
  s1 = s1.lower()
  s1_split = re.findall(r"\w+|[^\w\s]", s1, re.UNICODE)

  s2 = s2.lower()
  s2_split = re.findall(r"\w+|[^\w\s]", s2, re.UNICODE)

  dice_coef = len(set(s1_split).intersection(set(s2_split))) / (len(set(s1_split)) + len(set(s2_split)))
  return round(dice_coef, 2)

# Transformers Score

In [None]:
!pip install sentence_transformers

from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer('all-mpnet-base-v2')


Collecting sentence_transformers
  Downloading sentence_transformers-2.3.0-py3-none-any.whl (132 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/132.8 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━[0m [32m122.9/132.8 kB[0m [31m3.9 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m132.8/132.8 kB[0m [31m3.4 MB/s[0m eta [36m0:00:00[0m
Collecting sentencepiece (from sentence_transformers)
  Downloading sentencepiece-0.1.99-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m11.5 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: sentencepiece, sentence_transformers
Successfully installed sentence_transformers-2.3.0 sentencepiece-0.1.99


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [None]:
def transformer_score(s1,s2):

  query_embedding = model.encode(s1, convert_to_tensor=True)
  passage_embedding = model.encode(s2, convert_to_tensor=True)

  print(util.cos_sim(query_embedding, passage_embedding).numpy()[0][0])

  return round(util.cos_sim(query_embedding, passage_embedding).numpy()[0][0], 2)

## Calculate Score

In [None]:
#true_scores = df_str_rel['Score'].values
pred_scores = []

for index,row in df_str_rel.iterrows():
  s1,s2 = row["Text"].split("\n")

  # Overlap score
  pred_scores.append(transformer_score(s1,s2))

[0.67180616]
[0.5858357]
[0.47192085]
[0.04518123]
[0.43361932]
[0.09970629]
[0.17056051]
[0.19813669]
[0.64559174]
[0.2633849]
[0.7051522]
[0.18777591]
[0.10943045]
[0.14781614]
[0.5501696]
[0.7935852]
[0.2200626]
[0.4704823]
[0.6694157]
[0.29406953]
[0.7084053]
[0.33502802]
[0.20032458]
[0.01356762]
[0.6932968]
[0.77348715]
[-0.00173464]
[0.32275766]
[0.25486785]
[0.14861749]
[0.05608985]
[0.61889327]
[0.26830506]
[0.6891141]
[0.75037706]
[0.18310475]
[-0.00369433]
[0.96110344]
[0.15202755]
[0.8710099]
[0.8298831]
[0.6675395]
[0.21261176]
[0.3172735]
[0.2201983]
[0.1261324]
[0.48820645]
[0.30172703]
[0.8634758]
[0.46966392]
[0.48408803]
[0.70156556]
[0.69694304]
[0.859169]
[0.58597904]
[0.412085]
[0.78791136]
[0.4989751]
[0.2420541]
[0.9235457]
[0.6747817]
[0.23024726]
[0.1442755]
[0.3829977]
[0.2611591]
[0.6786245]
[0.30557638]
[0.33365706]
[0.05749208]
[-0.05438946]
[0.52919066]
[0.5651225]
[0.75596464]
[-0.02389339]
[0.25605965]
[0.17792645]
[0.03100724]
[0.97433966]
[0.33049667]


In [None]:
# How well does the baseline correlate with human judgments?
print("Pearson Correlation:", round(pearsonr(true_scores,pred_scores)[0],2))

Pearson Correlation: 0.58


# Generate submission file

### Append prediction to dataframe

In [None]:
df_str_rel['Pred_Score'] = pred_scores
df_str_rel.head()

Unnamed: 0,PairID,Text,Split_Text,Pred_Score
0,ENG-test-0000,Egypt's Brotherhood stands ground after killin...,[Egypt's Brotherhood stands ground after killi...,0.67
1,ENG-test-0001,install it for fre and get to know what all u ...,[install it for fre and get to know what all u...,0.59
2,ENG-test-0002,"Also, it was one of the debut novels that I wa...","[Also, it was one of the debut novels that I w...",0.47
3,ENG-test-0003,"Therefore, you can use the code BRAIL, BASIL, ...","[Therefore, you can use the code BRAIL, BASIL,...",0.05
4,ENG-test-0004,Solid YA novel with a funky take on zombies an...,[Solid YA novel with a funky take on zombies a...,0.43


### Generate submission file

Submission file has two columns: '**PairID**' and '**Pred_Score**'

In [None]:
df_str_rel[['PairID', 'Pred_Score']].to_csv('pred_eng.csv', index=False)