<a href="https://colab.research.google.com/github/nagasatvika/semantic-similarity/blob/main/bonus_tasks.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Installing the HuggingFace Datasets

In [30]:
!pip install datasets



Loading the dataset

In [31]:
from datasets import load_dataset

dataset = load_dataset("paws", "labeled_final")

#Using Sentence Transformers
Resource Referred :

1.  [sbert](https://www.sbert.net/docs/sentence_transformer/usage/semantic_textual_similarity.html)


1.  Installing **sentence-transformers**
2.  **-q** is a flag used to supress the output during installation process

In [32]:
%pip install sentence-transformers -q

1. Importing **SentenceTransformer** CLASS from **sentence_transformers**
2. util is a module which includes functions such as cosine similarity
1.  The **SentenceTransformer** Class is *instantiated* with **all-MiniLM-L6-v2**
 which generates the *vector* *representations* of given *sentence*

In [33]:
from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer('all-MiniLM-L6-v2')



1.  Extracting the first 10 rows from 'sentence1' and 'sentence2' of test dataset
2.  Generating vector representations/embeddings of sentence1 and sentence2 using SentenceTransformer model
1. Then Calculating the cosine similarity between the vector representations/embeddings
2. Using a for for iterating through the first 10 words printing sentences and similarity scores

In [34]:
s1 = dataset['test']['sentence1'][:10]
s2 = dataset['test']['sentence2'][:10]
embeddings1 = model.encode(s1, convert_to_tensor=True)
embeddings2 = model.encode(s2, convert_to_tensor=True)
cosine_score = util.cos_sim(embeddings1,embeddings2)
for i in range(10):
  cosine_score_value = cosine_score[i][i].item()
  print(f"{i}.\nsentence1:{s1[i]}\nsentence2:{s2[i]}\ncosine similarity score:{cosine_score_value}\n")


0.
sentence1:This was a series of nested angular standards , so that measurements in azimuth and elevation could be done directly in polar coordinates relative to the ecliptic .
sentence2:This was a series of nested polar scales , so that measurements in azimuth and elevation could be performed directly in angular coordinates relative to the ecliptic .
cosine similarity score:0.9322909712791443

1.
sentence1:His father emigrated to Missouri in 1868 but returned when his wife became ill and before the rest of the family could also go to America .
sentence2:His father emigrated to America in 1868 , but returned when his wife became ill and before the rest of the family could go to Missouri .
cosine similarity score:0.9918644428253174

2.
sentence1:In January 2011 , the Deputy Secretary General of FIBA Asia , Hagop Khajirian , inspected the venue together with SBP - President Manuel V. Pangilinan .
sentence2:In January 2011 , FIBA Asia deputy secretary general Hagop Khajirian along with S

Loading the dataset

In [35]:
from datasets import load_dataset

dataset1 = load_dataset("PiC/phrase_similarity")

In [36]:
phrases1 = dataset1['test']['phrase1'][:20]
phrases2 = dataset1['test']['phrase2'][:20]

embeddings1 = model.encode(phrases1, convert_to_tensor=True)
embeddings2 = model.encode(phrases2, convert_to_tensor=True)

cosine_score = util.cos_sim(embeddings1, embeddings2)

for i in range(20):
    cosine_score_value = cosine_score[i][i].item()
    print("{} \t\t{} \t\tScore:{:.4f}".format(phrases1[i], phrases2[i], cosine_score_value))

air position 		posture while jumping 		Score:0.5256
associated track 		correlating music single 		Score:0.4104
whole parts 		extended sections 		Score:0.3237
wide set 		spacious collection 		Score:0.4795
full protection 		complete defense 		Score:0.5793
prior case 		preceding game 		Score:0.4589
another station 		a separate airport 		Score:0.4244
initial activity 		starting task 		Score:0.5561
single square 		solitary border 		Score:0.2583
independent operation 		individual enterprise 		Score:0.3053
long segment 		quite a stretch 		Score:0.2828
borg family 		Borg household 		Score:0.8589
one die 		a single counter 		Score:0.4336
material one 		physical world 		Score:0.2893
offensive element 		attack component 		Score:0.4443
fast movements 		quick motions 		Score:0.7371
unnecessary credit 		gratuitous acknowledgment 		Score:0.3027
group address 		overall location 		Score:0.1806
third incarnation 		third embodiment 		Score:0.5671
visual record 		perceivable history 		Score:0.2086


Using commercial LLM APIs i.e (CHAT-GPT)

ZERO SHOT LEARNING

In [37]:
pip install datasets scikit-learn




In [38]:
from datasets import load_dataset
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# Load the dataset
dataset = load_dataset("paws", "labeled_final", split="test[:100]")  # Using a subset for efficiency

# Extract the sentences
sentences1 = dataset['sentence1']
sentences2 = dataset['sentence2']

# Combine all sentences for TF-IDF fitting
all_sentences = sentences1 + sentences2

# Create TF-IDF vectorizer
vectorizer = TfidfVectorizer()

# Fit and transform the sentences
tfidf_matrix = vectorizer.fit_transform(all_sentences)

# Split the transformed matrix into two parts for each sentence pair
tfidf_sentences1 = tfidf_matrix[:len(sentences1)]
tfidf_sentences2 = tfidf_matrix[len(sentences1):]

# Calculate cosine similarity for each pair of sentences
similarity_scores = cosine_similarity(tfidf_sentences1, tfidf_sentences2)

# Create a list of similarity scores for each sentence pair
similarity_scores_diagonal = similarity_scores.diagonal()

# Output the first few similarity scores as a sample
print(similarity_scores_diagonal[:10])


[0.86965731 0.98411651 0.93038253 0.8584274  1.         0.95025992
 0.91144189 0.98411651 0.92410881 0.98314708]


In [None]:
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

# Load the dataset
dataset = load_dataset("PiC/phrase_similarity")

# Extract the first two phrases
phrase1 = dataset['test'][0]['phrase1']
phrase2 = dataset['test'][0]['phrase2']

# Load a pre-trained model and tokenizer
model_name = "textattack/bert-base-uncased-MNLI"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

# Tokenize the phrases
tokenized_input = tokenizer(phrase1, phrase2, return_tensors='pt', padding=True, truncation=True)

# Compute the similarity score
with torch.no_grad():
    output = model(**tokenized_input)

similarity_score = torch.softmax(output.logits, dim=1)[0][1].item()  # Take the score for "entailment" class

print("Phrase 1:", phrase1)
print("Phrase 2:", phrase2)
print("Similarity Score:", similarity_score)


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/630 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

Phrase 1: air position
Phrase 2: posture while jumping
Similarity Score: 0.08406110852956772


In [None]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

nltk.download('stopwords')
nltk.download('punkt')

def preprocess(sentence):
    stop_words = set(stopwords.words('english'))
    word_tokens = word_tokenize(sentence.lower())
    filtered_sentence = [word for word in word_tokens if word not in stop_words]
    return ' '.join(filtered_sentence)

def cosine_similarity_score(sentence1, sentence2):
    preprocessed_sentence1 = preprocess(sentence1)
    preprocessed_sentence2 = preprocess(sentence2)

    tfidf_vectorizer = TfidfVectorizer()
    tfidf_matrix = tfidf_vectorizer.fit_transform([preprocessed_sentence1, preprocessed_sentence2])

    similarity_score = cosine_similarity(tfidf_matrix[0:1], tfidf_matrix[1:2])[0][0]

    return similarity_score

# Test cases
sentences = [
    ("This was a series of nested angular standards, so that measurements in azimuth and elevation could be done directly in polar coordinates relative to the ecliptic.",
     "This was a series of nested polar scales, so that measurements in azimuth and elevation could be performed directly in angular coordinates relative to the ecliptic."),
    ("His father emigrated to Missouri in 1868 but returned when his wife became ill and before the rest of the family could also go to America.",
     "His father emigrated to America in 1868, but returned when his wife became ill and before the rest of the family could go to Missouri."),
    ("In January 2011, the Deputy Secretary General of FIBA Asia, Hagop Khajirian, inspected the venue together with SBP-President Manuel V. Pangilinan.",
     "In January 2011, FIBA Asia deputy secretary general Hagop Khajirian along with SBP president Manuel V. Pangilinan inspected the venue.")
]

for idx, (sentence1, sentence2) in enumerate(sentences):
    similarity_score = cosine_similarity_score(sentence1, sentence2)
    print(f"Similarity score between sentences {idx + 1}: {similarity_score}")


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


Similarity score between sentences 1: 0.7523197619890015
Similarity score between sentences 2: 0.9317157650164152
Similarity score between sentences 3: 0.8836351388995085


In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

def calculate_similarity(phrase1, phrase2):
    vectorizer = TfidfVectorizer(stop_words='english')
    tfidf_matrix = vectorizer.fit_transform([phrase1, phrase2])
    similarity_matrix = cosine_similarity(tfidf_matrix, tfidf_matrix)
    return similarity_matrix[0, 1]

phrase1 = "air position"
phrase2 = "posture while jumping"
similarity_score = calculate_similarity(phrase1, phrase2)
print(f"Similarity score between '{phrase1}' and '{phrase2}': {similarity_score}")


Similarity score between 'air position' and 'posture while jumping': 0.0


Using open source LLMs/APIs i.e (Mistral AI)

In [2]:
!pip install sentence-transformers
!pip install datasets


Collecting sentence-transformers
  Downloading sentence_transformers-3.0.1-py3-none-any.whl (227 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m227.1/227.1 kB[0m [31m2.5 MB/s[0m eta [36m0:00:00[0m
Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch>=1.11.0->sentence-transformers)
  Using cached nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (23.7 MB)
Collecting nvidia-cuda-runtime-cu12==12.1.105 (from torch>=1.11.0->sentence-transformers)
  Using cached nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (823 kB)
Collecting nvidia-cuda-cupti-cu12==12.1.105 (from torch>=1.11.0->sentence-transformers)
  Using cached nvidia_cuda_cupti_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (14.1 MB)
Collecting nvidia-cudnn-cu12==8.9.2.26 (from torch>=1.11.0->sentence-transformers)
  Using cached nvidia_cudnn_cu12-8.9.2.26-py3-none-manylinux1_x86_64.whl (731.7 MB)
Collecting nvidia-cublas-cu12==12.1.3.1 (from torch>=1.11.0->sentence-transform

Zero Shot Setting

In [17]:
from datasets import load_dataset
from sentence_transformers import SentenceTransformer
import numpy as np

# Load the dataset
dataset = load_dataset("paws", "labeled_final")

# Load a pre-trained model
model_name = 'all-MiniLM-L6-v2'
model = SentenceTransformer(model_name)

# Get a pair of sentences from the dataset
for i in range(5):
  sentence1 = dataset['test'][i]['sentence1']
  sentence2 = dataset['test'][i]['sentence2']

# Calculate the embeddings for the sentences
  embedding1 = model.encode(sentence1)
  embedding2 = model.encode(sentence2)

# Calculate the cosine similarity between the embeddings
  similarity_score = np.dot(embedding1, embedding2) / (np.linalg.norm(embedding1) * np.linalg.norm(embedding2))

  print(f"Sentence 1: {sentence1}")
  print(f"Sentence 2: {sentence2}")
  print(f"Similarity score: {similarity_score}\n")


Sentence 1: This was a series of nested angular standards , so that measurements in azimuth and elevation could be done directly in polar coordinates relative to the ecliptic .
Sentence 2: This was a series of nested polar scales , so that measurements in azimuth and elevation could be performed directly in angular coordinates relative to the ecliptic .
Similarity score: 0.9322909116744995

Sentence 1: His father emigrated to Missouri in 1868 but returned when his wife became ill and before the rest of the family could also go to America .
Sentence 2: His father emigrated to America in 1868 , but returned when his wife became ill and before the rest of the family could go to Missouri .
Similarity score: 0.9918646812438965

Sentence 1: In January 2011 , the Deputy Secretary General of FIBA Asia , Hagop Khajirian , inspected the venue together with SBP - President Manuel V. Pangilinan .
Sentence 2: In January 2011 , FIBA Asia deputy secretary general Hagop Khajirian along with SBP presid

Few Shot Setting

In [8]:
from datasets import load_dataset
from sentence_transformers import SentenceTransformer
import numpy as np

# Load the dataset
dataset = load_dataset("paws", "labeled_final")

# Load a pre-trained model
model_name = 'all-MiniLM-L6-v2'
model = SentenceTransformer(model_name)

# Calculate the similarity for the first 5 pairs of sentences
for i in range(5):
    sentence1 = dataset['test'][i]['sentence1']
    sentence2 = dataset['test'][i]['sentence2']

    # Calculate the embeddings for the sentences
    embedding1 = model.encode(sentence1)
    embedding2 = model.encode(sentence2)

    # Calculate the cosine similarity between the embeddings
    similarity_score = np.dot(embedding1, embedding2) / (np.linalg.norm(embedding1) * np.linalg.norm(embedding2))

    print(f"Sentence 1: {sentence1}")
    print(f"Sentence 2: {sentence2}")
    print(f"Similarity score: {similarity_score}\n")


Sentence 1: This was a series of nested angular standards , so that measurements in azimuth and elevation could be done directly in polar coordinates relative to the ecliptic .
Sentence 2: This was a series of nested polar scales , so that measurements in azimuth and elevation could be performed directly in angular coordinates relative to the ecliptic .
Similarity score: 0.9322909116744995

Sentence 1: His father emigrated to Missouri in 1868 but returned when his wife became ill and before the rest of the family could also go to America .
Sentence 2: His father emigrated to America in 1868 , but returned when his wife became ill and before the rest of the family could go to Missouri .
Similarity score: 0.9918646812438965

Sentence 1: In January 2011 , the Deputy Secretary General of FIBA Asia , Hagop Khajirian , inspected the venue together with SBP - President Manuel V. Pangilinan .
Sentence 2: In January 2011 , FIBA Asia deputy secretary general Hagop Khajirian along with SBP presid

Zero Shot Setting

In [13]:
from datasets import load_dataset
import sentence_transformers

# Load the dataset
dataset1 = load_dataset("PiC/phrase_similarity")

# Get the two phrases you want to compare
phrase1 = dataset1['test'][0]['phrase1']
phrase2 = dataset1['test'][0]['phrase2']

# Initialize a Sentence Transformer model
model = sentence_transformers.SentenceTransformer('all-MiniLM-L6-v2')

# Convert the phrases to embeddings
embedding1 = model.encode(phrase1)
embedding2 = model.encode(phrase2)

# Calculate the cosine similarity between the embeddings
similarity_score = sentence_transformers.util.cos_sim(embedding1, embedding2)

print(f"The similarity score between '{phrase1}', and '{phrase2}' is '{similarity_score}' ")




The similarity score between 'newly formed camp', and 'recently made encampment' is 'tensor([[0.6109]])' 


Few Shot Setting

In [29]:
from datasets import load_dataset
import sentence_transformers

# Load the dataset
dataset1 = load_dataset("PiC/phrase_similarity")

# Get the two phrases you want to compare
phrase1 = dataset1['test'][0]['phrase1']
phrase2 = dataset1['test'][0]['phrase2']

# Initialize a Sentence Transformer model
model = sentence_transformers.SentenceTransformer('all-MiniLM-L6-v2')

# Convert the phrases to embeddings
embedding1 = model.encode(phrase1)
embedding2 = model.encode(phrase2)

# Calculate the cosine similarity between the embeddings
similarity_score = sentence_transformers.util.cos_sim(embedding1, embedding2)

# Convert the tensor to a Python float
#similarity_score = similarity_score.item().to('cpu')

# Print the result and compare it with the given similarity scores
print(f"The similarity score between '{phrase1}' and '{phrase2}' is {similarity_score}")

given_scores = {
    ("air position", "posture while jumping"): 0.5256,
    ("associated track", "correlating music single"): 0.4104,
    ("whole parts", "extended sections"): 0.3237,
    ("wide set", "spacious collection"): 0.4795,
    ("full protection", "complete defense"): 0.5793,
}

if (phrase1, phrase2) in given_scores:
    given_score = given_scores[(phrase1, phrase2)]
    print(f"The given similarity score is {given_score:.4f}")
    print(f"The difference between the two scores is {abs(similarity_score - given_score)}")


The similarity score between 'air position' and 'posture while jumping' is tensor([[0.5256]])
The given similarity score is 0.5256
The difference between the two scores is tensor([[4.4346e-05]])


In [26]:
from datasets import load_dataset
import sentence_transformers

# Load the dataset
dataset1 = load_dataset("PiC/phrase_similarity")

# Get the two phrases you want to compare
phrase1 = dataset1['test']['phrase1'][0]
phrase2 = dataset1['test']['phrase2'][0]

# Initialize a Sentence Transformer model
model = sentence_transformers.SentenceTransformer('all-MiniLM-L6-v2')

# Convert the phrases to embeddings
embedding1 = model.encode(phrase1)
embedding2 = model.encode(phrase2)

# Calculate the cosine similarity between the embeddings
similarity_score = sentence_transformers.util.cos_sim(embedding1, embedding2)

# Convert the tensor to a Python float (if necessary)
#similarity_score = similarity_score.item().to('cpu')

# Print the result and compare it with the given similarity scores
print(f"The similarity score between '{phrase1}' and '{phrase2}' is {similarity_score} ")

The similarity score between 'air position' and 'posture while jumping' is tensor([[0.5256]]) 
