<a href="https://colab.research.google.com/github/nagasatvika/semantic-similarity/blob/main/sentence_similarity.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Installing the HuggingFace Datasets

In [26]:
!pip install datasets



Loading the dataset

In [27]:
from datasets import load_dataset

dataset = load_dataset("paws", "labeled_final")

In [28]:
import pandas as pd
df = pd.DataFrame(dataset['test'])
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8000 entries, 0 to 7999
Data columns (total 4 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   id         8000 non-null   int64 
 1   sentence1  8000 non-null   object
 2   sentence2  8000 non-null   object
 3   label      8000 non-null   int64 
dtypes: int64(2), object(2)
memory usage: 250.1+ KB


#Jaccard Similarity
Jaccard Similarity matric used to determine the similarity between two text document means how the two text documents close to each other in terms of their context that is how many common words are exist over total words.

Resources Referred:
[Jaccard Similarity](https://studymachinelearning.com/jaccard-similarity-text-similarity-metric-in-nlp/#google_vignette)

1. Define a function lower_split which converts the whole text into lower case and splits sentences into words
2.  Define a Function Jaccard_Similarity and then returning the calculated similarity score using union and intersection
1. Using a For Loop and extracting the first 5 sentences from test Dataset
 >Extracting the required sentence

 >calling the lower_split and applying it on the sentence1 and sentence2

 >calling Jaccard_Similarity and finding the similarity between the 2 sentences

 >printing the 'sentence1' 'sentence2' and 'similarity score'





In [29]:
def lower_split(text):
    text = set(text.lower().split())
    return text
def Jaccard_Similarity(doc1, doc2):
    intersection = doc1.intersection(doc2)
    union = doc1.union(doc2)
    return float(len(intersection)) / len(union)
for example in dataset['test'].select(range(10)):
    sentence1 = example['sentence1']
    sentence2 = example['sentence2']

    doc1 = lower_split(sentence1)
    doc2 = lower_split(sentence2)

    similarity = Jaccard_Similarity(doc1, doc2)
    print(f'sentence1:"{sentence1}"\nsentence2:"{sentence2}"\nscore:{similarity}\n')

sentence1:"This was a series of nested angular standards , so that measurements in azimuth and elevation could be done directly in polar coordinates relative to the ecliptic ."
sentence2:"This was a series of nested polar scales , so that measurements in azimuth and elevation could be performed directly in angular coordinates relative to the ecliptic ."
score:0.8620689655172413

sentence1:"His father emigrated to Missouri in 1868 but returned when his wife became ill and before the rest of the family could also go to America ."
sentence2:"His father emigrated to America in 1868 , but returned when his wife became ill and before the rest of the family could go to Missouri ."
score:0.92

sentence1:"In January 2011 , the Deputy Secretary General of FIBA Asia , Hagop Khajirian , inspected the venue together with SBP - President Manuel V. Pangilinan ."
sentence2:"In January 2011 , FIBA Asia deputy secretary general Hagop Khajirian along with SBP president Manuel V. Pangilinan inspected the 

In [30]:
!pip install nltk
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords



[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [31]:
stop_words = stopwords.words('english')
def lower_split(text):
    words = text.lower().split()
    filtered_words = [word for word in words if word not in stop_words]
    return set(filtered_words)
def Jaccard_Similarity(doc1, doc2):
    intersection = doc1.intersection(doc2)
    union = doc1.union(doc2)
    return float(len(intersection)) / len(union)

for example in dataset['test'].select(range(3)):
    sentence1 = example['sentence1']
    sentence2 = example['sentence2']

    doc1 = lower_split(sentence1)
    doc2 = lower_split(sentence2)

    similarity = Jaccard_Similarity(doc1, doc2)
    print(f'sentence1:"{sentence1}"\nsentence2:"{sentence2}"\nscore:{similarity}\n')



sentence1:"This was a series of nested angular standards , so that measurements in azimuth and elevation could be done directly in polar coordinates relative to the ecliptic ."
sentence2:"This was a series of nested polar scales , so that measurements in azimuth and elevation could be performed directly in angular coordinates relative to the ecliptic ."
score:0.7777777777777778

sentence1:"His father emigrated to Missouri in 1868 but returned when his wife became ill and before the rest of the family could also go to America ."
sentence2:"His father emigrated to America in 1868 , but returned when his wife became ill and before the rest of the family could go to Missouri ."
score:0.875

sentence1:"In January 2011 , the Deputy Secretary General of FIBA Asia , Hagop Khajirian , inspected the venue together with SBP - President Manuel V. Pangilinan ."
sentence2:"In January 2011 , FIBA Asia deputy secretary general Hagop Khajirian along with SBP president Manuel V. Pangilinan inspected the

#Using Sentence Transformers
Resource Referred :

1.  [sbert](https://www.sbert.net/docs/sentence_transformer/usage/semantic_textual_similarity.html)


1.  Installing **sentence-transformers**
2.  **-q** is a flag used to supress the output during installation process


In [32]:
%pip install sentence-transformers -q

1. Importing **SentenceTransformer** CLASS from **sentence_transformers**
2. util is a module which includes functions such as cosine similarity
1.  The **SentenceTransformer** Class is *instantiated* with **all-MiniLM-L6-v2**
 which generates the *vector* *representations* of given *sentence*

In [33]:
from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer('all-MiniLM-L6-v2')





1.  Extracting the first 10 rows from 'sentence1' and 'sentence2' of test dataset
2.  Generating vector representations/embeddings of sentence1 and sentence2 using SentenceTransformer model
1. Then Calculating the cosine similarity between the vector representations/embeddings
2. Using a for for iterating through the first 10 words printing sentences and similarity scores

In [34]:
s1 = dataset['test']['sentence1'][:10]
s2 = dataset['test']['sentence2'][:10]
embeddings1 = model.encode(s1, convert_to_tensor=True)
embeddings2 = model.encode(s2, convert_to_tensor=True)
cosine_score = util.cos_sim(embeddings1,embeddings2)
for i in range(10):
  cosine_score_value = cosine_score[i][i].item()
  print(f"{i}.\nsentence1:{s1[i]}\nsentence2:{s2[i]}\ncosine similarity score:{cosine_score_value}\n")


0.
sentence1:This was a series of nested angular standards , so that measurements in azimuth and elevation could be done directly in polar coordinates relative to the ecliptic .
sentence2:This was a series of nested polar scales , so that measurements in azimuth and elevation could be performed directly in angular coordinates relative to the ecliptic .
cosine similarity score:0.9322909712791443

1.
sentence1:His father emigrated to Missouri in 1868 but returned when his wife became ill and before the rest of the family could also go to America .
sentence2:His father emigrated to America in 1868 , but returned when his wife became ill and before the rest of the family could go to Missouri .
cosine similarity score:0.9918644428253174

2.
sentence1:In January 2011 , the Deputy Secretary General of FIBA Asia , Hagop Khajirian , inspected the venue together with SBP - President Manuel V. Pangilinan .
sentence2:In January 2011 , FIBA Asia deputy secretary general Hagop Khajirian along with S

In [41]:
import numpy as np
from sklearn.metrics import confusion_matrix

s1 = dataset['test']['sentence1'][:300]
s2 = dataset['test']['sentence2'][:300]
labels = dataset['test']['label'][:300]

embeddings1 = model.encode(s1, convert_to_tensor=True)
embeddings2 = model.encode(s2, convert_to_tensor=True)
cosine_scores = util.cos_sim(embeddings1, embeddings2)

predictions = [1 if sim >= threshold else 0 for sim in similarities]

threshold = 0.8
conf_matrix = confusion_matrix(labels, predictions)
accuracy = accuracy_score(labels, predictions)
precision = precision_score(labels, predictions)
recall = recall_score(labels, predictions)
mse = mean_squared_error(labels, predictions)
print("Confusion Matrix:")
print(conf_matrix)
print("accuracy:", accuracy)
print("precision:", precision)
print("recall:", recall)

Confusion Matrix:
[[  0 173]
 [  0 127]]
accuracy: 0.42333333333333334
precision: 0.42333333333333334
recall: 1.0


#Using Spacy Similiarity



1.  Using pip to install spacy library
2.  Using 'python -m spacy' which means to run spacy module
2.   Downloading en_core_web_md which is Medium-sized English model



In [5]:
!pip install spacy
!python -m spacy download en_core_web_lg

Collecting en-core-web-lg==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-3.7.1/en_core_web_lg-3.7.1-py3-none-any.whl (587.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m587.7/587.7 MB[0m [31m1.2 MB/s[0m eta [36m0:00:00[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_lg')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


1.   Importing spacy library
2.   loading en_core_web_md i.e medium sized english pre-trained model


In [6]:
import spacy
nlp = spacy.load('en_core_web_lg')

1. Extracting the first 10 rows from 'sentence1' and 'sentence2' of test dataset
2. Initialize an empty list named 'similarities'
1. Using a for loop to iterate over the first 10 words then
> applying nlp for 'sentence1 and 'sentence2' which creates a spacy document

 >  we use similarity method to find the similarity scores of'sentence1' and 'sentence2'

 > print the similarity score


In [38]:
p1 = dataset['test']['sentence1'][:10]
p2 = dataset['test']['sentence2'][:10]

for sentence1,sentence2 in zip(p1,p2):
  W1 = nlp(sentence1)
  W2 = nlp(sentence2)
  Similarity = W1.similarity(W2)
  print(f'sentence1:"{sentence1}"\nsentence2:"{sentence2}" \nscore:{Similarity}\n')

sentence1:"This was a series of nested angular standards , so that measurements in azimuth and elevation could be done directly in polar coordinates relative to the ecliptic ."
sentence2:"This was a series of nested polar scales , so that measurements in azimuth and elevation could be performed directly in angular coordinates relative to the ecliptic ." 
score:0.9972708891272399

sentence1:"His father emigrated to Missouri in 1868 but returned when his wife became ill and before the rest of the family could also go to America ."
sentence2:"His father emigrated to America in 1868 , but returned when his wife became ill and before the rest of the family could go to Missouri ." 
score:0.997916950603778

sentence1:"In January 2011 , the Deputy Secretary General of FIBA Asia , Hagop Khajirian , inspected the venue together with SBP - President Manuel V. Pangilinan ."
sentence2:"In January 2011 , FIBA Asia deputy secretary general Hagop Khajirian along with SBP president Manuel V. Pangilinan

In [42]:
import pandas as pd
import spacy
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score, mean_squared_error
nlp = spacy.load('en_core_web_lg')

p1 = dataset['test']['sentence1'][:300]
p2 = dataset['test']['sentence2'][:300]
labels = dataset['test']['label'][:300]

similarities = []

for sentence1, sentence2 in zip(p1, p2):
    doc1 = nlp(sentence1)
    doc2 = nlp(sentence2)
    similarity = doc1.similarity(doc2)
    similarities.append(similarity)

threshold = 0.8

predictions = [1 if sim >= threshold else 0 for sim in similarities]

conf_matrix = confusion_matrix(labels, predictions)
accuracy = accuracy_score(labels, predictions)
precision = precision_score(labels, predictions)
recall = recall_score(labels, predictions)
print("Confusion Matrix:")
print(conf_matrix)
print("accuracy:", accuracy)
print("precision:", precision)
print("recall:", recall)

Confusion Matrix:
[[  0 173]
 [  0 127]]
accuracy: 0.42333333333333334
precision: 0.42333333333333334
recall: 1.0


In [12]:
from sentence_transformers import SentenceTransformer

word = "apple"

model = SentenceTransformer('all-MiniLM-L6-v2')

embedding = model.encode(word, convert_to_tensor=True)
print(f"Word: '{word}'")
print(f"Numerical Representation (Embedding): {embedding}")
print(f"Dimensions of the embedding: {embedding.size(0)}")


Word: 'apple'
Numerical Representation (Embedding): tensor([-6.1385e-03,  3.1012e-02,  6.4794e-02,  1.0941e-02,  5.2672e-03,
        -4.7476e-02,  8.1203e-02,  2.8981e-02,  6.6762e-02,  3.0300e-02,
         5.7465e-02, -8.6235e-03,  1.3228e-03,  3.9918e-04, -1.8843e-02,
        -2.5794e-02, -1.3042e-02, -5.2625e-02, -5.8293e-02, -2.5899e-02,
        -3.3374e-02,  2.4568e-02, -5.2266e-03,  2.3006e-02,  3.2861e-02,
         7.5022e-02,  5.8018e-03, -1.4959e-02, -2.8753e-02, -1.1855e-01,
        -3.9322e-02, -5.1388e-02,  7.6618e-02,  4.8404e-02, -3.0256e-02,
        -9.1434e-02,  5.1182e-02, -9.6496e-03, -2.1511e-02, -7.1774e-02,
        -6.3224e-02, -1.7669e-02,  2.8081e-02,  9.0047e-02,  1.9418e-02,
         5.4544e-03,  4.8302e-02,  8.4909e-03,  2.7751e-02,  9.6561e-02,
         2.6303e-02, -3.0533e-02, -6.8816e-02, -6.0341e-03, -5.6750e-03,
         7.8915e-03, -1.2150e-02,  3.0511e-02,  8.9591e-02,  5.5678e-02,
         2.4089e-02, -1.0735e-01, -3.3460e-02,  4.5137e-02,  3.0472e-02,