1. For vectorization use **bert** then you can use 
    1. Cosine similarity 
    2. euclidean distance

2. **Semantic Textual Similarity (STS) Metrics**: These metrics, such as Pearson correlation or Spearman rank correlation, are designed specifically for assessing the semantic similarity between two pieces of text. They are often used in NLP benchmarks.

3. **BLEU (Bilingual Evaluation Understudy)**: Originally designed for machine translation evaluation, BLEU measures the n-gram overlap between the generated sentence and a reference (paragraph in this case). It can be adapted for similarity by treating the paragraph as a reference.

4. **Word Mover's Distance (WMD)**: WMD calculates the minimum cumulative distance that words in one document need to travel to match the words in another document. It considers semantic similarity and can be effective for measuring the distance between a sentence and a paragraph.

5. **Fine-Tuned Models**: Alternatively, you can use fine-tuned BERT models that are specifically trained for sentence 
similarity tasks. These models may provide more accurate similarity scores tailored to your specific application.

6. **Sentence Transformers**: There are libraries and models built on top of BERT, such as Sentence Transformers, which provide ready-to-use implementations for computing sentence embeddings and similarity scores. These models are trained to generate semantically meaningful sentence embeddings that capture similarity effectively.

7. **Jaccard metrics**: It can only be used incase we have bag of words so that binary vector is created for each sentence.

8. **BLEU (Bilingual Evaluation Understudy):** Originally designed for machine translation evaluation, BLEU measures the n-gram overlap between the generated sentence and a reference (paragraph in this case). It can be adapted for similarity by treating the paragraph as a reference.

9. **Semantic Textual Similarity (STS) Metrics:** These metrics, such as Pearson correlation or Spearman rank correlation, are designed specifically for assessing the semantic similarity between two pieces of text. They are often used in NLP benchmarks. 

In [3]:
"""Implementation of jaccard simillary metrics calculation"""
"""
formula to calculate Jaccard simmilarity = a(intersection)b / a(union)b
"""
def calculate_jaccard_similarity(sentence1, sentence2):
    words_set1 = set(sentence1.lower().split())
    words_set2 = set(sentence2.lower().split())
    
    intersection = len(words_set1.intersection(words_set2))
    union = len(words_set1.union(words_set2))
    jaccard_similarity = intersection / union if union != 0 else 0
    
    return jaccard_similarity

sentence1 = "The cat is on the mat"
sentence2 = "The mat has a cat on it"
similarity_score = calculate_jaccard_similarity(sentence1, sentence2)
print(f"Jaccard Similarity between the sentences: {similarity_score}")


Jaccard Similarity between the sentences: 0.5


In [7]:
"""implementation of cosine simmilarity using bert for vectorization"""

import torch
from transformers import BertTokenizer, BertModel
from sklearn.metrics.pairwise import cosine_similarity
import time

start_time = time.time()

# Load pre-trained BERT model and tokenizer
model_name = 'bert-base-uncased'
tokenizer = BertTokenizer.from_pretrained(model_name)
model = BertModel.from_pretrained(model_name, output_hidden_states=True)

model_fetched_time = time.time() - start_time

# Define two sentences for similarity comparison
sentence1 = "The cat is on the mat."
sentence2 = "The mat has a cat on it."

# Tokenize and encode the sentences using BERT tokenizer
inputs1 = tokenizer(sentence1, return_tensors='pt', padding=True, truncation=True)
inputs2 = tokenizer(sentence2, return_tensors='pt', padding=True, truncation=True)

# Get BERT embeddings for the sentences
with torch.no_grad():
    outputs1 = model(**inputs1)
    outputs2 = model(**inputs2)

# Extract the last layer hidden states (word embeddings) for each sentence
last_hidden_states1 = outputs1.last_hidden_state
last_hidden_states2 = outputs2.last_hidden_state

# Average the word embeddings across each sentence to get sentence embeddings
sentence_embedding1 = torch.mean(last_hidden_states1, dim=1).squeeze()
sentence_embedding2 = torch.mean(last_hidden_states2, dim=1).squeeze()

tokenized_time = time.time() - model_fetched_time 

# Convert sentence embeddings to numpy arrays and calculate cosine similarity
embedding1_np = sentence_embedding1.numpy().reshape(1, -1)
embedding2_np = sentence_embedding2.numpy().reshape(1, -1)
cos_sim = cosine_similarity(embedding1_np, embedding2_np)[0][0]

cosine_similarity_time = time.time() - tokenized_time

print(f"Cosine Similarity between the sentences: {cos_sim:.4f}")
print(f"start_time = {start_time}\nmodel_fetched_time = {model_fetched_time}\ntokenized_time = {tokenized_time}\ncosine_similarity_time = {cosine_similarity_time}")

Cosine Similarity between the sentences: 0.8610
start_time = 1710831797.6821022
model_fetched_time = 1.2878665924072266
tokenized_time = 1710831797.783492
cosine_similarity_time = 1.2878665924072266


In [47]:
"""Used transformer for embedding sentence in to one dimension vector"""

from sentence_transformers import SentenceTransformer
model = SentenceTransformer('paraphrase-MiniLM-L6-v2')
sentence = ['One cat is siting here', 'Here is one cat sitting on mat']
embedding = model.encode(sentence)
a = embedding[0].reshape(1,-1)
b = embedding[1].reshape(1,-1)
print(cosine_similarity(a,b))

[[0.76337683]]


In [50]:
from scipy.spatial import distance
dist = distance.euclidean(embedding[0],embedding[1])
similarity = 1/(1+dist)
print(similarity)

0.17073156633678052


In [51]:
manhattan_distance = distance.cityblock(embedding[0],embedding[1])
similarity_score = 1 / (1 + manhattan_distance)
print(similarity_score)

0.012690662015076357


In [56]:
def jaccard_similarity(binary_array1, binary_array2):
    set1 = set(binary_array1.nonzero()[0])  # Convert non-zero indices to a set
    set2 = set(binary_array2.nonzero()[0])  # Convert non-zero indices to a set
    intersection = len(set1.intersection(set2))
    union = len(set1.union(set2))
    similarity_score = intersection / union if union != 0 else 0  # Handle division by zero
    return similarity_score


In [57]:
from sklearn.feature_extraction.text import CountVectorizer

sentences = ['One cat is siting here', 'Here is one cat sitting on mat']
vectorizer = CountVectorizer()
bow_matrix = vectorizer.fit_transform(sentences)
feature_names = vectorizer.get_feature_names_out()
print(jaccard_similarity(bow_matrix.toarray()[0],bow_matrix.toarray()[1]))



0.5
