**BERT** excels in tasks such as sentiment analysis, question answering, and named entity recognition, where **word-level granularity **is crucial.
**Sentence Transformers** are the preferred choice for semantic similarity assessments, text matching, and document retrieval tasks, where capturing the essence of **entire sentences or paragraphs** is essential.
**Fuzzy** excels when the focus is on **character-level similarities**, making it a valuable tool for tasks like deduplication of records, correcting typos, and matching strings with minor variations.


**Sentence Similarity with BERT**

*Step 1 Pre-processing Input Sentences*

In [1]:
from transformers import BertTokenizer

# Load the BERT tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

# Example sentences
sentence1 = "I like coding in Python."
sentence2 = "Python is my favorite programming language."

# Tokenize the sentences
tokens1 = tokenizer.tokenize(sentence1)
tokens2 = tokenizer.tokenize(sentence2)

# Add [CLS] and [SEP] tokens
tokens = ['[CLS]'] + tokens1 + ['[SEP]'] + tokens2 + ['[SEP]']
print("Token:", tokens)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Token: ['[CLS]', 'i', 'like', 'coding', 'in', 'python', '.', '[SEP]', 'python', 'is', 'my', 'favorite', 'programming', 'language', '.', '[SEP]']


Step 2 : Encoding

In [2]:
# Convert tokens to input IDs
input_ids = tokenizer.convert_tokens_to_ids(tokens)

# Display the tokens and input IDs
print("Input IDs:", input_ids)

Input IDs: [101, 1045, 2066, 16861, 1999, 18750, 1012, 102, 18750, 2003, 2026, 5440, 4730, 2653, 1012, 102]


Step 3 : Calculate Similarity

In [3]:
from transformers import BertTokenizer, BertModel
import torch
from sklearn.metrics.pairwise import cosine_similarity

# Load the BERT tokenizer and model
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

# Example sentences (already preprocessed)
tokens1 = ["[CLS]", "i", "like", "coding", "in", "python", ".", "[SEP]"]
tokens2 = ["[CLS]", "python", "is", "my", "favorite", "programming", "language", ".", "[SEP]"]

# Convert tokens to input IDs
input_ids1 = torch.tensor(tokenizer.convert_tokens_to_ids(tokens1)).unsqueeze(0)  # Batch size 1
input_ids2 = torch.tensor(tokenizer.convert_tokens_to_ids(tokens2)).unsqueeze(0)  # Batch size 1

# Obtain the BERT embeddings
with torch.no_grad():
    outputs1 = model(input_ids1)
    outputs2 = model(input_ids2)
    embeddings1 = outputs1.last_hidden_state[:, 0, :]  # [CLS] token
    embeddings2 = outputs2.last_hidden_state[:, 0, :]  # [CLS] token

# Calculate similarity
similarity_score = cosine_similarity(embeddings1, embeddings2)
print("Similarity Score:", similarity_score)

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Similarity Score: [[0.9558883]]


**Sentence Similarity with Sentence Transformer**

Step 1:Pre-processing Input Sentences

In [5]:
# Import the Sentence Transformer library
from sentence_transformers import SentenceTransformer, util

# There are several different Sentence Transformer models available on Hugging Face
model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")

sent1 = "I like coding in Python."
sent2 = "Python is my favorite programming language."



Step 2: Encodeing

In [6]:
# Convert the sentences into embeddings using the Sentence Transformer
sent_embedding1 = model.encode(sent1,convert_to_tensor=True)
sent_embedding2 = model.encode(sent2,convert_to_tensor=True)


Step 3: Similarity

In [8]:

# Find the similarity between the two embeddings
sim=util.pytorch_cos_sim(sent_embedding1, sent_embedding2)
print ( "Similarity Score:", sim)

Similarity Score: tensor([[0.8250]])


**Text Similatiry Theory**

Text similarity is a measure of how similar two pieces of text are. It is often used in search engines, plagiarism detection tools, and other applications. There are many different ways to measure text similarity, but some of the most common methods include:

**Jaccard similarity:** This method calculates the similarity between two sets of words. The Jaccard similarity score is calculated by dividing the number of common words by the total number of words in both sets.
**Cosine similarity**: This method calculates the similarity between two vectors. The cosine similarity score is calculated by dividing the dot product of the two vectors by the product of their lengths.
**Levenshtein distance**: This method calculates the minimum number of edits (insertions, deletions, or substitutions) required to transform one string into another. The Levenshtein distance is a measure of how dissimilar two strings are.

**Euclidean** : measures the shortest distance between two points in a vector space
The best text similarity method to use depends on the specific task, the nature of the text, use case and the desired level of similarity.
