**Text Similarity**

This notebook explores two techniques to measure similarity between text pairs:

1. **TF-IDF + Cosine Similarity** – based on word overlap
2. **Sentence Embeddings** – based on semantic meaning using `all-MiniLM-L6-v2`

In [21]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sentence_transformers import SentenceTransformer, util

2025-07-30 11:02:03.312692: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1753873323.675795      36 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1753873323.779413      36 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


**Example Sentences**

We'll compare different sentence pairs to observe similarity.

In [23]:
sent1 = "I love natural language processing"
sent2 = "I enjoy working on NLP tasks"
sent3 = "The weather is sunny today"

sentences = [sent1, sent2, sent3]

**TF-IDF + Cosine Similarity**

This method compares token-level overlap between sentences.

In [24]:
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(sentences)


In [25]:
X.toarray()

array([[0.       , 0.       , 0.5      , 0.5      , 0.5      , 0.       ,
        0.       , 0.5      , 0.       , 0.       , 0.       , 0.       ,
        0.       , 0.       ],
       [0.4472136, 0.       , 0.       , 0.       , 0.       , 0.4472136,
        0.4472136, 0.       , 0.       , 0.4472136, 0.       , 0.       ,
        0.       , 0.4472136],
       [0.       , 0.4472136, 0.       , 0.       , 0.       , 0.       ,
        0.       , 0.       , 0.4472136, 0.       , 0.4472136, 0.4472136,
        0.4472136, 0.       ]])

In [28]:
similarity_matrix = cosine_similarity(X)


In [30]:
import pandas as pd
df_sim = pd.DataFrame(similarity_matrix, columns=["sent1", "sent2", "sent3"], index=["sent1", "sent2", "sent3"])
print(df_sim.round(3))

       sent1  sent2  sent3
sent1    1.0    0.0    0.0
sent2    0.0    1.0    0.0
sent3    0.0    0.0    1.0


**Sentence Embedding Similarity**

We now use a pre-trained transformer (`all-MiniLM-L6-v2`) to get dense semantic embeddings.

In [35]:
model = SentenceTransformer("/kaggle/input/all-minilm-l6-v2/transformers/default/1/all-MiniLM-L6-v2")

embeddings = model.encode(sentences)

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

In [36]:
embeddings

array([[ 0.01395958, -0.06290712,  0.04286579, ...,  0.14781834,
         0.03571628, -0.04115   ],
       [-0.01024737, -0.04555632,  0.04750293, ...,  0.09685127,
         0.0004508 ,  0.034456  ],
       [-0.0204103 ,  0.10814118,  0.09428667, ..., -0.01203914,
        -0.11518511,  0.07021891]], dtype=float32)

In [37]:
sim_embed = util.cos_sim(embeddings, embeddings)

df_embed = pd.DataFrame(sim_embed.numpy(), columns=["sent1", "sent2", "sent3"], index=["sent1", "sent2", "sent3"])
print(df_embed.round(2))

       sent1  sent2  sent3
sent1   1.00   0.68   0.02
sent2   0.68   1.00   0.05
sent3   0.02   0.05   1.00


## Observations

- TF-IDF finds higher similarity when words overlap exactly.
- Embeddings capture **semantic similarity** even if words differ.

For example:
- "I love NLP" and "I enjoy NLP" → low TF-IDF similarity
- But high embedding similarity due to semantic meaning