## What is TF-IDF

TF-IDF for sentence similarity involves representing each sentence as a vector where each element corresponds to a unique word across all sentences, and the value of each element is the TF-IDF score of that word in the sentence. The TF-IDF score for a word in a sentence is calculated by multiplying its Term Frequency (how often it appears in that sentence) by its Inverse Document Frequency (how rare it is across all sentences). Sentences with similar meanings will have vectors with high TF-IDF scores for the same important and rare words, and their similarity can then be quantified using measures like cosine similarity, which calculates the cosine of the angle between the two sentence vectors; a higher cosine value indicates greater similarity.

In [5]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

In [6]:
def tfidf_cosine_similarity(s1, s2):
    vectorizer = TfidfVectorizer()
    tfidf = vectorizer.fit_transform([s1, s2])
    sim = cosine_similarity(tfidf[0:1], tfidf[1:2])[0][0]
    return sim

In [None]:
sent1 = "Dogs are wonderful pets."
sent2 = "Cats are amazing companions."
score = tfidf_cosine_similarity(sent1, sent2)
print(f"TF-IDF Cosine Similarity: {score:.4f}")

TF-IDF Cosine Similarity: 0.1444
