### Exercise 2.16: Calculating Text Similarity Using Jaccard and Cosine Similarity

In [1]:
from nltk import word_tokenize
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
lemmatizer = WordNetLemmatizer()

In [2]:
# Jaccard similarity
def extract_text_similarity_jaccard(text1, text2):
    # This method will return the jaccard similarity between two texts
    # after lemmatizing them.
    # :param text1: text1
    # :param text2: text2
    # :return: similarity measure
    lemmatizer = WordNetLemmatizer()
    words_text1 = [lemmatizer.lemmatize(word.lower()) for word in word_tokenize(text1)]
    words_text2 = [lemmatizer.lemmatize(word.lower()) for word in word_tokenize(text2)]
    nr = len(set(words_text1).intersection(set(words_text2))) # total unique words
    dr = len(set(words_text1).union(set(words_text2))) # total common words
    jaccard_sim = nr / dr
    return jaccard_sim

In [3]:
pair1 = ["What you do defines you", "Your deeds define you"]
pair2 = ["Once upon a time there lived a king.", "Who is your queen?"]
pair3 = ["He is desperate", "Is he not desperate?"]

In [4]:
# Jaccard similarity for statements pair1
extract_text_similarity_jaccard(pair1[0], pair1[1])

0.14285714285714285

In [5]:
# Jaccard similarity for statements pair2
extract_text_similarity_jaccard(pair2[0], pair2[1])

0.0

In [6]:
# Jaccard similarity for statements pair3
extract_text_similarity_jaccard(pair3[0], pair3[1])

0.6

In [7]:
# Cosine similarity using TfidfVectorizer() method
def get_tf_idf_vectors(corpus):
    tfidf_vectorizer = TfidfVectorizer()
    tfidf_results = tfidf_vectorizer.fit_transform(corpus).todense()
    # Matrices that contain mostly zero values are called sparse, distinct from matrices where most of the values are non-zero, called dense.
    return tfidf_results

In [8]:
# Create corpus as a list of texts and get the TFIDF vectors of each document
corpus = [pair1[0], pair1[1], pair2[0], pair2[1], pair3[0], pair3[1]]
tf_idf_vectors = get_tf_idf_vectors(corpus)

In [9]:
# Check the cosine similarity between the initial two texts
cosine_similarity(tf_idf_vectors[0], tf_idf_vectors[1])



array([[0.3082764]])

In [10]:
# Check the cosine similarity between the third and fourth texts
cosine_similarity(tf_idf_vectors[2], tf_idf_vectors[3])



array([[0.]])

In [11]:
# Check the cosine similarity between the fifth and sixth texts
cosine_similarity(tf_idf_vectors[4], tf_idf_vectors[5])



array([[0.80368547]])

So, in this exercise, we learned how to check the similarity between texts. As you can see, the texts "He is desperate" and "Is he not desperate?" returned similarity results of 0.80 (meaning they are highly similar), whereas sentences such as "Once upon a time there lived a king." and "Who is your queen?" returned zero as their similarity measure.