## Cosine Similarity in Natural Language Processing
By Piyush Srivastava

- Vectors have both magnitude and direction. This means that we can find and measure the angle between two vectors and conclude if two vectors are similar or not.
- We take the cosine measure of the angle between the vectors.
- The value of cosine similarity always lies between the range -1 to +1. The value of +1 indicates that the vectors into consideration are perfectly similar. Whereas the value of -1 indicates that the vectors into consideration are perfectly dissimilar or opposite to each other.

### Between two vectors

In [1]:
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

In [3]:
def cosine_similarity(vector1, vector2):
    vector1 = np.array(vector1)
    vector2 = np.array(vector2)
# we first take the dot products of the two vectors and then divide them by taking
# the magnitudes of the vectors.
    return np.dot(vector1, vector2) / (np.sqrt(np.sum(vector1**2)) * np.sqrt(np.sum(vector2**2)))

Example. We take two arrays as vectors and try to find the cosine similarity between them.

In [4]:
d1 = (6,0,4,6,3,9,8,7,5,6)
d2 = (7,0,4,4,6,9,8,5,2,1)
d1 = np.array(d1)
d2 = np.array(d2)
print(d1)
print(d2)

[6 0 4 6 3 9 8 7 5 6]
[7 0 4 4 6 9 8 5 2 1]


In [5]:
cosine_similarity(d1, d2)

0.923270487736007

On observing the output we come to know that the two vectors are quite similar to each other. As we had seen in the theory, when the cosine similarity is close to 1 it means the two vectors are very similar.

### Between documents in a corpus

- The corpus text consists of three documents. The first and the second document belong to the same topic of Trigonometry. But the third one is of a random topic.

- When we calculate the cosine similarity, we expect cosine similarity score to be higher for documents one and two and less for other combinations.

In [6]:
text = (""" Trigonometry is a branch of mathematics that studies relationships between side lengths and angles of triangles The field emerged in the Hellenistic world during the 3rd century BC from applications""", 
        """ Driven by the demands of navigation and the growing need for accurate maps of large geographic areas trigonometry grew into a major branch of mathematics Bartholomaeus Pitiscus was the first""", 
        """ One of Los Angeles oldest continuing operating restaurants The Apple Pan is also notable as the basis for the popular Johnny Rockets restaurant chain Johnny Rockets founder Ronn Teitlebaum claimed""")

In [7]:
#  convert the corpus to a series object 
corpus = pd.Series(text)

# Cosine Similarity Calculation
def cosine_similarity(vector1, vector2):
    vector1 = np.array(vector1)
    vector2 = np.array(vector2)
    return np.dot(vector1, vector2) / (np.sqrt(np.sum(vector1**2)) * np.sqrt(np.sum(vector2**2))) 

In [8]:
# using the CountVectorizer 
vectorizer = CountVectorizer()
bow_matrix = vectorizer.fit_transform(corpus)

In [9]:
feature_names_count = vectorizer.get_feature_names()
feature_names_count

['3rd',
 'accurate',
 'also',
 'and',
 'angeles',
 'angles',
 'apple',
 'applications',
 'areas',
 'as',
 'bartholomaeus',
 'basis',
 'bc',
 'between',
 'branch',
 'by',
 'century',
 'chain',
 'claimed',
 'continuing',
 'demands',
 'driven',
 'during',
 'emerged',
 'field',
 'first',
 'for',
 'founder',
 'from',
 'geographic',
 'grew',
 'growing',
 'hellenistic',
 'in',
 'into',
 'is',
 'johnny',
 'large',
 'lengths',
 'los',
 'major',
 'maps',
 'mathematics',
 'navigation',
 'need',
 'notable',
 'of',
 'oldest',
 'one',
 'operating',
 'pan',
 'pitiscus',
 'popular',
 'relationships',
 'restaurant',
 'restaurants',
 'rockets',
 'ronn',
 'side',
 'studies',
 'teitlebaum',
 'that',
 'the',
 'triangles',
 'trigonometry',
 'was',
 'world']

In [10]:
# mathematical representation of the three documents
features_array_count = bow_matrix.toarray()
features_array_count

array([[1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0,
        1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0,
        0, 0, 2, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 1, 3, 1, 1, 0,
        1],
       [0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1,
        0, 0, 0, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1,
        1, 0, 3, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 1, 1,
        0],
       [0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0,
        0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 0, 1, 0, 0, 0, 0,
        0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 2, 1, 0, 0, 1, 0, 3, 0, 0, 0,
        0]], dtype=int64)

In [11]:
bow_matrix.shape

(3, 67)

In [12]:
 # cosine similarity score for each document for every other document in the corpus
for i in range(bow_matrix.shape[0]):
    for j in range(i + 1, bow_matrix.shape[0]):
        print("The cosine similarity between the documents ", i, "and", j, "is: ",
              cosine_similarity(bow_matrix.toarray()[i], bow_matrix.toarray()[j]))

The cosine similarity between the documents  0 and 1 is:  0.48782135766494206
The cosine similarity between the documents  0 and 2 is:  0.3119251469460218
The cosine similarity between the documents  1 and 2 is:  0.32101211891111664


The output is just as we had expected it to be.