# Cosine Similarity

## What is it?


Cosine Similarity is a measure of similarity between two non-zero vectors in an inner product space. It calculates the cosine of the angle between the two vectors, providing a measure that indicates how similar the two vectors are, irrespective of their magnitude.

**Formula**

The cosine similarity between two vectors A and B is given by:

Cosine Similarity = $\frac{A \cdot B}{||A|| ||B||}$

**Range**

* The value of cosine similarity ranges from -1 to 1.
* ndicates that the vectors are identical (i.e., the angle between them is 0 degrees).
* ndicates that the vectors are orthogonal (i.e., the angle between them is 90 degrees).
* indicates that the vectors are diametrically opposed (i.e., the angle between them is 180 degrees).

## What for?

Applications in NLP

* Document Comparison: Comparing the content of documents for plagiarism detection.
* ry Matching: Matching user queries to documents in a search engine.
* Text Summarization: Finding sentences similar to a given sentence to create summaries.
* aphrase Detection: Detecting if two sentences have similar meanings.


## Example

$$
\mathbf{A} = \begin{bmatrix} 1 \\ 0 \\ 2 \\ 3 \end{bmatrix},
\quad
\mathbf{B} = \begin{bmatrix} 0 \\ 1 \\ 2 \\ 3 \end{bmatrix}
$$

$$
A \cdot{B} = (1 \cdot{0}) + (0 \cdot{1}) + (2 \cdot{2}) + ( 3 \cdot{3}) = 13
$$

$$
||A|| = \sqrt{1^2 + 0^2 + 2^2 + 3^2} = \sqrt{14}
$$

$$
||B|| =  \sqrt{0^2 + 1^2 + 2^2 + 3^2} = \sqrt{14}
$$

$$
Cosine Similarity = \frac{A \cdot{B}}{||A|| ||B||} = \frac{13}{\sqrt{14} \cdot{\sqrt{14}}} = 0.93
$$

## How to do it?

### Packages



*   sklearn



In [28]:
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

### Examples

In [None]:
import numpy as np
vector_a = np.array([[1, 0, 2, 3]])
vector_b = np.array([[1, 0, 2, 3]])

print(cosine_similarity(vector_a, vector_b))

[[1.]]


## Practise

Given two documents `doc1` and `doc2`


In [29]:
doc_1 = "i love cats, but not dogs"

doc_2 = "i love dogs, but not cats"

documents = [doc_1, doc_2]


### Quiz 1

Calculate the cosine similarity between two `doc1` and `doc2` using Count Vectorizer

In [49]:
count_vectorizer = CountVectorizer()
x1=count_vectorizer.fit_transform([doc_1])
x2=count_vectorizer.fit_transform([doc_2])
x1.toarray()
x2.toarray()



array([[1, 1, 1, 1, 1]])

In [51]:

count_vectorizer = CountVectorizer(stop_words='english')
x1=count_vectorizer.fit_transform([doc_1])
x2=count_vectorizer.fit_transform([doc_2])
x1.toarray()
x2.toarray()

array([[1, 1, 1]])

In [52]:
count_vectorizer.get_feature_names_out()

array(['cats', 'dogs', 'love'], dtype=object)

array([[1.]])

### Quiz 2

Calculate the cosine similarity using `Count Vectorizer` between `doc1` and `doc2`, after removing stop words

In [55]:

count_vectorizer = CountVectorizer(stop_words='english')
x1=count_vectorizer.fit_transform([doc_1])
x2=count_vectorizer.fit_transform([doc_2])
x1.toarray()
x2.toarray()

array([[1, 1, 1]])

In [59]:
print(cosine_similarity(x1,x2))

[[1.]]


### Quiz 3

Calculate the cosine similarity using TF-IDF between `doc1` and `doc2`

In [60]:
tfidf_vectorizer = TfidfVectorizer()
x1=tfidf_vectorizer.fit_transform([doc_1])
x2=tfidf_vectorizer.fit_transform([doc_2])
x1.toarray()
x2.toarray()

array([[0.4472136, 0.4472136, 0.4472136, 0.4472136, 0.4472136]])

In [61]:
cosine_similarity(x1,x2)

array([[1.]])

### Quiz 4

Calculate the cosine similarity using TF-IDF between `doc1` and `doc2`, after removing stop words ?

In [64]:
tfidf_vectorizer = TfidfVectorizer(stop_words='english')
x1=tfidf_vectorizer.fit_transform([doc_1])
x2=tfidf_vectorizer.fit_transform([doc_2])
x1.toarray()
x2.toarray()

array([[0.57735027, 0.57735027, 0.57735027]])

In [65]:
cosine_similarity(x1,x2)

array([[1.]])

### Quiz 5

Resolve Quiz 1-4 using Ngrams (1, 2)

In [None]:
# write answer here


       but not  but not cats  but not dogs  cats but  cats but not  dogs but  \
doc_1        1             0             1         1             1         0   
doc_2        1             1             0         0             0         1   

       dogs but not  love cats  love cats but  love dogs  love dogs but  \
doc_1             0          1              1          0              0   
doc_2             1          0              0          1              1   

       not cats  not dogs  
doc_1         0         1  
doc_2         1         0  

**** Cosine Similarity:

[[1.         0.14285714]
 [0.14285714 1.        ]]


In [None]:
# write answer here


       cats dogs  dogs cats  love cats  love cats dogs  love dogs  \
doc_1          1          0          1               1          0   
doc_2          0          1          0               0          1   

       love dogs cats  
doc_1               0  
doc_2               1  

**** Cosine Similarity:

[[1. 0.]
 [0. 1.]]


        but not  but not cats  but not dogs  cats but  cats but not  dogs but  \
doc_1  0.278943      0.000000      0.392044  0.392044      0.392044  0.000000   
doc_2  0.278943      0.392044      0.000000  0.000000      0.000000  0.392044   

       dogs but not  love cats  love cats but  love dogs  love dogs but  \
doc_1      0.000000   0.392044       0.392044   0.000000       0.000000   
doc_2      0.392044   0.000000       0.000000   0.392044       0.392044   

       not cats  not dogs  
doc_1  0.000000  0.392044  
doc_2  0.392044  0.000000  

**** Cosine Similarity:

[[1.         0.07780894]
 [0.07780894 1.        ]]


       cats dogs  dogs cats  love cats  love cats dogs  love dogs  \
doc_1    0.57735    0.00000    0.57735         0.57735    0.00000   
doc_2    0.00000    0.57735    0.00000         0.00000    0.57735   

       love dogs cats  
doc_1         0.00000  
doc_2         0.57735  
**** Cosine Similarity:
[[1. 0.]
 [0. 1.]]


### Quiz 6

Resolve Quiz 1-4 using Ngrams (2, 3)

In [None]:
# write answer here

### Quiz 7

What makes `doc1` and `doc2` similar?

### Quiz 8

what makes `doc1` and `doc2` different?

### Quiz 9

What your observation about the cosine similarity between `doc1` and `doc2`?

### Quiz 10

How Ngrams affect the cosine similarity between `doc1` and `doc2`?