------------------
#### Cosine Similarity
----------------------
- in text/NLP (LLM based applications)

In [1]:
import numpy as np
import pandas as pd

#### what is Cosine Similarity

- Intuitively, let’s say we have 2 vectors, each representing a sentence. If the vectors are close to parallel, maybe we assume that both sentences are “similar” in theme. Whereas if the vectors are orthogonal, then we assume the sentences are independent or NOT “similar”. 

$$ \mbox{Cosine Similarity} = \frac{\sum_{i=1}^{n}{x_{i} y_{i}}}
           {\sqrt{\sum_{i=1}^{n}{x_{i}^{2}}}
            \sqrt{\sum_{i=1}^{n}{y_{i}^{2}}}}$$
            
$$ \mbox{Cosine Distance} = 1 - \mbox{Cosine Similarity} $$ 

The resulting similarity ranges from 

- −1 meaning exactly opposite, 
-  1 meaning exactly the same, 
-  0 indicating orthogonality or decorrelation, 
- while in-between values indicate intermediate similarity or dissimilarity.

#### Applications
- In practice, cosine similarity tends to be useful when trying to determine how similar two texts/documents are. 

- used for sentiment analysis, translation, for detecting plagiarism. 

- Cosine similarity works in these usecases because we ignore magnitude and focus solely on orientation. when the magnitude of the vectors does not matter. 

- In NLP, this might help us still detect that a much longer document has the same “theme” as a much shorter document since we don’t worry about the magnitude or the “length” of the documents themselves.

- For __text matching__, the attribute vectors A and B are usually the term frequency vectors of the documents. Cosine similarity can be seen as a method of normalizing document length during comparison.

- In the case of __information retrieval__, the cosine similarity of two documents will range from 0 to 1, since the term frequencies (using tf–idf weights) cannot be negative. The angle between two term frequency vectors cannot be greater than 90°.

In [2]:
# vectors
a = np.array([1,  2,  3])
b = np.array([10, 10, 40])

In [3]:
from scipy.spatial import distance

In [4]:
# method - 1 - using scipy
# Compute the Cosine distance between 1-D arrays.
# so 
# cosine sim = 1 - cosine distance
print('Cosine distance = ', distance.cosine(a, b))

print('cosine sim = ', 1 - distance.cosine(a, b))

Cosine distance =  0.05508881747693195
cosine sim =  0.944911182523068


In [5]:
# method - 2 - using numpy
# manually compute cosine similarity
dot = np.dot(a, b)
norma = np.linalg.norm(a)
normb = np.linalg.norm(b)
cos = dot / (norma * normb)

print("Cosine similarity is {}".format(cos))
print("Cosine distance   is {}".format(1-cos))

Cosine similarity is 0.944911182523068
Cosine distance   is 0.05508881747693195


#### Exercise - on wiki data (apply cosine)

In [6]:
!pip install wikipedia

Collecting wikipedia
  Using cached wikipedia-1.4.0-py3-none-any.whl
Installing collected packages: wikipedia
Successfully installed wikipedia-1.4.0


In [7]:
import wikipedia

In [8]:
q1 = wikipedia.page('Deep Learning')
q2 = wikipedia.page('Artifical Intelligence')
q3 = wikipedia.page('olympic games')
q4 = wikipedia.page('Baseball')

In [10]:
q1.title

'Deep learning'

In [11]:
q1.summary

'Deep learning is a subset of machine learning that focuses on utilizing neural networks to perform tasks such as classification, regression, and representation learning. The field takes inspiration from biological neuroscience and is centered around stacking artificial neurons into layers and "training" them to process data. The adjective "deep" refers to the use of multiple layers (ranging from three to several hundred or thousands) in the network. Methods used can be either supervised, semi-supervised or unsupervised.\nSome common deep learning network architectures include fully connected networks, deep belief networks, recurrent neural networks, convolutional neural networks, generative adversarial networks, transformers, and neural radiance fields. These architectures have been applied to fields including computer vision, speech recognition, natural language processing, machine translation, bioinformatics, drug design, medical image analysis, climate science, material inspection 

In [26]:
#q1.content

In [14]:
len(q1.content.split())

8394

In [15]:
len(q2.content.split())

13055

In [16]:
len(q3.content.split())

13590

In [17]:
len(q4.content.split())

9751

In [18]:
from sklearn.feature_extraction.text import CountVectorizer

In [19]:
vect = CountVectorizer()

In [20]:
vect.fit([q1.content, q2.content, q3.content, q4.content])

In [21]:
len(vect.get_feature_names_out())

7011

#### Vectorize q1, q2, q3, q4

In [24]:
q_vect = vect.transform([q1.content, q2.content, q3.content, q4.content]).toarray()

In [25]:
q_vect

array([[5, 0, 0, ..., 0, 0, 0],
       [2, 0, 0, ..., 0, 0, 0],
       [8, 0, 0, ..., 1, 1, 1],
       [0, 1, 1, ..., 0, 0, 0]], dtype=int64)

In [27]:
# Method 3
# define own cosine function
def compute_cosine(a, b):
    result_num = np.dot(a, b)
    result_den = np.sqrt(np.sum(a **2)) * np.sqrt(np.sum(b ** 2))
    
    return result_num / result_den

In [28]:
compute_cosine(q_vect[0], q_vect[1])

0.8976490353165566

In [29]:
compute_cosine(q_vect[1], q_vect[0])

0.8976490353165566

In [30]:
compute_cosine(q_vect[0], q_vect[2])

0.8132531451887015

In [31]:
compute_cosine(q_vect[0], q_vect[3])

0.8163587091444843