Document similarity:

A commonly used approach to match similar documents is based on counting the maximum number of common words between the documents. But this approach has an inherent flaw. That is, as the size of the document increases, the number of common words tend to increase even if the documents talk about different topics.

The cosine similarity helps overcome this fundamental flaw in the 'count-the-common-words' or Euclidean distance approach.

What is Cosine Similarity and why is it advantageous?

Cosine similarity is a metric used to determine how similar the documents are irrespective of their size.

Mathematically, it measures the cosine of the angle between two vectors projected in a multi-dimensional space. 

This metric is a measurement of orientation and not magnitude.
The two vectors I am talking about are arrays containing the word counts of two documents.

The cosine similarity is advantageous because even if the two similar documents are far apart by the Euclidean distance because of the size (like, the word 'cricket' appeared 50 times in one document and 10 times in another) they could still have a smaller angle between them. 

Smaller the angle, higher the similarity.

Two documents are similar if all the words are same in both the documents

In [16]:
doc_trump = "Mr. Trump became president after winning the political election. \
Though he lost the support of some republican friends, Trump is friends with President Putin"

doc_election = "President Trump says Putin had no political interference in the election outcome.\
He says it was a witchhunt by political parties. \
He claimed President Putin is a friend who had nothing to do with the election"

doc_putin = "Post elections, Vladimir Putin became President of Russia.\
President Putin had served as the Prime Minister earlier in his political career"

In [17]:
documents = [ doc_trump , doc_election , doc_putin ]

In [18]:
import pandas as pd
import numpy as np

import nltk
from sklearn.feature_extraction.text import CountVectorizer

In [19]:
count_vect = CountVectorizer( stop_words = 'english' )

X = count_vect.fit_transform( documents )

In [20]:
df = pd.DataFrame( X.toarray() , columns = count_vect.get_feature_names() , 
                 index = ['doc_trump' , 'dec_election' , 'doc_putin'])
df

Unnamed: 0,career,claimed,earlier,election,elections,friend,friends,interference,lost,minister,...,putin,republican,russia,says,served,support,trump,vladimir,winning,witchhunt
doc_trump,0,0,0,1,0,0,2,0,1,0,...,1,1,0,0,0,1,2,0,1,0
dec_election,0,1,0,2,0,1,0,1,0,0,...,2,0,0,2,0,0,1,0,0,1
doc_putin,1,0,1,0,1,0,0,0,0,1,...,2,0,1,0,1,0,0,1,0,0


In [21]:
from sklearn.metrics.pairwise import cosine_similarity

In [61]:
doc_sim = cosine_similarity( df )

np.fill_diagonal( doc_sim , 0 ) 

# doc_sim [ np.diag_indices( 3 )  ] = 0 
# doc_sim

# doc_sim - np.diag( np.diag( doc_sim )  )

# doc_sim - np.eye( doc_sim.shape[0] )

# 1 > 1 , 2, 3 ..
# 2 > 1 ,2 , 3 ...

doc_sim

array([[0.        , 0.51639778, 0.36893239],
       [0.51639778, 0.        , 0.45360921],
       [0.36893239, 0.45360921, 0.        ]])

In [65]:
row_index = ['doc_trump' , 'dec_election' , 'doc_putin'] 
col_index = ['doc_trump' , 'dec_election' , 'doc_putin']

max_similarity = np.argmax( doc_sim , axis = 1 ) 
for i in range( 0 , len(max_similarity) ) :
    print( row_index[i] +  ' is simillar to the document ' +  col_index[max_similarity[i]] )
    print()

doc_trump is simillar to the document dec_election

dec_election is simillar to the document doc_trump

doc_putin is simillar to the document dec_election



In [69]:
print( np.argsort( doc_sim , axis = 1 ) )
print()
print( np.argsort( doc_sim , axis = 1 )[ : , : : -1] )

[[0 2 1]
 [1 2 0]
 [2 0 1]]

[[1 2 0]
 [0 2 1]
 [1 0 2]]


In [76]:
doc_sim

array([[0.        , 0.51639778, 0.36893239],
       [0.51639778, 0.        , 0.45360921],
       [0.36893239, 0.45360921, 0.        ]])

In [82]:
n_max_similarity = np.argsort( doc_sim , axis = 1 )[ : , : : -1]
n = 1

for i in range( 0 , len(row_index) ) :
    sim_doc = ''
    for j in n_max_similarity[ i , : n ]:
        sim_doc = sim_doc + col_index[j] + ',' 
        
    sim_doc = sim_doc[ : -1] 
    print( row_index[i] +  ' is simillar to the document ' +  sim_doc )
    print()

doc_trump is simillar to the document dec_election

dec_election is simillar to the document doc_trump

doc_putin is simillar to the document dec_election



In [None]:
# Similarity between the words acorss the corpus , 
# co-occurrence of 

In [87]:
sim_mat = cosine_similarity( df.T )
sim_mat = pd.DataFrame(  sim_mat , columns = df.columns , index = df.columns )



In [88]:
sim_mat.head(1)

Unnamed: 0,career,claimed,earlier,election,elections,friend,friends,interference,lost,minister,...,putin,republican,russia,says,served,support,trump,vladimir,winning,witchhunt
career,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,...,0.666667,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0


In [112]:
# print( sim_mat.loc['career' , :].values )

# print( sim_mat.loc['career' , :].index )

# a = list( zip( sim_mat.loc['career' , :].values  , sim_mat.loc['career' , :].index ) )
# print( a )

# print( sorted( a , key  = lambda x : ( x[0] , x[1] ) )  )
print( sim_mat.loc['career' , :].values )

[1.         0.         1.         0.         1.         0.
 0.         0.         0.         1.         0.         0.
 0.         0.40824829 1.         0.57735027 1.         0.66666667
 0.         1.         0.         1.         0.         0.
 1.         0.         0.        ]


In [102]:
 np.argsort( sim_mat.loc['career' , :].values )[ : : -1]

array([ 0, 24,  2,  4, 21,  9, 19, 14, 16, 17, 15, 13, 18, 25, 12, 11, 10,
       20,  8,  7,  6,  5, 22,  3, 23,  1, 26], dtype=int64)

In [116]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vect = TfidfVectorizer( stop_words = 'english' )

X1 = tfidf_vect.fit_transform( documents )

In [118]:
df1 = pd.DataFrame( X1.toarray() , columns = tfidf_vect.get_feature_names() , 
                 index = ['doc_trump' , 'dec_election' , 'doc_putin'])
df1

Unnamed: 0,career,claimed,earlier,election,elections,friend,friends,interference,lost,minister,...,putin,republican,russia,says,served,support,trump,vladimir,winning,witchhunt
doc_trump,0.0,0.0,0.0,0.203368,0.0,0.0,0.53481,0.0,0.267405,0.0,...,0.157934,0.267405,0.0,0.0,0.0,0.267405,0.406737,0.0,0.267405,0.0
dec_election,0.0,0.241982,0.0,0.368067,0.0,0.241982,0.0,0.241982,0.0,0.0,...,0.285837,0.0,0.0,0.483963,0.0,0.0,0.184033,0.0,0.0,0.241982
doc_putin,0.287012,0.0,0.287012,0.0,0.287012,0.0,0.0,0.0,0.0,0.287012,...,0.339028,0.0,0.287012,0.0,0.287012,0.0,0.0,0.287012,0.0,0.0


In [134]:
df1['career']

doc_trump       0.000000
dec_election    0.000000
doc_putin       0.287012
Name: career, dtype: float64

In [131]:
sim_tf = cosine_similarity( df1.T.loc[['career'] , : ] , df1.T.drop( 'career' ) )
sim_tf

array([[0.        , 1.        , 0.        , 1.        , 0.        ,
        0.        , 0.        , 0.        , 1.        , 0.        ,
        0.        , 0.        , 0.46071001, 1.        , 0.6227101 ,
        1.        , 0.72021969, 0.        , 1.        , 0.        ,
        1.        , 0.        , 0.        , 1.        , 0.        ,
        0.        ]])

In [None]:
sim_tf

In [None]:
def get_similar_words( ):
    
    