In [13]:
#Cosine similarity is a metric used to measure how similar the documents are irrespective of their size.
#Mathematically, it measures the cosine of the angle between two vectors projected in a multi-dimensional space.


In [14]:
#To test this i will create 3 documents taken from the first paragraph of three celebrities from wikipedia:
#Kobe Byrant, LeBron James, and Barack Obama.
#There are obvious similarities as Kobe Byrant and Lebron James are both professional basketball players 
#but they also share some similarites with Barack Obama who is an African American person.

In [15]:
# creating the documents
doc_kobe = "Kobe Bean Bryant (/ˈkoʊbiː/ KOH-bee; August 23, 1978 – January 26, 2020) was an American professional basketball player. A shooting guard, he spent his entire 20-year career with the Los Angeles Lakers in the National Basketball Association (NBA). Widely regarded as one of the greatest basketball players of all time,[3][4][5][6][7] Bryant won five NBA championships, was an 18-time All-Star, a 15-time member of the All-NBA Team, a 12-time member of the All-Defensive Team, the 2008 NBA Most Valuable Player (MVP), and a two-time NBA Finals MVP. Bryant also led the NBA in scoring twice, and ranks fourth in league all-time regular season and postseason scoring. He was posthumously voted into the Naismith Memorial Basketball Hall of Fame in 2020 and named to the NBA 75th Anniversary Team in 2021."
doc_lebron = "LeBron Raymone James Sr. (/ləˈbrɒn/; born December 30, 1984) is an American professional basketball player for the Los Angeles Lakers of the National Basketball Association (NBA). Nicknamed 'King James', he is widely considered one of the greatest players of all time and is often compared to Michael Jordan in debates over the greatest basketball player ever.[a] James has won four NBA championships, four NBA MVP awards, four NBA Finals MVP awards, three All-Star MVP awards, and two Olympic gold medals. James has scored the most points in the playoffs, the second most career points, and has the seventh most career assists. He has been selected an NBA All-Star 18 times, to the All-NBA Team a record 18 times,[b] and to the NBA All-Defensive First Team five times.[3] He has competed in ten NBA Finals, the third most all time, including eight consecutively between 2011 and 2018.[4] In 2021, James was selected to the NBA 75th Anniversary Team,[5] and in 2022 became the first player in NBA history to accumulate 10,000 or more career points, rebounds, and assists.[6]"
doc_obama = "Barack Hussein Obama II (/bəˈrɑːk huːˈseɪn oʊˈbɑːmə/ (listen) bə-RAHK hoo-SAYN oh-BAH-mə;[1] born August 4, 1961) is an American politician who served as the 44th president of the United States from 2009 to 2017. A member of the Democratic Party, he was the first African-American president of the United States.[2] Obama previously served as a U.S. senator from Illinois from 2005 to 2008 and as an Illinois state senator from 1997 to 2004."

# create array/list of documents
documents = [doc_kobe, doc_lebron, doc_obama]

In [16]:
# to find the cosine similarity, we first need to find the word count of each word in the documents.

from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd

# We want to create a word count matrix

# This is so words like 'an' and 'the' skew the data
count_vector = CountVectorizer(stop_words='english')
count_vector = CountVectorizer()

sparse_matrix = count_vector.fit_transform(documents)

# to dense to give nan values values.
doc_term_matrix = sparse_matrix.todense()
df = pd.DataFrame(doc_term_matrix, 
                  columns=count_vector.get_feature_names_out(), 
                  index=['doc_kobe', 'doc_lebron', 'doc_obama'])

df



Unnamed: 0,000,10,12,15,18,1961,1978,1984,1997,20,...,united,valuable,voted,was,who,widely,with,won,year,ˈkoʊbiː
doc_kobe,0,0,1,1,1,0,1,0,0,1,...,0,1,1,3,0,1,1,1,1,1
doc_lebron,1,1,0,0,2,0,0,1,0,0,...,0,0,0,1,0,1,0,1,0,0
doc_obama,0,0,0,0,0,1,0,0,1,0,...,2,0,0,1,1,0,0,0,0,0


In [17]:
from sklearn.metrics.pairwise import cosine_similarity
print(cosine_similarity(df, df))


[[1.         0.73977588 0.36062951]
 [0.73977588 1.         0.35881405]
 [0.36062951 0.35881405 1.        ]]


In [None]:
# The closer to 1, the more similar the documents are. We can see that the kobe and lebron doc have a cosine similarty of 0.73 
# compared to that of kobe and obama and obama and lebron. 