**COSINE SIMILARITY**

In [2]:
##import pandas and numpy
import pandas as pd
import numpy as np

In [3]:
## function to import each news story text file and make modifications to prepare the file for analysis
def import_files(txt_file=None):
    """
    This function opens a text file saved in the current working directory and converts each word
    in the text to lower case; replaces all full-stops with an empty space; and splits the contents of 
    the text file at every space, such that each word is a list item. The output of this function is a 
    a list containing the items from converted text file.

    """
    with open(txt_file,mode='rt',encoding='UTF-8') as file1:
        for line in file1:
            txt_file= line.lower()
            txt_file= txt_file.replace('.','')
            txt_file = txt_file.split()
        return txt_file

In [4]:
## run the function above for each news story
aljazeera=import_files("aljazeera-khashoggi.txt")
bbc=import_files("bbc-khashoggi.txt")
breitbart=import_files("breitbart-khashoggi.txt")
cnn=import_files("cnn-khashoggi.txt")
fox=import_files("fox-khashoggi.txt")

In [5]:
## define function to convert list items to dictionary objects with counts for each list item
def create_dict(news_agency=None):
    '''
    This function takes in a list of string items and outputs a dictionary object, turning each item in the 
    inputted list to a key and the number of times each item appears in the list to its corresponding value.
    '''
    news_agency_story = dict()
    for word in news_agency:
        if word in news_agency_story:
            news_agency_story[word][0] += 1
        else:
            news_agency_story[word] = [1]
    return news_agency_story

In [6]:
##run function above for each list created earlier in the notebook to convert to a dictionary
aljazeera=create_dict(aljazeera)
bbc=create_dict(bbc)
breitbart=create_dict(breitbart)
cnn=create_dict(cnn)
fox=create_dict(fox)

In [7]:
## build a function to create a document term matrix for each news dictionary created above; combine all into one DataFrame
def gen_DTM(texts=None):
    '''
    This function generates a document term matrix by taking in a series of dictionary objects as an input
    and outputting a pandas DataFrame,where the column headers represent the keys; the cells
    represent the corresponding dictionary value (i.e, how many times the word/key appears); and the index 
    represent which dictionary the words came from. The resulting DataFrame includes the document term matrix
    for each dictionary object inputted in the function.Note that the resulting DataFrame will have all instances
    where a word is in one text but not another replaced with a 0.
    '''
    DTM = pd.DataFrame()
    for text in texts:
        entry = text
        DTM = DTM.append(pd.DataFrame(entry),ignore_index=True,sort=True) # Row bind
    
    DTM.fillna(0, inplace=True) 
    return DTM

In [8]:
## run function and generate a document term matrix for each dictionary into single dataframe
DTM = gen_DTM([aljazeera,bbc, breitbart, cnn, fox])

In [9]:
#index the pandas dataframe to draw out a numpy array; i.e, turn into vector
a = DTM.iloc[0].values ##aljazeera
b = DTM.iloc[1].values ##bbc
c = DTM.iloc[2].values ##breitbart
d = DTM.iloc[3].values ##cnn
e = DTM.iloc[4].values ##fox

Calculate the cosine of angle between two vectors using this formula:
<br>

$$ \cos{\theta} =  \frac{\vec{a} \cdot \vec{b}}{\left\| a \right\| \left\| b \right\|}$$

In [10]:
##determine dot product of vector a and b
numerator=np.dot(a,b) 
##determine magnitude of vector a and b and multiply them with one another
denominator=np.sqrt(np.dot(a,a)) * np.sqrt(np.dot(b,b)) 

In [11]:
##derive the cosine of the angle between vector a and b; i.e.,similarity between aljazeera and bbc
cos_ab=numerator/denominator
print({'Aljazeera vs. BBC': cos_ab})

{'Aljazeera vs. BBC': 0.8622524641557016}


In [12]:
##determine dot product of vector a and c
numerator2=np.dot(a,c) 
##determine magnitude of vector a and c and multiply them with one another
denominator2=np.sqrt(np.dot(a,a)) * np.sqrt(np.dot(c,c)) 

In [13]:
##derive the cosine of the angle between vector a and c; i.e.,similarity between aljazeera and breitbart
cos_ac=numerator2/denominator2
print({'Aljazeera vs. Breitbart': cos_ac})

{'Aljazeera vs. Breitbart': 0.8248504911052962}


In [14]:
##determine dot product of vector a and d
numerator3=np.dot(a,d) 
##determine magnitude of vector a and d and multiply them with one another
denominator3=np.sqrt(np.dot(a,a)) * np.sqrt(np.dot(d,d)) 

In [15]:
##derive the cosine of the angle between vector a and d; i.e.,similarity between aljazeera and cnn
cos_ad=numerator3/denominator3
print({'Aljazeera vs. CNN': cos_ad})

{'Aljazeera vs. CNN': 0.7305176287247819}


In [16]:
##determine dot product of vector a and e
numerator4=np.dot(a,e) 
##determine magnitude of vector a and e and multiply them with one another
denominator4=np.sqrt(np.dot(a,a)) * np.sqrt(np.dot(e,e)) 

In [17]:
##derive the cosine of the angle between vector a and e; i.e.,similarity between aljazeera and fox
cos_ae=numerator4/denominator4
print({'Aljazeera vs. FOX': cos_ae})

{'Aljazeera vs. FOX': 0.8321887510410148}


In [18]:
##determine dot product of vector b and c
numerator5=np.dot(b,c) 
##determine magnitude of vector b and c and multiply them with one another
denominator5=np.sqrt(np.dot(b,b)) * np.sqrt(np.dot(c,c)) 

In [19]:
##derive the cosine of the angle between vector b and c; i.e.,similarity between bbc and breitbart
cos_bc=numerator5/denominator5
print({'BBC vs. Breitbart': cos_bc})

{'BBC vs. Breitbart': 0.8912414719811531}


In [20]:
##determine dot product of vector b and d
numerator6=np.dot(b,d) 
##determine magnitude of vector b and d and multiply them with one another
denominator6=np.sqrt(np.dot(b,b)) * np.sqrt(np.dot(d,d)) 

In [21]:
##derive the cosine of the angle between vector b and d; i.e.,similarity between bbc and cnn
cos_bd=numerator6/denominator6
print({'BBC vs. CNN': cos_bd})

{'BBC vs. CNN': 0.7454557179243538}


In [22]:
##determine dot product of vector b and e
numerator7=np.dot(b,e) 
##determine magnitude of vector b and e and multiply them with one another
denominator7=np.sqrt(np.dot(b,b)) * np.sqrt(np.dot(e,e))

In [23]:
##derive the cosine of the angle between vector b and e; i.e.,similarity between bbc and fox
cos_be=numerator7/denominator7
print({'BBC vs. FOX': cos_be})

{'BBC vs. FOX': 0.8868691475115703}


In [24]:
##determine dot product of vector c and d
numerator8=np.dot(c,d) 
##determine magnitude of vector c and d and multiply them with one another
denominator8=np.sqrt(np.dot(c,c)) * np.sqrt(np.dot(d,d))

In [25]:
##derive the cosine of the angle between vector c and d; i.e.,similarity between breitbart and cnn
cos_cd=numerator8/denominator8
print({'Breitbart vs. CNN': cos_cd})

{'Breitbart vs. CNN': 0.6835747867474448}


In [26]:
##determine dot product of vector c and e
numerator9=np.dot(c,e) 
##determine magnitude of vector c and e and multiply them with one another
denominator9=np.sqrt(np.dot(c,c)) * np.sqrt(np.dot(e,e))

In [27]:
##derive the cosine of the angle between vector c and e; i.e.,similarity between breitbart and fox
cos_ce=numerator9/denominator9
print({'Breitbart vs. FOX': cos_ce})

{'Breitbart vs. FOX': 0.8689899206657393}


In [28]:
##determine dot product of vector d and e
numerator10=np.dot(d,e) 
##determine magnitude of vector d and e and multiply them with one another
denominator10=np.sqrt(np.dot(d,d)) * np.sqrt(np.dot(e,e))

In [29]:
##derive the cosine of the angle between vector d and e; i.e.,similarity between cnn & fox
cos_de=numerator10/denominator10
print({'CNN vs. FOX': cos_de})

{'CNN vs. FOX': 0.7438663466782564}


Based on the analysis above, before removing any common stop words, the reporting of Turkish President Erdogan remarks about the murder of journalist Jamal Khashoggi is most similar between bbc and breitbart because the cosine of the angle between these vectors (0.8912414719811531) is the closest to 1. Conversely, breitbart and cnn are the most disimilar in their reporting. The cosine of the angle between these two vectors (0.6835747867474448) is the furthest away from 1.

Next, the steps above will be repeated but a list of common stop words will be removed from each news story text file, along with some additional special characters commonly used in the English language.

In [30]:
##import stop_words csv file; convert to list
stop_words=pd.read_csv("stop_words.csv")
stop_words=stop_words['word'].to_list()

In [31]:
## function to import each news story text file and make modifications to prepare the file for analysis 
def import_files2(txt_file=None):
    """
    This function opens a text file saved in the current working directory and brings it into the current
    python environment. It then convertes each word in the text to lower case; replaces all full-stops 
    with an empty space; and splits the contents of the text file at every space, such that each word 
    is a separate list item. This function also removes common punctuation/special characters (i.e, ?, "",$ etc)
    used in the English language.
    The output of this function is a list containing the items from converted text file.

    """
    with open(txt_file,mode='rt',encoding='UTF-8') as file1:
        for line in file1:
            txt_file= line.lower()
            txt_file= txt_file.replace('.','')
            txt_file= txt_file.replace('-','')
            txt_file= txt_file.replace('?','')
            txt_file= txt_file.replace('(','')
            txt_file= txt_file.replace(')','')
            txt_file= txt_file.replace(',','')
            txt_file= txt_file.replace('"','')
            txt_file= txt_file.replace('$','')
            txt_file= txt_file.replace('“','')
            txt_file= txt_file.replace('!','')
            txt_file= txt_file.replace("'",'')
            txt_file = txt_file.split()
            txt_file = [word for word in txt_file if word not in stop_words]
        return txt_file

In [32]:
aljazeera2=import_files2("aljazeera-khashoggi.txt")
bbc2=import_files2("bbc-khashoggi.txt")
breitbart2=import_files2("breitbart-khashoggi.txt")
cnn2=import_files2("cnn-khashoggi.txt")
fox2=import_files2("fox-khashoggi.txt")

In [33]:
## define function to convert list items to dictionary objects with counts for each list item
def create_dict(news_agency=None):
    '''
    This function takes in a list of string items and outputs a dictionary object, turning each item in the 
    inputted list to a key and the number of times each item appears in the list to its corresponding value.
    '''
    news_agency_story = dict()
    for word in news_agency:
        if word in news_agency_story:
            news_agency_story[word][0] += 1
        else:
            news_agency_story[word] = [1]
    return news_agency_story

In [34]:
##run function above for each list created earlier in the notebook to convert to a dictionary
aljazeera2=create_dict(aljazeera2)
bbc2=create_dict(bbc2)
breitbart2=create_dict(breitbart2)
cnn2=create_dict(cnn2)
fox2=create_dict(fox2)

In [35]:
## build a function to create a document term matrix for each news dictionary created above; combine all into one DataFrame
def gen_DTM(texts=None):
    '''
    This function generates a document term matrix by taking in a series of dictionary objects as an input
    and outputting a pandas DataFrame,where the column headers represent the keys; the cells
    represent the corresponding dictionary value (i.e, how many times the word/key appears); and the index 
    represent which dictionary the words came from. The resulting DataFrame includes the document term matrix
    for each dictionary object inputted in the function.Note that the resulting DataFrame will have all instances
    where a word is in one text but not another replaced with a 0.
    '''
    DTM = pd.DataFrame()
    for text in texts:
        entry = text
        DTM = DTM.append(pd.DataFrame(entry),ignore_index=True,sort=True) # Row bind
    
    DTM.fillna(0, inplace=True) 
    return DTM

In [36]:
## run function above; generate document term matrix
DTM2 = gen_DTM([aljazeera2,bbc2, breitbart2, cnn2, fox2]) 

In [37]:
#index the pandas dataframe to draw out a numpy array ; i.e., convert to vectors
a = DTM2.iloc[0].values ##aljazeera
b = DTM2.iloc[1].values ##bbc
c = DTM2.iloc[2].values ##breitbart
d = DTM2.iloc[3].values ##cnn
e = DTM2.iloc[4].values ##fox

In [38]:
##determine dot product of vector a and b
numerator=np.dot(a,b) 
##determine magnitude of vector a and b and multiply them with one another
denominator=np.sqrt(np.dot(a,a)) * np.sqrt(np.dot(b,b)) 


In [39]:
##derive the cosine of the angle between vector a and b; i.e.,similarity between aljazeera and bbc
cos_ab=numerator/denominator
print({'Aljazeera vs. BBC': cos_ab})


{'Aljazeera vs. BBC': 0.6756884307843023}


In [40]:
##determine dot product of vector a and c
numerator2=np.dot(a,c) 
##determine magnitude of vector a and c and multiply them with one another
denominator2=np.sqrt(np.dot(a,a)) * np.sqrt(np.dot(c,c)) 

In [41]:
##derive the cosine of the angle between vector a and c; i.e.,similarity between aljazeera and breitbart
cos_ac=numerator2/denominator2
print({'Aljazeera vs. Breitbart': cos_ac})

{'Aljazeera vs. Breitbart': 0.5653129979471417}


In [42]:
##determine dot product of vector a and d
numerator3=np.dot(a,d) 
##determine magnitude of vector a and d and multiply them with one another
denominator3=np.sqrt(np.dot(a,a)) * np.sqrt(np.dot(d,d)) 

In [43]:
##derive the cosine of the angle between vector a and d; i.e.,similarity between aljazeera and cnn
cos_ad=numerator3/denominator3
print({'Aljazeera vs. CNN': cos_ad})

{'Aljazeera vs. CNN': 0.5328558641198378}


In [44]:
##determine dot product of vector a and e
numerator4=np.dot(a,e) 
##determine magnitude of vector a and e and multiply them with one another
denominator4=np.sqrt(np.dot(a,a)) * np.sqrt(np.dot(e,e)) 

In [45]:
##derive the cosine of the angle between vector a and e; i.e.,similarity between aljazeera and fox
cos_ae=numerator4/denominator4
print({'Aljazeera vs. FOX': cos_ae})

{'Aljazeera vs. FOX': 0.6706485731741532}


In [46]:
##determine dot product of vector b and c
numerator5=np.dot(b,c) 
##determine magnitude of vector b and c and multiply them with one another
denominator5=np.sqrt(np.dot(b,b)) * np.sqrt(np.dot(c,c)) 

In [47]:
##derive the cosine of the angle between vector b and c; i.e.,similarity between bbc and breitbart
cos_bc=numerator5/denominator5
print({'BBC vs. Breitbart': cos_bc})

{'BBC vs. Breitbart': 0.5740920920891295}


In [48]:
##determine dot product of vector b and d
numerator6=np.dot(b,d) 
##determine magnitude of vector b and d and multiply them with one another
denominator6=np.sqrt(np.dot(b,b)) * np.sqrt(np.dot(d,d)) 

In [49]:
##derive the cosine of the angle between vector b and d; i.e.,similarity between bbc and cnn
cos_bd=numerator6/denominator6
print({'BBC vs. CNN': cos_bd})

{'BBC vs. CNN': 0.5039192189493414}


In [50]:
##determine dot product of vector b and e
numerator7=np.dot(b,e) 
##determine magnitude of vector b and e and multiply them with one another
denominator7=np.sqrt(np.dot(b,b)) * np.sqrt(np.dot(e,e))

In [51]:
##derive the cosine of the angle between vector b and e; i.e.,similarity between bbc and fox
cos_be=numerator7/denominator7
print({'BBC vs. FOX': cos_be})

{'BBC vs. FOX': 0.6246247352665979}


In [52]:
##determine dot product of vector c and d
numerator8=np.dot(c,d) 
##determine magnitude of vector c and d and multiply them with one another
denominator8=np.sqrt(np.dot(c,c)) * np.sqrt(np.dot(d,d))

In [53]:
##derive the cosine of the angle between vector c and d; i.e.,similarity between breitbart and cnn
cos_cd=numerator8/denominator8
print({'Breitbart vs. CNN': cos_cd})

{'Breitbart vs. CNN': 0.3579275305758634}


In [54]:
##determine dot product of vector c and e
numerator9=np.dot(c,e) 
##determine magnitude of vector c and e and multiply them with one another
denominator9=np.sqrt(np.dot(c,c)) * np.sqrt(np.dot(e,e))

In [55]:
##derive the cosine of the angle between vector c and e; i.e.,similarity between breitbart and fox
cos_ce=numerator9/denominator9
print({'Breitbart vs. FOX': cos_ce})

{'Breitbart vs. FOX': 0.533550867184022}


In [56]:
##determine dot product of vector d and e
numerator10=np.dot(d,e) 
##determine magnitude of vector d and e and multiply them with one another
denominator10=np.sqrt(np.dot(d,d)) * np.sqrt(np.dot(e,e))

In [57]:
##derive the cosine of the angle between vector d and e; i.e.,similarity between cnn & fox
cos_de=numerator10/denominator10
print({'CNN vs. FOX': cos_de})

{'CNN vs. FOX': 0.5219135267313623}


Based on the analysis above, after removing all common stop words and common punctuation marks from each news story, the picture of similiarity changes: the reporting of Turkish President Erdogan remarks about the murder of journalist Jamal Khashoggi is most similar between aljazeera and bbc because the cosine of the angle between these vectors (0.6756884307843023) is the closest to 1. This is  different from the previous result which found the most similarity between bbc and breitbart. Conversely, breitbart and cnn are the most disimilar in their reporting. The cosine of the angle between these two vectors(0.3579275305758634) is the furthest away from 1. While breitbart and cnn are the most dissimilar; their dissimilarity has increased as the cosine of the angle between them is now even more closer to 0. 

Next, the word count for aljazeera and bbc will be compared and the most similar word will be removed; the cosine similarity will be re-assessed.

In [58]:
##re-arrange dictionary in descending order
aljazeera2=dict(sorted(aljazeera2.items(), key=lambda item: item[1],reverse=True))
aljazeera2

{'saudi': [14],
 'erdogan': [12],
 'turkish': [7],
 'murder': [6],
 'consulate': [6],
 'istanbul': [5],
 'officials': [5],
 'killing': [5],
 'ankara': [5],
 'turkey': [4],
 'president': [4],
 'khashoggi': [4],
 'speech': [4],
 'arabia': [4],
 'salman': [4],
 'adding': [4],
 'body': [4],
 'dalay': [4],
 'khashoggis': [3],
 'bin': [3],
 'local': [3],
 'planned': [2],
 'days': [2],
 'party': [2],
 'savage': [2],
 'global': [2],
 'arabian': [2],
 'sort': [2],
 'added': [2],
 'crown': [2],
 'prince': [2],
 'october': [2],
 '2': [2],
 'unnamed': [2],
 'information': [2],
 'inside': [2],
 '18': [2],
 'investigators': [2],
 'investigation': [2],
 'questions': [2],
 'erdogans': [2],
 'confirmed': [2],
 'told': [2],
 'al': [2],
 'jazeera': [2],
 'happened': [2],
 'person': [2],
 'taking': [2],
 'steps': [2],
 'mbs': [2],
 'saudis': [2],
 'recep': [1],
 'tayyip': [1],
 'journalist': [1],
 'jamal': [1],
 'kingdoms': [1],
 'advance': [1],
 'addressing': [1],
 'legislators': [1],
 'justice': [1],
 '

In [59]:
##re-arrange dictionary in descending order
bbc2=dict(sorted(bbc2.items(), key=lambda item: item[1],reverse=True))
bbc2

{'saudi': [22],
 'khashoggi': [12],
 'erdogan': [9],
 'media': [9],
 'president': [8],
 'arabia': [8],
 'killing': [7],
 'crown': [7],
 'murder': [6],
 'consulate': [6],
 'prince': [6],
 'jamal': [5],
 'turkish': [5],
 'bin': [5],
 'caption': [5],
 'told': [4],
 'istanbul': [4],
 'khashoggis': [4],
 'body': [4],
 'conference': [4],
 'salman': [4],
 'tuesday': [4],
 'left': [4],
 'saudis': [4],
 'king': [4],
 'speech': [4],
 'image': [4],
 'news': [4],
 'killed': [3],
 'demanded': [3],
 'operation': [3],
 'investment': [3],
 'mohammed': [3],
 'playback': [3],
 'unsupported': [3],
 'device': [3],
 'family': [3],
 'riyadh': [3],
 'foreign': [3],
 'agency': [3],
 'recep': [2],
 'tayyip': [2],
 'days': [2],
 'mps': [2],
 'ruling': [2],
 'party': [2],
 'turkey': [2],
 'called': [2],
 'kingdom': [2],
 'conflicting': [2],
 'accounts': [2],
 'happened': [2],
 'alive': [2],
 'rogue': [2],
 'erdogans': [2],
 'leaders': [2],
 'appeared': [2],
 'day': [2],
 'removed': [2],
 '18': [2],
 'set': [2],


The word 'saudi' appears the most times in both the aljazeera news story (14times) and the bbc news story (22times). Below is the cosine of the angle between these two vectors after removing the word 'saudi' from both dictionaries.

In [60]:
## remove key 'saudi' from both dictionaries
aljazeera2.pop("saudi")
bbc2.pop("saudi")

[22]

In [61]:
## build a function to create a document term matrix for each news dictionary created above; combine all into one DataFrame
def gen_DTM(texts=None):
    '''
    This function generates a document term matrix by taking in a series of dictionary objects as an input
    and outputting a pandas DataFrame,where the column headers represent the keys; the cells
    represent the corresponding dictionary value (i.e, how many times the word/key appears); and the index 
    represent which dictionary the words came from. The resulting DataFrame includes the document term matrix
    for each dictionary object inputted in the function.Note that the resulting DataFrame will have all instances
    where a word is in one text but not another replaced with a 0.
    '''
    DTM = pd.DataFrame()
    for text in texts:
        entry = text
        DTM = DTM.append(pd.DataFrame(entry),ignore_index=True,sort=True) # Row bind
    
    DTM.fillna(0, inplace=True) 
    return DTM

In [62]:
##run function above; generate document term matrix
DTM3 = gen_DTM([aljazeera2,bbc2]) 

In [63]:
#index the pandas dataframe to draw out a numpy array ; i.e,, convert into vectors
a = DTM3.iloc[0].values ##aljazeera
b = DTM3.iloc[1].values ##bbc

In [64]:
##determine dot product of vector a and b
numerator=np.dot(a,b) 
##determine magnitude of vector a and b and multiply them with one another
denominator=np.sqrt(np.dot(a,a)) * np.sqrt(np.dot(b,b)) 

In [65]:
##derive the cosine of the angle between vector a and b; i.e.,similarity between aljazeera and bbc
cos_ab=numerator/denominator
print({'Aljazeera vs. BBC': cos_ab})

{'Aljazeera vs. BBC': 0.585514809687971}


Based on the analysis above, after removing the word 'saudi' from both the aljazeera and bbc dictionaries, the similarity between aljazeera and bbc is now 0.585514809687971 as opposed to  0.6756884307843023 in the previous analysis. The picture of similiarity changed substantially between aljazeera and bbc after removing the word 'saudi', which was the most frequently appearing word in both accounts of Turkish President Erdogan's remarks about the murder of journalist Jamal Khashoggi. They are now less similar than they were before.