
# Corpus Analysis with Cosine Similarity

Author: Lucas van der Deijl, University of Amsterdam <br/>
Version: 9 December 2020 <br/>
Contact: l.a.vanderdeijl@uva.nl, www.lucasvanderdeijl.nl <br/>
Project: 'Radical Rumours' (Funded by NWO 2017-2021) <br/>

## Aim of this program

The program offers a basic method to analyse textual similarity between document pairs from either one corpus or two different corpora. Textual similarity is formalised as [cosine similarity](https://www.sciencedirect.com/topics/computer-science/cosine-similarity) based on [tf-idf values](https://en.wikipedia.org/wiki/Tf%E2%80%93idf) for every word type in the corpus. The results are visualised as a heatmap and can be stored as a csv-file.

This Jupyter notebook can be used to reuse the code or to replicate the analysis with different corpora. Run each code block individually or use the 'Run all'-option from the Cell-tab.

## Pipeline
The pipeline desigend to achieve the program's aim performs the following steps:

+ import the required modules
+ install missing libraries (if needed)
+ define functions for preprocessing and parsing
+ define the filepath to the location of your corpus
+ load and preprocess your corpus
+ create a term-document matrix with tfidf values
+ creae a document-document matrix with cosine similarity values
+ visualise the document matrix as a heatmap
+ store the output

### Import the required libraries

First, the required libaries and resources need to be imported.

In [None]:
import os

import nltk
from nltk.tokenize import word_tokenize
from string import punctuation

import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer # We're going to use the Python package scikit-learn to transform texts into vectors of TF-IDF values
from scipy import spatial

import seaborn as sns
import matplotlib.pyplot as plt

### Install missing libraries

In case you got an error after the previous step because not all of the required modules are installed, you can uncomment (remove the '#') the relevant install-command below and run the code. Once the module is installed, run the block above again to import it before moving on to the next step.

In [None]:
# !pip install nltk
# !pip install scipy

### Define functions for preprocessing and parsing

In [None]:
def preprocess(doc):
  stopwords = open(("Resources/stopwoorden.txt"), 'rt', encoding='utf-8').read().split()
  punct = punctuation
  tokens = word_tokenize(doc)
  lowercase_tokens = [token.lower() for token in tokens]
  punct_and_stops_removed = " ".join([token for token in lowercase_tokens if (token not in stopwords) and (token not in punct)]) 
  preprocessed_doc = punct_and_stops_removed
  return(preprocessed_doc)

def parse_corpus(corpus_location):
    corpus = []
    titles = []
    for filename in os.listdir(corpus_location):
        title = filename.split("_")[2]
        titles.append(title)
        file = open((corpus_location + filename), 'rt', encoding='utf-8')
        preprocessed_text = preprocess(file.read())
        corpus.append(preprocessed_text)
        file.close()
    return(titles, corpus)

### Define the filepath to the location of your corpus

Sample data is included in the form of 17 seventeenth-century Dutch translations of books by Descartes and Spinoza. These texts have been prepared for analysis through partial spelling normalisation and lemmatisation.

In [None]:
# Define filepaths to corpora
path_to_corpusfolder = "Corpus/"
path_to_corpus_Descartes = path_to_corpusfolder + "Descartes/"
path_to_corpus_Spinoza = path_to_corpusfolder + "Spinoza/"
#os.listdir(path_to_corpusfolder)

### Load and preprocess your corpus

In [None]:
source_corpus = parse_corpus(path_to_corpus_Spinoza)
target_corpus = parse_corpus(path_to_corpus_Spinoza)

document_titles = source_corpus[0] + target_corpus[0]
total_corpus = source_corpus[1] + target_corpus[1]

### Create a term-document matrix with tfidf values

In [None]:
vect = TfidfVectorizer(min_df=0) # set parameters for vectorization
term_doc_matrix = vect.fit_transform(total_corpus)
term_doc_matrix_array = term_doc_matrix.toarray()

### Create a document-document matrix with cosine similarity values

In [None]:
similarity_matrix = {}

for row_counter, tfidf_array_source in enumerate(term_doc_matrix_array[:len(source_corpus[1])]):
    source_index = row_counter
    source_title = document_titles[source_index]
    similarity_matrix[source_title] = {}
    for column_counter, tfidf_array_target in enumerate(term_doc_matrix_array[len(source_corpus[1]):]): 
        target_index = len(source_corpus[1]) + column_counter
        target_title = document_titles[target_index]
        cos_similarity = 1- spatial.distance.cosine(tfidf_array_source, tfidf_array_target) # substracted from 1 to create similarity metric
        similarity_matrix[source_title][target_title] = cos_similarity

heatmapdata = []
for source_key in similarity_matrix:
    values_list = [value for value in similarity_matrix[source_key].values()]
    heatmapdata.append(values_list)

### Visualise the document matrix as a heatmap

In [None]:
%matplotlib inline
plt.rcParams['figure.dpi'] = 300
plt.rcParams["font.family"] = "Garamond"

source_titles = source_corpus[0]
target_titles = target_corpus[0] 

data = heatmapdata

mask = np.zeros_like(data)
mask[np.triu_indices_from(mask)] = False # Switch to 'True' if source corpus = target corpus, to exclude redundant data

ax = sns.heatmap(data, 
                 mask=mask, 
                 cmap="YlGnBu", 
                 annot=True, 
                 xticklabels=target_titles, 
                 yticklabels=source_titles, 
                 cbar=True, 
                 vmin=0, 
                 vmax=1)

#plt.savefig("Output/IMAGE_NAME.png", bbox_inches='tight', dpi=300) # Uncomment to save the file

### Store the output

In [None]:
outputfile = open("Output/output.csv", 'w', encoding="UTF-8")
heading = ";" + ";".join(source_corpus[0]) +"\n"
outputfile.write(heading)

for counter, row in enumerate(heatmapdata):
    values = ""
    for value in row:
        values += str(value) + ";"   
    
    output_row = str(source_corpus[0][counter]) + ";" + values +"\n"
    outputfile.write(output_row)
outputfile.close()