### Text Document Similarity
Create a python program that will compute the text document similarity between different docu- ments. Your implementation will take a list of documents as an input text corpus, and it will compute a dictionary of words for the given corpus. Later, when a new document (i.e, search document) is provided, your implementation should provide a list of documents that are similar to the given search document, in descending order of their similarity with the search document.
For computing similarity between any two documents in our question, you can use the following distance measures (optionally, you can also use any other measure as well).
1. dot product between the two vectors
2. distance norm (or Euclidean distance) between two vectors .e.g. || u − v ||

As part of answering the question, you can also compare and comment on which of the two methods (or any other measure if you have used some other measure) will perform better and what are the reasons for it.

Hint A text document can be represented as a word vector against a given dictionary of words. So first, compute the dictionary of words for a given text corpus containing the unique words from the documents of the given corpus. Then transform every text document of the given corpus into vector form, i.e., creating a word vector where 0 indicates the word is not in the document, and 1 indicates that the word is present in the given document. In our question, a text document is just represented as a string, so the text corpus is nothing but a list of strings.

In [269]:
import re
import string
import numpy as np
import pandas as pd
from pathlib import Path


class DocumentSimilarityAnalyzer:
    def __init__(self, directory):
        self.dictionary = set()
        self.corpus = [] #list of strings -> appended in the "update corpus" function
        self.name_list = [] #list of strings -> appended in the "update name" function -> used to get document names
        if Path(directory).is_dir() == True: self.directory = directory
        else: print("Inputted path is not valid. Please check your input.")
        self.corpus = [] #list of strings -> appended in the "update corpus" function
        self.name_list = [] #list of strings -> appended in the "update name" function -> used to get document names
        
    def print_dictionary(self): 
        print(self.dictionary)
    
    def print_corpus(self):
        print(self.corpus)
    
    def print_name(self): 
        print(self.name_list)
    
    def clear_corpus(self):
        self.corpus.clear()

    def clear_name_list(self):
        self.name_list.clear()

    def load_search_doc(self, doc_name): #function to load the search document from drive
        
        #open file by using path() -> to make it operateable on windows and mac systems
        assert Path(self.directory+doc_name).is_file(), "The inputted file does not exist at that directory."
        file = Path(self.directory + doc_name)
        
        #open and read search file
        #throw errors in case try fails
        try:
            with open(file, "r") as f:
                search_doc = f.read()
            return search_doc
        except FileNotFoundError as e:
            print(f"Error: File not found - {file}: {e}")
        except IOError as e:
            print(f"Error opening or reading file {file}: {e}")
        except Exception as e:
            print(f"An unexpected error occurred: {e}")
        return None

    def create_corpus(self, corpus_folder):
        assert Path(self.directory+corpus_folder).is_dir(), "The input directory does not exist."
        files = Path(self.directory+corpus_folder).glob('*')
        for file in files:
            if file.suffixes == [".txt"]:
                try:
                    self.name_list.append(file.name)
                    with open(file, "r") as f:
                        self.corpus.append(f.read())
                except FileNotFoundError as e:
                    print(f"Error: File not found - {file}: {e}")
                except IOError as e:
                    print(f"Error opening or reading file {file}: {e}")
                except Exception as e:
                    print(f"An unexpected error occurred: {e}")
                
            else: print(f"The following file could not be uploaded as it is no in .txt format: {file.name}")
        self.dictionary = self.create_dictionary()

    def update_corpus(self,corpus_folder):
        assert Path(self.directory+corpus_folder).is_dir(), "The input directory does not exist. Please input an existing corpus folder to upadate the corpus."
        self.clear_corpus()
        self.clear_name_list()
        self.create_corpus(corpus_folder)
        print("The corpus was successfully updated.")

    def process_text(self, text): #function to extract words from string
        #extract words from string and return in lower case
        return re.findall(r'\w+', text.lower())

    def create_dictionary(self): #function to create a dictionary from the corpus documents
        #iterate through all documents from the corpus
        for document in self.corpus:
            #extract all words in lower case from the string
            #save words as a set and add words to the dictionary by building the union
            words = self.process_text(document)
            new_words = set(words)
            self.dictionary = self.dictionary.union(new_words)
        return self.dictionary

    def document_to_vector(self, document): #function to convert a document (list of document words) into a binary vector
        # Convert a document into a binary vector
        # uses process_text function to convert document into a list of words
        word_list = self.process_text(document)
        # iterates through the dictionary  
        # appends 1 when word is in the list of words from the document  
        # appends 0 when word is not in the list of words from the document  
        doc_vector = np.array([1 if word in word_list else 0 for word in self.dictionary])
        return doc_vector

    def freq_vector(self, document): #Function to convert a document (list of document words) into a frequency vector
        #Convert a document into a list of words
        word_list = self.process_text(document)
        #Count occurrences of each word in the document
        word_counts = {word: word_list.count(word) for word in set(word_list)}
        #Create the frequency vector
        freq_vector = np.array([word_counts.get(word, 0) for word in self.dictionary])
        return freq_vector

    def dot_similarity(self, search_vector): #function to compute similarities by using dot product
        similarity_dic = {}
        #use "for i, doc_vector" to also get index of the iteration -> used to get the document from the corpus
        for i, doc_vector in enumerate(self.doc_vectors):
            similarity = np.dot(doc_vector, search_vector)
            doc_name = self.name_list[i]
            similarity_dic.update({doc_name: similarity})
        return similarity_dic

    def jac_similarity(self, search_doc, search_vector): #function to compute similarities by using Jaccard Index
        similarity_dic = {}
        #create set of all words in the search document
        #search_len = len(set(self.process_text(search_doc)))
        for i, doc_vector in enumerate(self.doc_vectors):
            #create set of all words for each document in the corpus
            #cor_len = len(set(self.process_text(self.corpus[i])))
            #divide the dot product by the number of words in the union of search_doc and doc from corpus
            similarity = np.dot(doc_vector, search_vector) / (len(set(self.process_text(search_doc)))+ len(set(self.process_text(self.corpus[i]))))
            #get name from name_list by indexing from the name_list
            doc_name = self.name_list[i]
            similarity_dic.update({doc_name: similarity})
        return similarity_dic

    def euc_similarity(self, search_vector): #function to compute similarities by using the euclidean distance
        similarity_dic = {}
        for i, doc_vector in enumerate(self.doc_vectors):
            similarity = np.linalg.norm(doc_vector - search_vector)
            doc_name = self.name_list[i]
            similarity_dic.update({doc_name: similarity})
        return similarity_dic

    def cosine_similarity(self, search_vector): #function to compute similarities by using the cosine similarity
        similarity_dic = {}
        for i, doc_vector in enumerate(self.doc_vectors):
            #dot product of the seach vector and document vector is divided by the product of the lengths of the vectors
            similarity = np.dot(doc_vector, search_vector) / (np.linalg.norm(doc_vector) * np.linalg.norm(search_vector))
            doc_name = self.name_list[i]
            similarity_dic.update({doc_name: similarity})
        return similarity_dic

    def compute_similarity(self, search_doc, method): #function which is triggered by user to compute similarity
        #sends search document string to load search document function and gets the content of the file as string
        search_doc = self.load_search_doc(search_doc)
        #create biary vector of search document
        search_vector = self.document_to_vector(search_doc)
        #create array of binary vectors of corpus documents
        self.doc_vectors = np.array([self.document_to_vector(document) for document in self.corpus])

        #if statements to assess which method is chosen
        #sends vectors and corpus to computing function
        if method == "Dot Product":
            similarities = self.dot_similarity(search_vector)
            #convert dictionary into data frame for formatted output and possibility to easily order results
            similarities_df = pd.DataFrame(similarities.items(), columns=['Document', 'Similarity'])
            similarities_df.sort_values(["Similarity"], ascending=False, inplace=True)
            print("Dot Product: \n", similarities_df, "\nThe higher the dot product the higher the similarity." )
        elif method == "Jaccard Index":
            similarities = self.jac_similarity(search_doc, search_vector)
            similarities_df = pd.DataFrame(similarities.items(), columns=['Document', 'Similarity'])
            similarities_df.sort_values(["Similarity"], ascending=False, inplace=True)
            print("Jaccard Index: \n", similarities_df, "\nThe higher the Jaccard Index the higher the similarity." )
        elif method == "Euclidean Distance":
            similarities = self.euc_similarity(search_vector)
            similarities_df = pd.DataFrame(similarities.items(), columns=['Document', 'Similarity'])
            similarities_df.sort_values(["Similarity"], ascending=True, inplace=True)
            print("Euclidean Distance: \n", similarities_df, "\nThe lower the Euclidean Distance the higher the similarity." )
        elif method == "Cosine Similarity":
            similarities = self.cosine_similarity(search_vector)
            similarities_df = pd.DataFrame(similarities.items(), columns=['Document', 'Similarity'])
            similarities_df.sort_values(["Similarity"], ascending=False, inplace=True)
            print("Cosine Similarity: \n", similarities_df, "\nThe lower the Cosine Similarity the higher the similarity." )
        else:
            print("Unknown method. Please input one of the following Methods: \n", "Dot Product, ", "Cosine Similarity, ", "Jaccard Index, ", "Euclidean Distance")
            return None
        
        
        #return similarities_df






### Initialise the program

In [270]:
#requires directory path as input
directory = input("Please enter the path to the folder containing the files for which you want to compute the stext similiarities: ")
analyzer = DocumentSimilarityAnalyzer(directory)

#set directory path by pasting th string
directory = ""
analyzer = DocumentSimilarityAnalyzer(directory)


#JONATHAN
#/Users/jonathan/Library/Mobile Documents/com~apple~CloudDocs/Master/Foundations of Data Science/Assignment/ds_final_assignment/

#NICK
#/Users/nickolasreinecke/Desktop/FDS Q3/


In [271]:
#upload documents for corpus
analyzer.create_corpus("Corpus_Docs")
analyzer.print_corpus()
analyzer.print_name()
analyzer.print_dictionary()


analyzer.update_corpus("Corpus_Docs")
analyzer.print_corpus()
analyzer.print_name()
analyzer.print_dictionary()

The following file could not be uploaded as it is no in .txt format: Error_Test.docx
['The fifth document contains words that are not present in the search document.\n', 'This is the fourth document. It shares some common words with the other documents.\n', 'This is the first document. It contains some words for testing.', 'Document number three is different from the others. It has unique words.', 'The second document is here. It shares some words with the first document.']
['doc5.txt', 'doc4.txt', 'doc1.txt', 'doc3.txt', 'doc2.txt']
{'first', 'three', 'number', 'is', 'others', 'with', 'some', 'present', 'documents', 'other', 'that', 'words', 'not', 'the', 'fourth', 'has', 'it', 'search', 'for', 'second', 'from', 'shares', 'document', 'testing', 'this', 'fifth', 'common', 'here', 'contains', 'in', 'unique', 'different', 'are'}
The following file could not be uploaded as it is no in .txt format: Error_Test.docx
The corpus was successfully updated.
['The fifth document contains words tha

In [272]:
#cimpute similarities
analyzer.compute_similarity("search_doc.txt", "Dot Product")
analyzer.compute_similarity("search_doc.txt", "Jaccard Index")
analyzer.compute_similarity("search_doc.txt", "Euclidean Distance")
analyzer.compute_similarity("search_doc.txt", "Cosine Similarity")

Dot Product: 
    Document  Similarity
2  doc1.txt           7
1  doc4.txt           6
4  doc2.txt           6
3  doc3.txt           5
0  doc5.txt           4 
The higher the dot product the higher the similarity.
Jaccard Index: 
    Document  Similarity
2  doc1.txt    0.291667
4  doc2.txt    0.250000
1  doc4.txt    0.230769
3  doc3.txt    0.200000
0  doc5.txt    0.166667 
The higher the Jaccard Index the higher the similarity.
Euclidean Distance: 
    Document  Similarity
2  doc1.txt    2.449490
4  doc2.txt    2.828427
1  doc4.txt    3.162278
3  doc3.txt    3.316625
0  doc5.txt    3.464102 
The lower the Euclidean Distance the higher the similarity.
Cosine Similarity: 
    Document  Similarity
2  doc1.txt    0.703526
4  doc2.txt    0.603023
1  doc4.txt    0.554700
3  doc3.txt    0.481125
0  doc5.txt    0.402015 
The lower the Cosine Similarity the higher the similarity.


### Error Triggering

In [263]:
#input wrong directory
analyzer = DocumentSimilarityAnalyzer("adf3a/asdaf")

Inputted path is not valid. Please check your input.


In [261]:
#create corpus with not valid folder name
analyzer.create_corpus("Corpus_Dcs")

AssertionError: The input directory does not exist.

In [262]:
#update corpus with not valid folder name
analyzer.create_corpus("Corus_Docs")

AssertionError: The input directory does not exist.

In [273]:
#input wrong method
analyzer.compute_similarity("search_doc.txt", "Dots Product")

Unknown method. Please input one of the following Methods: 
 Dot Product,  Cosine Similarity,  Jaccards Index,  Euclidean Distance


In [268]:
#input not valid search document
analyzer.compute_similarity("search_docs.txt", "Dot Product")

AssertionError: The inputted file does not exist at that directory.