### Text Document Similarity
Create a python program that will compute the text document similarity between different docu- ments. Your implementation will take a list of documents as an input text corpus, and it will compute a dictionary of words for the given corpus. Later, when a new document (i.e, search document) is provided, your implementation should provide a list of documents that are similar to the given search document, in descending order of their similarity with the search document.
For computing similarity between any two documents in our question, you can use the following distance measures (optionally, you can also use any other measure as well).
1. dot product between the two vectors
2. distance norm (or Euclidean distance) between two vectors .e.g. || u − v ||

As part of answering the question, you can also compare and comment on which of the two methods (or any other measure if you have used some other measure) will perform better and what are the reasons for it.

Hint A text document can be represented as a word vector against a given dictionary of words. So first, compute the dictionary of words for a given text corpus containing the unique words from the documents of the given corpus. Then transform every text document of the given corpus into vector form, i.e., creating a word vector where 0 indicates the word is not in the document, and 1 indicates that the word is present in the given document. In our question, a text document is just represented as a string, so the text corpus is nothing but a list of strings.

### Functions

**def_input_corpus**: 
- input text corpus containing of multiple strings CHECK 
- words from corpus to dictionary CHECK
- split inpt strings into words (account for .:,;!&/()="")) CHECK
- append words to a library (list of distinct words) CHECK
- create vektors fo text documents CHECK



**def_input_new_document**:
- create vektor for text document based on corpus CHECK
- calculate difference between input text and every existing text document CHECK
- create a list of the documents and their similarity rating -> dictionary used -> probably convert it to df in the end


**things to add**:
- document names CHECK
- reading of documents from drive path (.txt file) CHECK -> test with windows 
- word frequency for documents -> not binary vector anymore -> does that make sense?
- change dictionary to set instead of list CHECK
- combine the similarity methods and call them individually be inputting "method = ..." -> similar to ChatGPT Code -> CHECK
- add cosine similarity method

Error Handling: 
- Error when document is already part of the corpus (asserted by checking if already in name list) CHECK
- Error when search document name and document name for corpus are not in the folder CHECK
- Error when file is empty (when contains no words)



In [10]:
import re
import string
import numpy as np
import pandas as pd
from pathlib import Path


class DocumentSimilarityAnalyzer:
    def __init__(self, directory):
        self.directory = directory
        self.dictionary = set()
        self.corpus = [] #list of strings -> appended in the "update corpus" function
        self.name_list = [] #list of strings -> appended in the "update name" function -> used to get document names

    def clear_corpus(self):
        self.corpus.clear()

    def clear_name_list(self):
        self.name_list.clear()

    def load_search_doc(self, doc_name): #function to load the search document from drive
        
        #open file by using path() -> to make it operateable on windows and mac systems
        file_to_open = Path(self.directory + doc_name)
        
        #open and read search file
        #throw errors in case try fails
        try:
            with open(file_to_open, "r") as f:
                search_doc = f.read()
            return search_doc
        except FileNotFoundError as e:
            print(f"Error: File not found - {file_to_open}: {e}")
        except IOError as e:
            print(f"Error opening or reading file {file_to_open}: {e}")
        except Exception as e:
            print(f"An unexpected error occurred: {e}")
        return None

    def update_corpus(self, doc_name): #function to input a new document for the corpus
        #open file by using path() -> to make it operateable on windows and mac systems
        file_to_open = Path(self.directory + doc_name)
        #open and read files to append to the corpus
        #throw errors in case try fails
        try:
            #assert if file is already on name list and therefore part of the corpus
            #append file name to name_list when not already on name_list
            assert doc_name not in self.name_list, f"Document {doc_name} already uploaded, please choose another file."
            self.name_list.append(doc_name)
            with open(file_to_open, "r") as f:
                self.corpus.append(f.read())
            self.dictionary.update(self.create_dictionary())
        except FileNotFoundError as e:
            print(f"Error: File not found - {file_to_open}: {e}")
        except IOError as e:
            print(f"Error opening or reading file {file_to_open}: {e}")
        except Exception as e:
            print(f"An unexpected error occurred: {e}")
        return None # Return a sentinel value or handle the case appropriately

    def process_text(self, text): #function to extract words from string
        # extract words from string and return in lower case
        return re.findall(r'\w+', text.lower())

    def create_dictionary(self): #function to create a dictionary from the corpus documents
        
        #iterate through all documents from the corpus
        for document in self.corpus:
            #extract all words in lower case from the string
            #save words as a set and add words to the dictionary by building the union
            words = self.process_text(document)
            new_words = set(words)
            dictionary = dictionary.union(new_words)
        return dictionary

    def document_to_vector(self, document): #function to convert a document (list of document words) into a binary vector
        # Convert a document into a binary vector
        # uses process_text function to convert document into a list of words
        word_list = self.process_text(document)
        # iterates through the dictionary  
        # appends 1 when word is in the list of words from the document  
        # appends 0 when word is not in the list of words from the document  
        doc_vector = np.array([1 if word in word_list else 0 for word in self.dictionary])
        return doc_vector

    def freq_vector(self, document): #Function to convert a document (list of document words) into a frequency vector
        #Convert a document into a list of words
        word_list = self.process_text(document)
        #Count occurrences of each word in the document
        word_counts = {word: word_list.count(word) for word in set(word_list)}
        #Create the frequency vector
        freq_vector = np.array([word_counts.get(word, 0) for word in self.dictionary])
        return freq_vector

    def dot_similarity(self, search_vector): #function to compute similarities by using dot product
        similarity_dic = {}
        #use "for i, doc_vector" to also get index of the iteration -> used to get the document from the corpus
        for i, doc_vector in enumerate(self.doc_vectors):
            similarity = np.dot(doc_vector, search_vector)
            doc_name = self.name_list[i]
            similarity_dic.update({doc_name: similarity})
        return similarity_dic

    def jac_similarity(self, search_vector): #function to compute similarities by using Jaccard Index
        similarity_dic = {}
        for i, doc_vector in enumerate(self.doc_vectors):
            cor_len = len(self.process_text(self.corpus[i]))
            search_len = len(self.process_text(str(search_vector)))
            #divide the dot product by the number of words in the union of search_doc and doc from corpus
            similarity = np.dot(doc_vector, search_vector) / (search_len + cor_len)
            #get name from name_list by indexing from the name_list
            doc_name = self.name_list[i]
            similarity_dic.update({doc_name: similarity})
        return similarity_dic

    def euc_similarity(self, search_vector): #function to compute similarities by using the euclidean distance
        similarity_dic = {}
        for i, doc_vector in enumerate(self.doc_vectors):
            similarity = np.linalg.norm(doc_vector - search_vector)
            doc_name = self.name_list[i]
            similarity_dic.update({doc_name: similarity})
        return similarity_dic

    def compute_similarity(self, search_doc, method): #function which is triggered by user to compute similarity
        #sends search document string to load search document function and gets the content of the file as string
        search_doc = self.load_search_doc(search_doc)
        #create biary vector of search document
        search_vector = self.document_to_vector(search_doc)
        #create array of binary vectors of corpus documents
        self.doc_vectors = np.array([self.document_to_vector(document) for document in self.corpus])

        #if statements to assess which method is chosen
        #sends vectors and corpus to computing function
        if method == "Dot Product":
            similarities = self.dot_similarity(search_vector)
        elif method == "Jaccard Index":
            similarities = self.jac_similarity(search_vector)
        elif method == "Euclidean Distance":
            similarities = self.euc_similarity(search_vector)
        else:
            print("Unknown method")
            return None

        #convert dictionary into data frame for formatted output and possibility to easily order results
        similarities_df = pd.DataFrame(similarities.items(), columns=['Document', 'Similarity'])
        similarities_df.sort_values(["Similarity"], ascending=False, inplace=True)
        print(f"{method}: \n", similarities_df)
        #return similarities_df






In [11]:
#replace directory with file path to the folder in which the documents and the code-file are located
directory= "/Users/jonathan/Library/Mobile Documents/com~apple~CloudDocs/Master/Foundations of Data Science/Assignment/Final Assignment/"
analyzer = DocumentSimilarityAnalyzer(directory)

analyzer.update_corpus("doc1.txt")
analyzer.update_corpus("doc2.txt")
analyzer.update_corpus("doc3.txt")
analyzer.update_corpus("doc4.txt")
analyzer.compute_similarity("search_doc.txt", "Dot Product")
analyzer.compute_similarity("search_doc.txt", "Jaccard Index")
analyzer.compute_similarity("search_doc.txt", "Euclidean Distance")

Dot Product: 
    Document  Similarity
3  doc4.txt           5
2  doc3.txt           3
0  doc1.txt           2
1  doc2.txt           1
Jaccard Index: 
    Document  Similarity
3  doc4.txt    0.312500
2  doc3.txt    0.187500
0  doc1.txt    0.166667
1  doc2.txt    0.083333
Euclidean Distance: 
    Document  Similarity
1  doc2.txt    2.645751
2  doc3.txt    2.449490
0  doc1.txt    2.236068
3  doc4.txt    1.414214


In [None]:
# Example usage:
directory = "/path/to/your/files/"
analyzer = DocumentSimilarityAnalyzer(directory)
analyzer.update_corpus("example_doc1.txt")
analyzer.update_corpus("example_doc2.txt")
analyzer.compute_similarity("search_doc.txt", "Dot Product")