### Text Document Similarity
Create a python program that will compute the text document similarity between different docu- ments. Your implementation will take a list of documents as an input text corpus, and it will compute a dictionary of words for the given corpus. Later, when a new document (i.e, search document) is provided, your implementation should provide a list of documents that are similar to the given search document, in descending order of their similarity with the search document.
For computing similarity between any two documents in our question, you can use the following distance measures (optionally, you can also use any other measure as well).
1. dot product between the two vectors
2. distance norm (or Euclidean distance) between two vectors .e.g. || u − v ||

As part of answering the question, you can also compare and comment on which of the two methods (or any other measure if you have used some other measure) will perform better and what are the reasons for it.

Hint A text document can be represented as a word vector against a given dictionary of words. So first, compute the dictionary of words for a given text corpus containing the unique words from the documents of the given corpus. Then transform every text document of the given corpus into vector form, i.e., creating a word vector where 0 indicates the word is not in the document, and 1 indicates that the word is present in the given document. In our question, a text document is just represented as a string, so the text corpus is nothing but a list of strings.

### Functions

**def_input_corpus**: 
- input text corpus containing of multiple strings CHECK 
- words from corpus to dictionary CHECK
- split inpt strings into words (account for .:,;!&/()="")) CHECK
- append words to a library (list of distinct words) CHECK
- create vektors fo text documents CHECK



**def_input_new_document**:
- create vektor for text document based on corpus CHECK
- calculate difference between input text and every existing text document CHECK
- create a list of the documents and their similarity rating -> dictionary used -> probably convert it to df in the end


**things to add**:
- document names CHECK
- reading of documents from drive path (.txt file) CHECK -> test with windows 
- word frequency for documents -> not binary vector anymore -> does that make sense?
- change dictionary to set instead of list CHECK
- combine the similarity methods and call them individually be inputting "method = ..." -> similar to ChatGPT Code

Error Handling: 
- Error when document is already part of the corpus (asserted by checking if already in name list) CHECK
- Error when search document name and document name for corpus are not in the folder CHECK
- Error when file is empty (when contains no words)



In [102]:
#Code Jonathan 
import re
import string
import numpy as np
import pandas as pd
from pathlib import Path
import math
import sys

#global variables used accross all functions
dictionary = set()
corpus = [] #list of strings -> appended in the "update corpus" function
name_list = [] #list of strings -> appended in the "update name" function -> used to get document names

def clear_corpus (corpus):
  corpus = corpus.clear()
  return corpus

def clear_name_list(name_list):
  name_list = name_list.clear()
  return name_list

def set_directory(input_directory): #sets the input directory as a global variable so is is usable accross all functions
  global directory 
  directory = input_directory
  return directory

def load_search_doc(doc_name): #function to load the search document from drive
  file_to_open = Path(directory+doc_name)
  try:  
    with open(file_to_open, "r") as f:
      search_doc = f.read()
    return search_doc
  except FileNotFoundError:
        print(f"Error: File not found - {file_to_open}")
  except IOError as e:
        print(f"Error opening or reading file {file_to_open}: {e}")
  except Exception as e:
        print(f"An unexpected error occurred: {e}")
  return None  # Return a sentinel value or handle the case appropriately

def update_corpus(directory,doc_name): #function to input a new document for the corpus
  global corpus
  global name_list
    
  #open file by using path() -> to make it operateable on windows and mac systems
  file_to_open = Path(directory+doc_name)
    
  #open and read file to append the content to the corpus
  
  assert doc_name not in name_list, f"Document {doc_name} already read uploaded, please choose another file."
  name_list.append(doc_name)
  
  #open and read file to append the content to the corpus
  try:
    with open(file_to_open, "r") as f:
      corpus.append(f.read())
  #append file name to name_list
    dictionary = create_dictionary(corpus)
    #return corpus, dictionary, name_list
  except FileNotFoundError:
        print(f"Error: File not found - {file_to_open}")
  except IOError as e:
        print(f"Error opening or reading file {file_to_open}: {e}")
  except Exception as e:
        print(f"An unexpected error occurred: {e}")
  return None  # Return a sentinel value or handle the case appropriately

def process_text(text): #function to extract words from string
  # extract words from string and return in lower case
  return re.findall(r'\w+', text.lower())

def create_dictionary(corpus): #function to create a dictionary from the corpus documents
  global dictionary
  #iterate through all documents from the corpus
  for document in corpus:  
    #extract all words in lower case from the string
    #save words as a set and add words to the dictionary by building the union
    words = process_text(document)
    words_doc = set(words)
    dictionary = dictionary.union(words_doc)
  
  return dictionary 

def document_to_vector(document, dictionary): #function to convert a document (list of document words) into a binary vector
  # Convert a document into a binary vector
  # uses process_text function to convert document into a list of words
  word_list = process_text(document)
  # iterates through the dictionary  
  # appends 1 when word is in the list of words from the document  
  # appends 0 when word is not in the list of words from the document  
  doc_vector = [1 if word in word_list else 0 for word in dictionary]
  return np.array(doc_vector)

def freq_vector(document, dictionary):# Function to convert a document (list of document words) into a frequency vector
    
    # Convert a document into a list of words
    word_list = process_text(document)
    
    # Count occurrences of each word in the document
    word_counts = {word: word_list.count(word) for word in set(word_list)}
    
    # Create the frequency vector
    doc_vector = [word_counts.get(word, 0) for word in dictionary]
    
    # Convert the list to a NumPy array
    return np.array(doc_vector)



def dot_similarity(doc_vectors, search_vector, corpus): #function to compute similarities by using dot product
    similarity_dic = {}
    #use "for i, doc_vector" to also get index of the iteration -> used to get the document from the corpus
    for i, doc_vector in enumerate(doc_vectors):
        similarity = np.dot(doc_vector, search_vector)
        #similarity=np.dot(doc, search_vector)/len(process_text(corpus[i]))
        doc_name = name_list[i]
        similarity_dic.update({doc_name: similarity})
    return similarity_dic


def jac_similarity(doc_vectors, search_vector, corpus): #function to compute similarities by using Jaccard Index
    similarity_dic = {}
    #use "for i, doc_vector" to also get index of the iteration -> used to get the document from the corpus
    for i, doc_vector in enumerate(doc_vectors):
        cor_len= len(process_text(corpus[i]))
        search_len = len(process_text(str(search_vector)))
        
        #divide the dot product by the number of words in the union of search_doc and doc from corpus
        similarity=np.dot(doc_vector, search_vector)/(search_len+cor_len)

        #get name from name_list by indexing from the name_list
        doc_name = name_list[i]
        similarity_dic.update({doc_name: similarity})
    
    return similarity_dic


def euc_similarity(doc_vectors, search_vector, corpus): #function to compute similarities by using the euclidean distance
  similarity_dic = {}
    
    #use "for i, doc_vector" to also get index of the iteration -> used to get the document from the corpus
  for i, doc_vector in enumerate(doc_vectors):
      similarity = np.linalg.norm(doc_vector-search_vector)
      #print(similarity)
      doc_name = name_list[i]
      similarity_dic.update({doc_name: similarity})
  return similarity_dic
  
  

def compute_similarity(search_doc, corpus, method):
    
    #global dictionary
    
    #sends search document string to load search document function and gets the content of the file as string
    search_doc = load_search_doc(search_doc)
    
    #create biary vector of search document
    search_vector = document_to_vector(search_doc, dictionary)
    
    #create array of binary vectors of corpus documents
    doc_vectors = np.array([document_to_vector(document, dictionary) for document in corpus])

  #if statements to assess which method is chosen
  #sends vectors and corpus to computing function
  #prints similirities
    if method == "Dot Product": 
      similarities = dot_similarity(doc_vectors,search_vector,corpus)
      #convert dictionary into data frame for formatted output and possibility to easily order results
      similarities = pd.DataFrame(similarities.items(), columns=['Document', 'Similarity'])
      similarities.sort_values(["Similarity"], ascending=False)
      print("Dot Product: \n",similarities)
    elif method == "Jaccard Index": 
      similarities = jac_similarity(doc_vectors,search_vector,corpus)
      #convert dictionary into data frame for formatted output and possibility to easily order results
      similarities = pd.DataFrame(similarities.items(), columns=['Document', 'Similarity'])
      similarities.sort_values(["Similarity"], ascending=False)
      print("Jaccard Index: \n",similarities)
    elif method == "Eucledian Distance": 
      similarities = euc_similarity(doc_vectors,search_vector,corpus)
      #convert dictionary into data frame for formatted output and possibility to easily order results
      similarities = pd.DataFrame(similarities.items(), columns=['Document', 'Similarity'])
      similarities.sort_values(["Similarity"], ascending=False)
      print("Eucledian Distance: \n",similarities)
    else: print("Unknown method")

    #return similarities

In [97]:
#replace directory with file path to the folder in which the documents and the code-file are located
set_directory("/Users/jonathan/Library/Mobile Documents/com~apple~CloudDocs/Master/Foundations of Data Science/Assignment/Final Assignment/")


update_corpus(directory, "doc1.txt")
update_corpus(directory,"doc2.txt")
update_corpus(directory,"doc3.txt")
update_corpus(directory,"doc4.txt")


AssertionError: Document doc1.txt already read uploaded, please choose another file.

In [98]:
print(corpus)
print(search_doc)
print(name_list)

compute_similarity("search_doc.txt", corpus, "Dot Product")
compute_similarity("search_doc.txt", corpus, "Jaccard Index")
compute_similarity("search_doc.txt", corpus, "Eucledian Distance")



['Hello, world!', 'Bye, World!', 'Das ist Mars Mars Textdokument. Cool', 'Das ist das nächste Testdokument. Mars!']
Hello, Hello, World, Bye!
['doc1.txt', 'doc2.txt', 'doc3.txt', 'doc4.txt']
Dot Product: 
    Document  Similarity
0  doc1.txt           2
1  doc2.txt           1
2  doc3.txt           3
3  doc4.txt           5
Jaccard Index: 
    Document  Similarity
0  doc1.txt    0.166667
1  doc2.txt    0.083333
2  doc3.txt    0.187500
3  doc4.txt    0.312500
Eucledian Distance: 
    Document  Similarity
0  doc1.txt    2.236068
1  doc2.txt    2.645751
2  doc3.txt    2.449490
3  doc4.txt    1.414214


In [None]:
clear_name_list(name_list)
print(name_list)

In [None]:
clear_corpus(corpus)
print(corpus)

In [103]:
update_corpus(directory, "doc1s.txt")

Error: File not found - /Users/jonathan/Library/Mobile Documents/com~apple~CloudDocs/Master/Foundations of Data Science/Assignment/Final Assignment/doc1s.txt
