### Text Document Similarity
Create a python program that will compute the text document similarity between different docu- ments. Your implementation will take a list of documents as an input text corpus, and it will compute a dictionary of words for the given corpus. Later, when a new document (i.e, search document) is provided, your implementation should provide a list of documents that are similar to the given search document, in descending order of their similarity with the search document.
For computing similarity between any two documents in our question, you can use the following distance measures (optionally, you can also use any other measure as well).
1. dot product between the two vectors
2. distance norm (or Euclidean distance) between two vectors .e.g. || u − v ||

As part of answering the question, you can also compare and comment on which of the two methods (or any other measure if you have used some other measure) will perform better and what are the reasons for it.

Hint A text document can be represented as a word vector against a given dictionary of words. So first, compute the dictionary of words for a given text corpus containing the unique words from the documents of the given corpus. Then transform every text document of the given corpus into vector form, i.e., creating a word vector where 0 indicates the word is not in the document, and 1 indicates that the word is present in the given document. In our question, a text document is just represented as a string, so the text corpus is nothing but a list of strings.

### Functions

**def_input_corpus**: 
- input text corpus containing of multiple strings CHECK 
- words from corpus to dictionary CHECK
- split inpt strings into words (account for .:,;!&/()="")) CHECK
- append words to a library (list of distinct words) CHECK
- create vektors fo text documents CHECK



**def_input_new_document**:
- create vektor for text document based on corpus CHECK
- calculate difference between input text and every existing text document CHECK
- create a list of the documents and their similarity rating -> dictionary used -> probably convert it to df in the end


**things to add**:
- document names
- reading of documents from drive path (.txt file) 
- word frequency for documents -> not binary vector anymore
- change dictionary to set instead of list


In [88]:
#Code Jonathan 
import re
import string
import numpy as np
import pandas as pd
from pathlib import Path
import math

dictionary = set()
corpus = []
name_list = []

def process_text(text):
  # extract words from string and return in lower case
  return re.findall(r'\w+', text.lower())

def create_dictionary(corpus):
  global dictionary
  
  #iterate through all documents from the corpus
  for document in corpus:  
   
    #extract all words in lower case from the string
    #save words as a set and add words to the dictionary by building the union
    words = process_text(document)
    words_doc = set(words)
    dictionary = dictionary.union(words_doc)
    
    ################### old
    #iterate thorugh the words of each document and append to diciontary when not already in dictionary
    ## instead of loop, do set(words) and then union this set with the dictionary set
    #for word in words:
    #  if word not in dictionary:
    #    dictionary.append(word)
  #dictionary = sorted(dictionary)
    ##############################
  return dictionary 

def document_to_vector(document, dictionary):
  # Convert a document into a binary word vector
  word_list = process_text(document)
  doc_vector = [1 if word in word_list else 0 for word in dictionary]
  return np.array(doc_vector)



####### dot product #################################

def compute_dot_similarity(doc_vectors, search_vector, corpus):
    similarity_dic = {}
    #use "for i, doc_vector" to also get index of the iteration -> used to get the document from the corpus
    for i, doc_vector in enumerate(doc_vectors):
        similarity = np.dot(doc_vector, search_vector)
        #similarity=np.dot(doc, search_vector)/len(process_text(corpus[i]))
        doc_name = corpus[i]
        #doc_name = word_list[i]
        similarity_dic.update({doc_name: similarity})
    return similarity_dic

def similarity_dot_prod(search_doc,corpus):
  dictionary = create_dictionary(corpus)
  doc_vectors = np.array([document_to_vector(document, dictionary) for document in corpus])
  #print("doc vectors: ", doc_vectors)
  #print("dicdic",dictionary)
  search_vector = document_to_vector(search_doc, dictionary)
  #print("search vector:", search_vector)
  similarities = compute_dot_similarity(doc_vectors,search_vector,corpus)
  
  #convert dictionary into data frame for formatted output and possibility to easily order results
  similarities = pd.DataFrame(similarities.items(), columns=['Document', 'Similarity'])
  similarities.sort_values(["Similarity"], ascending=False)
  return similarities



####### jaccard index #################################

def compute_jac_similarity(doc_vectors, search_vector, corpus):
    similarity_dic = {}
    #use "for i, doc_vector" to also get index of the iteration -> used to get the document from the corpus
    for i, doc_vector in enumerate(doc_vectors):
        cor_len= len(process_text(corpus[i]))
        search_len = len(process_text(str(search_vector)))
        
        #divide the dot product by the number of words in the union of search_doc and doc from corpus
        similarity=np.dot(doc_vector, search_vector)/(search_len+cor_len)
        
        #print(len(process_text(corpus[i])))
        doc_name = corpus[i]
        similarity_dic.update({doc_name: similarity})
    
    return similarity_dic

def similarity_jac_ind(search_doc,corpus):
  dictionary = create_dictionary(corpus)
  doc_vectors = np.array([document_to_vector(document, dictionary) for document in corpus])
  #print("doc vectors: ", doc_vectors)
  #print("dicdic",dictionary)
  search_vector = document_to_vector(search_doc, dictionary)
  #print("search vector:", search_vector)
  similarities = compute_jac_similarity(doc_vectors,search_vector,corpus)
  
  #convert dictionary into data frame for formatted output and possibility to easily order results
  similarities = pd.DataFrame(similarities.items(), columns=['Document', 'Similarity'])
  similarities.sort_values(["Similarity"], ascending=False)
  return similarities



####### euclidean distance #################################

def compute_euc_similarity(doc_vectors, search_vector, corpus): 
  similarity_dic = {}
    
    #use "for i, doc_vector" to also get index of the iteration -> used to get the document from the corpus
  for i, doc_vector in enumerate(doc_vectors):
      similarity = np.linalg.norm(doc_vector-search_vector)
      #print(similarity)
      doc_name = corpus[i]
      similarity_dic.update({doc_name: similarity})
  return similarity_dic

def similarity_euc_dist (search_doc,corpus): 
  dictionary = create_dictionary(corpus)
  doc_vectors = np.array([document_to_vector(document, dictionary) for document in corpus])
  #print("doc vectors: ", doc_vectors)
  #print("dicdic",dictionary)
  search_vector = document_to_vector(search_doc, dictionary)
  #print("search vector:", search_vector)
  similarities = compute_euc_similarity(doc_vectors,search_vector,corpus)
  
  #convert dictionary into data frame for formatted output and possibility to easily order results
  similarities = pd.DataFrame(similarities.items(), columns=['Document', 'Similarity'])
  similarities.sort_values(["Similarity"], ascending=True)
  return similarities



In [90]:
corpus = ["Hello, World!", "Bye, World?!", "“¶¢[[Hello Mars$%"]

search_doc = "Hello, Hello, World, Bye!"

document_to_vector(search_doc, dictionary)

print(similarity_dot_prod(search_doc, corpus))
print(similarity_euc_dist(search_doc, corpus))
print(similarity_jac_ind(search_doc, corpus))

            Document  Similarity
0      Hello, World!           2
1       Bye, World?!           2
2  “¶¢[[Hello Mars$%           1
            Document  Similarity
0      Hello, World!    1.000000
1       Bye, World?!    1.000000
2  “¶¢[[Hello Mars$%    1.732051
            Document  Similarity
0      Hello, World!    0.333333
1       Bye, World?!    0.333333
2  “¶¢[[Hello Mars$%    0.166667


In [131]:
###load files from folder into the corpus

from pathlib import Path

corpus = []
word_list = []

def read_file(directory,doc_name):
    global corpus
    global word_list
    print(corpus)
    
    file_to_open = Path(directory+doc_name)
    with open(file_to_open, "r") as f:
       corpus.append(f.read())
    word_list.append(doc_name)

    return corpus


directory = "/Users/jonathan/Library/Mobile Documents/com~apple~CloudDocs/Master/Foundations of Data Science/Assignment/Final Assignment/"
read_file(directory, "doc1.txt")
read_file(directory,"doc2.txt")
read_file(directory,"doc3.txt")
print(corpus)
print(word_list)



[]
['Hello, world!']
['Hello, world!', 'Bye, World!']
['Hello, world!', 'Bye, World!', '"“¶¢[[Hello Mars$%"']
['doc1.txt', 'doc2.txt', 'doc3.txt']


In [133]:
dictionary = set()

def process_text(text):
  # extract words from string and return in lower case
  return re.findall(r'\w+', text.lower())

def create_dictionary_1(corpus):
  global dictionary
  
  #iterate through all documents from the corpus
  for document in corpus:  
    #extract all words in lower case from the string
    words = process_text(document)
    words_doc = set(words)
    #print(words)
    dictionary = dictionary.union(words_doc)
    
    #iterate thorugh the words of each document and append to diciontary when not already in dictionary
    ## instead of loop, do set(words) and then union this set with the dictionary set
    #for word in words:
    #  if word not in dictionary:
    #    dictionary.append(word)
  #dictionary = sorted(dictionary)
  return dictionary 


corpus = ["Hello, World!", "Bye, World?!", "“¶¢[[Hello Mars$%", "“¶¢[[Hello Mars$%", "“¶¢[[Hello Marses$%"]

create_dictionary_1(corpus)

{'bye', 'hello', 'mars', 'marses', 'world'}