# Final Exam Project Question 3

Create a python program that will compute the text document similarity between different documents. Your implementation will take a list of documents as an input text corpus, and it will compute
a dictionary of words for the given corpus. Later, when a new document (i.e, search document) is
provided, your implementation should provide a list of documents that are similar to the given search
document, in descending order of their similarity with the search document.
For computing similarity between any two documents in our question, you can use the following
distance measures (optionally, you can also use any other measure as well).

1. dot product between the two vectors
2. distance norm (or Euclidean distance) between two vectors .e.g. || u − v ||

As part of answering the question, you can also compare and comment on which of the two
methods (or any other measure if you have used some other measure) will perform better and what
are the reasons for it.

##### Relevant libraries

In [1]:
# import relevant libraries 

from numpy import dot #  to calculate dot-product
from numpy.linalg import norm # to calculate norm of vector
import string # for punctuation removal
import pandas as pd # to create dataframe

##### Class text_document_similarity()

In [2]:
class text_document_similarity():
    '''
    This class contains the functions document_term_matrix() and text_by_distance()
    '''

    def document_term_matrix(text_corpus: list):
        ''' 
        Function creates a document-term-matrix out of a list of documents as a pandas dataframe

        Input: 
        Text corpus: list of documents, i.e., list of strings (it is also possible just to add one text as only a document-term-matrix is created)

        Preprocessing of text input includes: 
        Standardization to lowercase characters, punctuation removal
        Stopword removal and stemming, for example, not included as NLP libraries have to be used 
        
        Tokenization:
        Tokenize all words and count them how many times they appear in each document

        Document-term-matrix: 
        Words as columns and the different docs as rows
        The values are counted appearances of each word in that specific document
        
        Output:
        Document-term-matrix as pandas dataframe
        '''
        # check first that input is correct
        try:
            assert isinstance(text_corpus, list), "text_corpus parameter: Input must be a list."
            assert all(isinstance(i, str) for i in text_corpus), "text_corpus parameter: Not all elements in the list are strings."

            # preprocessing of text input
            # putting processed text input into new variable text_proc
            # not considered: stopword removal, stemming etc. as NLP libraries have to be used
            text_proc = text_corpus.copy()

            # lower case of characters
            for i in range(len(text_proc)):
                text_proc[i] = text_proc[i].lower()

            # punctuation removal 
            for i in range(len(text_proc)):
                text_proc[i] = text_proc[i].translate(str.maketrans('', '', string.punctuation))

            # tokenize text input, only single words are considered
            tokens = []
            for i in range(len(text_proc)):
                sub_tokens = text_proc[i].split() # split each text of text input by spaces and add it to a list
                for j in range(len(sub_tokens)): 
                    tokens.append(sub_tokens[j]) # append all identified tokens to the final token list

            tokens_unique = [] 
            # consideration of duplicates in the token list, thus, removing duplicates
            [tokens_unique.append(x) for x in tokens if x not in tokens_unique]

            # creation of document-term-matrix (DTM): 
            # tokens are our columns, we count the appearance of these tokens in each document 
            dtm = []
            for i in range(len(text_proc)):
                sub_text = text_proc[i].split() # take each individual text input as a list of their contained words
                sub_dtm = [] # temporary variable which will be added to dtm
                for j in range(len(tokens_unique)): # go through each token once
                    count = sub_text.count(tokens_unique[j]) # count the word appearances of tokens in one individual text input
                    sub_dtm.append(count) # add the appearance to the sub_dtm
                dtm.append(sub_dtm) # final dtm by adding all sub_dtm in one list

            # converting the dtm to a pandas dataframe
            dtm = pd.DataFrame(dtm)

            # adding the column names which are the tokens
            dtm.columns = tokens_unique

            # return the document-term-matrix 
            return dtm

        except AssertionError as msg:
                print(msg)

                
    def text_by_distance(new_text: list, text_corpus: list, distance_measure="cosine"):
        '''
        Function that returns original text corpus in a sequence depending on the distance to a search document

        Input:
          new_text: New text document, i.e., list with one string
          text_corpus: Text corpus (initial text input), i.e., list of strings (for practical reasons this list needs more than 1 text, otherwise this function makes no sense)
          distance_measure: Distance measure, i.e. user can choose between cosine similarity or euclidean distance, cosine similarity is default
        To choose cosine similarity type "cosine", to choose Euclidean distance type "Euclidean"
        Cosine similarity, higher values are better (more similar)
        Euclidean, smaller values are better (more similar)

        Output:
          list_of_text: The sequence depends on the distance to the new text input
        Smallest distance first, maximal distance last (descending order), i.e., more similar documents first
        Before each document a ranking is added, for example, 1 means most similar to search document
        '''
        # check that input is correct
        try:
            assert isinstance(new_text, list), "new_text parameter: Input must be a list."
            assert len(new_text)==1, "new_text parameter: Only one string in the list is allowed."
            assert isinstance(new_text[0], str), "new_text parameter: Element in list must be a string."
            
            assert distance_measure in ("cosine","euclidean"), "Check distance measure input."
            
            assert isinstance(text_corpus, list), "text_corpus parameter: Input must be a list."
            assert len(text_corpus)>1, "text_corpus parameter: More than one string should be used as input."
            assert all(isinstance(i, str) for i in text_corpus), "text_corpus parameter: Not all elements in the list are strings."

            # same processes to create the document-term-matrix
            dtm2 = text_document_similarity.document_term_matrix(new_text) 
            
            # get original text corpus dtm
            dtm = text_document_similarity.document_term_matrix(text_corpus)

            # get the vector of the new text input in case you would have the columns of the dtm of the original input text corpus
            vector_new_text = []
            for column in dtm:
                if column in dtm2.columns:
                    vector_new_text.append(dtm2[column][0])
                else:
                    vector_new_text.append(0) 

            # create a dictionary with two columns: doc and distance to new text input
            doc_dic = {}

            # distance measure cosine similarity
            if distance_measure=="cosine": 
                for i in range(dtm.shape[0]):
                    # for loop going through each document and calculating the distance to new text input
                    # putting doc as key and distance as value
                    a = dtm.iloc[i] # looping through all documents, one document per loop
                    b = vector_new_text # comparing distance to new search document
                    if (norm(a)==0) or (norm(b)==0): # if norm of a or b is 0, take over 0 as cosine similarity (lowest possible similarity) as you cannot divide by 0
                    # norm could be 0, for example, when tokens of search document are not available in orginal text corpus
                      cosine = 0
                    else:
                      cosine = dot(a,b)/(norm(a)*norm(b))
                    doc_dic[i] = cosine

            # distance measure Euclidean distance
            if distance_measure=="euclidean":
                for i in range(dtm.shape[0]): 
                    # for loop going through each document and calculating the distance to new text input
                    # putting doc as key and distance as value
                    a = dtm.iloc[i]
                    b = vector_new_text
                    dist = 0
                    for j in range(dtm.shape[1]): 
                        dist += (b[j] - a[j]) ** 2
                    dist = dist ** 0.5
                    doc_dic[i] = dist # add doc as key and dist as value

            # data frame with original text corpus index as index and dist to new text input as column
            docs_sorted_by_dist = pd.DataFrame.from_dict(doc_dic, orient="index", columns=["dist"])

            if distance_measure=="cosine": # sort from high to low distance
                docs_sorted_by_dist.sort_values("dist", ascending=False, inplace=True)

            if distance_measure=="euclidean": # sort from low to high distance
                docs_sorted_by_dist.sort_values("dist", ascending=True, inplace=True)

            # get original text as output by accessing the original text corpus list in the right sequence
            docs_list = docs_sorted_by_dist.index.tolist()
            list_of_text = []
            for i in range(len(docs_sorted_by_dist.index)):
                # for loop accessing the original text corpus list in right sequence according to distance with new text input
                x = text_corpus[docs_list[i]]
                # adding also ranking, distance with used measurement rounded to 2 decimal places and the text 
                list_of_text.append(["Ranking: " + str(i+1) + ", Distance (" + str(distance_measure) + "): " + str(docs_sorted_by_dist["dist"].iloc[i].round(2)) + ", Text: " + x])

            print("This is the text which is compared to the text corpus: ", new_text, "\n")
            return list_of_text
        
        except AssertionError as msg:
                print(msg)

In [3]:
# test with text input "cosine"
text_corpus = ["Dogs are beautiful", "Hello, what a nice day", "Cool! You made it"]
new_text = ["Are you beautiful?"]

text_document_similarity.text_by_distance(new_text, text_corpus, "cosine")

This is the text which is compared to the text corpus:  ['Are you beautiful?'] 



[['Ranking: 1, Distance (cosine): 0.67, Text: Dogs are beautiful'],
 ['Ranking: 2, Distance (cosine): 0.29, Text: Cool! You made it'],
 ['Ranking: 3, Distance (cosine): 0.0, Text: Hello, what a nice day']]

In [4]:
# test with text input "euclidean"
text_corpus1 = ["I am great", "They're very lazy", "This is very very amazing"]
new_text1 = ["That is not bad"]

text_document_similarity.text_by_distance(new_text1, text_corpus1, "euclidean")

This is the text which is compared to the text corpus:  ['That is not bad'] 



[['Ranking: 1, Distance (euclidean): 2.0, Text: I am great'],
 ["Ranking: 2, Distance (euclidean): 2.0, Text: They're very lazy"],
 ['Ranking: 3, Distance (euclidean): 2.45, Text: This is very very amazing']]

In [5]:
# test with text input "cosine" (default)
text_corpus2 = ["Dogs are beautiful", "Hello, what a nice day", "Cool! You made it"]
new_text2 = ["Are you beautiful?"]

text_document_similarity.text_by_distance(new_text2, text_corpus2)

This is the text which is compared to the text corpus:  ['Are you beautiful?'] 



[['Ranking: 1, Distance (cosine): 0.67, Text: Dogs are beautiful'],
 ['Ranking: 2, Distance (cosine): 0.29, Text: Cool! You made it'],
 ['Ranking: 3, Distance (cosine): 0.0, Text: Hello, what a nice day']]