<h1 align='center'> CSC4120 Programming Assignment 1 </h1>

## Submission Requirements

   The submission <font color = #FF0000>deadline is January 28 (Sun.), 2024, 11:59 pm</font>. Solutions submitted after the deadline will be graded as 0 points. Please submit an **ipynb** file and clearly state your group members' student IDs. Otherwise, your points will be deducted.

## What you need to do

1. Understand the document distance problem.

2. Understand the python code and how we improve the algorithm in each step.

3. Implement merge sort and the dictionary version.

## Student IDs

\###

120040025 - Yohandi

120040007 - Andrew Nathanael

\###

In [106]:
import math
import sys
import cProfile
import string

filename_1 = "file3.txt"
filename_2 = "file5.txt"
translation_table = str.maketrans(string.punctuation + string.ascii_uppercase,
                                     " "*len(string.punctuation) + string.ascii_lowercase)

## 1. Initial version of document distance

This program computes the "distance" between two text files as the angle between their word frequency vectors (in radians).

For each input file, a word-frequency vector is computed as follows:

   (1) the specified file is read in

   (2) it is converted into a list of alphanumeric "words"

       Here a "word" is a sequence of consecutive alphanumeric
       characters.  Non-alphanumeric characters are treated as blanks.
       Case is not significant.

   (3) for each word, its frequency of occurrence is determined

The "distance" between two vectors is the angle between them.

If $ x = (x_1, x_2, ..., x_n) $ is the first vector ($ x_i $ = freq of word i)
and $ y = (y_1, y_2, ..., y_n) $ is the second vector,
then the angle between them is defined as:

   $$ d(x,y) = \arccos{\left(\frac{\operatorname*{innerProduct}(x,y)}{\operatorname*{norm}(x) * \operatorname{norm}(y)}\right)} $$

where:
$$
\begin{cases}
\operatorname*{innerProduct}(x,y) = x_1*y_1 + x_2*y_2 + \cdots + x_n*y_n \\[1em]
\operatorname*{norm}(x) = \sqrt{\operatorname*{innerProduct}(x,x)}
\end{cases}
$$

   ***


### What you need to do

Run the code and report the running time.

\###

Running Time: 5.044s

\###

In [107]:
def read_file(filename):
    """ 
    Read the text file with the given filename;
    return a list of the lines of text in the file.
    """
    try:
        f = open(filename, 'r', encoding= 'utf-8')
        return f.readlines()
    except IOError:
        print("Error opening or reading input file: ", filename)
        sys.exit()

def get_words_from_line_list(L):
    """
    Parse the given list L of text lines into words.
    Return list of all words found.
    """
    word_list = []
    for line in L:
        words_in_line = get_words_from_string(line)
        word_list = word_list + words_in_line
    return word_list

def get_words_from_string(line):
    """
    Return a list of the words in the given input string,
    converting each word to lower-case.

    Input:  line (a string)
    Output: a list of strings 
              (each string is a sequence of alphanumeric characters)
    """
    line = line.translate(translation_table)
    word_list = line.split()
    return word_list

def count_frequency(word_list):
    """
    Return a list giving pairs of form: (word,frequency)
    """
    L = []
    for new_word in word_list:
        for entry in L:
            if new_word == entry[0]:
                entry[1] = entry[1] + 1
                break
        else:
            L.append([new_word, 1])
    return L

def word_frequencies_for_file(filename):
    """
    Return alphabetically sorted list of (word,frequency) pairs 
    for the given file.
    """
    line_list = read_file(filename)
    word_list = get_words_from_line_list(line_list)
    freq_mapping = count_frequency(word_list)
    return freq_mapping

def inner_product(L1, L2):
    """
    Inner product between two vectors, where vectors
    are represented as lists of (word,freq) pairs.

    Example: inner_product([["and",3],["of",2],["the",5]],
                           [["and",4],["in",1],["of",1],["this",2]]) = 14.0 
    """
    sum = 0.0
    for word1, count1 in L1:
        for word2, count2 in L2:
            if word1 == word2:
                sum += count1 * count2
    return sum

def vector_angle(L1, L2):
    """
    The input is a list of (word,freq) pairs, sorted alphabetically.

    Return the angle between these two vectors.
    """
    numerator = inner_product(L1, L2)
    denominator = math.sqrt(inner_product(L1, L1) * inner_product(L2, L2))
    return math.acos(numerator / denominator)

def docdist1():
    document_vector_1 = word_frequencies_for_file(filename_1)
    document_vector_2 = word_frequencies_for_file(filename_2)
    distance = vector_angle(document_vector_1, document_vector_2)
    print("The distance between the documents is: %0.6f (radians)"%distance)

cProfile.run("docdist1()")

The distance between the documents is: 0.619319 (radians)
         76653 function calls in 5.044 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        2    0.000    0.000    0.005    0.003 573875801.py:1(read_file)
        2    4.324    2.162    4.399    2.199 573875801.py:13(get_words_from_line_list)
    25115    0.019    0.000    0.075    0.000 573875801.py:24(get_words_from_string)
        2    0.609    0.305    0.609    0.305 573875801.py:37(count_frequency)
        2    0.000    0.000    5.014    2.507 573875801.py:51(word_frequencies_for_file)
        3    0.028    0.009    0.028    0.009 573875801.py:61(inner_product)
        1    0.000    0.000    0.029    0.029 573875801.py:76(vector_angle)
        1    0.002    0.002    5.044    5.044 573875801.py:86(docdist1)
        1    0.000    0.000    5.044    5.044 <string>:1(<module>)
        2    0.000    0.000    0.000    0.000 codecs.py:260(__init__)
        2    0.00

## 2. Change concatenate to extend in get_words_from_line_list

Compare the running time, analyze why we get improvement (or why not), identify it using `cProfile`.

\###

Running Time: 0.800s (improvement made from 5.044s)

As highlighted by the `cProfile` results, only from the two calls of the `get_words_from_line_list` function:
- The original code runs in 4.324s, while
- The changed code runs in 0.009s

If we compare the code side by side:
- The original code uses
```py
    word_list = []
    for line in L:
        words_in_line = get_words_from_string(line)
        word_list = word_list + words_in_line
    return word_list
```
, while
- The changed code uses
```py
    word_list = []
    for line in L:
        words_in_line = get_words_from_string(line)
        word_list.extend(words_in_line)
    return word_list
```

The bottleneck was made due to concatenation being slower than the `extend` function. Here is why:
- Each time the command $L_1 + L_2$ is run (both are lists), Python is enforced to create a new list that combines both $L_1$ and $L_2$, making the complexity $\Theta(L_1 + L_2)$ due to iteration over both $L_1$ and $L_2$.
- On the other hand, the $L_1.\texttt{extend}(L_2)$ simply adds each element from $L_2$ to the end of $L_1$, making the complexity $\Theta(L_2)$ due to iteration over $L_2$. Moreover, in the constant side, the added elements are done directly to $L_1$ while the concatenation creates a new list and copies all that back to $L_1$.

\###



In [108]:
def get_words_from_line_list(L):
    """
    Parse the given list L of text lines into words.
    Return list of all words found.
    """
    word_list = []
    for line in L:
        words_in_line = get_words_from_string(line)
        word_list.extend(words_in_line)
    return word_list

def docdist2():
    document_vector_1 = word_frequencies_for_file(filename_1)
    document_vector_2 = word_frequencies_for_file(filename_2)
    distance = vector_angle(document_vector_1, document_vector_2)
    print("The distance between the documents is: %0.6f (radians)"%distance)

cProfile.run("docdist2()")

The distance between the documents is: 0.619319 (radians)
         101768 function calls in 0.800 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        2    0.009    0.005    0.053    0.026 2488429402.py:1(get_words_from_line_list)
        1    0.002    0.002    0.800    0.800 2488429402.py:12(docdist2)
        2    0.000    0.000    0.006    0.003 573875801.py:1(read_file)
    25115    0.009    0.000    0.041    0.000 573875801.py:24(get_words_from_string)
        2    0.706    0.353    0.706    0.353 573875801.py:37(count_frequency)
        2    0.000    0.000    0.765    0.382 573875801.py:51(word_frequencies_for_file)
        3    0.033    0.011    0.033    0.011 573875801.py:61(inner_product)
        1    0.000    0.000    0.033    0.033 573875801.py:76(vector_angle)
        1    0.000    0.000    0.800    0.800 <string>:1(<module>)
        2    0.000    0.000    0.000    0.000 codecs.py:260(__init__)
        2    0.

## 3. Sort the document vector

Compare the running time, analyze why we get improvement (or why not), identify it using `cProfile`.

\###

Running Time: 0.724s (improvement made from 0.800s)

As highlighted by the `cProfile` results, the performance bottleneck is now down to the `count_frequency` function:
- The code with unsorted document vector has the `count_frequency` function runs in 0.706s, while
- The code with sorted document vector has the `count_frequency` function runs in 0.643s.

Although with a drawback of having additional 0.025s for the insertion sort, the optimized code still runs faster due to:
- Sorted nature of the data in the `inner_product` function that allows a single pass (managed using a merge-like process) process, which is efficient.
- Sorted nature of the data makes less comparisons using a fact that if $x > y$ and $y > z$ then $x > z$ (transitive characteristic).
- Early termination possibilities without processing the remaining words as there can be no more matching words in the other list due to the alphabetical ordering.

\###


In [109]:
def insertion_sort(A):
    """
    Sort list A into order, in place.

    From Cormen/Leiserson/Rivest/Stein,
    Introduction to Algorithms (second edition), page 17,
    modified to adjust for fact that Python arrays use 
    0-indexing.
    """
    for j in range(len(A)):
        key = A[j]
        # insert A[j] into sorted sequence A[0..j-1]
        i = j - 1
        while i > -1 and A[i] > key:
            A[i + 1] = A[i]
            i = i - 1
        A[i + 1] = key
    return A
    
def word_frequencies_for_file(filename):
    """
    Return alphabetically sorted list of (word,frequency) pairs 
    for the given file.
    """
    line_list = read_file(filename)
    word_list = get_words_from_line_list(line_list)
    freq_mapping = count_frequency(word_list)
    insertion_sort(freq_mapping)
    return freq_mapping

def inner_product(L1, L2):
    """
    Inner product between two vectors, where vectors
    are represented as alphabetically sorted (word,freq) pairs.

    Example: inner_product([["and",3],["of",2],["the",5]],
                           [["and",4],["in",1],["of",1],["this",2]]) = 14.0 
    """
    sum = 0.0
    i = 0
    j = 0
    while i < len(L1) and j < len(L2):
        # L1[i:] and L2[j:] yet to be processed
        if L1[i][0] == L2[j][0]:
            # both vectors have this word
            sum += L1[i][1] * L2[j][1]
            i += 1
            j += 1
        elif L1[i][0] < L2[j][0]:
            # word L1[i][0] is in L1 but not L2
            i += 1
        else:
            # word L2[j][0] is in L2 but not L1
            j += 1
    return sum

def docdist3():
    sorted_word_list_1 = word_frequencies_for_file(filename_1)
    sorted_word_list_2 = word_frequencies_for_file(filename_2)
    distance = vector_angle(sorted_word_list_1, sorted_word_list_2)
    print("The distance between the documents is: %0.6f (radians)"%distance)

cProfile.run("docdist3()")

The distance between the documents is: 0.619319 (radians)
         105521 function calls in 0.724 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        2    0.008    0.004    0.048    0.024 2488429402.py:1(get_words_from_line_list)
        2    0.025    0.012    0.025    0.012 2713057197.py:1(insertion_sort)
        2    0.000    0.000    0.721    0.361 2713057197.py:20(word_frequencies_for_file)
        3    0.001    0.000    0.001    0.000 2713057197.py:31(inner_product)
        1    0.002    0.002    0.724    0.724 2713057197.py:57(docdist3)
        2    0.000    0.000    0.005    0.003 573875801.py:1(read_file)
    25115    0.008    0.000    0.037    0.000 573875801.py:24(get_words_from_string)
        2    0.643    0.321    0.643    0.322 573875801.py:37(count_frequency)
        1    0.000    0.000    0.001    0.001 573875801.py:76(vector_angle)
        1    0.000    0.000    0.724    0.724 <string>:1(<module>)
     

## 4. Change sorting from insertion sort to merge sort

Implement merge sort.

Compare the running time, analyze why we get improvement (or why not), identify it using `cProfile`.

\###

Running Time: 0.699s (improvement made from 0.724s)

The previously made improvement has a 0.025s drawback due to insertion sort's time complexity being $\mathcal{O}(N^2)$. The merge sort algorithm further reduces such drawback, making it down to 0.004s. Although in terms of the running time, the improvement made this time does not really affect much in total; however, an improvement is still an improvement, especially since we are focusing on the sorting algorithm running time. 0.025s to 0.004s is significant.

\###

In [112]:
def merge_sort(A):
    """
    Sort list A into order, and return result.
    """
    if len(A) > 1:
        mid = len(A) // 2
        left_half = A[:mid]
        right_half = A[mid:]

        merge_sort(left_half)
        merge_sort(right_half)

        i = j = k = 0

        while i < len(left_half) and j < len(right_half):
            if left_half[i][0] < right_half[j][0]:
                A[k] = left_half[i]
                i += 1
            else:
                A[k] = right_half[j]
                j += 1
            k += 1

        while i < len(left_half):
            A[k] = left_half[i]
            i += 1
            k += 1

        while j < len(right_half):
            A[k] = right_half[j]
            j += 1
            k += 1

    return A

def word_frequencies_for_file(filename):
    """
    Return alphabetically sorted list of (word,frequency) pairs 
    for the given file.
    """
    line_list = read_file(filename)
    word_list = get_words_from_line_list(line_list)
    freq_mapping = count_frequency(word_list)
    freq_mapping = merge_sort(freq_mapping)
    return freq_mapping

def docdist3():
    sorted_word_list_1 = word_frequencies_for_file(filename_1)
    sorted_word_list_2 = word_frequencies_for_file(filename_2)
    distance = vector_angle(sorted_word_list_1, sorted_word_list_2)
    print("The distance between the documents is: %0.6f (radians)"%distance)

cProfile.run("docdist3()")

The distance between the documents is: 0.619319 (radians)
         132390 function calls (130296 primitive calls) in 0.699 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        2    0.009    0.004    0.050    0.025 2488429402.py:1(get_words_from_line_list)
   2096/2    0.004    0.000    0.005    0.002 2528663488.py:1(merge_sort)
        2    0.000    0.000    0.695    0.348 2528663488.py:36(word_frequencies_for_file)
        1    0.002    0.002    0.699    0.699 2528663488.py:47(docdist3)
        3    0.001    0.000    0.001    0.000 2713057197.py:31(inner_product)
        2    0.000    0.000    0.006    0.003 573875801.py:1(read_file)
    25115    0.009    0.000    0.039    0.000 573875801.py:24(get_words_from_string)
        2    0.634    0.317    0.634    0.317 573875801.py:37(count_frequency)
        1    0.000    0.000    0.001    0.001 573875801.py:76(vector_angle)
        1    0.000    0.000    0.699    0.699 <stri

## 5. Use dictionaries instead of lists

Implement the algorithm using dictionaries instead of lists. 

Analyze why we get improvement and identify it using `cProfile`.

\### 

Running time: 0.093s (improvement made from 0.699s)

In Python, dictionaries are implemented using hash tables, which is a data structure that maps keys to values using a hash function to compute an index. Accessing elements in a dictionary is typically an $\mathcal{O}(1)$ operation. As highlight by `cProfile`, `count_frequency` that has a major impact in the code that uses lists:
- Runs in 0.634s in the code that uses lists.
- Runs in 0.024s in the code that uses dictionaries.

When using lists, the function requires a linear search through the list for each word in the input $\texttt{word}$ _ $\texttt{list}$. In the worst-case scenario, the function essentially performs an $\mathcal{O}(N)$ operation for each word and the fact that each word might be unique. Definitely that impact of accessing the element in $N \times \mathcal{O}(1) = \mathcal{O}(N)$ is significant compared to $N \times \mathcal{O}(N) = \mathcal{O}(N^2)$.

\###

In [117]:
def read_file(filename):
    """ 
    Read the text file with the given filename;
    return a list of the lines of text in the file.
    """
    try:
        f = open(filename, 'r', encoding= 'utf-8')
        return f.readlines()
    except IOError:
        print("Error opening or reading input file: ", filename)
        sys.exit()

def get_words_from_line_list(L):
    """
    Parse the given list L of text lines into words.
    Return list of all words found.
    """
    word_list = []
    for line in L:
        words_in_line = get_words_from_string(line)
        word_list.extend(words_in_line)
    return word_list

def get_words_from_string(line):
    """
    Return a list of the words in the given input string,
    converting each word to lower-case.

    Input:  line (a string)
    Output: a list of strings 
              (each string is a sequence of alphanumeric characters)
    """
    line = line.translate(translation_table)
    word_list = line.split()
    return word_list

def count_frequency(word_list):
    """
    Input a list of words
    Return a DICTIONARY of (word, frequency) pairs
    """
    word_frequency = {}
    for word in word_list:
        word_frequency[word] = word_frequency.get(word, 0) + 1
    return word_frequency

def word_frequencies_for_file(filename):
    """
    Return dictionary of (word,frequency) pairs for the given file.
    """

    line_list = read_file(filename)
    word_list = get_words_from_line_list(line_list)
    freq_mapping = count_frequency(word_list)
    return freq_mapping

def inner_product(D1, D2):
    """
    Inner product between two vectors, where vectors
    are represented as dictionaries of (word,freq) pairs.

    Example: inner_product({"and":3,"of":2,"the":5},
                           {"and":4,"in":1,"of":1,"this":2}) = 14.0 
    """
    sum = 0.0
    for key in D2:
        if key in D1:
            sum += D1[key] * D2[key]
    return sum

def vector_angle(D1, D2):
    """
    The input are two vectors represented as dictionary of (word,freq) pairs.
    Return the angle between these two vectors.
    """
    numerator = inner_product(D1, D2)
    denominator = math.sqrt(inner_product(D1, D1) * inner_product(D2, D2))
    return math.acos(numerator / denominator)

def docdist5():
    word_dict_1 = word_frequencies_for_file(filename_1)
    word_dict_2 = word_frequencies_for_file(filename_2)
    distance = vector_angle(word_dict_1, word_dict_2)
    print("The distance between the documents is: %0.6f (radians)"%distance)

cProfile.run("docdist5()")

The distance between the documents is: 0.619319 (radians)
         251296 function calls in 0.093 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        2    0.000    0.000    0.008    0.004 3965631452.py:1(read_file)
        2    0.009    0.004    0.046    0.023 3965631452.py:13(get_words_from_line_list)
    25115    0.008    0.000    0.035    0.000 3965631452.py:24(get_words_from_string)
        2    0.024    0.012    0.037    0.018 3965631452.py:37(count_frequency)
        2    0.000    0.000    0.091    0.045 3965631452.py:47(word_frequencies_for_file)
        3    0.000    0.000    0.000    0.000 3965631452.py:57(inner_product)
        1    0.000    0.000    0.000    0.000 3965631452.py:71(vector_angle)
        1    0.002    0.002    0.093    0.093 3965631452.py:80(docdist5)
        1    0.000    0.000    0.093    0.093 <string>:1(<module>)
        2    0.000    0.000    0.000    0.000 codecs.py:260(__init__)
        