Hannah Busshoff, Snorri Petersen and Sebastian Wolf

# Table of Content {-}

1. Edit Distance \newline
    1.1 Discussion of the problem \newline
    1.2 Proposed algorithm \newline
    1.3 Proof of Correctness 
2. Huffman Codes \newline
    2.1 Discussion of the problem \newline
    2.2 Proposed algorithm \newline
    2.3 Proof of Correctness
3. References

# Edit Distance

## Discussion of the problem

The edit distance is a measure for the  similarity of two strings. It counts the minimum number of operations required to transform one string into another. The operations considered for the edit distance are:

1.	Insert a character
2.	Delete an existing character
3.	Substitute a character by another

When the edit distance is measured using only these three operations it is also called a 'Levensthein distance'. The edit distance has many applications, including automatic spelling correction in natural language processing or the comparison of DNA in bioinformatics.  

In this assignment we are presented with a problem of assessing the minimum operations between two sets of two text strings:

1.	DNA where 
    a)	X = ACTACTAGATTACTTACGGATCAGGTACTTTAGAGGCTTGCAACCAY
	b)	Y = TACTAGCTTACTTACCCATCAGGTTTTAGAGATGGCAACCA
2.	Proteins where
	a)	X =  AASRPRSGVPAQSDSDPCQNLAATPIPSRPPSSQSCQKCRADARQGRWGPY
	b)	Y =  SGAPGQRGEPGPQGHAGAPGPPGPPGSDG
    
There are potentially many ways to transform one string into another, and trying out all possible combinations to find the lowest cost option would be prohibitively expensive. To solve the problem more efficiently, we therefore propose an algorithm based on the solution for the longest common subsequence that was covered in class. This algortihm falls under the dynamic programming paradigm.  

The idea of Dynamic Programming is to solve a large problem that has too many potential combinations to solve greedily, by dividing it into subproblems. In contrast to divide and conquer algorithms, the subproblems in dynamic programming overlap; the solution of one subproblem depends on the solution of others.

The subproblems in this case involves comparing substrings, rather than the whole string at once. The program creates a table where the two strings are compared prefix by prefix where all three operations (insert, delete and substitude) are available. Here below each operation comes with the cost of 1.


## Proposed algorithm

Define a function ***'compare'***, which gives for a given cell D[m,n] of the matrix D, where m denotes the row index and n the column index, the optimal prior cell that can lead the algorithm to D[m,n] by either deletion, insertion, substitution or no manipulations. Optimality is defined in terms of minimal value of the available prior cell. If there is a tie between prior cells, the algorithm always takes the diagonal.

In [1]:
# Import numpy package to create an array.
import numpy as np

def compare(D, m, n, cost1, cost2):
    if (m == 0) and (n > 0):
        return "I", max(m,0), max(n-1,0)
    elif (n == 0) and (m > 0):
        return "D", max(m-1,0), max(n,0)
    elif (D[m - 1, n - 1] == D[m,n]) and (D[m - 1, n-1] <= min(D[m, n-1
        ] + cost1, D[m-1, n] + cost1)):
        return "-", max(m-1,0), max(n-1,0) # nothing
    elif D[m - 1, n - 1] + cost2 <= min(D[m, n-1], D[m-1, n]) + cost1 :
        return "S", max(m-1,0), max(n-1,0) #substitution
    elif D[m-1,n]  < D[m,n-1]:
        return "D", max(m-1,0), max(n,0) # deletion
    elif D[m, n-1] < D[m, n-1]:
        return "I", max(m,0), max(n-1,0) # insertion

Define a function ***'backtrace'***. The function recovers the optimal solution by iteratively applying the compare function. The input is D[m,n], where m denotes the highest row index and n the highest column index. The backtrace function stops when it reaches D[0,0] the starting point of the algorithm.

In [2]:
def backtrace(D, m, n, cost1, cost2):
    changes = [] # Initialize an empty vector to store the backtrace.
    # Iteratively apply the compare function to reach D[0,0], 
    # the starting point of the algorithm.
    while m > 0 or n > 0:
        # Compute the optimal prior cell.
        result = compare(D, m, n, cost1, cost2) 
        # Store the optimal move in the list changes and append it.
        changes.append(result[0]) 
        m = result[1] # Update row location.
        n = result[2] # Update column location.
    return changes

Define a function ***'edit_simple'***, which computes the edit distance, with costs of cost1 for insertion and deletion, and cost2 for substitution.

In [3]:
def edit_distance(str1, str2, cost1, cost2):
    #Computing the length of the input strings.
    m = len(str1)
    n = len(str2)
    #Initiliazing the matrix.
    D = np.zeros(((m+1), (n+1)))
    # Filling the matrix from top to bottom, by looping over rows and columns.
    for i in range(m+1):
        for j in range(n+1):
            if i == 0:
                D[i, j] = j*cost1  # insertion
            elif j == 0:
                D[i,j] = i*cost1    # deletion
            elif str1[i-1] == str2[j-1]:
                D[i, j] = D[i-1, j-1]
            else:
                D[i, j] = min(D[i, j-1]   + cost1,    # insertion
                              D[i-1, j]   + cost1,    # deletion
                              D[i-1, j-1] + cost2)     # substitution
    #Saving the minimal costs in the variable cost.
    cost = D[m,n]
    # Initialize matrix dimensions to run the backtrace function.
    m = np.size(D, 0) - 1
    n = np.size(D, 1) - 1
    #Getting the backtrace.
    changes = backtrace(D, m, n, cost1, cost2)
    return cost, changes

Computing the edit distance and the backtrace if costs of substitution, insertion, and deletion equal 1.

In [17]:
#Initializing the strings, which shall be analyzed.
X = "ACTACTAGATTACTTACGGATCAGGTACTTTAGAGGCTTGCAACCA"
Y = "TACTAGCTTACTTACCCATCAGGTTTTAGAGATGGCAACCA"

print(edit_distance(X,Y,1,1)[0])
print(edit_distance(X,Y,1,1)[1])

10.0
['-', '-', '-', '-', '-', '-', '-', 'S', '-', 'S', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', 'S', '-', '-', 'S', '-', '-', '-', '-', '-', '-', '-', '-', 'S', '-', '-', '-', '-', '-', '-', 'D', 'D', 'D', 'D', 'D']


Computing the edit distance and the backtrace if costs of substitution equal 1 and the costs of insertion, and deletion equal 2.

In [5]:
print(edit_distance(X,Y,2,1)[0])
print(edit_distance(X,Y,2,1)[1])

15.0
['-', '-', '-', '-', '-', '-', '-', 'S', '-', 'S', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', 'S', '-', '-', 'S', '-', '-', '-', '-', '-', '-', '-', '-', 'S', '-', '-', '-', '-', '-', '-', 'D', 'D', 'D', 'D', 'D']


Computing the edit distance and the backtrace if costs of substitution, insertion, and deletion equal 1.

In [6]:
#Initializing the strings, which shall be analyzed.
X = "AASRPRSGVPAQSDSDPCQNLAATPIPSRPPSSQSCQKCRADARQGRWGP"
Y = "SGAPGQRGEPGPQGHAGAPGPPGPPGSDG"

print(edit_distance(X,Y,1,1)[0])
print(edit_distance(X,Y,1,1)[1])

37.0
['-', '-', 'S', '-', '-', 'S', 'S', '-', 'S', 'S', 'S', 'S', 'S', 'S', 'S', 'S', '-', '-', 'S', '-', 'S', 'S', 'S', '-', 'S', '-', '-', '-', '-', 'D', 'D', 'D', 'D', 'D', 'D', 'D', 'D', 'D', 'D', 'D', 'D', 'D', 'D', 'D', 'D', 'D', 'D', 'D', 'D', 'D']


Computing the edit distance and the backtrace if costs of substitution equal 1 and the costs of insertion, and deletion equal 2.

In [7]:
print(edit_distance(X,Y,2,1)[0])
print(edit_distance(X,Y,2,1)[1])

58.0
['-', '-', 'S', '-', '-', 'S', 'S', '-', 'S', 'S', 'S', 'S', 'S', 'S', 'S', 'S', '-', '-', 'S', '-', 'S', 'S', 'S', '-', 'S', '-', '-', '-', '-', 'D', 'D', 'D', 'D', 'D', 'D', 'D', 'D', 'D', 'D', 'D', 'D', 'D', 'D', 'D', 'D', 'D', 'D', 'D', 'D', 'D']


## Proof of Correctness

While calculating the given problems along with the added problem correctly our algorithm has utilized all possible solutions (insert, delete and substitution). The subproblems and their solutions are finite and can only be of the following nature: (i-1, j), (i, j-1) and (i-1, j-1). The algorithm provides a correct solution to the above problems and it’s subproblems showing the full extent of it’s functionality as seen in the above table. We can therefore say that the algorithm can wield any of the possible solutions successfully. Now when a problem is larger (longer text strings) we can expect it to deliver the correct outcome because it is just a matter of having the recursive part of our algorithm run longer until it reaches the end. We can therefore assume that it will also prove to be correct for problems of larger size for, in a longer chain of subproblems (bigger strings of text) the subproblems will not differ in nature nor will the available solutions and the algorithm’s ability to apply them.

We can equally say that if a person can take an initial step in a staircase there is nothing technically lacking in order to take the remaining steps.

# Huffman codes

## Discussion of the probem

## The proposed algorithm

The ***'frequency_count'*** function takes a text input string and creates a dictionary (frequency_table) of characters found in the text as keys and their absolute frequency as values. This is implemented through a for-loop through the text input string that checks whether the iterative is already a key in the dictionary and if not, creates it, and if it is, augments its value. The dictionary is then transformed, by swapping keys and values, which transforms the dictionary into a list of tuples which are sorted by frequency. This list of ordered tuples is then returned as output.

The cost of this operation is O(n) because the loop runs through n operations, and the lookup in the dictionary is constant cost, given that the length of the alphabet is fixed.

In [8]:
def frequency_count(text):
    # initialise a dictionary - keys: characters, values: frequency
    frequency_table = {}
    # loop through the string
    for i in text.lower():
        # check whether character is already in dictionary
        if i not in frequency_table:
            # if not in dictionary, add it with frequency 1
            frequency_table[str(i)] = 1
        else:
            # if it is in dictionary, augment its frequency by 1
            frequency_table[str(i)] = frequency_table[str(i)] + 1
    # swap keys and values, transform into list, and sort the list
    frequency_table = sorted([(v, k) for k, v in frequency_table.items()])
    return frequency_table

The ***'build_tree'*** function first calls the ***'frequency_count'*** function on the text input string it is given, and then builds a Huffman tree from it. The function iterates through the list from lowest to highest frequency, at each step creating a new node with the two character tuples with lowest frequency. The node is created as a nested list with two entries, the sum of the two children's frequencies, and a nested list with the two tuples. The node is then inserted back into the list that its entries were extracted from, with the position chosen by a for loop that compares the nodes value with each entry from the originator list until the node value is larger than the iterator's value. This loop runs as long as the top level list is longer than 1. When it reaches length 1, the tree is complete and returned as a nested list.

The cost of this operation is constant, because it depends only on the length of the dictionary given as input, which 
depends only of the length of the alphabet and thus is fixed.

In [9]:
def build_tree(text):
    # call frequency_count function and assign to tree
    tree = frequency_count(text)
    # loop over the list, until its length is 1
    while len(tree) > 1:
        # pop out the two lowest value nodes and merge into new node
        node = [tree[0][0] + tree[1][0], [tree.pop(0), tree.pop(0)]]
        # check where to place the new node
        for i in range(len(tree)):
            # if the node's value > the current iterate, continue
            if node[0] > tree[i][0]:
                pass
            # if not, assign the iterate's value to an index
            else:
                index = i
                break
        # insert the node at the index we found in the for loop
        tree.insert(index, node)
    return tree

The ***'climb_tree'*** function first calls the ***'build_tree'*** function on the text input it is given, gets a Huffman tree in the form of a nested list, and then assigns codes to each node of the tree. The function climbs the tree from its root, taking left and right steps. While climbing, the function builds a dictionary with characters from the tree-nodes making up the keys, and code assignments making up the values. The dictionary is updated at every step that encounters a leaf of the tree, which is detected with an if condition that checks whether the node contains a string. If a node does not contain a string, this means that another list is nested inside the node, and that we are on a branch rather than a leaf of the tree. In the case of a branch, the function continues climbing with right and left steps. Right steps at a 1 to the code string, left steps add a 0 to the code string.

The cost of this operation is constant, because it depends only on the length of the nested list it receives as an input. The length and depth of the nested list in turn depends only on the length of the alphabet, which, again, is fixed.

In [10]:
def climb_tree(text):
    # call the build_tree function on the text to get a Huffman tree
    tree = build_tree(text)[0][1]
    # initialise a code variable to store codes we will assign
    code = str()
    # initialise a dictionary that will hold our codes
    dictionary = {}

    # define a function that accesses the first element of a node
    def right_step(branch, code):
        # go down to first element
        branch = branch[0][1]
        # add a 1 to the code string
        code = code + str(1)
        # check if we have reached a string rather than a list
        if type(branch) == str:
            # if yes, access the dictionary
            nonlocal dictionary
            # assign the current code to the string found
            dictionary[branch] = code
        else:
            # if not, take both a right and left step
            right_step(branch, code)
            left_step(branch, code)
            
    # define a fucntion that accesses the second element of a node
    def left_step(branch, code):
        # # go down to second element
        branch = branch[1][1]
        # add a 0 to the code string
        code = code + str(0)
        # check if we have reached a string rather than a list
        if type(branch) == str:
            # if yes, access the dictionary
            nonlocal dictionary
            # assign the current code to the string found
            dictionary[branch] = code
        else:
            # if not, take both a right and left step
            right_step(branch, code)
            left_step(branch, code)
            
    # these are the first two steps, which go both left and right
    left_step(tree, code)
    right_step(tree, code)
    # return the dictionary with key: string, value: code.
    return dictionary

The ***'encode'*** function calls the ***'climb_tree'*** function to get a dictionary of characters in the text with their code assignments. It iterates through the input text string and creates an output text string where every character is replaced with their code assignment. It then returns the encoded output string.

The cost of this operation is O(n), because each character needs to be replaced.

In [11]:
def encode(text):
    # call the climb_tree function to get string-code assignments
    dictionary = climb_tree(text)
    # initialise output string
    encoded_text = str()
    # iterate through text
    for character in text.lower():
        # augment output string with code assignment for each string
        encoded_text += dictionary[character]
    return encoded_text

The ***'decode'*** function requires the same dictionary used to encode a text string, and the encoded text as input. First, it swaps the keys and values of the dictionary such that the codes are keys, and the characters are values. It then iterates through the encoded text and checks whether it can find an iterate in the dictionary. If not, it adds the next iterate and checks whether it can find this longer code in the dictionary. Once it finds a code sequence it can find in the dictionary it adds the corresponding dictionary entry to the output string. Once the function has iterated  through the entire encoded text it has recovered the original text string and outputs it.

The cost of this  operation is O(n), because each character is recovered in turn. The length of the encoded string is proportional to the alphabet used. More complex alphabets will require longer code strings. Regardless, this increases the cost of the algorithm only by a constant.

In [12]:
def decode(dictionary, code):
    # get dictionary with string-code assignments and swap keys and values
    dictionary = dict([(v, k) for k, v in dictionary.items()])
    # initialise output string
    text = str()
    # initialise pointer for the encoded text
    index = 0
    # iterate through encoded text,
    for i in range(len(code)):
        # consider expanding encoded text sequence until a match in dict
        if code[index:i+1] in dictionary:
            # augment output text by the matched string
            text = text + dictionary[code[index:i+1]]
            index = i + 1
    return(text)

In [13]:
# Text examples

T1 = 'O all you host of heaven! O earth! What else? And shall I couple '\
'hell? Oh, fie! Hold, hold, my heart, And you, my sinews, grow not instant '\
'old, But bear me stiffly up. Remember thee! Ay, thou poor ghost, whiles '\
'memory holds a seat In this distracted globe. Remember thee! Yea, from '\
'the table of my memory I’ll wipe away all trivial fond records, All saws '\
'of books, all forms, all pressures past That youth and observation copied '\
'there, And thy commandment all alone shall live Within the book and volume '\
'of my brain, Unmixed with baser matter. Yes, by heaven! O most pernicious '\
'woman! O villain, villain, smiling, damned villain! My tables! Meet it is '\
'I set it down That one may smile, and smile, and be a villain. At least '\
'I’m sure it may be so in Denmark. So, uncle, there you are. Now to my word.'

T2 = 'Habe nun, ach! Philosophie, Juristerei und Medizin, Und leider auch '\
'Theologie Durchaus studiert, mit heissem Bemühn. Da steh ich nun, ich armer '\
'Tor! Und bin so klug als wie zuvor; Heisse Magister, heisse Doktor gar Und '\
'ziehe schon an die zehen Jahr Herauf, herab und quer und krumm Meine '\
'Schüler an der Nase herum Und sehe, dass wir nichts wissen können! Das '\
'will mir schier das Herz verbrennen. Zwar bin ich gescheiter als all die '\
'Laffen, Doktoren, Magister, Schreiber und Pfaffen; Mich plagen keine '\
'Skrupel noch Zweifel, Fürchte mich weder vor Hölle noch Teufel Dafür '\
'ist mir auch alle Freud entrissen, Bilde mir nicht ein, was Rechts zu '\
'wissen, Bilde mir nicht ein, ich könnte was lehren, Die Menschen zu '\
'bessern und zu bekehren. Auch hab ich weder Gut noch Geld, Noch Ehr und '\
'Herrlichkeit der Welt; Es möchte kein Hund so länger leben! Drum hab ich '\
'mich der Magie ergeben, Ob mir durch Geistes Kraft und Mund Nicht manch '\
'Geheimnis würde kund; Dass ich nicht mehr mit saurem Schweiss Zu sagen '\
'brauche, was ich nicht weiss; Dass ich erkenne, was die Welt Im Innersten '\
'zusammenhält, Schau alle Wirkenskraft und Samen, Und tu nicht mehr in '\
'Worten kramen.'

In [14]:
climb_tree(T1)

{'l': '0111',
 'o': '0110',
 '’': '01011111',
 'x': '010111101',
 '?': '010111100',
 'k': '01011101',
 'g': '01011100',
 'w': '010110',
 '.': '0101011',
 'c': '0101010',
 'u': '010100',
 'a': '0100',
 'h': '00111',
 'r': '00110',
 'b': '001011',
 'p': '0010101',
 '!': '0010100',
 'm': '00100',
 'e': '0001',
 'n': '00001',
 's': '00000',
 ' ': '11',
 'f': '101111',
 'v': '101110',
 'y': '10110',
 'i': '1010',
 ',': '10011',
 'd': '10010',
 't': '1000'}

In [15]:
climb_tree(T2)

{'h': '0111',
 'm': '01101',
 't': '01100',
 'i': '0101',
 'p': '01001111',
 'ü': '01001110',
 'ä': '010011011',
 'q': '0100110101',
 'j': '0100110100',
 ';': '01001100',
 'o': '010010',
 'w': '010001',
 ',': '010000',
 'n': '0011',
 'c': '00101',
 'd': '00100',
 ' ': '000',
 'u': '1111',
 'a': '1110',
 'z': '110111',
 'f': '110110',
 'v': '11010111',
 '!': '11010110',
 '.': '11010101',
 'ö': '11010100',
 'g': '110100',
 's': '1100',
 'e': '101',
 'l': '10011',
 'b': '100101',
 'k': '100100',
 'r': '1000'}

# References

 Huffman, D. (1952). "A Method for the Construction of Minimum-Redundancy Codes" (PDF). Proceedings of the IRE. 40 (9): 1098–1101. doi:10.1109/JRPROC.1952.273898. 