# Module 3: Completing Read Alignment Problem 

## Module 3 Introduction

Dynamic Programming Algorithms 
- flexible
- allow us to look for approximate matches considering insertions/deletions 
- find similar substrings between strings 

And then we move onto genome assembly without a reference genome
- begin to discuss concepts related to that. 

## Solving Edit Distance Problem 

Two different ways to measure difference betwen two strings 

Hamming distance = minimum # of substitutions needed to turn one string to another (both strings are of the same length). 
- pretty easy to find the Hamming Distance 

Edit distance = mimumum # of substitutions/insertions/deletions needed to turn one string to another (both strings do not have to be the same length). 
- pretty hard to find the edit distance

Have to approach edit distance with a couple steps: 

If we have strings, x and y, of the same length, we can say the following about the hamming_distance and edit_distance between x and y 

1. edit_distance <= hamming_distance
- in the case there are only substitution differences, they'll be equal
- if there's an indel, edit_distance will be less because you don't have to make all those substituions, you can just shift it. 

If we have strings, x and y, of different length, we can say the following about the edit_distance between x and y 

1. edit_distance >= abs(len(x)-len(y))
- we always have to fill the gap in lengths. so if there are no substitutions, adding those characters will be the = edit_distance 
- if there are substitutions, adding those characters will be > edit distance. 

Algorithm to correct the edit distance: 

- Take two very long strings
- Consider the prefix of the strings (everything but the last base)
- C and A are more generally represented as X and Y, differing bases. 

Fig. 1

- Take the edit distance between the two prefixes 

- The mimumum of these three numbers is the edit distance between the two strings: 

delta = 0 if x = y or 1 otherwise.
ed(changing alpha to beta) + (delta)
ed(changing alpha(x) to beta) + (adding one base to beta)
ed(changing alpha to ( beta(y)) ) + (adding one base to alpha)

Fig. 2 

- all of this can be represented by a recursive function (below)
problem : recursion is slow. super slow.
solution: use dynamic programming. 

Fig. 1 
<img src="/Users/arshmeetkaur/Genomic_Data_Science/course 3 /images/prefixes.png" alt="Local Image" width="300">

Fig. 2
<img src="/Users/arshmeetkaur/Genomic_Data_Science/course 3 /images/edit_distance_formula.png" alt="Local Image" width="300">

In [None]:
def hamming_distance(x, y): 
    """ returns hamming distance (subs only). """
    subs = 0
    for i in range(len(x)):
        if x[i] != y[i]: 
            subs += 1 
    return subs 

In [None]:
def edit_distance_recursion(a, b): 
    """ returns edit distance recursively (slow). """

    # base case - one empty string means edit distance between them is equal to the other string. 
    if len(a) == 0: 
        return len(b) 
    if len(b) == 0:
        return len(a)

    delt = 1 if a[-1] != b[-1] else 0
    return min(edit_distance_recursion(a[:-1], b[:-1]) + delt, 
               edit_distance_recursion(a, b[:-1]) + 1, 
               edit_distance_recursion(a[:-1], b) + 1)

## Using dynamic programming for edit distance

- need to rewrite the recursive function 
- characters of string x label rows, characters of string y label columns (fig 1)
- first row and first column are labeled with epsilon, for the empty string
- fill in each position in the matrix with the edit distance
- the total edit distance is in the bottom right corner (fig 2)

How this works: 
- initialize 1rst row to 0... len(x), initialize 1rst column to 0... len(y), this makes sense if you think about the edit distance. 
- think of each of the edit distance terms as one position in the matrix (see fig 3). the edist( alpha(x) + beta(y) ) = min (three other cells)

- this dynamic programming approach is good for a lot of things. 

Fig 1 
<img src="/Users/arshmeetkaur/Genomic_Data_Science/course 3 /images/matrix.png" alt="Local Image" width="500">

Fig 2
<img src="/Users/arshmeetkaur/Genomic_Data_Science/course 3 /images/corner_matrix.png" alt="Local Image" width="500">

Fig 3 
<img src="/Users/arshmeetkaur/Genomic_Data_Science/course 3 /images/boxes.jpeg" alt="Local Image" width="500">
<img src="/Users/arshmeetkaur/Genomic_Data_Science/course 3 /images/fill_in.png" alt="Local Image" width="500">
<img src="/Users/arshmeetkaur/Genomic_Data_Science/course 3 /images/final_matrix.png" alt="Local Image" width="500">