## Minimum Edit Distance

The minimum steps (delete, insert, replace) from A to B.

Note: similar to global alignment, but with different initial table and scores. It is also a dynamic programming problem.

Let $A = a_1, a_2, \dots, a_n$ and $B = b_1, b_2, \dots, b_m$ be two strings of characters.We would like to change $A$ character by character such that it becomes equal to $B$. For each step, we allow insertion, deletion or replacemet. Our goal is to minimize the number of steps.

**Induction!** We denote $A(i)$ and $B(j)$ be the prefix substring $a_1, a_2, \dots, a_i$ and $b_1, b_2, \dots, b_j$, respectively. Our problem is to change $A(i)$ to $B(j)$ with a minimum number of edit steps, denoted as $C(i,j)$. By induction, we can find $C(n, m)$

In [1]:
A = "0TGTTACGG" # index i
B = "0GGTTGACTA" # index j

n = len(A)
m = len(B)

In [2]:
import pprint

C = [[0 if j != 0 else i for j in range(m)] if i != 0 else [j for j in range(m)] for i in range(n)]
C

[[0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
 [1, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 [2, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 [3, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 [4, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 [5, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 [6, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 [7, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 [8, 0, 0, 0, 0, 0, 0, 0, 0, 0]]

There are four possibilities for each step. 

* Delete: if $a_n$ is deleted in the minimum change for A to B. The best change would be from $A(n-1)$ to $B(m)$ and them one more deletion. In other wors, $C(n, m) = C(n-1, m) + 1$

* Insert: if the minimum change from $A$ to $B$ involves insertion of a character to match $b_m$, then we have $C(n, m) = C(n, m-1) + 1$.

* Replace: if $a_n$ is replacing $b_m$, we first need to find the minimum change from $A(n-1)$ to $B(m-1)$ and then to add 1 if $a_n \neq b_m$.

* Match: if $a_n = b_m$, then $C(n, m) = C(n-1, m-1)$. 

$c(i,j) = 0$ if $a_i == b_j$ else $1$

$C(n,m) = min( C(n-1,m)+1, C(n, m-1)+1, C(n-1, m-1) + c(n,m) )$


In [3]:
def Minimum_Edit_Distance(A, B):
    global C
    
    n = len(A)
    m = len(B)
    
    for i in range(1, n):
        for j in range(1, m):
            delete = C[i-1][j] + 1 
            insert = C[i][j-1] + 1
            if A[i] == B[j]:
                match_or_replace = C[i-1][j-1]
            else:
                match_or_replace = C[i-1][j-1] + 1
                
            C[i][j]= min(delete, insert, match_or_replace)

In [4]:
Minimum_Edit_Distance(A, B)

In [5]:
# Not algorithm, just for printing
print("  ", end='')
for j in range(len(B)):
    print(" {0:2s}".format(B[j]), end="")
print()

for i, row in enumerate(C):
    print(A[i]+" ", end='')
    print(row)

   0  G  G  T  T  G  A  C  T  A 
0 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
T [1, 1, 2, 2, 3, 4, 5, 6, 7, 8]
G [2, 1, 1, 2, 3, 3, 4, 5, 6, 7]
T [3, 2, 2, 1, 2, 3, 4, 5, 5, 6]
T [4, 3, 3, 2, 1, 2, 3, 4, 5, 6]
A [5, 4, 4, 3, 2, 2, 2, 3, 4, 5]
C [6, 5, 5, 4, 3, 3, 3, 2, 3, 4]
G [7, 6, 5, 5, 4, 3, 4, 3, 3, 4]
G [8, 7, 6, 6, 5, 4, 4, 4, 4, 4]


In [6]:
# Q: C(n,m) = 4, what is the 4 steps to change A to B?