### Baum-Welch Learning algorithm

---


**Exercise Break**: Consider the following questions.

For the crooked dealer HMM, compute Pr(πi = k|x) for x = “THTHHHTHTTH” and each value of i.  
How does your answer change if x = “HHHHHHHHHHH”?  

<br>

<img src = http://bioinformaticsalgorithms.com/images/HMM/HMM_diagram_complete.png width = 400px>

<br>

##### Apply your solution for the Soft Decoding Problem to find CG-islands in the first million nucleotides from the human X chromosome. How does your answer differ from the solution given by the Viterbi algorithm?

<br>

In [111]:
def forward(X, T, E, S): # forward outcome likelihood
    F = np.zeros(shape = (S, len(X))) 
    for state in range(S):
        F[state][0] = E[state][X[0]] / S
    for i in range(1,len(X)):
        for state in range(S):
            F[state][i] = sum(T[k][state]*F[k][i-1] for k in range(S))
            F[state][i] *= E[state][X[i]]
    return F
def backward(X, T, E, S): # slightly different recurrence
    B = np.ones(shape = (S, len(X))) # last column nodes = 1
    for i in range(len(X)-2, -1, -1): 
        for state in range(S):
            B[state][i] = sum(T[state][k] * B[k][i+1] * E[k][X[i+1]] for k in range(S))
    return B
def forward_backward(X,T,E,S):
    """returns responsibility matrix for states"""    
    F = forward(X, T, E, S)
    B = backward(X, T, E, S)
    F_sink = sum(F[state][len(X)-1] for state in range(S))
    return np.multiply(F, B)/F_sink


In [112]:
import numpy as np
import pandas as pd
# crooked casino hmm
T = [[0.9, 0.1], [0.1,0.9]]
E = [[0.5, 0.5], [0.75,0.25]]
S = 2

# given string with varied sequence
X = [int(i) for i in "1 0 1 0 0 0 1 0 1 1 0".split(' ')]
Pi_star = forward_backward(X,T,E,S)
Pi_star = pd.DataFrame(np.around(Pi_star,3), index = ("F", "B"))
print("\n# when string is 1 0 1 0 0 0 1 0 1 1 0")
print(Pi_star)

# when string is all heads
X =[0 for _ in range(len(X))]
Pi_star = forward_backward(X,T,E,S)
Pi_star = pd.DataFrame(np.around(Pi_star,3), index = ("F", "B"))
print("\n# when string is all heads")
print(Pi_star)

X = [int(i) for i in "1 0 1 0 0 0 1 0 1 1 0".split(' ')]
Pi_star = forward_backward(X,T,E,S)



# when string is 1 0 1 0 0 0 1 0 1 1 0
       0      1    2      3      4      5      6      7      8      9     10
F  0.636  0.593  0.6  0.533  0.515  0.544  0.627  0.633  0.692  0.686  0.609
B  0.364  0.407  0.4  0.467  0.485  0.456  0.373  0.367  0.308  0.314  0.391

# when string is all heads
       0      1      2      3      4      5      6      7      8      9     10
F  0.175  0.134  0.109  0.095  0.087  0.085  0.087  0.095  0.109  0.134  0.175
B  0.825  0.866  0.891  0.905  0.913  0.915  0.913  0.905  0.891  0.866  0.825


---

<br>

We have just seen how to compute the **conditional probability $Pr(π_i = k|x)$ that the HMM passes through node (k, i)** in the Viterbi graph given that the HMM emits x.  

But what about **the conditional probability $Pr(π_i = l, π_{i+1} = k|x)$ that the HMM passes through the edge** connecting (l, i) to (k, i + 1) given that the HMM emits x?  

As with the forward-backward algorithm, we can divide every path through the edge in question into a blue path from $source$ to this edge and a red path from this edge to $sink$ (see figure below).

<img src =http://bioinformaticsalgorithms.com/images/HMM/forward_edge.png width = 500px>

**Exercise Break:**
Prove that $Pr(π_i = l, π_{i+1} = k|x)$ is equal to $forward(l, i) \cdot Weighti(l, k) \cdot backward(k,i+1) / forward(sink)$.



**Figure**: Each path from source to sink in the Viterbi graph passing through the (black) edge (l, i) → (k, i + 1) in the Viterbi graph can be partitioned into two subpaths, one from source to (l, i) (shown in blue) and another from (k, i + 1) to sink (shown in red).


<br>

The probabilities $Pr(π_i = k|x)$ can be put into a $|States| × n$ **responsibility matrix Π∗**, where $Π∗(k, i)$ corresponds to a node in the Viterbi graph and is equal to $Pr(π_i = k|x)$. The figure below (top) shows the “responsibility” matrix Π∗ for the crooked casino.

The probabilities $Pr(π_i = l, π_{i+1} = k|x)$ can be put into another $|States| × |States| × (n − 1)$ **responsibility matrix Π∗∗**, where $Π∗∗(l, k, i)$ corresponds to an edge in the Viterbi graph and is equal to $Pr(π_i = l, π_{i+1} = k|x)$ (figure below (bottom)). For brevity, we use **Π to collectively refer to the matrices Π∗ and Π∗∗.**

**Exercise Break**: What is the complexity of an algorithm computing the matrices Π∗ and Π∗∗?


In [113]:
# crooked casino hmm
T = [[0.9, 0.1], [0.1,0.9]]
E = [[0.5, 0.5], [0.75,0.25]]
S = 2

# emitted string
X = [int(i) for i in "1 0 1 0 0 0 1 0 1 1 0".split(' ')]

# calculate responsibility matrices Pi_star and Pi_2star
F = forward(X, T, E, S)
B = backward(X, T, E, S)
F_sink = sum(F[state][len(X)-1] for state in range(S))
Pi_star = forward_backward(X,T,E,S)


transitions = list((l,k) for l in range(S) for k in range(S))
print(transitions)

Pi_2star = np.zeros(shape = (len(transitions), len(X)-1))
for i in range(len(X)-1):
    for t in range(len(T)**2):
        (l,k) = transitions[t]
        weight_edge = E[k][X[i+1]] * T[l][k]
        Pi_2star[t][i] = F[l][i] * weight_edge * B[k][i+1] / F_sink


Pi_star = pd.DataFrame(np.around(Pi_star,3), index = ("F","B"))
Pi_2star = pd.DataFrame(np.around(Pi_2star,3), index = transitions)
print("\n\nPiStar:\n", Pi_star,"\n\nPi_2Star:\n",Pi_2star)

[(0, 0), (0, 1), (1, 0), (1, 1)]


PiStar:
        0      1    2      3      4      5      6      7      8      9     10
F  0.636  0.593  0.6  0.533  0.515  0.544  0.627  0.633  0.692  0.686  0.609
B  0.364  0.407  0.4  0.467  0.485  0.456  0.373  0.367  0.308  0.314  0.391 

Pi_2Star:
             0      1      2      3      4      5      6      7      8      9
(0, 0)  0.562  0.548  0.507  0.473  0.478  0.523  0.582  0.608  0.643  0.588
(0, 1)  0.074  0.045  0.093  0.059  0.037  0.022  0.045  0.025  0.049  0.098
(1, 0)  0.031  0.053  0.025  0.042  0.066  0.104  0.051  0.084  0.043  0.022
(1, 1)  0.333  0.354  0.374  0.426  0.418  0.351  0.322  0.282  0.265  0.293


In [114]:
# new forward-backward algorithm that returns π = {π*, π**}

def forward_backward(X,T,E,S):
    """returns responsibility matrix for states"""    
    
    def forward(X, T, E, S): # forward outcome likelihood
        F = np.zeros(shape = (S, len(X))) 
        for state in range(S):
            F[state][0] = E[state][X[0]] / S
        for i in range(1,len(X)):
            for state in range(S):
                F[state][i] = sum(T[k][state]*F[k][i-1] for k in range(S))
                F[state][i] *= E[state][X[i]]
        return F
      
    def backward(X, T, E, S): # slightly different recurrence
        B = np.ones(shape = (S, len(X))) # last column nodes = 1
        for i in range(len(X)-2, -1, -1): 
            for state in range(S):
                B[state][i] = sum(T[state][k] * B[k][i+1] * E[k][X[i+1]] for k in range(S))
        return B
    
    transitions = list((l,k) for l in range(S) for k in range(S))    
    F = forward(X, T, E, S)
    B = backward(X, T, E, S)
    F_sink = sum(F[state][len(X)-1] for state in range(S))
    
    pi_star = np.multiply(F, B)/F_sink
    pi_2star = np.zeros(shape = (len(T)**2, len(X)-1))
    for i in range(len(X)-1):
        for t in range(len(T)**2):
            (l,k) = transitions[t]
            weight_edge = E[k][X[i+1]] * T[l][k]
            pi_2star[t][i] = F[l][i] * weight_edge * B[k][i+1]
    pi_2star = pi_2star/ F_sink
    return pi_star, pi_2star
    



In [115]:
X = [int(i) for i in "1 0 1 0 0 0 1 0 1 1 0".split(' ')]
Pi_s, Pi_ss = forward_backward(X,T,E,S)

Pi_star = pd.DataFrame(np.around(Pi_s,3), index = ("F","B"))
Pi_2star = pd.DataFrame(np.around(Pi_ss,3), index = transitions)
print(X,"\n\nPiStar:\n", Pi_star,"\n\nPi_2Star:\n",Pi_2star)

[1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0] 

PiStar:
        0      1    2      3      4      5      6      7      8      9     10
F  0.636  0.593  0.6  0.533  0.515  0.544  0.627  0.633  0.692  0.686  0.609
B  0.364  0.407  0.4  0.467  0.485  0.456  0.373  0.367  0.308  0.314  0.391 

Pi_2Star:
             0      1      2      3      4      5      6      7      8      9
(0, 0)  0.562  0.548  0.507  0.473  0.478  0.523  0.582  0.608  0.643  0.588
(0, 1)  0.074  0.045  0.093  0.059  0.037  0.022  0.045  0.025  0.049  0.098
(1, 0)  0.031  0.053  0.025  0.042  0.066  0.104  0.051  0.084  0.043  0.022
(1, 1)  0.333  0.354  0.374  0.426  0.418  0.351  0.322  0.282  0.265  0.293


<img src = http://bioinformaticsalgorithms.com/images/HMM/responsibility_matrices.png width = 500px>

In [135]:

def estimate_parameters(X, Pi_s, Pi_ss, alpha):
    """from π*, derive E, T parameters froms sums in responsibility matrices"""
    transitions = list((l,k) for l in range(S) for k in range(S))    
    E = np.zeros(shape=(S,len(alpha)))
    T = np.zeros(shape=(S,S))
    for i in range(len(X)):
        for k in range(S):
            E[k][X[i]] += Pi_s[k][i]
    for i in range(len(X)-1):
        for t in range(len(transitions)):
            (l,k) = transitions[t]
            T[l][k] += Pi_ss[t][i]
    T_sums = T.sum(axis=1)
    T =  T/ T_sums[:, np.newaxis]
    E_sums = E.sum(axis=1)
    E =  E/ E_sums[:, np.newaxis]        
    return T,E

In [136]:
# crooked casino hmm
T = [[0.9, 0.1], [0.1,0.9]]
E = [[0.5, 0.5], [0.75,0.25]]
S = 2
#string
X = [int(i) for i in "1 0 1 0 0 0 1 0 1 1 0".split(' ')]
alpha = [0,1]

Pi_s, Pi_ss = forward_backward(X,T,E,S)
T, E = estimate_parameters(X, Pi_s, Pi_ss, alpha)

# T_sums = T.sum(axis=1)
# T =  T/ T_sums[:, np.newaxis]
# E_sums = E.sum(axis=1)
# E =  E/ E_sums[:, np.newaxis]

print("\nE:")
print(pd.DataFrame(np.around(E,3)))
print("\nT:")
print(pd.DataFrame(np.around(T,3)))


# print("\nE:")
# print(pd.DataFrame(np.around(EE,3)))
# print("\nT:")
# print(pd.DataFrame(np.around(TT,3)))


E:
       0      1
0  0.514  0.486
1  0.594  0.406

T:
       0      1
0  0.910  0.090
1  0.132  0.868


In [188]:

### Baum-Welch Learning
import numpy as np
import pandas as pd

def parse_BaumWelch_file(filepath):
    with open(filepath,'r') as file:
        inputs = file.readlines()
    iters = int(inputs[0].strip())
    alpha = inputs[4].strip().split()
    X = list(alpha.index(x) for x in inputs[2].strip())
    states = inputs[6].strip().split()
    S = len(states)
    # HMM Transition, emission matrices,
    T = np.array([line.split()[1:] for line in inputs[9: 9+S]], float)
    E = np.array([line.split()[1:] for line in inputs[9+S+2:]], float)
    return iters, X, alpha, T, E, S, states


# new forward-backward algorithm that returns π =
def forward_backward(X,T,E,S):
    """E-step (X, θ) -> π (π*, π**)"""    
    def forward(X, T, E, S): # forward outcome likelihood
        F = np.zeros(shape = (S, len(X))) 
        for state in range(S):
            F[state][0] = E[state][X[0]] / S
        for i in range(1,len(X)):
            for state in range(S):
                F[state][i] = sum(T[k][state]*F[k][i-1] for k in range(S))
                F[state][i] *= E[state][X[i]]
        return F
    def backward(X, T, E, S): # slightly different recurrence
        B = np.ones(shape = (S, len(X))) # last column nodes = 1
        for i in range(len(X)-2, -1, -1): 
            for state in range(S):
                B[state][i] = sum(T[state][k] * B[k][i+1] * E[k][X[i+1]] for k in range(S))
        return B
    F = forward(X, T, E, S)
    B = backward(X, T, E, S)
    F_sink = sum(F[state][len(X)-1] for state in range(S))
    pi_star = np.multiply(F, B)/F_sink
    pi_2star = np.zeros(shape = (len(T)**2, len(X)-1))
    transitions = list((l,k) for l in range(S) for k in range(S))    
    for i in range(len(X)-1):
        for t in range(len(T)**2):
            (l,k) = transitions[t]
            weight_edge = E[k][X[i+1]] * T[l][k]
            pi_2star[t][i] = F[l][i] * weight_edge * B[k][i+1]
    pi_2star = pi_2star/ F_sink
    return pi_star, pi_2star

def estimate_parameters(X, Pi_s, Pi_ss, alpha, S):
    """M-step: (X, π) -> θ (T,E)"""
    transitions = list((l,k) for l in range(S) for k in range(S))    
    E = np.zeros(shape=(S,len(alpha)))
    T = np.zeros(shape=(S,S))
    for i in range(len(X)):
        for k in range(S):
            E[k][X[i]] += Pi_s[k][i]
    for i in range(len(X)-1):
        for t in range(len(transitions)):
            (l,k) = transitions[t]
            T[l][k] += Pi_ss[t][i]
    T_sums = T.sum(axis=1)
    T =  T/ T_sums[:, np.newaxis]
    E_sums = E.sum(axis=1)
    E =  E/ E_sums[:, np.newaxis]        
    return T,E

def baum_welch_learning(X,T,E,S, alpha, iters):
    """returns HMM paramters learned from emitted string X"""
    for _ in range(iters):
        # E-step
        Pi_s, Pi_ss = forward_backward(X,T,E,S)
        # M- step
        T, E = estimate_parameters(X, Pi_s, Pi_ss, alpha, S)
    return T, E

In [189]:

%cd /Users/jasonmoggridge/Dropbox/Rosalind/Coursera_textbook_track/Course6/data
iters, X, alpha, T, E, S, states = parse_BaumWelch_file("BA10K.sample.txt")
print(iters, X, alpha, T, E, S, states)
T,E = baum_welch_learning(X,T,E,S, alpha, iters)
E = pd.DataFrame(np.around(E,3), index= states, columns = alpha)
T = pd.DataFrame(np.around(T,3), index= states, columns = states)
print(E)
print(T)
%cd /Users/jasonmoggridge/Dropbox/Rosalind/Coursera_textbook_track/Course6/results/
T.to_csv("BA10k.sample.out.txt", sep ='\t')
with open("BA10k.sample.out.txt",'a') as f:
    f.write('--------\n')
E.to_csv("BA10k.sample.out.txt", mode = 'a', sep='\t')
    

/Users/jasonmoggridge/Dropbox/Rosalind/Coursera_textbook_track/Course6/data
10 [0, 2, 1, 1, 2, 1, 2, 1, 0, 1] ['x', 'y', 'z'] [[0.019 0.981]
 [0.668 0.332]] [[0.175 0.003 0.821]
 [0.196 0.512 0.293]] 2 ['A', 'B']
       x      y      z
A  0.242  0.000  0.758
B  0.172  0.828  0.000
       A      B
A  0.000  1.000
B  0.786  0.214
/Users/jasonmoggridge/Dropbox/Rosalind/Coursera_textbook_track/Course6/results


In [192]:
# stepik test
%cd /Users/jasonmoggridge/Dropbox/Rosalind/Coursera_textbook_track/Course6/data
iters, X, alpha, T, E, S, states = parse_BaumWelch_file("dataset_26262_5.txt")
print(iters, X, alpha, T, E, S, states)
T,E = baum_welch_learning(X,T,E,S, alpha, iters)
E = pd.DataFrame(np.around(E,3), index= states, columns = alpha)
T = pd.DataFrame(np.around(T,3), index= states, columns = states)
print(E)
print(T)
# %cd /Users/jasonmoggridge/Dropbox/Rosalind/Coursera_textbook_track/Course6/results/
# file = "ba10k_stepik.txt"
# T.to_csv(file, sep ='\t')
# with open(file,'a') as f:
#     f.write('--------\n')
# E.to_csv(file, mode = 'a', sep='\t')
    

/Users/jasonmoggridge/Dropbox/Rosalind/Coursera_textbook_track/Course6/data
100 [1, 2, 0, 0, 2, 0, 2, 0, 0, 1, 2, 1, 0, 2, 2, 2, 0, 1, 1, 2, 0, 2, 1, 1, 2, 0, 1, 2, 2, 1, 0, 0, 2, 2, 2, 0, 2, 1, 1, 2, 1, 1, 1, 0, 2, 0, 1, 0, 2, 0, 0, 1, 2, 1, 0, 0, 0, 0, 2, 0, 0, 1, 1, 0, 1, 2, 1, 2, 2, 2, 0, 1, 2, 1, 2, 1, 2, 0, 1, 1, 1, 0, 1, 2, 2, 2, 0, 0, 1, 1, 2, 0, 2, 2, 1, 1, 1, 2, 1, 0] ['x', 'y', 'z'] [[0.296 0.353 0.351]
 [0.569 0.303 0.128]
 [0.225 0.206 0.568]] [[0.357 0.31  0.333]
 [0.388 0.231 0.381]
 [0.316 0.279 0.404]] 3 ['A', 'B', 'C']
       x      y      z
A  0.642  0.351  0.007
B  0.023  0.140  0.838
C  0.192  0.694  0.114
       A      B      C
A  0.361  0.258  0.381
B  0.675  0.309  0.015
C  0.005  0.800  0.195
/Users/jasonmoggridge/Dropbox/Rosalind/Coursera_textbook_track/Course6/results


### nailed it!

In [193]:
# rosalind test
%cd /Users/jasonmoggridge/Dropbox/Rosalind/Coursera_textbook_track/Course6/data
iters, X, alpha, T, E, S, states = parse_BaumWelch_file("rosalind_ba10k.txt")
print(iters, X, alpha, T, E, S, states)
T,E = baum_welch_learning(X,T,E,S, alpha, iters)
E = pd.DataFrame(np.around(E,3), index= states, columns = alpha)
T = pd.DataFrame(np.around(T,3), index= states, columns = states)
print(E)
print(T)
%cd /Users/jasonmoggridge/Dropbox/Rosalind/Coursera_textbook_track/Course6/results/
file = "ba10k_rosalind.txt"
T.to_csv(file, sep ='\t')
with open(file,'a') as f:
    f.write('--------\n')
E.to_csv(file, mode = 'a', sep='\t')
    

/Users/jasonmoggridge/Dropbox/Rosalind/Coursera_textbook_track/Course6/data
100 [2, 1, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0, 0, 2, 2, 2, 2, 0, 0, 1, 1, 2, 1, 2, 1, 0, 2, 2, 1, 0, 1, 1, 2, 0, 2, 0, 1, 2, 2, 0, 2, 2, 2, 1, 0, 2, 2, 0, 0, 1, 2, 1, 0, 2, 0, 0, 2, 0, 2, 1, 1, 0, 2, 0, 1, 1, 0, 2, 0, 2, 2, 0, 0, 0, 1, 2, 1, 2, 1, 2, 1, 1, 0, 0, 0, 1, 2, 2, 1, 0, 0, 0, 0, 2, 1, 1, 1, 2, 0, 2] ['x', 'y', 'z'] [[0.313 0.639 0.048]
 [0.357 0.441 0.203]
 [0.347 0.358 0.296]] [[0.225 0.309 0.466]
 [0.387 0.137 0.476]
 [0.491 0.158 0.35 ]] 3 ['A', 'B', 'C']
       x      y      z
A  0.146  0.555  0.299
B  0.580  0.190  0.230
C  0.006  0.007  0.987
       A      B      C
A  0.151  0.671  0.178
B  0.445  0.486  0.069
C  0.957  0.042  0.001
/Users/jasonmoggridge/Dropbox/Rosalind/Coursera_textbook_track/Course6/results



---

**Exercise Break**: Use Baum-Welch learning to learn parameters for the HMM modeling CG-islands and for the HIV profile HMM. Compare these parameters with parameters derived by applying Viterbi learning.

### reached the end of the program.
