### BA10G Multiple sequence alignment with profile HMM

**STOP and Think**: Since the HMM diagram in the figure below has 25 nodes — not including the start and end states — the Viterbi graph for the string emitted in this figure has 25 rows. How many columns does this Viterbi graph have?

*the length of the sequence, + any deletions are states too??*


<br>

<img src='http://bioinformaticsalgorithms.com/images/HMM/aligning_text_against_profile.png' width = 500 px>

<br>

<br>

We are now ready to align a string Text to a multiple alignment by constructing the Viterbi graph for this string (see figure below) and solving the Decoding Problem to find the most likely hidden path.

<br>

<img src =http://bioinformaticsalgorithms.com/images/HMM/AEFDFDC_path_1.png width = 500 px >


**Figure**: The Viterbi graph for HMM(Alignment, θ) and a path in this graph (shown in purple) corresponding to the hidden path for the emitted string AEFDFDC from the alignment we have been working with. Edges between columns correspond to allowed transitions in the HMM diagram and have an implied rightward orientation. Edges entering nodes corresponding to deletion states are dashed. Emitted symbols are shown beneath each column.


<br>

￼STOP and Think: Find paths through the Viterbi graph corresponding to the bottom four hidden paths in the figure below. What happens?

*it just doesn't work ? there aren't any edges to the end for most sequences throught the viterbi graph* 

<br>
<img src = http://bioinformaticsalgorithms.com/images/HMM/alignment_HMM_paths.png width = 400px>




#### The troublesome silent states

If you reached this point without any questions about the figure two steps ago, then we have successfully concealed from you that solving **the Decoding Problem for HMMs with silent states** is not as simple as it may appear: the graph we proposed is not a Viterbi graph! 


To see why not, consider the path in the figure below, which emits the same string but passes through one fewer silent deletion state, thus reducing the number of columns by one. But the *Viterbi graph is not allowed to change depending on the hidden path π, since we know nothing about the hidden path in advance*! 
  

Instead, the number of **columns in the Viterbi graph must equal the length of the emitted string, a condition that is violated in the two figures we have proposed**.

￼STOP and Think: How can we modify the notion of the Viterbi graph for HMMs with silent states?

<br>

<img src = http://bioinformaticsalgorithms.com/images/HMM/AEFDFDC_path_2.png width=450px>

**Figure**: Another path through another “Viterbi graph” emitting the same string AEFDFDC.


<br>



More generally, **the Viterbi algorithm does not tolerate silent states other than the initial and terminal states**. In other words, this algorithm assumes that node (k, i) in the Viterbi graph describes the event “the HMM emitted symbol xi when it was in state k”. However, if k is a silent state, then the role of the node (k, i) in the Viterbi graph is poorly defined, as it is unclear how to define the weight of edges entering this node.


Fortunately, we can fix this issue in the case of profile HMMs by defining the Viterbi graph with |States| rows and |Text| columns (see figure below). Every time the HMM moves into a deletion state, rather than crossing over to the next column of the Viterbi graph (as in the two failed figures), we will move within the same column. When the HMM moves into a match or insertion state, we will move to the next column. As a result, every column of the Viterbi graph corresponds to a single emitted symbol, even though a path can pass through more than one state in a given column.


**Exercise Break**: Show that the vertical edge connecting (i, l) to (i, k), where k is a deletion state, should be assigned weight equal to transitionl, k.

**STOP and Think**: Take another look at the graph from the previous step, reproduced below. Are there any remaining issues with it?


<br> 
<img src = http://bioinformaticsalgorithms.com/images/HMM/profile_HMM_vertical_deletion_edges.png width = 400 px>
<br>

**Figure**: The Viterbi graph with |States| rows and |Text| columns for the profile HMM we have been considering emitting a string Text of length 7 so that edges entering deletion states are drawn downward within the same column instead of between columns as in the two previously proposed figures. The purple path corresponds to the path through the HMM emitting AEFDFDC.

 
 There is still a minor flaw with the graph from the previous step. If the HMM moves from the initial state into Deletion(1), then the HMM will move through the first column without emitting a symbol. We will therefore transform the initial state into a column of silent states containing the initial state and all deletion states (figure below). This way, if the HMM enters Deletion(1) from the initial state, it can move downward through deletion states before transitioning to a match or insertion state in the first column.


<br> 
<img src =  http://bioinformaticsalgorithms.com/images/HMM/profile_HMM_Viterbi_complete.png width = 400 px>
<br>

**Figure**: The final Viterbi graph for the profile HMM emitting a string of length 7. Edges within the same column have a downward orientation; edges between columns have a rightward orientation. Once again, the purple path corresponds to the path through the HMM emitting AEFDFDC.

<br>

<br>

<br>





<br>

<br>

<br>


In [87]:
import pandas as pd
import numpy as np

def parse_HMM(file): # with pseudocount sigma
    lines = f.readlines()
    text = lines[0].strip()
    # if error: check if tab or space in this input line
#     (theta, sigma) = (float(i) for i in lines[2].strip().split('\t'))
    (theta, sigma) = (float(i) for i in lines[2].strip().split(' '))
    alpha = lines[4].strip().split()
    alpha = dict(zip(alpha, range(len(alpha))))
    alignment = [line.strip() for line in lines[6:]]
    print("Text: ", text)
    print("Alignment:")
    for a in alignment:
        print('\t', a)
    print("Alpha:", alpha)
    print("Theta:", theta)
    print("Sigma:", sigma)
    return text, alignment, alpha, theta, sigma

### algorithm for building profile HMM from an alignment|theta
def profileHMM(alignment, alpha, theta):

    def get_seed_alignment(alignment, theta, alphabet):
        """returns seed alignment columns for HMM"""
        k = len(alignment[0]); a = len(alphabet.keys())
        freq = np.zeros(shape=(a+1, k)) 
        for seq in alignment:
            for i in range(k):
                if seq[i] == '-':
                    freq[a][i] += 1
                else:
                    freq[alphabet[seq[i]]][i] +=1
        n = len(alignment)
        seed = [x/n < theta for x in freq[a]]
        return seed

    def state_transition(T, prev, kind, S):
        """ find next Match, Del, Ins, or End state in [states] """
        x=0
        if S[prev][0] == 'M':
            x=1
        for nxt in range(prev+1+x, len(T)):
            if S[nxt][0] == kind[0]:
                T[prev][nxt] += 1
                return T, nxt
            
    def normalize_matrices(T,E):
        """Normalize Transmission and Emission frequencies to sum of row for state"""
        for state in range(len(S)):
            if sum(T[state]) > 0:
                T[state] =  T[state]/sum(T[state])
            if sum(E[state]) >0:
                E[state] =  E[state]/sum(E[state])
        return T, E
    
    def pseudo_normalize(T, E):
        """adds pseudocounts to HMM matrices"""
        # add pseudos and normalize E
        for i in range(1, len(S)-1):
            if S[i][0] != 'D':
                E[i] += 0.01
                E[i] = E[i]/sum(E[i])
        # add pseudos and normalize T
        for i in range(len(S)-1):
            start = 3*((i+1)//3)+1
            end = start+3
            for j in range(start, end):
                if j < len(S):
                    T[i][j] += 0.01
            T[i] = T[i]/sum(T[i])
        return T, E
    
    ### Walk each sequence to count transitions, emissions
    seed = get_seed_alignment(alignment, theta, alpha)
    n = len(alignment); k = len(alignment[0])
    S = ['S','I0']+list(c+str(n) for n in range(1,sum(seed)+1) for c in "MDI")+['E']
    E = np.zeros(shape=(len(S), len(alpha.keys())))
    T = np.zeros(shape=(len(S), len(S)))

    for seq in alignment:
        # 'state' is the hidden state col/row; i is pos in alignment
        state = 0; i = 0
        while i < k:
            # seed: either match or del as hidden state
            if seed[i]:
                if seq[i] in alpha:
                    T, state = state_transition(T, state, 'Match', S)
                    E[state][alpha[seq[i]]] +=1
                else:
                    T, state = state_transition(T, state, 'Deletion', S)
            # not seed: either insert or nothing
            else:
                # count any emissions before next seed column
                emits = []
                while not seed[i]:
                    if seq[i] in alpha:
                        emits.append(seq[i])
                    i += 1
                    if i == k: 
                        break  # hit end of sequence
                i-=1                
                if len(emits)>0:
                    # count the transition(state, insertion)
                    T, state = state_transition(T, state, 'Insert', S)
                    # get all emissions(state) in insertion
                    for symbol in emits:
                        E[state][alpha[symbol]] += 1
                    # count all symbols as cyclic transition t(ins_x, ins_x)
                    if len(emits) >1:
                        T[state][state] += len(emits)-1
                # else do nothing, just gaps in align
            i += 1
        # from last state to 'End'    
        T, state = state_transition(T, state, 'End', S)

    T,E = normalize_matrices(T,E)
    T, E = pseudo_normalize(T, E)
    return S, T, E, seed



def Viterbi_alignment(T,E,S,alpha,text):
    """Aligns text to HMM using Viterbi algorithm for optimal hidden path"""
    
    # initialize the Viterbi graph (|states|*|text|)
    Viterbi = np.ones(shape=(len(S)-1, len(text)+1))*-float('inf')
    pointers = [[False]+[False for t in text] for s in S[:-1]] 
    Viterbi[0][0] = np.log(1)
    pointers[0][0] = 'root'
    offsets = {'M':(3,1),'D':(4,0), 'I':(2,1)}

    # init 1st column all del states (not match or ins)
    prev = (0)
    for state in range(len(S)):
        if S[state][0]=='D':
            Viterbi[state][0] = Viterbi[prev][0] + np.log(T[prev][state])
            pointers[state][0] = (prev,0)
            prev=state

    for column in range(1, len(text)+1):
        symbol = alpha.index(text[column-1])
        for state in range(1, len(S[:-1])):
            (x,y) = offsets[S[state][0]]
            best_score = -float('inf')
            if S[state][0]=='D':
                p_emit = 1
            else:
                p_emit = E[state][symbol]
#             print('\nCurrent cell', S[state],(state, column), "; symbol:", text[column-1],
#                   "p_emit =", np.around(p_emit,4))
            for prev in range(state-x, state-x+3):
                if not prev<0:
                    p_prev = Viterbi[prev][column-y]
                    p_trans = T[prev][state]
                    p_edge = p_prev + np.log(p_trans)
#                     print("\tedge:",S[prev], (prev, column-y), '–>',  S[state], (state, column),)
#                     print("\t\tViterbi: {}, Transition: {};;; Edge -> {}".format(np.around(p_prev,4),
#                                                                                  np.around(p_trans,4),
#                                                                                  np.around(p_edge,4)))
                    if p_edge > best_score:
                        best_score = p_edge
                        Viterbi[state][column] = p_edge + np.log(p_emit)
                        pointers[state][column] = (prev, column-y)
    #                     print("\t\t\t**best edge:", (prev, column-y), S[prev])

    print(pd.DataFrame(np.around(Viterbi,3), columns=['start']+list(text), index=S[:-1]))
    print(pd.DataFrame(pointers, columns=['start']+list(text), index=S[:-1]))                
    return Viterbi, pointers

def backtrack(Viterbi, pointers):
    ### walk back pointers:
    # start backtrack from greatest probability in last column of viterbi
    score = -float('inf')
    for state in range(len(S)-4,len(S)-1):
        if Viterbi[state][len(text)] > score:
            last = state
            score = Viterbi[state][len(text)]
    path = ['']
    print("Best Score at end of Path:", score, "at state:", S[last])

    # backtrack to recreate max likelihood hidden_path in reverse
    x,y = last, len(text)
    while path:
        path.append(S[x])    
        (x,y) = pointers[x][y]
        if (x,y) ==(0,0):
            path = ' '.join(path[-1:0:-1])
            break

    print('\n\nPATH:\n',path)
    return path

In [88]:
### Sample data:
with open('/Users/jasonmoggridge/Dropbox/Rosalind/Coursera_textbook_track/Course6/data/ba10g.sample.txt') as f:
    text, alignment, alpha, theta, sigma = parse_HMM(f)

S, T, E, seed = profileHMM(alignment, alpha, theta)
alpha = list(alpha.keys())
print(pd.DataFrame(np.around(T,3), index = S, columns=S))
print(pd.DataFrame(np.around(E,3), index = S, columns=alpha))
V, P = Viterbi_alignment(T,E,S, alpha, text)
path = backtrack(V,P)


Text:  AEFDFDC
Alignment:
	 ACDEFACADF
	 AFDA---CCF
	 A--EFD-FDC
	 ACAEF--A-C
	 ADDEFAAADF
Alpha: {'A': 0, 'B': 1, 'C': 2, 'D': 3, 'E': 4, 'F': 5}
Theta: 0.4
Sigma: 0.01
      S     I0     M1     D1     I1     M2     D2     I2     M3     D3  ...  \
S   0.0  0.010  0.981  0.010  0.000  0.000  0.000  0.000  0.000  0.000  ...   
I0  0.0  0.333  0.333  0.333  0.000  0.000  0.000  0.000  0.000  0.000  ...   
M1  0.0  0.000  0.000  0.000  0.010  0.786  0.204  0.000  0.000  0.000  ...   
D1  0.0  0.000  0.000  0.000  0.333  0.333  0.333  0.000  0.000  0.000  ...   
I1  0.0  0.000  0.000  0.000  0.333  0.333  0.333  0.000  0.000  0.000  ...   
M2  0.0  0.000  0.000  0.000  0.000  0.000  0.000  0.010  0.981  0.010  ...   
D2  0.0  0.000  0.000  0.000  0.000  0.000  0.000  0.010  0.010  0.981  ...   
I2  0.0  0.000  0.000  0.000  0.000  0.000  0.000  0.333  0.333  0.333  ...   
M3  0.0  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  ...   
D3  0.0  0.000  0.000  0.000  0.000  0.0

#### Getting the right answer on the sample dataset now! yay! going to call it quits for tonight

In [89]:
### extra data:
with open('/Users/jasonmoggridge/Dropbox/Rosalind/Coursera_textbook_track/Course6/data/ba10g.extra.txt') as f:
    text, alignment, alpha, theta, sigma = parse_HMM(f)


S, T, E, seed = profileHMM(alignment, alpha, theta)
alpha = list(alpha.keys())
print(pd.DataFrame(np.around(T,3), index = S, columns=S))
print(pd.DataFrame(np.around(E,3), index = S, columns=alpha))
V, P = Viterbi_alignment(T,E,S,alpha, text)
path = backtrack(V,P)


Text:  EEBEABDCEEABCCCEEBDEDCADEDACCDCBBEECDBDACABDADCBE
Alignment:
	 EEBBA--C-DBAA-AECD--BDB---CC-DDCBBCEDE-EBB-DAEE-C
	 EEEEABBCEABBCDEE-DAEBDBAEDC-BDBCB--C-B-BCA-DAEECA
	 --CEB-ACCDEACEEEEDBEED-ADBCCDAC--BC--BDBCAEDAEECC
	 A--AABDCE-A-CD-ECD-EBBA-EDC-DACCBBCCD-D-BA-DAAEBC
	 EECEAB--EDDACCE-CD-E--B-EDCCD-CCBBCCD-DBBA--AE-CA
	 E-CDA-DCECAAECB-CDCEB-B-BDCCD---B--CD-DBCDBDAEB-C
	 EBCEAEDC-DABC--A-DCEDDBAED-CD-CCBBCCEBDB--BEA-EEC
	 AC-E-BDCEDAADDEECDEEB-CAEDC-DD-CBBCCD-DBCABDAEECC
	 EECBABDCEDEAEC-DCDC-BDBDEDA-D-AD-A-EABEB--BDA-ECC
Alpha: {'A': 0, 'B': 1, 'C': 2, 'D': 3, 'E': 4}
Theta: 0.359
Sigma: 0.01
       S     I0     M1     D1     I1     M2     D2   I2   M3   D3  ...  M42  \
S    0.0  0.010  0.873  0.118  0.000  0.000  0.000  0.0  0.0  0.0  ...  0.0   
I0   0.0  0.333  0.333  0.333  0.000  0.000  0.000  0.0  0.0  0.0  ...  0.0   
M1   0.0  0.000  0.000  0.000  0.010  0.738  0.252  0.0  0.0  0.0  ...  0.0   
D1   0.0  0.000  0.000  0.000  0.010  0.010  0.981  0.0  0.0  0.0  ...  0.0

In [None]:
### TRYING TO FIGURE WHY IT DOESNT WORK ON THIS DATASET????

# path = path.split(' ')
# len(path)
# sol = 'M1 M2 M3 M4 M5 M6 M7 M8 M9 D10 M11 M12 I12 I12 M13 M14 M15 M16 M17 M18 D19 M20 M21 M22 M23 I23 M24 M25 M26 I26 I26 M27 D28 M29 M30 M31 I31 I31 D32 M33 M34 I34 M35 M36 M37 M38 I38 M39 M40 M41 M42 M43 M44'.split(' ')
# len(sol)
# sum(1 if s[0] in ('M','I') else 0 for s in path), sum(1 if s[0] in ('D') else 0 for s in path)
# sum(1 if s[0] in ('M','I') else 0 for s in sol), sum(1 if s[0] in ('D') else 0 for s in sol)
# v = pd.DataFrame(np.around(V,3), columns=['start']+list(text), index=S[:-1])
# p = pd.DataFrame(P, columns=['start']+list(text), index=S[:-1])     


In [101]:
### stepik test data:
with open('/Users/jasonmoggridge/Dropbox/Rosalind/Coursera_textbook_track/Course6/data/dataset_26259_14.txt') as f:
    text, alignment, alpha, theta, sigma = parse_HMM(f)


S, T, E, seed = profileHMM(alignment, alpha, theta)
alpha = list(alpha.keys())
print(pd.DataFrame(np.around(T,3), index = S, columns=S))
print(pd.DataFrame(np.around(E,3), index = S, columns=alpha))
V, P = Viterbi_alignment(T,E,S,alpha,text)
path = backtrack(V,P)


Text:  ADDBACBDDDBACEABEEBDBEEDAECEAEBDCEADEEBCEADBDCABD
Alignment:
	 ADD-AEBCDEBECE-BE-BA-EEEAB-ECEBDCEABEE-B--DBEEB-D
	 A--EAEB-DEBEC-DB-EBCBEAAAB-EADBDC-AD-EDA-A-B-ABB-
	 -DAD-EB-DEBECBDBD-B--EDEABBE-ABD-EADEEBC-ADEDCBBD
	 CADCAE-D-E-E--DBEEBA-EEEC-BEADBCCEDABCBC-ADBD--BD
	 ADDDA-BED-BE-E-BEEB-BEEDA--EBDB-CEA-ECDCCA-BDABBB
	 ADDDAEDDD-BD-EDBEE-DD-E-D-B-ADBD-EB--EBCCE-BDABBD
Alpha: {'A': 0, 'B': 1, 'C': 2, 'D': 3, 'E': 4}
Theta: 0.327
Sigma: 0.01
       S     I0     M1     D1     I1     M2     D2   I2   M3   D3  ...  M32  \
S    0.0  0.010  0.819  0.172  0.000  0.000  0.000  0.0  0.0  0.0  ...  0.0   
I0   0.0  0.333  0.333  0.333  0.000  0.000  0.000  0.0  0.0  0.0  ...  0.0   
M1   0.0  0.000  0.000  0.000  0.010  0.786  0.204  0.0  0.0  0.0  ...  0.0   
D1   0.0  0.000  0.000  0.000  0.010  0.981  0.010  0.0  0.0  0.0  ...  0.0   
I1   0.0  0.000  0.000  0.000  0.333  0.333  0.333  0.0  0.0  0.0  ...  0.0   
..   ...    ...    ...    ...    ...    ...    ...  ...  ...  ...  ...  .

#### correct on the stepik dataset, now try rosalind

In [102]:
### rosalind test data:
with open('/Users/jasonmoggridge/Dropbox/Rosalind/Coursera_textbook_track/Course6/data/rosalind_ba10g.txt') as f:
    text, alignment, alpha, theta, sigma = parse_HMM(f)


S, T, E, seed = profileHMM(alignment, alpha, theta)
alpha = list(alpha.keys())
print(pd.DataFrame(np.around(T,3), index = S, columns=S))
print(pd.DataFrame(np.around(E,3), index = S, columns=alpha))
V, P = Viterbi_alignment(T,E,S,alpha, text)
path = backtrack(V,P)


Text:  EACCEECBECCADDEBABEEBEABDCDEEBECEBECBBDCDCDDCDCAA
Alignment:
	 EACCEEB-A-EAABEBA--ECCDBD-DCACE--BACAB-CB-DABA-D-
	 EAC--EBCE--A--EB-DEBCCDB--CCACEBE--DABCADCCABA-AC
	 EACCEEBBECEAABEBCEECCC-C-DDC-EECEBBAABACDC-ABAC-C
	 EACC-BB-E--A--EBA-EE--D-DCDCCCDB-BD-ABECDCD-BAEA-
	 -ACC-E-BEC---B-BA---CED-DCDCBCECEBC-ABACD-D-BA--C
	 CACEEE---CEA-BE--DEECBDBD-DCA--B----DB-CDCEBBA-AC
	 EACD--BDEC-AAB-BADEE-CD-BCDCAC-DABDCABA-ECDC-AEAE
	 EACC--BBECEAABDBADEECCDB-CADACB-E--CAB-C--DAB-BAC
Alpha: {'A': 0, 'B': 1, 'C': 2, 'D': 3, 'E': 4}
Theta: 0.297
Sigma: 0.01
       S     I0     M1     D1     I1     M2     D2   I2   M3   D3  ...  M31  \
S    0.0  0.010  0.859  0.131  0.000  0.000  0.000  0.0  0.0  0.0  ...  0.0   
I0   0.0  0.333  0.333  0.333  0.000  0.000  0.000  0.0  0.0  0.0  ...  0.0   
M1   0.0  0.000  0.000  0.000  0.010  0.981  0.010  0.0  0.0  0.0  ...  0.0   
D1   0.0  0.000  0.000  0.000  0.010  0.981  0.010  0.0  0.0  0.0  ...  0.0   
I1   0.0  0.000  0.000  0.000  0.333  0.333  0.3