
# 10: Hidden Markov Models 2: Profile HMM construction
---



Trying to align some protein of interest to an alignment of a family of proteins...

$Alignment$ contains many space symbols and likely do not represent meaningful characteristics of the family. Ignore columns for which the fraction of space symbols is ≥ a **column removal threshold θ** in the profile alignment. Column removal results in a 5 × 8 **seed alignment**:

Given seed alignment $Alignment∗$, build an HMM that models the propensities of symbols in $Alignment∗$ in the **profile matrix** $Profile(Alignment∗)$ , as shown below. 

Compute the **probability that the HMM emits Text**. If the HMM is designed well, then the more similar Text is to the strings in Alignment∗, the more likely it will be emitted by the HMM.

<br>
<br> <img src = http://bioinformaticsalgorithms.com/images/HMM/multiple_alignment_2.png width = 500px> 
<br>



Construct a simple HMM that treats the columns of $Alignment$ as $k$ sequentially linked states called **match states** (adding a panel to the figure from the previous step) denoted $Match(1), . . . , Match(k)$. 

When the HMM enters state $Match(i)$, it emits symbol $x_i$ with probability equal to the frequency of this symbol in the $i$-th column of $Profile(Alignment∗)$. The HMM then moves into state $Match(i + 1)$ with transition probability equal to 1.

The **similarity score** between $Alignment_*$ and $Text$ is the probability $Pr(Text)$ that the HMM for $Alignment∗$ emits $Text$. This score is equal to the product of frequencies in $Profile(Alignment∗)$ corresponding to each symbol of $Text$. For example, the probability that the HMM in the figure below emits ADDAFFDF is

$ 1 · (1/4) · (3/4) · (1/5) · 1 · (1/5) · (3/4) · (3/5) = 0.003375 $

STOP and Think: What are the limitations of the HMM below?


    
<br> <img src = http://bioinformaticsalgorithms.com/images/HMM/multiple_alignment_3.png width = 500px> <br>

Figure: Adding the diagram of a simple HMM that models the profile matrix introduced on the previous step. The match states Match(i) are abbreviated as Mi . The HMM only has one possible path; it is initially in state Match(1), the transition probability from state Match(i) to state Match(i+1) is equal to 1 for all i, and all other transitions are forbidden. Emission probabilities are equal to frequencies in the profile, e.g., emission probabilities for M2 are 0 for A, 2/4 for C, 1/4 for D, 0 for E, and 1/4 for F.

#### Profile HMM Problem: Construct a profile HMM from a multiple alignment.

**Input**: A multiple alignment Alignment and a threshold θ.
**Output**: HMM(Alignment, θ), in the form of transition and emission matrices.


Note: Your matrices can be either space-separated or tab-separated.


        Sample Input:

        0.289
        --------
        A B C D E
        --------
        EBA
        E-D
        EB-
        EED
        EBD
        EBE
        E-D
        E-D
        Sample Output:

            S	I0	M1	D1	I1	M2	D2	I2	E	
        S	0	0	1.0	0	0	0	0	0	0
        I0	0	0	0	0	0	0	0	0	0
        M1	0	0	0	0	0.625	0.375	0	0	0
        D1	0	0	0	0	0	0	0	0	0
        I1	0	0	0	0	0	0.8	0.2	0	0
        M2	0	0	0	0	0	0	0	0	1.0
        D2	0	0	0	0	0	0	0	0	1.0
        I2	0	0	0	0	0	0	0	0	0
        E	0	0	0	0	0	0	0	0	0
        --------
            A	B	C	D	E
        S	0	0	0	0	0
        I0	0	0	0	0	0
        M1	0	0	0	0	1.0
        D1	0	0	0	0	0
        I1	0	0.8	0	0	0.2
        M2	0.143	0	0	0.714	0.143
        D2	0	0	0	0	0
        I2	0	0	0	0	0
        E	0	0	0	0	0

In [85]:
import pandas as pd
import numpy as np

def parse_HMM(file):
    lines = f.readlines()
    theta = float(lines[0].strip())
    alpha = lines[2].strip().split()
    alpha = dict(zip(alpha, range(len(alpha))))
    alignment = [line.strip() for line in lines[4:]]
    print("Alignment:")
    for a in alignment:
        print('\t', a)
    print("Alpha:", alpha)
    print("Theta:", theta)
    return alignment, alpha, theta

### algorithm for building profile HMM from an alignment|theta
def profileHMM(alignment, alpha, theta):

    # # 'seed' indicates which columns pass threshold theta for align*
    def get_seed_alignment(alignment, theta, alphabet):
        """'seed': True if gaps < theta; included in align*
            returns seed alignment columns for HMM"""
        k = len(alignment[0])
        a = len(alphabet.keys())
        print("Length of alignment(k) = ",k)
        print("size of alphabet = ",k,'\n')
        freq = np.zeros(shape=(a+1, k)) 

        for seq in alignment:
            for i in range(k):
                if seq[i] == '-':
#                     print('\tgap', seq[i])
                    freq[a][i] += 1
                else:
                    freq[alphabet[seq[i]]][i] +=1
#                     print('\tbase', seq[i])
        n = len(alignment)
        print('freq matrix:\n', freq)
        seed = [x/n < theta for x in freq[a]]
        return seed
    
    def normalize_matrices(T,E):
        """Normalize Transmission and Emission frequencies to sum of row for state"""
        for state in range(len(S)):
            if sum(T[state]) > 0:
                T[state] =  T[state]/sum(T[state])
            if sum(E[state]) >0:
                E[state] =  E[state]/sum(E[state])
        return T, E
    
    def state_transition(T, prev, kind, S):
        """ find next Match, Del, Ins, or End state in [states] """
        x=0
        if S[prev][0] == 'M':
            x=1
        for nxt in range(prev+1+x, len(T)):
            if S[nxt][0] == kind[0]:
                T[prev][nxt] += 1
#                 print('  >Transition added:',((S[prev],prev),(S[nxt], nxt)))
                return T, nxt
    
    ### Walk each sequence to count transitions, emissions
    seed = get_seed_alignment(alignment, theta, alpha)
    print('seed:',seed)
    n = len(alignment)
    k = len(alignment[0])
    S = ['S','I0']+list(c+str(n) for n in range(1,sum(seed)+1) for c in "MDI")+['E']
    E = np.zeros(shape=(len(S), len(alpha.keys())))
    T = np.zeros(shape=(len(S), len(S)))

    for seq in alignment:
        # 'state' is the hidden state col/row; i is pos in alignment
        state = 0; i = 0
        while i < k:
            # seed: either match or del as hidden state
            if seed[i]:
                if seq[i] in alpha:
                    T, state = state_transition(T, state, 'Match', S)
                    E[state][alpha[seq[i]]] +=1
                else:
                    T, state = state_transition(T, state, 'Deletion', S)
            # not seed: either insert or nothing
            else:
                # count any emissions before next seed column
                emits = []
                while not seed[i]:
                    if seq[i] in alpha:
                        emits.append(seq[i])
                    i += 1
                    if i == k: 
                        break  # hit end of sequence
                i-=1                
                if len(emits)>0:
                    # count the transition(state, insertion)
                    T, state = state_transition(T, state, 'Insert', S)
                    # get all emissions(state) in insertion
                    for symbol in emits:
                        E[state][alpha[symbol]] += 1
                    # count all symbols as cyclic transition t(ins_x, ins_x)
                    if len(emits) >1:
                        T[state][state] += len(emits)-1
                # else do nothing, just gaps in align
            i += 1
        # from last state to 'End'    
        T, state = state_transition(T, state, 'End', S)

    T,E = normalize_matrices(T,E)
    return S, T, E

def outputHMM(T,E,S,alpha):
    lists = []
    lists.append(['']+S)
    for i in range(len(S)):
        # use {0:g} format to remove trailing zeros
        lists.append([S[i]]+['{0:g}'.format(x) for x in T[i]])
    lists.append('--------')
    lists.append([''] + alpha)
    for i in range(len(S)):
        lists.append([S[i]]+['{0:g}'.format(x) for x in E[i]])
    return lists

#### Test dataset

In [86]:
### Extra test data 1:


with open('/Users/jasonmoggridge/Dropbox/Rosalind/Coursera_textbook_track/Course6/data/10e.extra.txt') as f:
    alignment, alpha, theta = parse_HMM(f)

S, T, E = profileHMM(alignment, alpha, theta)
T = np.around(T,3)
E = np.around(E,3)
alpha = list(alpha.keys())
lists =outputHMM(T,E,S, alpha)
with open('10e.out.txt','w') as file:
    for line in lists[:-1]:
        if line[0] != '-':
            file.write('\t'.join(line))
        else:
            file.write(line)
        file.write('\n')
    file.write('\t'.join(lists[-1]))

    
with open('/Users/jasonmoggridge/Dropbox/Rosalind/Coursera_textbook_track/Course6/data/10e.extra.E.csv', 'r') as f:
    e = pd.read_csv(f, header=None)
    solE = np.array(e)
with open('/Users/jasonmoggridge/Dropbox/Rosalind/Coursera_textbook_track/Course6/data/10e.extra.T.csv', 'r') as f:
    t = pd.read_csv(f, header=None)
    solT = np.array(t)
np.all(T== solT),np.all(E== solE), pd.DataFrame(T == solT)
    

Alignment:
	 DCDABACED
	 DCCA--CA-
	 DCDAB-CA-
	 BCDA---A-
	 BC-ABE-AE
Alpha: {'A': 0, 'B': 1, 'C': 2, 'D': 3, 'E': 4}
Theta: 0.252
Length of alignment(k) =  9
size of alphabet =  9 

freq matrix:
 [[0. 0. 0. 5. 0. 1. 0. 4. 0.]
 [2. 0. 0. 0. 3. 0. 0. 0. 0.]
 [0. 5. 1. 0. 0. 0. 3. 0. 0.]
 [3. 0. 3. 0. 0. 0. 0. 0. 1.]
 [0. 0. 0. 0. 0. 1. 0. 1. 1.]
 [0. 0. 1. 0. 2. 3. 2. 0. 3.]]
seed: [True, True, True, True, False, False, False, True, False]


(True,
 True,
        0     1     2     3     4     5     6     7     8     9    10    11  \
 0   True  True  True  True  True  True  True  True  True  True  True  True   
 1   True  True  True  True  True  True  True  True  True  True  True  True   
 2   True  True  True  True  True  True  True  True  True  True  True  True   
 3   True  True  True  True  True  True  True  True  True  True  True  True   
 4   True  True  True  True  True  True  True  True  True  True  True  True   
 5   True  True  True  True  True  True  True  True  True  True  True  True   
 6   True  True  True  True  True  True  True  True  True  True  True  True   
 7   True  True  True  True  True  True  True  True  True  True  True  True   
 8   True  True  True  True  True  True  True  True  True  True  True  True   
 9   True  True  True  True  True  True  True  True  True  True  True  True   
 10  True  True  True  True  True  True  True  True  True  True  True  True   
 11  True  True  True  True  True  Tru

#### Challenge dataset

In [87]:
## rosalind data

with open('/Users/jasonmoggridge/Dropbox/Rosalind/Coursera_textbook_track/Course6/data/rosalind_ba10e.txt') as f:
    alignment, alpha, theta = parse_HMM(f)

S, T, E = profileHMM(alignment, alpha, theta)
T = np.around(T,3)
E = np.around(E,3)
alpha = list(alpha.keys())
lists =outputHMM(T,E,S, alpha)
with open("/Users/jasonmoggridge/Dropbox/Rosalind/Coursera_textbook_track/Course6/data/rosalind_ba10e.output.txt",'w') as file:
    for line in lists[:-1]:
        if line[0] != '-':
            file.write('\t'.join(line))
        else:
            file.write(line)
        file.write('\n')
    file.write('\t'.join(lists[-1]))

print("\nT:", pd.DataFrame(T), "\nE:", pd.DataFrame(E))

ValueError: could not convert string to float: 'S\tI0\tM1\tD1\tI1\tM2\tD2\tI2\tM3\tD3\tI3\tM4\tD4\tI4\tE'

In [77]:
## dataset_26258_15

with open('/Users/jasonmoggridge/Dropbox/Rosalind/Coursera_textbook_track/Course6/data/dataset_26258_15.txt') as f:
    alignment, alpha, theta = parse_HMM(f)

S, T, E = profileHMM(alignment, alpha, theta)
T = np.around(T,3)
E = np.around(E,3)
alpha = list(alpha.keys())
lists =outputHMM(T,E,S, alpha)
with open('10e.finalout.txt','w') as file:
    for line in lists[:-1]:
        if line[0] != '-':
            file.write('\t'.join(line))
        else:
            file.write(line)
        file.write('\n')
    file.write('\t'.join(lists[-1]))


Alignment:
	 BEA-E-AD-
	 DC-DDCAEC
	 BE-CE-ABC
	 -EAC-CADE
	 B-ACECEDE
	 EEA--CADC
	 BED--CA-C
	 BEA-E-ADC
Alpha: {'A': 0, 'B': 1, 'C': 2, 'D': 3, 'E': 4}
Theta: 0.382
Length of alignment(k) =  9
size of alphabet =  9 

freq matrix:
 [[0. 0. 5. 0. 0. 0. 7. 0. 0.]
 [5. 0. 0. 0. 0. 0. 0. 1. 0.]
 [0. 1. 0. 3. 0. 5. 0. 0. 5.]
 [1. 0. 1. 1. 1. 0. 0. 5. 0.]
 [1. 6. 0. 0. 4. 0. 1. 1. 2.]
 [1. 1. 2. 4. 3. 3. 0. 1. 1.]]
seed: [True, True, True, False, True, True, True, True, True]


seq: BEA-E-AD- 

-------------------------------


pos.i: 0 adding to ? trans from state S 0
  >findtrans S[prev]: S 0 to Match
  >Transition added: (('S', 0), ('M1', 2))
new state is M1 2

pos.i: 1 adding to ? trans from state M1 2
  >findtrans S[prev]: M1 2 to Match
   >offset =1
  >Transition added: (('M1', 2), ('M2', 5))
new state is M2 5

pos.i: 2 adding to ? trans from state M2 5
  >findtrans S[prev]: M2 5 to Match
   >offset =1
  >Transition added: (('M2', 5), ('M3', 8))
new state is M3 8

pos.i: 3 adding to 