# BA10F - Profile HMM with pseudocounts #

Aligning a protein against a profile HMM
Given a protein family, represented by Alignment, we can now return to the problem of deciding whether a newly sequenced protein, represented by Text, belongs to the family. We first form HMM(Alignment, θ) for some parameter θ. As shown in the figure below, a hidden path through HMM(Alignment, θ) corresponds to a sequence of match, insertion, and deletion states for aligning Text against Alignment.


<br>
<br> <img src = http://bioinformaticsalgorithms.com/images/HMM/aligning_text_against_profile.png width = 500px> 
<br>


Figure: (Top) A path through HMM(Alignment, 0.35) for the multiple alignment from the previous section and the emitted string Text = ACAFDEAF. (Bottom) The emitted symbols correspond to aligning Text against Alignment. Specifically, the first two symbols are emitted from two match states and belong in the first two positions of the alignment. The next two symbols are emitted from an insertion state and belong in columns of their own (shown in pink). The space symbols in the seventh and eleventh columns above correspond to deletion states; these symbols are not emitted by the HMM. The space symbols in the gray columns do not correspond to any states and are passed over. The non-shaded columns form an augmented 6 × 8 seed alignment for comparison against newly sequenced proteins.

To find the “best” alignment of Text against Alignment, we simply need to apply the Viterbi algorithm to find an optimal hidden path in HMM(Alignment, θ). If the product weight of this optimal hidden path exceeds a predetermined threshold, then we may conclude that Text belongs to the protein family, in which case we augment the existing seed alignment with an additional row corresponding to Text. In this way, we can recruit more and more distant family members to a seed alignment, adding these new proteins to the growing multiple alignment, and thus making the resulting profile HMM more and more suitable for analyzing the protein family of interest.

￼STOP and Think: If the product weight for a new protein exceeds a threshold for more than one protein family, how would you classify this protein?

Profile HMMs have finally helped us achieve our original goal of scoring different columns of a multiple alignment differently based on the frequency of symbols in each column. For example, say that the seventh column of Alignment∗ contains more occurrences of A than C, and the ninth column of Alignment∗ contains more occurrences of C than A. A hidden path passing through Match(7) would be rewarded more for emitting A than C, whereas a hidden path passing through Match(9) would be rewarded more for emitting C than A.

#### The return of pseudocounts


Returning to the figure from the previous lesson showing a table of transition probabilities, the majority of transition probabilities in the gray cells of this figure are equal to zero. (The same is true of emission probabilities.)

<br>
<img src = http://bioinformaticsalgorithms.com/images/HMM/transition_probabilities.png width = 500 px>
<br>

These zeroes may cause problems; for example, the path in the figure from aligning a string against a profile matrix, reproduced below, seems perfectly reasonable for Text = ACAFDEAF, and yet Pr(x, π) is equal to zero because the transition probability from Match(2) to Insertion(2) for this profile HMM is zero.


As when we searched for motifs, we will introduce pseudocounts by adding a small value σ to entries in the transition matrix that correspond to edges of the HMM diagram  (i.e., only the gray elements of the table from the previous step, reproduced below). Note that white cells in this table, corresponding to forbidden transitions, are not affected by pseudocounts. The resulting matrix will then need to be normalized so that the elements in each row sum to 1.

﻿Exercise Break: Compute the normalized matrix for the matrix below after adding the pseudocount σ = 0.01.




In [154]:
import pandas as pd
import numpy as np

def parse_HMM(file): # with pseudocount sigma
    lines = f.readlines()
    # if error: check if tab or space in this input line
    (theta, sigma) = (float(i) for i in lines[0].strip().split('\t'))
    alpha = lines[2].strip().split()
    alpha = dict(zip(alpha, range(len(alpha))))
    alignment = [line.strip() for line in lines[4:]]
    print("Alignment:")
    for a in alignment:
        print('\t', a)
    print("Alpha:", alpha)
    print("Theta:", theta)
    print("Sigma:", sigma)
    return alignment, alpha, theta, sigma

### algorithm for building profile HMM from an alignment|theta
def profileHMM(alignment, alpha, theta):

    # # 'seed' indicates which columns pass threshold theta for align*
    def get_seed_alignment(alignment, theta, alphabet):
        """'seed': True if gaps < theta; included in align*
            returns seed alignment columns for HMM"""
        k = len(alignment[0])
        a = len(alphabet.keys())
        print("Length of alignment(k) = ",k)
        print("size of alphabet = ",k,'\n')
        freq = np.zeros(shape=(a+1, k)) 

        for seq in alignment:
            for i in range(k):
                if seq[i] == '-':
#                     print('\tgap', seq[i])
                    freq[a][i] += 1
                else:
                    freq[alphabet[seq[i]]][i] +=1
#                     print('\tbase', seq[i])
        n = len(alignment)
        print('freq matrix:\n', freq)
        seed = [x/n < theta for x in freq[a]]
        return seed
    
    def normalize_matrices(T,E):
        """Normalize Transmission and Emission frequencies to sum of row for state"""
        for state in range(len(S)):
            if sum(T[state]) > 0:
                T[state] =  T[state]/sum(T[state])
            if sum(E[state]) >0:
                E[state] =  E[state]/sum(E[state])
        return T, E
    
    def state_transition(T, prev, kind, S):
        """ find next Match, Del, Ins, or End state in [states] """
        x=0
        if S[prev][0] == 'M':
            x=1
        for nxt in range(prev+1+x, len(T)):
            if S[nxt][0] == kind[0]:
                T[prev][nxt] += 1
#                 print('  >Transition added:',((S[prev],prev),(S[nxt], nxt)))
                return T, nxt
    
    ### Walk each sequence to count transitions, emissions
    seed = get_seed_alignment(alignment, theta, alpha)
    print('seed:',seed)
    n = len(alignment)
    k = len(alignment[0])
    S = ['S','I0']+list(c+str(n) for n in range(1,sum(seed)+1) for c in "MDI")+['E']
    E = np.zeros(shape=(len(S), len(alpha.keys())))
    T = np.zeros(shape=(len(S), len(S)))

    for seq in alignment:
        # 'state' is the hidden state col/row; i is pos in alignment
        state = 0; i = 0
        while i < k:
            # seed: either match or del as hidden state
            if seed[i]:
                if seq[i] in alpha:
                    T, state = state_transition(T, state, 'Match', S)
                    E[state][alpha[seq[i]]] +=1
                else:
                    T, state = state_transition(T, state, 'Deletion', S)
            # not seed: either insert or nothing
            else:
                # count any emissions before next seed column
                emits = []
                while not seed[i]:
                    if seq[i] in alpha:
                        emits.append(seq[i])
                    i += 1
                    if i == k: 
                        break  # hit end of sequence
                i-=1                
                if len(emits)>0:
                    # count the transition(state, insertion)
                    T, state = state_transition(T, state, 'Insert', S)
                    # get all emissions(state) in insertion
                    for symbol in emits:
                        E[state][alpha[symbol]] += 1
                    # count all symbols as cyclic transition t(ins_x, ins_x)
                    if len(emits) >1:
                        T[state][state] += len(emits)-1
                # else do nothing, just gaps in align
            i += 1
        # from last state to 'End'    
        T, state = state_transition(T, state, 'End', S)

    T,E = normalize_matrices(T,E)
    return S, T, E

def pseudo_normalize(T, E):
    """adds pseudocounts to HMM matrices"""
    # add pseudos and normalize E
    for i in range(1, len(S)-1):
        if S[i][0] != 'D':
            E[i] += 0.01
            E[i] = E[i]/ sum(E[i])
    # add pseudos and normalize T
    for i in range(len(S)-1):
        start = 3*((i+1)//3)+1
        end = start+3
        for j in range(start, end):
            if j < len(S):
                T[i][j] += 0.01
        T[i] = T[i]/sum(T[i])
    return T, E

def outputHMM(T,E,S,alpha):
    """formats the T,E matrices for Rosalind output"""
    lists = []
    lists.append(['']+S)
    for i in range(len(S)):
        # use {0:g} format to remove trailing zeros
        lists.append([S[i]]+['{0:g}'.format(x) for x in T[i]])
    lists.append('--------')
    lists.append([''] + alpha)
    for i in range(len(S)):
        lists.append([S[i]]+['{0:g}'.format(x) for x in E[i]])
    return lists

#### Test dataset

In [147]:
### Extra test data 1:


with open('/Users/jasonmoggridge/Dropbox/Rosalind/Coursera_textbook_track/Course6/data/10f.extra.txt') as f:
    alignment, alpha, theta, sigma = parse_HMM(f)

S, T, E = profileHMM(alignment, alpha, theta)
alpha = list(alpha.keys())
T, E = pseudo_normalize(T, E)
T = np.around(T,3)
E = np.around(E,3)

lists =outputHMM(T,E,S, alpha)
with open('/Users/jasonmoggridge/Dropbox/Rosalind/Coursera_textbook_track/Course6/data/10f.extra.out.txt','w') as file:
    for line in lists[:-1]:
        if line[0] != '-':
            file.write('\t'.join(line))
        else:
            file.write(line)
        file.write('\n')
    file.write('\t'.join(lists[-1]))
del(line, lists)

Alignment:
	 ED-BCBDAC
	 -D-ABBDAC
	 ED--EBD-C
	 -C-BCB-D-
	 AD-BC-CA-
	 -DDB-BA-C
Alpha: {'A': 0, 'B': 1, 'C': 2, 'D': 3, 'E': 4}
Theta: 0.399
Sigma: 0.01
Length of alignment(k) =  9
size of alphabet =  9 

freq matrix:
 [[1. 0. 0. 1. 0. 0. 1. 3. 0.]
 [0. 0. 0. 4. 1. 5. 0. 0. 0.]
 [0. 1. 0. 0. 3. 0. 1. 0. 4.]
 [0. 5. 1. 0. 0. 0. 3. 1. 0.]
 [2. 0. 0. 0. 1. 0. 0. 0. 0.]
 [3. 0. 5. 1. 1. 1. 1. 2. 2.]]
seed: [False, True, False, True, True, True, True, True, True]


In [152]:
### Stepik challenge data:

## 'theta sigma' line was space sep (changed parser)

with open('/Users/jasonmoggridge/Dropbox/Rosalind/Coursera_textbook_track/Course6/data/dataset_26259_5.txt') as f:
    alignment, alpha, theta, sigma = parse_HMM(f)

S, T, E = profileHMM(alignment, alpha, theta)
alpha = list(alpha.keys())
T, E = pseudo_normalize(T, E)
T = np.around(T,3)
E = np.around(E,3)

lists =outputHMM(T,E,S, alpha)
with open('/Users/jasonmoggridge/Dropbox/Rosalind/Coursera_textbook_track/Course6/data/BA10F.dataset_26259_5.out.txt','w') as file:
    for line in lists[:-1]:
        if line[0] != '-':
            file.write('\t'.join(line))
        else:
            file.write(line)
        file.write('\n')
    file.write('\t'.join(lists[-1]))
del(line, lists)

Alignment:
	 EBA
	 E--
	 -BA
	 EBA
	 EBA
	 EBE
	 ---
	 ABA
Alpha: {'A': 0, 'B': 1, 'C': 2, 'D': 3, 'E': 4}
Theta: 0.274
Sigma: 0.01
Length of alignment(k) =  3
size of alphabet =  3 

freq matrix:
 [[1. 0. 5.]
 [0. 6. 0.]
 [0. 0. 0.]
 [0. 0. 0.]
 [5. 0. 1.]
 [2. 2. 2.]]
seed: [True, True, True]


In [156]:
### Rosalind challenge data:

## 'theta sigma' line was space sep (changed parser)

with open('/Users/jasonmoggridge/Dropbox/Rosalind/Coursera_textbook_track/Course6/data/rosalind_ba10f.txt') as f:
    alignment, alpha, theta, sigma = parse_HMM(f)

S, T, E = profileHMM(alignment, alpha, theta)
alpha = list(alpha.keys())
T, E = pseudo_normalize(T, E)
T = np.around(T,3)
E = np.around(E,3)

lists =outputHMM(T,E,S, alpha)
with open('/Users/jasonmoggridge/Dropbox/Rosalind/Coursera_textbook_track/Course6/data/BA10F.rosalind_ba10f.out.txt','w') as file:
    for line in lists[:-1]:
        if line[0] != '-':
            file.write('\t'.join(line))
        else:
            file.write(line)
        file.write('\n')
    file.write('\t'.join(lists[-1]))
del(line, lists)

Alignment:
	 A-BEBCEEA
	 AA-DBEEAC
	 AAADB-EDC
	 BABDBEC-C
	 ABBDBEEAC
	 AABDBEEEC
	 AABCBEEAC
	 AABDBEE-C
Alpha: {'A': 0, 'B': 1, 'C': 2, 'D': 3, 'E': 4}
Theta: 0.223
Sigma: 0.01
Length of alignment(k) =  9
size of alphabet =  9 

freq matrix:
 [[7. 6. 1. 0. 0. 0. 0. 3. 1.]
 [1. 1. 6. 0. 8. 0. 0. 0. 0.]
 [0. 0. 0. 1. 0. 1. 1. 0. 7.]
 [0. 0. 0. 6. 0. 0. 0. 1. 0.]
 [0. 0. 0. 1. 0. 6. 7. 2. 0.]
 [0. 1. 1. 0. 0. 1. 0. 2. 0.]]
seed: [True, True, True, True, True, True, True, False, True]
