# PROBLEM 19

Intro:
This problem was fairly typical because I only needed to find the cumulative probability according to the transition matrix. Pandas' .loc feature makes this trivial because I don't need to encode the variables into indices or whatever.

outro:
I saw that Dodi had the same idea as me, just that his didn't have initial and next explicitly stored into variables. Trevor, though, went the cumbersome route and decided to use indices.


In [73]:
import numpy as np
import pandas as pd

# A    ->    A    ->    B
# 0.5      0.194       0.806

def path_prob(path, t_matrix):
    prob = 0.5
    
    for x in range(len(path)-1):
        initial = path[x]
        next = path[x+1]
        
        prob *= t_matrix.loc[initial, next]
    
    return prob
    

In [75]:
h_path = "BAAAAABBBBABABBBAABBBBABBABAAABBBBABBBAABBBABBBBBB"

states = ['A', 'B']

matrix = [[0.806, 0.194],
            [0.48, 0.52]]


t_matrix = pd.DataFrame(matrix, index=states, columns=states)


cheese = path_prob(h_path, t_matrix)
print(cheese)


4.744060601682143e-18


In [68]:
print(t_matrix.loc['A','B'])

0.137


# PROBLEM 20

Intro:
This problem was also simple. It's just problem 19 but using an emission probability matrix.

outro:
Again, same as the last problem. Dodi arrived at a similar problem and Trevor accomplished the same thing algorithmically but with indices.

In [258]:
import numpy as np
import pandas as pd

def emi_prob(path, emi_path, emi_matrix):
    prob = 1
    
    for x in range(len(path)):
        state = path[x]
        emitted = emi_path[x]

        prob *= emi_matrix.loc[state, emitted]
    
    return prob

In [260]:
h_path2 = 'AAABBABBBBBAABBABBAABBBBAAABBBABBBBBABAAABAABBAAAB'
emi_path = 'yyzzxyxzzzxyzyxzyyxxxzzyxxyzyzzxxxyzzzxxxxzzyzzxzz'


matrix2 = [
    [0.384, 0.344, 0.273],
    [0.185, 0.268, 0.547]
]

states = ['A','B']
emi_states = ['x', 'y', 'z']

emi_matrix = pd.DataFrame(matrix2, index=states, columns=emi_states)


print(emi_prob(h_path2, emi_path, emi_matrix))

3.4693080273462047e-25


# PROBLEM 21

Intro:
This problem was REALLY difficult to wrap my head around, but as soon as I really looked at the formulas it started becoming clearer. Filling out the nodes is pretty much just using the formulas shown in the module videos. The difficult part, however, came with figuring out how to backtrack through. I just settled on creating another backtrack dataframe and then logging whatever preceding node-state had the highest score. At the end I'd just use a for-loop to just walk backwards through the backtracking matrix for the corresponding states.

In [357]:
import numpy as np
import pandas as pd

def viterbi_path(emi_path, states, t_matrix, emi_matrix):
    #hold scores and states for backtracking
    vit_table = pd.DataFrame(    np.full((len(states),len(emi_path)), 0)   , index=states)
    back_track = pd.DataFrame(   np.full((len(states),len(emi_path)), 0)   , index=states)
    
    for state in states: #initialize start scores
        vit_table.at[state,0] = (1/len(states)) * emi_matrix.loc[state, emi_path[0]]
    
    for time in range(1, len(emi_path)): #nested loop for calculating scores of every node
        for state in states:
            # prev_scores = [score_a/b/c_i-1 * weight]
            prev_scores = [vit_table.loc[prev_state][time-1] * t_matrix.loc[prev_state, state] * emi_matrix.loc[state, emi_path[time]] for prev_state in states]
            # fill in node with max of preceding scores
            vit_table.at[state,time] = max(prev_scores)
            # keep track of the previous node-state that gave the highest score
            back_track.at[state,time] = max(states, key=lambda prev_state: vit_table.loc[prev_state,time-1] * t_matrix.loc[prev_state, state] * emi_matrix.loc[state, emi_path[time]])
    
    # Append to a list, backtracking from the start node, then reversing the list for the right order
    
    #Find the max value of the last column and return its corresponding state
    vit_path = [max(states, key=lambda state: vit_table.loc[state, len(emi_path)-1])]
    for time in range(len(emi_path)-1, 0, -1):
        # uses the last list element to find the previous node from backtrack
        vit_path.append(back_track.loc[vit_path[-1],time])
    print(back_track)
    vit_path.reverse()
    
    return vit_path



In [358]:
with open("rosalind_ba10c2.txt") as f:
    lines = [line.strip() for line in f]
    
    
observed_path = lines[0].strip()
emi_states = lines[2].strip().split()
states = lines[4].strip().split()

t_matrix = []
emi_matrix = []
for x in range(7, 7+len(states)):
    t_matrix.append(lines[x].strip().split()[1:])

for x in range(7+len(states)+2, 7+len(states)+2+4):
    emi_matrix.append(lines[x].strip().split()[1:])

t_matrix_final = pd.DataFrame(t_matrix, index=states, columns=states).astype(float)
emi_matrix_final = pd.DataFrame(emi_matrix, index=states, columns=emi_states).astype(float)

print(''.join(viterbi_path(observed_path, states, t_matrix_final, emi_matrix_final)))

   0  1  2  3  4  5  6  7  8  9   ... 90 91 92 93 94 95 96 97 98 99
A   0  D  A  D  A  B  B  D  B  C  ...  D  B  B  B  D  B  B  B  D  A
B   0  A  C  B  C  B  B  B  B  C  ...  B  B  B  B  B  B  B  B  B  C
C   0  D  C  D  C  B  B  D  B  C  ...  D  B  B  B  D  B  B  B  D  C
D   0  D  C  D  C  B  B  D  B  C  ...  D  B  B  B  D  B  B  B  D  C

[4 rows x 100 columns]
DCDCBBBBCBBCBCDCDDCDDCBBBBCCCDCCCDDCCBCBBBBBCBBBBDCBCBBBBBBDCCBDDDBBBCCCBBDCDDCDCBDDDCDDBBBBBBBBBDCB


In [356]:
for time in range(9-1, 0, -1):
    print(time)

8
7
6
5
4
3
2
1


# PROBLEM 22

Intro:
After solving 21, problem 22 is just tweaking the algorithms to follow the forward variant. Thankfully my formatting allowed me to easily swap max() with sum(). What I'm struggling with, though, is getting the values to match accurately with the expected values.

Outro:
I still can't figure out how to get it to match rosalind properly, but as far as finding the values I think it's close enough.

In [347]:
import numpy as np
import pandas as pd

def viterbi_forward(emi_path, states, t_matrix, emi_matrix):
    #hold scores and states for backtracking
    vit_table = pd.DataFrame(    np.full((len(states),len(emi_path)), 0)   , index=states)
    
    for state in states: #initialize start scores
        vit_table.at[state,0] = (1/len(states)) * emi_matrix.loc[state, emi_path[0]]
    
    for time in range(1, len(emi_path)): #nested loop for calculating scores of every node
        for state in states:
            # prev_scores = [score_a/b/c_i-1 * weight]
            prev_scores = [vit_table.loc[prev_state][time-1] * t_matrix.loc[prev_state, state] * emi_matrix.loc[state, emi_path[time]] for prev_state in states]
            # fill in node with sum of preceding scores
            vit_table.at[state,time] = sum(prev_scores)
    vit_prob = vit_table.sum(axis=0,numeric_only=True)[len(emi_path)-1]
    return vit_prob

In [348]:
obs_path = 'zxxxzyyxyzyxyyxzzxzyyxzzxyxxzyzzyzyzzyxxyzxxzyxxzxxyzzzzzzzxyzyxzzyxzzyzxyyyyyxzzzyzxxyyyzxyyxyzyyxz'
states22 = ['A', 'B']
t_matrix22 = pd.DataFrame([[0.994, 0.006],[0.563, 0.437]], index=['A', 'B'], columns=['A', 'B'])
emi_matrix22 = pd.DataFrame([[0.55, 0.276, 0.174],[0.311, 0.368, 0.321]], index=['A', 'B'], columns=['x','y','z'])

print(viterbi_forward(obs_path, states22, t_matrix22, emi_matrix22))

4.0821070838067285e-55


In [353]:
with open("rosalind_ba10d3.txt") as f:
    lines = [line.strip() for line in f]

    
observed_path = lines[0].strip()
emi_states = lines[2].strip().split()
states = lines[4].strip().split()

t_matrix = []
emi_matrix = []
for x in range(7, 7+len(states)):
    t_matrix.append(lines[x].strip().split()[1:])

for x in range(7+len(states)+2, 7+len(states)+2+len(states)):
    emi_matrix.append(lines[x].strip().split()[1:])

t_matrix22 = pd.DataFrame(t_matrix, index=states, columns=states).astype(float)
emi_matrix22 = pd.DataFrame(emi_matrix, index=states, columns=emi_states).astype(float)

print()

print(viterbi_forward(obs_path, states22, t_matrix22, emi_matrix22))


4.213433658842122e-53
