# Implementing Various Parts of HMM
References:
- <https://www.cs.sjsu.edu/~stamp/RUA/HMM.pdf> - This explains the MM + HMM model with good motivation. It also describes the 3 HMM problems and their solutions but does not explain their derivations.
- <https://en.wikipedia.org/wiki/Hidden_Markov_model> - This has further references and links to the solutions used for solving the HMM problems. Some (like the entry on the `forward algotithm`) are well written and can make things clearer.
- <http://www.robots.ox.ac.uk:5000/~vgg/rg/papers/hmm.pdf> - The seminal paper on HMMs for speech recognition by Rabiner `89.

My aim here is to derive the solutions for each of the HMM problems and apply them to toy examples. I've tried to be as lucid with the Math, code and the derivations as possible.

# **Toy example 1**

- Two hidden states : [1_Hot, 2_Cold]
- Three observed states : [0_Small, 1_Medium, 2_Large]
- Init Probability : $\pi = [0.6, 0.4]$
- Transition Matrix $A$

| t-1 => t | s0 | s_hot | s_cold |
|----------|----|-------|--------|
| **s0**       | 0  | 0.6   | 0.4    |
| **s_hot**    | 0  | 0.7   | 0.3    |
| **s_cold**   | 0  | 0.4   | 0.6    |


- Emission Matrix $B$

| zt => xt | 0_small | 1_medium | 2_large |
|----------|---------|----------|---------|
| s0       | NA       | NA        | NA       |
| s_hot    | 0.1     | 0.4      | 0.5     |
| s_cold   | 0.7     | 0.2      | 0.1     |

In [1]:
## In numpy
import numpy as np

In [2]:
vocStates = 2
vocObs = 3

trA = np.array(
    [
        [0, 0.6, 0.4],
        [0, 0.7, 0.3],
        [0, 0.4, 0.6]
    ]
)

emB = np.array(
    [
        [None, None, None],
        [0.1, 0.4, 0.5],
        [0.7, 0.2, 0.1]
    ]
)

obX = [0, 1, 0, 2]

## Task 1
Given $A, B, \pi$ and a observed sequence $\vec{x} = (0, 1, 0, 2)$, what is the probability of observing this sequence?

As derived, for a sequence $\vec x$ given $A, B$, we can arrive at the following:
$$
P(\vec x; A,B) = \sum_{\vec z} P(\vec x, \vec z; A,B) \\
= \sum_{\vec z} \big(\prod_{t=1}^T B_{z_t, x_t}\big) \big(\prod_{t=1}^T A_{z_{t-1}, z_t}\big) 
$$

This solution involves considering every possible state assignment combination for the output sequence.

In [3]:
total_prob = 0
for i in range(1, 3):
    for j in range(1, 3):
        for k in range(1, 3):
            for l in range(1, 3):
                zprob = 1.0
                zseq = (i, j, k, l)
                for t, (z, x) in enumerate(zip(zseq, obX)):
                    _b = emB[z, x]
                    
                    if t == 0:
                        zprev = 0
                    else:
                        zprev = zseq[t-1]
                    
                    _a = trA[zprev, z]
                    
                    zprob = zprob * _a * _b
                    
                total_prob += zprob
                print zseq, round(zprob, 6), total_prob
                
print total_prob

(1, 1, 1, 1) 0.000412 0.0004116
(1, 1, 1, 2) 3.5e-05 0.00044688
(1, 1, 2, 1) 0.000706 0.00115248
(1, 1, 2, 2) 0.000212 0.00136416
(1, 2, 1, 1) 5e-05 0.0014145599999999998
(1, 2, 1, 2) 4e-06 0.0014188799999999998
(1, 2, 2, 1) 0.000302 0.0017212799999999997
(1, 2, 2, 2) 9.1e-05 0.0018119999999999998
(2, 1, 1, 1) 0.001098 0.0029095999999999996
(2, 1, 1, 2) 9.4e-05 0.0030036799999999995
(2, 1, 2, 1) 0.001882 0.004885279999999999
(2, 1, 2, 2) 0.000564 0.005449759999999999
(2, 2, 1, 1) 0.00047 0.005920159999999999
(2, 2, 1, 2) 4e-05 0.005960479999999999
(2, 2, 2, 1) 0.002822 0.008782879999999998
(2, 2, 2, 2) 0.000847 0.009629599999999999
0.009629599999999999


This solution is naive. We are iterating over every combination of $\vec z$ over time. Let's see if it can be done faster.

The task is to find $P(\vec x)$ (matrices $A$, $B$ omitted for brevity). This can be expressed as
$$
P(\vec x) = \sum_{z_t} P(z_t, \vec x)  \tag0
$$

The equation above is true for any expression. Let's now try to expand $P(z_t, \vec x)$ in terms of our emission and transition terms ...

$$
\begin{align}
P(z_t, x_{1:t}) &= \sum_{z_{t-1}} P(z_t, z_{t-1}, x_{1:t}) \tag1 \\
&= \sum_{z_{t-1}} P(x_t, z_t, z_{t-1}, x_{t-1}, x_{t-2}, ... , x_1) \tag2 \\
&= \sum_{z_{t-1}} P(x_t | z_t) P(z_t, z_{t-1}, x_{1:t-1}) \tag3 \\
&= \sum_{z_{t-1}} P(x_t | z_t) P(z_t | z_{t-1}) P(z_{t-1}, x_{1:t-1}) \tag4 \\
&= P(x_t | z_t) \sum_{z_{t-1}} P(z_t | z_{t-1}) P(z_{t-1}, x_{1:t-1}) \tag5
\end{align}
$$

In eq 1, we introduce a new hidden state var from previous timestep. Eq 2 simple expands $x_{1:t}$ before applying the Markov assumptions. In eq 3 we apply the emission prob. assumption and in eq 4 we apply the transition prob. assumption. Finally, in eq 5, we can take $P(x_t, z_t)$ out of the summand since we are iterating over values of $z_{t-1}$ only.

Observe that there is a recurrence b/w EQ1 and EQ5. If we define $\alpha_z(t)$ to be $P(z_t, x_{1:t})$, then we can rewrite EQ 5 as follows:

$$
\begin{align}
P(z_t, x_{1:t}) &= P(x_t | z_t) \sum_{z_{t-1}} P(z_t | z_{t-1}) P(z_{t-1}, x_{1:t-1}) \\
\rightarrow \alpha_z(t) &= P(x_t | z_t) \sum_{z_{t-1}} P(z_t | z_{t-1}) \alpha_z(t-1) \\
&= B_{z_t, x_t} \sum_{z_{t-1}} A_{z_t, z_{t-1}} \alpha_z(t-1)
\end{align}
$$

This expression only involves summing over all possible states of $z_i$, which in our example is 2. Our original expression EQ 0, thus becomes

$$
P(\vec x) = \sum_{z_t} P(z_t, \vec x) = \sum_z \alpha_z(t = T)
$$

Now, we are no longer considering all possible combinations of the entire sequence of hidden states, but rather, the possible values of $z$, which in our example is 2. And to calculate $\alpha(T)$, we will have to iterate over all timesteps.

In [4]:
from copy import deepcopy
from IPython.core.debugger import set_trace

In [5]:
_toy = """
====
Init
====
_|_H_|_L_
*|0.5|0.5

============
Transmission
============
_|_H_|_L_
H|0.5|0.5
L|0.4|0.6

========
Emission
========
_|_A_|_C_|_G_|_T_
H|0.2|0.3|0.3|0.2
L|0.3|0.2|0.2|0.3

========
Observed
========
[G G C A]
"""
states = ['H', 'L']
genes = ['A', 'C', 'G', 'T']

inits = [0.5, 0.5]

trans = np.array(
    [
        [0.5, 0.5],
        [0.4, 0.6]
    ]
)
emmit = np.array(
    [
        [0.2, 0.3, 0.3, 0.2],
        [0.3, 0.2, 0.2, 0.3]
    ]
)

obsX = [2, 2, 1, 0]

In [6]:
## Priyam, forget whatever shit you wrote on top.
## We define `\alpha` for every timestep AND every hidden state ==> `\alpha_i(t)`.
total_prob = 0

for i in range(2):
    for j in range(2):
        for k in range(2):
            for l in range(2):
                prob = 1.0
                zstates = (i, j, k, l)
                
                for t, (z, x) in enumerate(zip(zstates, obsX)):
                    if t == 0:
                        prob = prob * inits[z] * emmit[z, x]
                    else:
                        prob = prob * trans[zstates[t-1], z] * emmit[z, x]
                
                total_prob += prob
                print zstates, prob, total_prob

(0, 0, 0, 0) 0.0003375 0.0003375
(0, 0, 0, 1) 0.00050625 0.00084375
(0, 0, 1, 0) 0.00018 0.00102375
(0, 0, 1, 1) 0.0004049999999999999 0.00142875
(0, 1, 0, 0) 0.00018 0.00160875
(0, 1, 0, 1) 0.00027 0.00187875
(0, 1, 1, 0) 0.000144 0.00202275
(0, 1, 1, 1) 0.000324 0.00234675
(1, 0, 0, 0) 0.00018000000000000004 0.00252675
(1, 0, 0, 1) 0.00027 0.0027967499999999998
(1, 0, 1, 0) 9.600000000000004e-05 0.00289275
(1, 0, 1, 1) 0.00021600000000000005 0.00310875
(1, 1, 0, 0) 0.000144 0.00325275
(1, 1, 0, 1) 0.00021600000000000002 0.00346875
(1, 1, 1, 0) 0.00011520000000000001 0.0035839500000000002
(1, 1, 1, 1) 0.0002592 0.00384315


In [7]:
print _toy


====
Init
====
_|_H_|_L_
*|0.5|0.5

Transmission
_|_H_|_L_
H|0.5|0.5
L|0.4|0.6

Emission
_|_A_|_C_|_G_|_T_
H|0.2|0.3|0.3|0.2
L|0.3|0.2|0.2|0.3

Observed
[G G C A]



In [8]:
for t,x in enumerate(obsX):
    if t == 0:
        alpha_prev = []
        for i in range(2):
            alpha_prev.append(inits[i] * emmit[i, x])
            
    else:
        alpha_new = []
        for i in range(2):
            _jsum = 0
            for j in range(2):
                _jsum += alpha_prev[j] * trans[j, i]
            
            alpha_new.append(_jsum * emmit[i, x])
        
        alpha_prev = deepcopy(alpha_new)
        
print sum(alpha_new)
assert round(total_prob, 6) == round(sum(alpha_new), 6)
print "Matches brute force!"

0.0038431500000000005
Matches brute force!


Now that we have it working, let's try deriving the *alpha-pass* algorithm again.

Following EQ 5, we can express the joint probability $P(z_t, x_{1:t})$ as 

$$
\begin{align}
P(z_t, x_{1:t}) &= P(x_t | z_t) \sum_{z_{t-1}} P(z_t | z_{t-1}) P(z_{t-1}, x_{1:t-1}) \tag6
\end{align}
$$

Here, $z_t$ is simply a random variable whose value exists in the set of state values $S$; the joint distribution "table" is defined for all possible values that $z_t$ can have. Concretely, we can define the joint probability for a particular state value $s_i$ as:

$$
\begin{align}
P(z_t=s_i, x_{1:t}) &= P(x_t | z_t=s_i) \sum_{s_j \in S} P(z_t = s_i | z_{t-1} = s_j) \times P(z_{t-1} = s_j, x_{1:t-1}) \tag7
\end{align}
$$

With this view, we can define $\alpha_t(s_i)$, as the probability of the observed partial sequnce upto time $t$ where $z_t = s_i$ :

$$
\alpha_t(s_i) = P(z_t = s_i, x_{1:t}) \tag8
$$

Now, we can re-write EQ7 in terms of $\alpha, A, B$:

$$
\alpha_t(s_i) = B_{s_i, x_t} \sum_{s_j \in S} A_{s_j, s_i} \times \alpha_{t-1}(s_j) \tag9
$$

The base-case, for this recursive definition is for the first observation $x_1$. This is simply the probability of observing $x_1$ for all possible state assignment to $z_1$.

$$
\alpha_1(s_i) = B_{s_i, x_1} \pi_{s_i} \tag{10}
$$

Using EQs 8, 9 and 10, we can express the probability of observing any sequnce $x_{1:T}$ as 

$$
\begin{align}
P(x_{1:T}) &= \sum_{s_i \in S} P(x_{1:T}, z_T=s_i) \\
&= \sum_{s_i \in S} \alpha_T(s_i) \tag{11}
\end{align}
$$

Finally! We have now reduced the time-complexity from $O(|S|^T)$ to $O(|S|^2T)$. How?
- To arrive at $\alpha_T$, we will have to loop over all possible values of $T$.
```python
for t,x in enumerate(obsX):
```

- At every iteration, we will compute the intermediate value $\alpha_t(s_i)$ for all $i$, which will require a sum over every state value $j$ from the previously computer $\alpha_{t-1}$ (from EQ 9) -- hence the $|S|^2$.
```python
        alpha_new = []
        for i in range(2): ## For current $\alpha_t$
            _jsum = 0
            for j in range(2): ## From previous $\alpha_{t-1}$
                _jsum += alpha_prev[j] * trans[j, i]
            
            alpha_new.append(_jsum * emmit[i, x])
        
        alpha_prev = deepcopy(alpha_new)
```

Once the last `alpha_new` has been computed, the joint is simply the sum of probabilities over the states : `return sum(alpha_new)`

## Task 2 (a)
Given a sequence of time $t$, what is the distribution over hidden states $P(z_t)$ at time $t$?

We need to find $P(z_t | x_{1:t})$. This can be done easily using the result from the `forward algorithm`.

$$
\begin{align}
P(z_t = s_i | x_{1:t}) &= \frac{P(z_t = s_i, x_{1:t})}{P(x_{1:t})} \tag{12} \\
&= \frac{\alpha_t(s_i)}{\sum_{s_j}\alpha_t(s_j)} \tag{13}
\end{align}
$$

## Task 2 (b)
Finding $P(z_t = s_i | x_{1:T})$ for any arbitrary $t \in [1, T]$ is not trivial. Deriving this to the final "optimized form" took a while because it was not apparent what terms to factorize in the joint expression.

We begin by expressing $P(z_t = s_i | x_{1:T}) \propto P(z_t = s_i, x_{1:T})$. The equation is complete with $P(x_{1:T})$ in the denominator as in EQ.12, 13. We ignore that term for now.

$$
\begin{align}
P(z_t = s_i | x_{1:T}) &\propto P(z_t = s_i, x_{1:T}) \\
&= P(z_t = s_i, x_{1:t}, x_{t+1:T}) \tag{14} \\
&= P(x_{t+1:T} | z_t=s_i, x_{1:t}) \ P(z_t = s_i, x_{1:t}) \tag{15}
\end{align}
$$

In EQ. 14 15, we are trying to factorize the joint into two terms: the second term we have solved already in EQ.12. We can simplify the first term by exploiting the fact that the probability of the future observations $x_{t+1:T}$ is independent of the past observations $x_{1:t}$ given the current hidden state $z_t = s_i$. The following diagram, will make it clear.

![](hmm-independence.svg)

With this, we can rewrite it to $P(x_{t+1 : T}|z_t = s_i) \times \alpha_t(s_i)$. Now let's estimate the first term.

$$
\begin{align}
P(x_{t+1 : T}|z_t = s_i) &= \frac{P(x_{t+1 : T}, z_t = s_i)}{P(z_t = s_i)} \\ \\
&= \sum_{s_j \in S} P(x_{t+1:T}, z_t = s_i, z_{t+1}=s_j) / P(z_t = s_i) \tag{16} \\ \\
&= \sum_{s_j \in S} P(x_{t+1}, x_{t+2:T}, z_t = s_i, z_{t+1}=s_j) / P(z_t = s_i) \tag{17} \\ \\
\end{align}
$$

EQ.16 is trivial - we introduce a new state variable $z_{t+1}$. In EQ. 17, we breakdown the observed states into $x_{t+1}, x_{t+2:T}$. This will allow us to factorize the expression as a recursion of FUTURE time-steps. Continuing...

$$
\begin{align}
&= \sum_{s_j \in S} P(x_{t+2:T} | x_{t+1}, z_t = s_i, z_{t+1}=s_j) \times P(x_{t+1}, z_t = s_i, z_{t+1}=s_j) / P(z_t = s_i) \tag{18} \\ \\
&= \sum_{s_j \in S} P(x_{t+2:T} | z_{t+1}=s_j) \times P(x_{t+1} | z_t = s_i, z_{t+1}=s_j) \times P(z_t = s_i, z_{t+1}=s_j) / P(z_t = s_i) \tag{19} \\ \\
&= \sum_{s_j \in S} P(x_{t+2:T} | z_{t+1}=s_j) \times P(x_{t+1} | z_{t+1}=s_j) \times P(z_{t+1} = s_j | z_t = s_i) \times P(z_t = s_i) / P(z_t = s_i) \tag{20} \\ \\
&= \sum_{s_j \in S} P(x_{t+2:T} | z_{t+1}=s_j) \times B_{x_{t+1}, s_j} \times A_{s_i, s_j} \tag{21}
\end{align}
$$

Lots going here. 
- In EQ 19, the first term is a result of applying the concept from our last diagram -- the future observations $x_{t+2:T}$ only depend on the current hidden state $z_{t+1} = s_j$. The simplied form is the first term in EQ 20.
- The remaining terms in EQ 19 are factors of the second joint prob term in EQ 18.
- In EQ 20, for the second term, we apply the emission independence assumption. The next two terms in the equation are factors of the joint $P(z_t = s_i, z_{t+1} = s_j)$.
- The last two terms in EQ 20 cancel each other and the final form in terms of our emission and transition matrices is in EQ 21.

And *now*, we finally have our recursive relation. At any time $t$ and a particular state assignment $s_i$ we define $\beta_t(s_i) = P(x_{t+1 : T} | z_t = s_i)$. This is the probability of all future observations, given the current hidden state. Appropriately,

$$
\begin{align}
P(x_{t+1 : T}|z_t = s_i) &= \sum_{s_j \in S} P(x_{t+2:T} | z_{t+1}=s_j) \times B_{x_{t+1}, s_j} \times A_{s_i, s_j} \\ \\
\implies \beta_t(s_i) &= \sum_{s_j \in S} \beta_{t+1}(s_j) \times B_{x_{t+1}, s_j} \times A_{s_i, s_j} \tag{22}
\end{align}
$$

With this in place, we can define the base-case at the last timestep $T$ as $\beta_T(s_i \in S) = 1$, since there are no observations after time $T$. We then work back starting from $T$ and use the future $\beta$ values in each step.

Using EQs 12 and 14, we can now complete the expression $P(z_t = s_i | x_{1:T})$

$$
\begin{align}
P(z_t = s_i | x_{1:T}) &= \frac{P(z_t = s_i, x_{1:T})}{P(x_{1:T})} \tag{joint} \\ \\
&= \frac{P(z_t = s_i, x_{1:t}, x_{t+1:T})}{P(x_{1:T})} \tag{observed seq time split} \\ \\
&= \frac{P(x_{t+1:T} | z_t=s_i, x_{1:t}) \ P(z_t = s_i, x_{1:t})}{P(x_{1:T})} \tag{factoring for $\alpha, \beta$} \\ \\
&= \frac{P(x_{t+1:T} | z_t=s_i) \ P(z_t = s_i, x_{1:t})}{P(x_{1:T})} \tag{"independence" from the diagram} \\ \\
&= \frac{\beta_t(s_i) \ \alpha_t(s_i)}{P(x_{1:T})} \tag{23}
\end{align}
$$

Unfortunately, I have introduce yet another toy example before we begin. This ones from the [Forward-Backward Algorithm's Wiki page](https://en.wikipedia.org/wiki/Forward%E2%80%93backward_algorithm).

In [9]:
states = ('Healthy', 'Fever')
inits = [0.6, 0.4]

trans = np.array(
    [
        [0.7, 0.3],
        [0.4, 0.6]
    ]
)

outputs = ['normal', 'cold', 'dizzy']
emmit = np.array(
    [
        [0.5, 0.4, 0.1],
        [0.1, 0.3, 0.6]
    ]
)

obsX = [0, 1, 2]

In [10]:
from copy import deepcopy

In [11]:
## forward pass
alphas = []
for t, x in enumerate(obsX):
    if t == 0: ## init prev_alpha
        prev_alpha = []
        for s in range(len(states)):
            prev_alpha.append(emmit[s, x] * inits[s])
        prev_prob = deepcopy(prev_alpha)
            
    else:
        new_alpha = []
        max_prob = []
        
        for i in range(len(states)):
            accum = 0.0
            for j in range(len(states)):
                accum += prev_alpha[j] * trans[j, i]
                
            new_alpha.append(accum * emmit[i, x])
            
            max_prob.append(np.max([prev_prob[j] * trans[j, i] * emmit[i, x] for j in range(len(states))]))
        
        prev_alpha = deepcopy(new_alpha)
        prev_prob = deepcopy(max_prob)
    
    alphas.append(prev_alpha)
            
p_obs = sum(new_alpha)
print "Probability of observing sequence", [outputs[x] for x in obsX], p_obs
print "Max prob", np.max(max_prob)

## backward pass
betas = []
for t in range(len(obsX))[::-1]:
    if t == (len(obsX)-1):
        ahead_beta = []
        for s in range(len(states)):
            ahead_beta.append(1.0)
            
    else:
        x = obsX[t+1]
        current_beta = []
        for i in range(len(states)):
            accum = 0.0
            for j in range(len(states)):
                accum += trans[i, j] * emmit[j, x] * ahead_beta[j]
            
            current_beta.append(accum)
            
        ahead_beta = deepcopy(current_beta)
    
    print t, "ahead_beta", ahead_beta    
    betas.insert(0, ahead_beta)

## 
for t in range(len(obsX)):
    p_states = [alphas[t][s] * betas[t][s] / p_obs for s in range(len(states))]
    print p_states, np.argmax(p_states), np.max(p_states)

Probability of observing sequence ['normal', 'cold', 'dizzy'] 0.03628
Max prob 0.01512
2 ahead_beta [1.0, 1.0]
1 ahead_beta [0.25, 0.4]
0 ahead_beta [0.10599999999999998, 0.112]
[0.8765159867695699, 0.12348401323043003] 0 0.8765159867695699
[0.6229327453142228, 0.37706725468577734] 0 0.6229327453142228
[0.2121278941565601, 0.7878721058434399] 1 0.7878721058434399


## Task 2(c) Viterbi

The task is to find the most likely state sequence, given the observed sequence.

$$
\underset{z_{1:T}}{\text{argmax }} P(z_{1:T} | x_{1:T}) = \underset{z_{1:T}}{\text{argmax }} P(z_{1:T}, x_{1:T}) \tag{joint; denominator is $P(x_{1:T})$}
$$

The probability expression (*not* the `argmax`) can be factorized very elegantly as

$$
\begin{align}
P(z_{1:T}, x_{1:T}) &= P(x_1 | z_1) P(z_1) \prod_{t=2}^T P(x_t | z_t) P(z_t | z_{t-1}) \tag{24}
\end{align}
$$

To get the most likely state assignment, we could try all possible combinations of state assignments for the entire sequence and see which combination maximizes EQ.24. This will be in the order of $|S|^T$.

To motivate why Viterbi works, and it's link to Dynamic Programming, consider a simpler problem where you know the best state assignment upto timestep $T-1$. Now, you simply have to pick the link that maximizes the final state assignment. We can extend this logic back to the first state : Viterbi computes the state asignment probabilities at time $t$, which maximize the probability of observing all the data upto time $t$.

For the first observation $x_1$, there is only 1 random variable $z_1$, for which we can easily calculate the assignment probabilities along the lines of EQ 24. For the next observation $x_2$, we consider all possible transitions from the previous state variable $z_1$ to the current variable $z_2$; if every state can have $S$ possible assignments, then we consider $S^2$ links. We then pick *that* link to every state assignment $z_2$ which has the highest probability.

![](hmm-viterbi.svg)

We can use EQ.24, to compute the joint probability of the states, upto any time $t$. For $t=1$, this is simply the initialization probability and the observation probability : 
$$
P(z_1 = s_i, x_1) = B_{x_1, s_i}\times \pi_{s_i}
$$

For $t = 2$, we can write the joint as follows
$$
P(z_1, z_2, x_1, x_2) = B_{x_1, z_1}\pi_{z_1} \times B_{x_2, z_2} A_{z_2, z_1}
$$

The first term, can come from the previous step, saving the need to needlessly compute the entire probability sequence every time.

The complete algorithm inferes the hidden state sequence in two stages:
1. In the first stage, you iterate over all timesteps starting from 1 and keep a track of
    - The most probable current state with the probability of "reaching here and observing the data till this point".
    - The most probable connection to current state from *every* possible previous state.
2. In the second stage, we backtrack from the last timestep, selecting the most probable state assignments, through the most likely paths.

I follow the notation used by Rabiner for the variables used in Viterbi (`delta` for the state probabilities, and `psi` for the most probable path).

In [27]:
## Viterbi.
delta = []
psi = []

## stage1
for t, x in enumerate(obsX):
    if t == 0:
        _d = []
        _p = []
        for s in range(len(states)):
            _d.append(emmit[s, x] * inits[s])
            _p.append(0)
            
        delta.append(_d)
        psi.append(_p)
        
    else:
        _d = []
        _p = []
        for curr in range(len(states)):
            temp = []
            for prev in range(len(states)):
                temp.append(delta[t-1][prev] * trans[prev, curr] * emmit[curr, x]) ## delta
                
            _d.append(max(temp))
            
            temp = []
            for prev in range(len(states)):
                temp.append(delta[t-1][prev] * trans[prev, curr]) ## psi
                
            _p.append(np.argmax(temp))
            
        delta.append(_d)
        psi.append(_p)
        
## stage2
decoding = []
qt = np.argmax(delta[-1])
decoding.append(qt)

for t in range(len(obsX))[::-1][:-1]:
    qt = psi[t][qt]
    decoding.append(qt)
    
## answer
print "Hidden State Sequence:", [states[s] for s in reversed(decoding)]
print "State Probabilities ['Healthy', 'Fever']:", delta

Hidden State Sequence: ['Healthy', 'Healthy', 'Fever']
State Probabilities ['Healthy', 'Fever']: [[0.3, 0.04000000000000001], [0.084, 0.027], [0.00588, 0.01512]]


The results match with the Wiki example.