# Bayesian Inference in the Poisson HMM

In this notebook we will fit a Hidden Markov Model. See the slides for Lap 7 ([Part 1](https://github.com/slinderman/stats271sp2021/blob/main/slides/lap7_hmms.pdf) and [Part 2](https://github.com/slinderman/stats271sp2021/blob/main/slides/lap7_hmms_b.pdf)) for more information on this model.

Mathematically, let $\mathbf{x}_t^{(v)} \in \mathbb{R}^{20}$ denote the combined features at time step $t$ of the $v$-th event, and let $\mathbf{x}_{1:T_i}^{(v)} = (\mathbf{x}_1^{(v)}, \ldots, \mathbf{x}_{T_v}^{(v)})$ denote the full sequence of features for the $v$-th event, where $T_v$ is the number of time steps. Likewise, let $z_{1:T_v}^{(v)} = (z_1^{(v)}, \ldots, z_{T_v}^{(v)})$ denote the sequence of discrete states for the $i$-th event.

The joint distribution of the complete dataset is,

\begin{align}
p(\{(z_{1:T_v}^{(v)}, \mathbf{x}_{1:T_v}^{(v)}\}_{v=1}^V \mid \Theta) 
&= \prod_{v=1}^V p(z_{1:T_v}^{(v)}, \mathbf{x}_{1:T_v}^{(v)} \mid \Theta) \\
&= \prod_{v=1}^V \left[p(z_1^{(v)} \mid \Theta) \prod_{t=2}^{T_v} p(z_{t}^{(v)} \mid z_{t-1}^{(v)} , \Theta)  \prod_{t=1}^{T_v} p(\mathbf{x}_{t}^{(v)} \mid z_{t}^{(v)}, \Theta) \right]
\end{align}

The goal is to find the parameters $\Theta$ that maximize the marginal likelihood of the data by using EM. Then we'll use cross validation, holding out a random subset of events, to determine the number of discrete states. Finally, we'll visualize the inferred states in terms of the distribution over features.

<img src="../hmm.png" width="400">

\begin{align}
\theta_k \sim Dirichlet_k(\alpha)\\
z_t\sim Discrete(\theta_{z_{t-1}})\\
\mu_k\sim Normal(0, \sigma)\\
y_t \sim Poisson(\mu_{z_t}^Tx_t)
\end{align}

In [1]:
import sys
sys.path.append('../')
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from src.d01_data.dengue_data_api import DengueDataApi

In [4]:
dda = DengueDataApi()
x_train, x_validate, y_train, y_validate = dda.split_data()

In [None]:
format_data(df):
    df.sort_values(inplace=True)
    datas = []
    for city in df.index.get_level_values('city').unique():
        for year in df.loc[city].index.get_level_values('year').unique():
            datas.append(df.loc[city].loc[year].values)
    return datas
    