In [1]:
import numpy as np
import matplotlib.pyplot as plt

This blog post is based on results from [1] where the authors consider consistent estimation of the number of hidden states in a state-emitting HMM. They employ two techniques: *Shtarkov normalized maximum likelihood* and *Krichevsky-Trokimov mixtures*.

In the following we will narrow down their framework to a specific class of hidden Markov models, namely *unifilar edge-emitting HMM*.

Let us fix some notation:
 - $\mathcal{S}$ - a finite set of hidden states
 - $\mathcal{X}$ - an alphabet of observed symbols
 - $\lbrace \mathcal{T}^{(x)} \rbrace$ - a set of sub-stochastic matrices indexed by $x \in \mathcal{X}$ where $\mathcal{T}^{(x)}_{ij} = \Pr (S_{t+1} = j, X_t = x \mid S_t = i)$
 - $\mu$ - an initial distribution of hidden states, or a stationary distribution if it exists and we assume that we are observing $X_1^n$ after stationary distribution being achieved.

The goal is to estimate $k = |\mathcal{S}|$ given a word $x_1^n \in \mathcal{X}^n$.

Probability of observing a word $x_1^n$ is given by
$$
\Pr (x_1^n) = \mu^\intercal \prod_{i=1}^n \mathcal{T}^{(x_i)} \mathbf{1}
$$

In an unifilar setting, once we know an initial state $s_1$, the trajectory of hidden states $s_1^n$ is determined by the observed sequence $x_1^n$:
$$
\Pr \left( x_1^n \mid s_1, \lbrace p_i (\cdot) \rbrace \right) = \prod_{i=1}^n p_{s_i} (x_i), \quad s_{i+1} = \delta (s_i, x_i)
$$
for a deterministic function $\delta : \mathcal{S} \times \mathcal{X} \to \mathcal{S}$ and
$$
p_{s_i} (x) = \sum_{j} T_{ij}^{(x)} = T_{i, \delta(s_i, x)}^{(x)}
$$

Inspired by Kraft-Millan theorem, they find optimal $k$ by minimizing average number of bits required to compress $X_1^n$ via uniquely decodable code and adding a pentalty for a complexity:
$$
\hat{k} = \arg \min_k (-\log_2 Q_k^n (X_1^n) + \lambda (n,k))
$$

Why a well-choosen penalty $\lambda$ is essential to guarantee consistency?
 - if a penalty is too small, we risk overestimating $k$
 - if a penalty is too large, we risk underestimating $k$.

To guarantee a consistency, the penalty must offset
$$
\sup_{\theta \in \Theta^k} L^n (\theta, X_1^n) - L^n (\theta_{true}, X_1^n)
$$

How to choose a coding distribution $Q_k^n (\cdot)$ for the words $x_1^n \in \mathcal{X}^n$?

### Shtarkov NML

$$
NML_k^n (x_1^n) = \frac{\sup_{\theta \in \Theta^k} \mathbb{P}_\theta (x_1^n)}{C(k,n)}
$$

where
$$
C(k,n) = \sum_{x_1^n \in \mathcal{X}^n} \sup_{\theta \in \Theta^k} \mathbb{P}_\theta (x_1^n)
$$

It is evident that $C(k,n)$ may be difficult to calcualte for large $n$. In [1, Lemma 8] an upper bound is derived for $C(k,n)$. A special case, where $C(k,n) = C(k)$, is a BIC estimator.

**Open question: can we find a tighter upper bound for $C(k,n)$ if we narrow down the scope to unifilar HMM?**

### Krichevsky-Trokimov mixture

KT is a Bayesian approach, where a priori dsitribution is Dirichlet and then we take an expectation over the parameters (more details and formulas can be found in [1])
$$
KT_k^n (x_1^n) = \mathbb{E}_{\nu_k} \left[ \mathbb{P}_{\mu, \theta} (x_1^k) \right]
$$

where $\nu_k$ is a density of Dirichlet distribution on $\Theta^k$.

We will be seeking not only $k$ as in [1], but $(k, \delta)$. For a fixed $k$ we choose candidate topologies $\mathcal{D}_k = \lbrace delta \rbrace$

## References

1. Gassiat, E. & Boucheron, S. Optimal error exponents in hidden markov models order estimation. IEEE Trans. Inform. Theory 49, 964â€“980 (2003).