# Intro

The aim of the notebook is to present the impact of disretization on gaussian hidden markov model training, as well as investigate the possibilities of co-occurrence-based learning procedure.

## Plan

1. Define several Gaussian HMMs
2. Train my model using:
    - EM
    - co-occurrence (pay attention to SGD hyperparameters!)
3. Evaluate the results:
    - loglikelihood
    - accuracy
    - $d_{tv}$, KL on transition matrix
    - RMSE on means and covariance matrices
4. Repeat each experiments several (10) times, to understand the stability of proposed methods.
5. Present the results:
    - table summarising numerical results
    - visualize real/learned distribution
    - present data, nodes and clusters

## Setup

In [1]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

import urllib
import itertools
from scipy.stats import multivariate_normal

from ssm.util import find_permutation
from ssm.plots import gradient_cmap, white_to_color_cmap

from hmmlearn import hmm
from source.utils.utils import  total_variance_dist
from source.model.discretized_HMM import DiscreteHMM, DISCRETIZATION_TECHNIQUES
LEARNING_ALGORITHMS = ["em", "cooc"]

In [2]:
sns.set_style("white")

with urllib.request.urlopen('https://xkcd.com/color/rgb.txt') as f:
    colors = f.readlines()
color_names = [str(c)[2:].split('\\t')[0] for c in colors[1:]]

colors = sns.xkcd_palette(color_names)
cmap = gradient_cmap(colors)

In [3]:
np.random.seed(42)

true_model = hmm.GaussianHMM(n_components=3, covariance_type="full")
true_model.startprob_ = np.array([0.6, 0.3, 0.1])
true_model.transmat_ = np.array([[0.7, 0.2, 0.1],
                            [0.3, 0.5, 0.2],
                            [0.3, 0.3, 0.4]])

true_model.means_ = np.array([[0.0, 0.0], [3.0, -3.0], [4.0, 3.0]])
true_model.covars_ = np.array([[[1, -.5], [-.5, 1.2]], [[.6, -.5], [-.5, 1.2]], [[1.5, .5], [.5, 2.2]]]) * .8

In [12]:
X_train, Z_train = true_model.sample(10000)
X_test, Z_test = true_model.sample(1000)

## Define experiments

In [15]:
results = []

for learning_alg, discretize_meth in itertools.product(LEARNING_ALGORITHMS, DISCRETIZATION_TECHNIQUES):
    print(learning_alg, discretize_meth)
    try:
        model = DiscreteHMM(discretize_meth, 20, n_components=3, learning_alg=learning_alg, verbose=True, optim_params=dict(max_epoch=100000, lr=0.1), n_iter=100)
        model.fit(X_train)
        Z_hat = model.predict(X_test)
        Z_hat = find_permutation(np.concatenate([Z_test, np.array([0, 1, 2])]), np.concatenate([Z_hat, np.array([0, 1, 2])]))[Z_hat]

        results.append({
            'learning_alg': learning_alg,
            'discretize_meth': discretize_meth,
            'accuracy': (Z_hat == Z_test).mean(),
            'transmat_dtv': total_variance_dist(model.transmat_, true_model.transmat_),
            'means_diff': np.abs(model.means_, true_model.means_).sum(),
            'covars_diff': np.abs(model.covars_, true_model.covars_).sum(),
            'loglikelihood': model.score(X_test)
        })
    except:
        pass

em random


Even though the 'transmat_' attribute is set, it will be overwritten during initialization because 'init_params' contains 't'
Even though the 'means_' attribute is set, it will be overwritten during initialization because 'init_params' contains 'm'
Even though the 'covars_' attribute is set, it will be overwritten during initialization because 'init_params' contains 'c'
         1      -51939.6446             +nan
         2      -49356.7219       +2582.9228
         3      -49294.3820         +62.3398
         4      -49257.5234         +36.8586
         5      -49220.6513         +36.8721
         6      -49177.6421         +43.0092
         7      -49124.9280         +52.7140
         8      -49060.9095         +64.0185
         9      -48982.6379         +78.2716
        10      -48873.1593        +109.4786
        11      -48670.2916        +202.8677
        12      -48130.9228        +539.3688
        13      -46623.4002       +1507.5226
        14      -45881.0606        +742.33

em latin_cube_u


Even though the 'transmat_' attribute is set, it will be overwritten during initialization because 'init_params' contains 't'
Even though the 'means_' attribute is set, it will be overwritten during initialization because 'init_params' contains 'm'
Even though the 'covars_' attribute is set, it will be overwritten during initialization because 'init_params' contains 'c'
         1      -57086.0537             +nan
         2      -47894.0764       +9191.9773
         3      -44645.9967       +3248.0797
         4      -39328.9868       +5317.0099
         5      -29863.0845       +9465.9023
         6      -24958.2114       +4904.8731
         7      -18156.3229       +6801.8885
         8       11537.4399      +29693.7628
         9       11609.9990         +72.5591
        10       11610.0146          +0.0156
        11       11610.0093          -0.0054


em latin_cube_q


Even though the 'transmat_' attribute is set, it will be overwritten during initialization because 'init_params' contains 't'
Even though the 'means_' attribute is set, it will be overwritten during initialization because 'init_params' contains 'm'
Even though the 'covars_' attribute is set, it will be overwritten during initialization because 'init_params' contains 'c'
         1      -49910.6252             +nan
         2      -43622.0032       +6288.6221
         3      -43097.3583        +524.6449
         4      -42494.5420        +602.8163
         5      -41533.2106        +961.3313
         6      -40688.4228        +844.7879
         7      -40325.1825        +363.2403
         8      -40112.5903        +212.5922
         9      -39968.5462        +144.0441
        10      -39859.4484        +109.0978
        11      -39786.6796         +72.7688
        12      -39752.6701         +34.0095
        13      -39741.9467         +10.7234
        14      -39739.1556          +2.79

em uniform


Even though the 'transmat_' attribute is set, it will be overwritten during initialization because 'init_params' contains 't'
Even though the 'means_' attribute is set, it will be overwritten during initialization because 'init_params' contains 'm'
Even though the 'covars_' attribute is set, it will be overwritten during initialization because 'init_params' contains 'c'
         1      -61355.1817             +nan
         2      -57671.0877       +3684.0941
         3      -50711.1893       +6959.8984
         4      -21924.3979      +28786.7914
         5      -13415.6933       +8508.7046
         6        5477.6832      +18893.3764
         7       -7745.0923      -13222.7754


cooc random


         1      -55991.6968             +nan
         2      -55305.0224        +686.6744
         3      -54862.7305        +442.2919
         4      -54589.7087        +273.0218
         5      -54429.6407        +160.0680
         6      -54344.7495         +84.8912
         7      -54310.4242         +34.3254
         8      -54310.6345          -0.2104


cooc latin_cube_u
cooc latin_cube_q


         1      -52380.8679             +nan
         2      -52467.3521         -86.4842


cooc uniform


         1      -51230.5047             +nan
         2      -51788.9103        -558.4056


In [16]:
pd.DataFrame(results)

Unnamed: 0,learning_alg,discretize_meth,accuracy,transmat_dtv,means_diff,covars_diff,loglikelihood
0,em,random,0.475,0.332477,18.930874,28.267222,-7824.480002
1,em,latin_cube_u,0.481,0.222334,15.636164,76.818576,-23229.267134
2,em,latin_cube_q,0.454,0.263428,18.867262,21.7215,-8370.499869
3,em,uniform,0.497,0.190199,29.426423,24.772629,-16471.052958
4,cooc,random,0.48,0.612697,16.37077,101.704704,-5449.626876
5,cooc,latin_cube_q,0.423,0.499271,16.269599,73.931223,-5304.368782
6,cooc,uniform,0.479,0.334252,16.202467,77.662555,-5201.554874


# TODO: recheck cavariance learning in HmmOptim!
# TODO: recheck Latin cube