# MSM estimation and validation

<a rel="license" href="http://creativecommons.org/licenses/by/4.0/"><img alt="Creative Commons Licence" style="border-width:0" src="https://i.creativecommons.org/l/by/4.0/88x31.png" title='This work is licensed under a Creative Commons Attribution 4.0 International License.' align="right"/></a>

In this notebook, we will cover how to estimate a Markov state model (MSM) and do model validation;
we also show how to save and restore model and estimator objects.
For this notebook, you need to know how to do data loading/visualization as well as dimension reduction.


**Remember**:
- to run the currently highlighted cell, hold <kbd>&#x21E7; Shift</kbd> and press <kbd>&#x23ce; Enter</kbd>;
- to get help for a specific function, place the cursor within the function's brackets, hold <kbd>&#x21E7; Shift</kbd>, and press <kbd>&#x21E5; Tab</kbd>;
- you can find the full documentation at [PyEMMA.org](http://www.pyemma.org).

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import mdshare
import pyemma

## Loading MD data and repeating the clustering step

Let's load alanine dipeptide backbone torsions and discretise with 200 $k$-means centers...

In [None]:
pdb = mdshare.fetch('alanine-dipeptide-nowater.pdb', working_directory='data')
files = mdshare.fetch('alanine-dipeptide-*-250ns-nowater.xtc', working_directory='data')

feat = pyemma.coordinates.featurizer(pdb)
feat.add_backbone_torsions(periodic=False)

data = pyemma.coordinates.load(files, features=feat)
data_concatenated = np.concatenate(data)

cluster = pyemma.coordinates.cluster_kmeans(data, k=200, max_iter=50, stride=10)

... and plot the free energy along with the cluster centers:

In [None]:
fig, ax = plt.subplots()
pyemma.plots.plot_free_energy(*data_concatenated.T, ax=ax, legacy=False)
ax.scatter(*cluster.clustercenters.T, s=15, c='k')
ax.set_xlabel('$\Phi$ / rad') 
ax.set_ylabel('$\Psi$ / rad')
fig.tight_layout()

## Implied time scales and lag time selection

The first step after obtaining the discretized dynamics is finding a suitable lag time.
The systematic approach is to estimate MSMs at various lag times and observe how the implied timescales (ITSs) of these models behave.
In particular, we are looking for lag time ranges in which the implied timescales are constant
To this aim, PyEMMA provides the `its()` function which we use to track the first four (`nits=4`) implied timescales:

In [None]:
its = pyemma.msm.its(cluster.dtrajs, lags=[1, 2, 5, 10, 20, 50], nits=4)

$$\begin{eqnarray*}
T(n \tau) & = & (T(\tau))^n\\[0.75em]
\lambda(n \tau) & = & (\lambda(\tau))^n\\[0.75em]
\mathrm{ITS}(n \tau) & = & - \frac{n \tau}{\ln \lambda(n \tau)} = - \frac{n \tau}{\ln (\lambda(\tau))^n} = - \frac{\tau}{\ln \lambda(\tau)} = \mathrm{ITS}(\tau)
\end{eqnarray*}$$

We can pass the returned `its` object to the `pyemma.plots.plot_implied_timescales()` function:

In [None]:
pyemma.plots.plot_implied_timescales(its, units='ps')

The above plot tells us that there are three resolved processes (blue, red, green) which are largely invariant to the MSM lag time.
The fourth ITS (cyan) is smaller than the lag time (black line, grey-shaded area);
it corresponds to a process which is faster than the lag time and, thus, is not resolved.
Since the implied timescales are, like the corresponding eigenvalues, sorted in decreasing order,
we know that all other remaining processes must be even faster.

## Error bars for the timescales

To compute error bars, pass the `errors=bayes` parameter:

In [None]:
its = pyemma.msm.its(cluster.dtrajs, lags=[1, 2, 5, 10, 20, 50], nits=4, errors='bayes')

pyemma.plots.plot_implied_timescales(its, units='ps')

## Effect of the discretization on the implied timescales

Let's look at the discretisation's influence on the ITSs:

In [None]:
cluster_20 = pyemma.coordinates.cluster_kmeans(data, k=20, max_iter=50, stride=10)
its_20 = pyemma.msm.its(cluster_20.dtrajs, lags=[1, 2, 5, 10, 20, 50], nits=4, errors='bayes')

cluster_50 = pyemma.coordinates.cluster_kmeans(data, k=50, max_iter=50, stride=10)
its_50 = pyemma.msm.its(cluster_50.dtrajs, lags=[1, 2, 5, 10, 20, 50], nits=4, errors='bayes')

cluster_100 = pyemma.coordinates.cluster_kmeans(data, k=100, max_iter=50, stride=10)
its_100 = pyemma.msm.its(cluster_100.dtrajs, lags=[1, 2, 5, 10, 20, 50], nits=4, errors='bayes')

In [None]:
fig, axes = plt.subplots(2, 3, figsize=(12, 6))

pyemma.plots.plot_free_energy(*data_concatenated.T, ax=axes[0, 0], cbar=False, legacy=False)
axes[0, 0].scatter(*cluster_20.clustercenters.T, s=15, c='k')
pyemma.plots.plot_implied_timescales(its_20, ax=axes[1, 0], units='ps')

pyemma.plots.plot_free_energy(*data_concatenated.T, ax=axes[0, 1], cbar=False, legacy=False)
axes[0, 1].scatter(*cluster_50.clustercenters.T, s=15, c='k')
pyemma.plots.plot_implied_timescales(its_50, ax=axes[1, 1], units='ps')

pyemma.plots.plot_free_energy(*data_concatenated.T, ax=axes[0, 2], cbar=False, legacy=False)
axes[0, 2].scatter(*cluster_100.clustercenters.T, s=15, c='k')
pyemma.plots.plot_implied_timescales(its_100, ax=axes[1, 2], units='ps')

fig.tight_layout()

## Estimating the maximum likelihood Markov model

In [None]:
msm = pyemma.msm.estimate_markov_model(cluster.dtrajs, lag=10, dt_traj='1 ps')

print('fraction of states used = ', msm.active_state_fraction)
print('fraction of counts used = ', msm.active_count_fraction)

In [None]:
msm.timescales(k=4)

## Estimating the Bayesian Markov model

In [None]:
bayesian_msm = pyemma.msm.bayesian_markov_model(cluster.dtrajs, lag=10, dt_traj='1 ps') 

In [None]:
bayesian_msm.sample_conf('timescales', k=3)

## The Chapman-Kolmogorov test

To see whether our model satisfies Markovianity, we perform (and visualize) a Chapman-Kolmogorow (CK) test.
Since we aim at modeling the dynamics between metastable states rather than between microstates, this will be conducted in the space of metastable states.
The latter are identified automatically using PCCA++ (which is explained later).
We usually choose the number of metastable states according to the implied timescales plot by identifying a gap between the ITS.

In [None]:
pyemma.plots.plot_cktest(msm.cktest(4), units='ps');

In [None]:
pyemma.plots.plot_cktest(bayesian_msm.cktest(4), units='ps');

## Persisting and restoring estimators

In [None]:
cluster_50.save('nb3.pyemma', model_name='kmeans_k50')

In [None]:
msm.save('nb3.pyemma', model_name='msm', overwrite=True)

In [None]:
cluster_restored = pyemma.load('nb3.pyemma', model_name='kmeans_k50')

In [None]:
msm_restored = pyemma.load('nb3.pyemma', model_name='msm')

In [None]:
msm_restored.timescales(k=3)

In [None]:
pyemma.list_models('nb3.pyemma').keys()

## Hands-on

#### Exercise 1

Load the heavy atom distances into memory, perform PCA and TICA (`lag=3`) with `dim=2`,
then discretize with $100$ $k$-means centers and a stride of $10$. Compare the two discretizations be generating implied timescale plots for both of them.

In [None]:
feat =  #FIXME
feat. #FIXME
data =  #FIXME

pca = pyemma.coordinates.pca(data, dim=2)
tica = #FIXME

pca_concatenated = np.concatenate(pca.get_output())
tica_concatenated = #FIXME

cls_pca = pyemma.coordinates.cluster_kmeans(pca, k=100, max_iter=50, stride=10)
cls_tica = #FIXME

its_pca = pyemma.msm.its(
    cls_pca.dtrajs, lags=[1, 2, 5, 10, 20, 50], nits=4, errors='bayes')
its_tica = #FIXME

###### Solution

In [None]:
feat = pyemma.coordinates.featurizer(pdb)
pairs = feat.pairs(feat.select_Heavy())
feat.add_distances(pairs, periodic=False)
data = pyemma.coordinates.load(files, features=feat)

pca = pyemma.coordinates.pca(data, dim=2)
tica = pyemma.coordinates.tica(data, lag=3, dim=2)

pca_concatenated = np.concatenate(pca.get_output())
tica_concatenated = np.concatenate(tica.get_output())

cls_pca = pyemma.coordinates.cluster_kmeans(pca, k=100, max_iter=50, stride=10)
cls_tica = pyemma.coordinates.cluster_kmeans(tica, k=100, max_iter=50, stride=10)

its_pca = pyemma.msm.its(
    cls_pca.dtrajs, lags=[1, 2, 5, 10, 20, 50], nits=4, errors='bayes')
its_tica = pyemma.msm.its(
    cls_tica.dtrajs, lags=[1, 2, 5, 10, 20, 50], nits=4, errors='bayes')

Let's visualize the ITS convergence for both projections:

In [None]:
fig, axes = plt.subplots(2, 3, figsize=(12, 6))
pyemma.plots.plot_feature_histograms(pca_concatenated, ax=axes[0, 0])
pyemma.plots.plot_feature_histograms(tica_concatenated, ax=axes[1, 0])
axes[0, 0].set_title('PCA')
axes[1, 0].set_title('TICA')
pyemma.plots.plot_density(*pca_concatenated.T, ax=axes[0, 1], cbar=False, alpha=0.1)
axes[0, 1].scatter(*cls_pca.clustercenters.T, s=15, c='C1')
axes[0, 1].set_xlabel('PC 1')
axes[0, 1].set_ylabel('PC 2')
pyemma.plots.plot_density(*tica_concatenated.T, ax=axes[1, 1], cbar=False, alpha=0.1)
axes[1, 1].scatter(*cls_tica.clustercenters.T, s=15, c='C1')
axes[1, 1].set_xlabel('IC 1')
axes[1, 1].set_ylabel('IC 2')
pyemma.plots.plot_implied_timescales(its_pca, ax=axes[0, 2], units='ps')
pyemma.plots.plot_implied_timescales(its_tica, ax=axes[1, 2], units='ps')
axes[0, 2].set_ylim(1, 2000)
axes[1, 2].set_ylim(1, 2000)
fig.tight_layout()

Despite the fact that PCA yields a projection with some defined basins,
the ITS plot shows that only one "slow" process is resolved which is more than one order of magnitude too fast.

TICA does find three slow processes which agree (in terms of the implied timescales) with the backbone torsions example above.

We conclude that this PCA projection is not suitable to resolve the slow dynamics of alanine dipeptide and we will continue to estimate/validate the TICA-based projection.

#### Exercise 2

Estimate a Bayesian MSM at lag time $10$ ps and perform/show a CK test for four metastable states.

In [None]:
bayesian_msm = pyemma.msm.bayesian_markov_model(cls_tica.dtrajs, lag=10, dt_traj='1 ps')
pyemma.plots. #FIXME

###### Solution

In [None]:
bayesian_msm = pyemma.msm.bayesian_markov_model(cls_tica.dtrajs, lag=10, dt_traj='1 ps')
pyemma.plots.plot_cktest(bayesian_msm.cktest(4), units='ps');

We again see a good agreement between model prediction and re-estimation.

## Wrapping up
In this notebook, we have learned how to estimate a regular or Bayesian MSM from discretized molecular simulation data with `pyemma` and how to perform basic model validation.
In detail, we have selected a suitable lag time by using
- `pyemma.msm.its()` to obtain an implied timescale object and
- `pyemma.plots.plot_implied_timescales()` to visualize the convergence of the implied timescales.

We then have used
- `pyemma.msm.estimate_markov_model()` to estimate a regular MSM,
- `pyemma.msm.bayesian_markov_model()` to estimate a Bayesian MSM,
- the `timescales()` method of an estimated MSM object to access its implied timescales,
- the `cktest()` method of an estimated MSM object to perform a Chapman-Kolmogorow test, and
- `pyemma.plots.plot_cktest()` to visualize the latter.