# 03 - MSM estimation and validation
In this notebook, we will cover how to estimate a Markov state model (MSM) and do model validation.

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import mdshare
import pyemma

## Case 1: preprocessed, two-dimensional data (toy model)
We load the two-dimensional trajectory from an archive using `numpy`, directly discretize the full space using $k$-means clustering, and visualize the marginal and joint distributions of both components as well as the cluster centers:

In [None]:
file = mdshare.fetch('hmm-doublewell-2d-100k.npz', working_directory='data')
with np.load(file) as fh:
    data = fh['trajectory']

cluster = pyemma.coordinates.cluster_kmeans(data, k=50, max_iter=50)

fig, axes = plt.subplots(1, 2, figsize=(10, 4))
pyemma.plots.plot_feature_histograms(data, feature_labels=['$x$', '$y$'], ax=axes[0])
axes[1].scatter(*data.T, s=1, alpha=0.3)
axes[1].scatter(*cluster.clustercenters.T, s=15)
axes[1].set_xlabel('$x$')
axes[1].set_ylabel('$y$')
fig.tight_layout()

The first step after obtaining the discretized dynamics is finding a suitable lag time. The systematic approach is to estimate MSMs at various lag times and observe how the implied timescales (ITSs) of these models behave. To this aim, `pyemma` provides the `its()` function which we use to track the first three implied timescales:

In [None]:
its = pyemma.msm.its(cluster.dtrajs, lags=[1, 2, 3, 5, 7, 10], nits=3, errors='bayes')

We can pass the returned `its` object to the `pyemma.plots.plot_implied_timescales()` function:

In [None]:
pyemma.plots.plot_implied_timescales(its, ylog=False);

The above plot tells us that there is one resolved process with an ITS of approximately $8.5$ steps (blue) which is largely invariant to the lag time at which the MSM has been estimated. The other two ITSs (green, red) are smaller than the lag time (black line, grey-shaded area); they correspond to processes which are faster than the lag time and, thus, are not resolved.

As MSMs tend to underestimate the true ITSs, we are looking for a converged maximum in the ITS plot. In our case, any lag time before the slow process (blue line) crosses the lag time threshold (black line) would work and, to maximize the kinetic resolution, we choose the lag time $1$ step.

For a single process, we can assume that there are two metastable states between which the process occurs.

To see whether our model satisfies Markovianity, we perform (and visualize) a Chapman-Kolmogorow (CK) test for two metastable states.

In [None]:
msm = pyemma.msm.estimate_markov_model(cluster.dtrajs, lag=1)
pyemma.plots.plot_cktest(msm.cktest(2));

We can see a perfect agreement between models estimated at higher lag times and predictions of the model at lag time $1$ step.

Thus, we have estimated an MSM at lag time $1$ step and performed basic model validation.

## Case 2: low-dimensional molecular dynamics data (alanine dipeptide)
We fetch the alanine dipeptide data set, load the backbone torsions into memory, directly discretize the full space using $k$-means clustering, visualize the margial and joint distributions of both components as well as the cluster centers, and show the ITS convergence to help selecting a suitable lag time:

In [None]:
pdb = mdshare.fetch('alanine-dipeptide-nowater.pdb', working_directory='data')
files = mdshare.fetch('alanine-dipeptide-*-250ns-nowater.dcd', working_directory='data')

feat = pyemma.coordinates.featurizer(pdb)
feat.add_backbone_torsions()
data = pyemma.coordinates.load(files, features=feat)

cluster = pyemma.coordinates.cluster_kmeans(data, k=200, max_iter=50, stride=10)
its = pyemma.msm.its(cluster.dtrajs, lags=[1, 2, 5, 10, 20, 50], nits=4, errors='bayes')

fig, axes = plt.subplots(1, 3, figsize=(12, 3))
pyemma.plots.plot_feature_histograms(np.concatenate(data), feature_labels=['$\Phi$', '$\Psi$'], ax=axes[0])
axes[1].scatter(*np.concatenate(data).T, s=1, alpha=0.3)
axes[1].scatter(*cluster.clustercenters.T, s=15)
axes[1].set_xlabel('$\Phi$')
axes[1].set_ylabel('$\Psi$')
pyemma.plots.plot_implied_timescales(its, ax=axes[2], units='ps')
fig.tight_layout()

We observe three resolved processes with flat ITSs for a lag time of approximately $10$ ps.

Please note though that this ITS convergence analysis is based on the assumption that $200$ $k$-means centers are sufficient to discretize the dynamics. In order to study the influence of the clustering on the ITS convergence, we repeat the clustering and ITS convergence analysis for various number of cluster centers:

In [None]:
fig, axes = plt.subplots(2, 3, figsize=(12, 6))
for i, k in enumerate([20, 50, 100]):
    cls = pyemma.coordinates.cluster_kmeans(data, k=k, max_iter=50, stride=10)
    axes[0, i].scatter(*np.concatenate(data).T, s=1, alpha=0.3)
    axes[0, i].scatter(*cls.clustercenters.T, s=15)
    axes[0, i].set_xlabel('$\Phi$')
    axes[0, i].set_ylabel('$\Psi$')
    axes[0, i].set_title('k = %d centers' % k)
    pyemma.plots.plot_implied_timescales(
        pyemma.msm.its(cls.dtrajs, lags=[1, 2, 5, 10, 20, 50], nits=4, errors='bayes'),
        ax=axes[1, i], units='ps')
    axes[1, i].set_ylim(1, 2000)
fig.tight_layout()

We can see from this analysis that the ITS curves indeed converge towards the $200$ centers case and we can continue with estimating/validating an MSM.

We estimate an MSM at lag time $10$ ps and, given that we have three slow processes, perform a CK test for four metastable states:

In [None]:
msm = pyemma.msm.estimate_markov_model(cluster.dtrajs, lag=10, dt_traj='1 ps')
pyemma.plots.plot_cktest(msm.cktest(4));

The model prediction and re-estimation are in quite good agreement but we do see some small deviations in the first row.

To obtain error bars for the model prediction, we estimate a Bayesian MSM under the same conditions as the regular MSM and repeat the CK test for the Bayesian model:

In [None]:
bayesian_msm = pyemma.msm.bayesian_markov_model(cluster.dtrajs, lag=10, dt_traj='1 ps', conf=0.95)
pyemma.plots.plot_cktest(bayesian_msm.cktest(4));

Thus, we observe that the deviations are within a $95\%$ confidence interval.

**Exercise**: Load the heavy atoms' distances into memory, perform PCA and TICA (`lag=3`) with `dim=2`, discretize with $100$ $k$-means centers and a stride of $10$, and show the ITS convergence for both projections.

In [None]:
feat = pyemma.coordinates.featurizer(pdb)
feat.add_distances(feat.select_Heavy())
data = pyemma.coordinates.load(files, features=feat)

pca = pyemma.coordinates.pca(data, dim=2)
tica = pyemma.coordinates.tica(data, lag=3, dim=2)

cls_pca = pyemma.coordinates.cluster_kmeans(pca, k=100, max_iter=50, stride=10)
cls_tica = pyemma.coordinates.cluster_kmeans(tica, k=100, max_iter=50, stride=10)

its_pca = pyemma.msm.its(cls_pca.dtrajs, lags=[1, 2, 5, 10, 20, 50], nits=4, errors='bayes')
its_tica = pyemma.msm.its(cls_tica.dtrajs, lags=[1, 2, 5, 10, 20, 50], nits=4, errors='bayes')

fig, axes = plt.subplots(2, 3, figsize=(12, 6))
pyemma.plots.plot_feature_histograms(np.concatenate(pca.get_output()), ax=axes[0, 0])
pyemma.plots.plot_feature_histograms(np.concatenate(tica.get_output()), ax=axes[1, 0])
axes[0, 0].set_title('PCA')
axes[1, 0].set_title('TICA')
axes[0, 1].scatter(*np.concatenate(pca.get_output()).T, s=1, alpha=0.3)
axes[0, 1].scatter(*cls_pca.clustercenters.T, s=15)
axes[0, 1].set_xlabel('PC 1')
axes[0, 1].set_ylabel('PC 2')
axes[1, 1].scatter(*np.concatenate(tica.get_output()).T, s=1, alpha=0.3)
axes[1, 1].scatter(*cls_tica.clustercenters.T, s=15)
axes[1, 1].set_xlabel('IC 1')
axes[1, 1].set_ylabel('IC 2')
pyemma.plots.plot_implied_timescales(its_pca, ax=axes[0, 2], units='ps')
pyemma.plots.plot_implied_timescales(its_tica, ax=axes[1, 2], units='ps')
axes[0, 2].set_ylim(1, 2000)
axes[1, 2].set_ylim(1, 2000)
fig.tight_layout()

Despite the fact that PCA yields a projection with some defined basins, the ITS plot shows that only one "slow" process is resolved which is more than one order of magnitude too fast.

TICA does find three slow processes which agree (in terms of the implied timescales) with the backbone torsions example above.

We conclude that the PCA projection is not suitable to resolve the slow dynamics of alanine dipeptide and we will continue to estimate/validate the TICA-based projection.

**Exercise**: Estimate a Bayesian MSM at lag time $10$ ps and perform/show a CK test for four metastable states.

In [None]:
bayesian_msm = pyemma.msm.bayesian_markov_model(cls_tica.dtrajs, lag=10, dt_traj='1 ps')
pyemma.plots.plot_cktest(bayesian_msm.cktest(4));

We again see a good agreement between model prediction and re-estimation.

## Case 3: another molecular dynamics data set (pentapeptide)

**Exercise**: Fetch the pentapeptide data set, load the cossin transformations of the backbone and $\chi_1$ sidechain torsions into memory, perform TICA with `lag=20` and `var_cutoff=0.9`, discretize with $250$ $k$-means centers using a stride of $10$, visualize the margial distributions, and show the ITS convergence:

In [None]:
pdb = mdshare.fetch('pentapeptide-impl-solv.pdb', working_directory='data')
files = mdshare.fetch('pentapeptide-*-500ns-impl-solv.xtc', working_directory='data')

feat = pyemma.coordinates.featurizer(pdb)
feat.add_backbone_torsions(cossin=True)
feat.add_sidechain_torsions(which='chi1', cossin=True)
data = pyemma.coordinates.load(files, features=feat)

tica = pyemma.coordinates.tica(data, lag=20, var_cutoff=0.9)
cluster = pyemma.coordinates.cluster_kmeans(tica, k=250, max_iter=50, stride=10)
its = pyemma.msm.its(cluster.dtrajs, lags=30, nits=10, errors='bayes')

fig, axes = plt.subplots(1, 2, figsize=(10, 4))
pyemma.plots.plot_feature_histograms(np.concatenate(tica.get_output()), ax=axes[0])
pyemma.plots.plot_implied_timescales(its, ax=axes[1], dt=10.0, units='ns')
fig.tight_layout()

Here, the picture is not as clear as in the above examples. We do observe ITS plateaus but the plot does not hint at the number of important processes and, thus, metastable states.

In this case, it is recommended to take a closer look at the timescale spectrum of an estimated MSM.

**Exercise**: Estimate an MSM at `lag=12` (steps) with `dt_traj='0.01 ns'`.

In [None]:
msm = pyemma.msm.estimate_markov_model(cluster.dtrajs, lag=12, dt_traj='0.01 ns')

We then plot the ITS values (in decreasing order) and look for spectral gaps, i.e., large relative differences between two neighboring ITS values:

In [None]:
timescales = msm.timescales(k=10)

plt.plot(timescales, '-o')
plt.xlabel('timescale index')
plt.ylabel('timescale / ns');

We observe two spectral gaps: one after the first process and another one after the third.

This observation hints that we should consider either two or four metastable states for the CK test.

**Exercise**: Estimate a Bayesian MSM at `lag=12` (steps) and perform/show a CK test with two metastable states.

In [None]:
bayesian_msm = pyemma.msm.bayesian_markov_model(cluster.dtrajs, lag=12)
pyemma.plots.plot_cktest(bayesian_msm.cktest(2));

**Exercise**: Estimate a Bayesian MSM at `lag=12` (steps) and perform/show a CK test with four metastable states.

In [None]:
bayesian_msm = pyemma.msm.bayesian_markov_model(cluster.dtrajs, lag=12)
pyemma.plots.plot_cktest(bayesian_msm.cktest(4));

We observe that, in both cases, the model is Markovian and suitable for further analysis.

## Wrapping up
In this notebook, we have learned how to estimate a regular or Bayesian MSM from discretized molecular simulation data with `pyemma` and how to perform basic model validation. In detail, we have selected a suitable lag time by using
- `pyemma.msm.its()` to obtain an implied timescale object and
- `pyemma.plots.plot_implied_timescales()` to visualize the convergence of the implied timescales.

We then have used
- `pyemma.msm.estimate_markov_model()` to estimate a regular MSM,
- `pyemma.msm.bayesian_markov_model()` to estimate a Bayesian MSM,
- the `timescales()` method of an estimated MSM object to access its implied timescales,
- the `cktest()` method of an estimated MSM object to perform a Chapman-Kolmogorow test, and
- `pyemma.plots.plot_cktest()` to visualize the latter.