# VAMP score based feature selection

The first step to build a Markov state model is to choose a good set of collective variables to later on discretize the state space into disjoint sets. We have seen in earlier tutorial steps that this step has a huge impact on the ability to build a good model. To a certain extend Hidden Markov state models have been shown to circumvent problems arising from choosing a poor set of coordinates, but if we can make a good choice a priori this would be preferable.

In this tutorial we are going to show you how to benchmark a set of coordinates at the beginning of the pipeline, instead of performing discretization and kinetic model building and validation to get the answer that we could have made a bad choice in the first place.

The VAMP score helps us judging if we are using a set of coordinates, which contains more or less kinetic information. To estimate the score, PyEMMA needs to estimate three covariance matrices from your data, namely the input coordinates:

$ C_{00}, C_{10}, C_{11}$

where 1 denotes the correlation with a time-shifted $X$.

In [None]:
%%javascript
Jupyter.utils.load_extensions('rubberband/main')
Jupyter.utils.load_extensions('exercise2/main')

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import mdshare
import pyemma

## Case 1: preprocessed, two-dimensional data (toy model)
We load the two-dimensional as well as the true discrete trajectory from an archive using `numpy`...

In [None]:
# Case 1: preprocessed, two-dimensional data (toy model)
We load the two-dimensional as well as the true discrete trajectory from an archive using `numpy`...

pdb = mdshare.fetch('alanine-dipeptide-nowater.pdb', working_directory='data')
files = mdshare.fetch(
    'alanine-dipeptide-*-250ns-nowater.dcd', working_directory='data')
print(pdb)
print(files)

file = mdshare.fetch('hmm-doublewell-2d-100k.npz', working_directory='data')
with np.load(file) as fh:
    data = fh['trajectory']
    good_dtraj = fh['discrete_trajectory']

Lets select only the x coordinates to discretize state space, we intuitively know that this is a bad model choice. So this should reflect in the VAMP score computed on the input features...

In [None]:
x = data[:, 0]
y = data[:, 1]

pyemma.plots.plot_free_energy(x, y)

In [None]:
vamp = pyemma.coordinates.vamp(x)
s_x = vamp.score(x)
s_x

In [None]:
vamp = pyemma.coordinates.vamp(y)
s_y = vamp.score(y)
s_y

In [None]:
vamp = pyemma.coordinates.vamp(data)
s_all = vamp.score(data)
s_all

In [None]:
print('contribution of x:', s_all - s_y)

We see, that there is almost no kinetic variance along the x-axis, because the meta-stable regions are divided by a shift on the y-axis. The VAMP-2 score reflects how much kinetic variance is contained within a coordinate. The combination of x and y therefore does not increase the score much.

## Case 2: low-dimensional molecular dynamics data (alanine dipeptide)
We fetch the alanine dipeptide data set, load the backbone torsions into memory, and visualize the margial and joint distributions:

In [None]:
pdb = mdshare.fetch('alanine-dipeptide-nowater.pdb', working_directory='data')
files = mdshare.fetch('alanine-dipeptide-*-250ns-nowater.dcd', working_directory='data')
print(pdb)
print(files)

A reader is created with the minimum RMSD to a reference structure. This reader is passed on to the VAMP estimator to compute the covariance matrices on this feature. We specify the lag time for $C_{01}$ to 10 steps.

In [None]:
lag = 10

In [None]:
reader_rmsd = pyemma.coordinates.source(files, top=pdb)
reader_rmsd.featurizer.add_minrmsd_to_ref(pdb)

vamp_minRMSD = pyemma.coordinates.vamp(reader_rmsd, lag=lag)
vamp_minRMSD.score()

Now let us compute the score for the backbone torsion angles.

In [None]:
reader_bt = pyemma.coordinates.source(files, top=pdb)
reader_bt.featurizer.add_backbone_torsions()

vamp_bt = pyemma.coordinates.vamp(reader_bt, lag=lag)
vamp_bt.score()

Now lets check, whether the higher VAMP score of the backbone torsion angles is related to the timescales when we build a kinetic model. First of all we have a look at the feature histogram to get an idea how to discretize it.

In [None]:
reader_rmsd = pyemma.coordinates.source(files, top=pdb)
reader_rmsd.featurizer.add_minrmsd_to_ref(pdb)
data = reader_rmsd.get_output()

ax = pyemma.plots.plot_feature_histograms(np.concatenate(data))
ax.set_title('Minimum RMSD to reference')
ax;

Now we disretize this one dimensional space into ten states.

In [None]:
cl_rmsd = pyemma.coordinates.cluster_kmeans(data, k=10, max_iter=50)

... and have a look at the resolved timescales.

In [None]:
its_rmsd = pyemma.msm.its(cl_rmsd.dtrajs, lags=60)
ax = pyemma.plots.plot_implied_timescales(its_rmsd, nits=9)
ax.set_title('Implied timescales for minRMSD, 10 kmeans clusters.');

By inspection of the implied timescales, we see only one process, which is only closely above the actual lag time and does not seem to be converged either. Now lets compute the VAMP score for a Markov state model at a lag time of interest.

In [None]:
msm_rmsd = pyemma.msm.estimate_markov_model(cl_rmsd.dtrajs, lag=lag)
msm_rmsd.score(cl_rmsd.dtrajs, score_k=3)

In [None]:
reader_backbone_torsions = pyemma.coordinates.source(files, top=pdb)
reader_backbone_torsions.featurizer.add_backbone_torsions()
cl_backbone_torsions = pyemma.coordinates.cluster_kmeans(reader_backbone_torsions, k=60)

In [None]:
its_bt = pyemma.msm.its(cl_backbone_torsions.dtrajs, lags=60)
pyemma.plots.plot_implied_timescales(its_bt, nits=9)

The implied timescales are covering three processes at lag times smaller 30 steps and two above. So the backbone torsion feature is much better suited to build a kinetic model than the minimum RMSD feature. This was already visible by inspecting the score of the VAMP estimation, whereas minRMSD yielded $\tilde{} 1.3$, while backbone torsions got $\tilde{} 1.4$.

In [None]:
msm_bt = pyemma.msm.estimate_markov_model(cl_backbone_torsions.dtrajs, lag=10)
msm_bt.score(cl_backbone_torsions.dtrajs, score_k=3)

When we compute the VAMP score for the three slowest processes on the MSM, we see big gap between the two input features again: $\tilde{} 1.4$ vs $\tilde{} 2.7$. It is important to limit the number of slow processes in the scoring, because if include everything, we will add noise to the score. The noise consists out of very fast decaying processes below the actual lag time, for which we can not make any statements with the aid of the MSM.

## Case 3: another molecular dynamics data set (pentapeptide)

**Exercise**: Fetch the pentapeptide data set, and compare different features in order to find an optimal input feature set. Recall the tutorial about discretization on how to select features. Compare them and discuss why the best scoring feature does describe the slow processes in the system in a physical manner.