# 07 - VAMP score based feature selection

The first step to build a Markov state model is to choose a good set of collective variables to later on discretize the state space into disjoint sets. We have seen in earlier tutorial steps that this step has a huge impact on the ability to build a good model. To a certain extend Hidden Markov state models have been shown to circumvent problems arising from choosing a poor set of coordinates, but if we can make a good choice a priori this would be preferable.

In this tutorial we are going to show you how to benchmark a set of coordinates at the beginning of the pipeline, instead of performing discretization and kinetic model building and validation to get the answer that we could have made a bad choice in the first place.

The VAMP score helps us judging if we are using a set of coordinates, which contains more or less kinetic information. To estimate the score, PyEMMA needs to estimate three covariance matrices from your data, namely the input coordinates:

$ C_{00}, C_{10}, C_{11}$

where 1 denotes the correlation with a time-shifted $X$.

<a rel="license" href="http://creativecommons.org/licenses/by/4.0/"><img alt="Creative Commons Licence" style="border-width:0" src="https://i.creativecommons.org/l/by/4.0/88x31.png" /></a><br />This work is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by/4.0/">Creative Commons Attribution 4.0 International License</a>.

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import mdshare
import pyemma

## Case 1: preprocessed, two-dimensional data (toy model)
We load the two-dimensional as well as the true discrete trajectory from an archive using `numpy`...

In [None]:
file = mdshare.fetch('hmm-doublewell-2d-100k.npz', working_directory='data')
with np.load(file) as fh:
    data = fh['trajectory']
    good_dtraj = fh['discrete_trajectory']

Lets select only the x coordinates to discretize state space, we intuitively know that this is a bad model choice. So this should reflect in the VAMP score computed on the input features...

In [None]:
x = data[:, 0]
y = data[:, 1]

pyemma.plots.plot_free_energy(x, y, legacy=False);

In [None]:
vamp = pyemma.coordinates.vamp(x)
s_x = vamp.score(x)
print(s_x)

In [None]:
vamp = pyemma.coordinates.vamp(y)
s_y = vamp.score(y)
print(s_y)

In [None]:
vamp = pyemma.coordinates.vamp(data)
s_all = vamp.score(data)
print(s_all)

In [None]:
print('contribution of x-dimension: {}'.format(s_all - s_y))

We see, that there is almost no kinetic variance along the x-axis, because the meta-stable regions are divided by a shift on the y-axis. The VAMP-2 score reflects how much kinetic variance is contained within a coordinate. The combination of x and y therefore does not increase the score much.

## Case 2: low-dimensional molecular dynamics data (alanine dipeptide)
We fetch the alanine dipeptide data set, load the backbone torsions into memory, and visualize the margial and joint distributions:

In [None]:
pdb = mdshare.fetch('alanine-dipeptide-nowater.pdb', working_directory='data')
files = mdshare.fetch('alanine-dipeptide-*-250ns-nowater.dcd', working_directory='data')
print(pdb)
print(files)

A reader is created with the minimum RMSD to a reference structure. This reader is passed on to the VAMP estimator to compute the covariance matrices on this feature. We specify the lag time for $C_{01}$ to 10 steps.

In [None]:
lag = 10

In [None]:
reader_rmsd = pyemma.coordinates.source(files, top=pdb)
reader_rmsd.featurizer.add_minrmsd_to_ref(pdb)

vamp_minRMSD = pyemma.coordinates.vamp(reader_rmsd, lag=lag)
print(vamp_minRMSD.score())

Now let us compute the score for the backbone torsion angles.

In [None]:
reader_bt = pyemma.coordinates.source(files, top=pdb)
reader_bt.featurizer.add_backbone_torsions()

vamp_bt = pyemma.coordinates.vamp(reader_bt, lag=lag)
print(vamp_bt.score())

Now lets check, whether the higher VAMP score of the backbone torsion angles is related to the timescales when we build a kinetic model. First of all we have a look at the feature histogram to get an idea how to discretize it.

In [None]:
fig, ax = pyemma.plots.plot_feature_histograms(
    np.concatenate(reader_rmsd.get_output()))
ax.set_title('Minimum RMSD to reference')

Now we disretize this one dimensional space into ten states.

In [None]:
cl_rmsd = pyemma.coordinates.cluster_kmeans(data, k=10, max_iter=50)

... and have a look at the resolved timescales.

In [None]:
its_rmsd = pyemma.msm.its(cl_rmsd.dtrajs, lags=60)
ax = pyemma.plots.plot_implied_timescales(its_rmsd, nits=9)
ax.set_title('Implied timescales for minRMSD, 10 kmeans clusters.');

By inspection of the implied timescales, we see only one process, which is only closely above the actual lag time and does not seem to be converged either. Now lets compute the VAMP score for a Markov state model at a lag time of interest.

In [None]:
msm_rmsd = pyemma.msm.estimate_markov_model(cl_rmsd.dtrajs, lag=lag)
print(msm_rmsd.score(cl_rmsd.dtrajs, score_k=3))

In [None]:
cl_backbone_torsions = pyemma.coordinates.cluster_kmeans(reader_bt, k=50)

In [None]:
its_bt = pyemma.msm.its(cl_backbone_torsions.dtrajs, lags=60)
pyemma.plots.plot_implied_timescales(its_bt, nits=9);

The implied timescales are covering three processes at lag times smaller 30 steps and two above. So the backbone torsion feature is much better suited to build a kinetic model than the minimum RMSD feature. This was already visible by inspecting the score of the VAMP estimation, whereas minRMSD yielded $\tilde{} 1.3$, while backbone torsions got $\tilde{} 1.4$.

In [None]:
msm_bt = pyemma.msm.estimate_markov_model(cl_backbone_torsions.dtrajs, lag=10)
print(msm_bt.score(cl_backbone_torsions.dtrajs, score_k=3))

When we compute the VAMP score for the three slowest processes on the MSM, we see big gap between the two input features again: $\tilde{} 1.4$ vs $\tilde{} 2.7$. It is important to limit the number of slow processes in the scoring, because if include everything, we will add noise to the score. The noise consists out of very fast decaying processes below the actual lag time, for which we can not make any statements with the aid of the MSM.

## Case 3: another molecular dynamics data set (pentapeptide)

**Exercise 1**: Fetch the pentapeptide data set, and compare different features in order to find an optimal input feature set. Recall the tutorial about discretization on how to select features. Compare them and discuss why the best scoring feature does describe the slow processes in the system in a physical manner. Ensure that you cap the number of processes to score to a sane number.

In [None]:
pdb = mdshare.fetch('pentapeptide-impl-solv.pdb', working_directory='data')
files = mdshare.fetch('pentapeptide-*-500ns-impl-solv.xtc', working_directory='data')

feat = pyemma.coordinates.featurizer(pdb)
reader = pyemma.coordinates.source(files, features=feat)
n_processes = 10
lag = 20
#FIXME

In [None]:
pdb = mdshare.fetch('pentapeptide-impl-solv.pdb', working_directory='data')
files = mdshare.fetch('pentapeptide-*-500ns-impl-solv.xtc', working_directory='data')

feat = pyemma.coordinates.featurizer(pdb)
reader = pyemma.coordinates.source(files, features=feat)
n_processes = 10
lag = 20


def test_feature(feat_name, *args, **kwargs):
    feat.active_features = []
    getattr(feat, 'add_'+feat_name)(*args, **kwargs)
    print('input dimension:', feat.dimension())
    v = pyemma.coordinates.vamp(
        reader, lag=lag, dim=min(n_processes, feat.dimension()))
    s = v.score()
    return s


# define distance pairs, including every 5th neighbour
distance_pairs = feat.pairs(np.arange(feat.topology.n_atoms), 5)

features = [('backbone_torsions', (), dict(cossin=False)),
            ('sidechain_torsions', (), {}),
            ('distances_ca', (), {}),
            ('distances', (distance_pairs, ), {}),
            ('contacts', (feat.topology.select('backbone'), ), {}),
            ]
results = {}
for feat_name, args, kw in features:
    print('scoring', feat_name)
    score = test_feature(feat_name, *args, **kw)
    results[feat_name] = score

fig, ax = plt.subplots()
for i, (f, score) in enumerate(results.items()):
    ax.bar(i, height=score, label=f)
ax.legend()
ax.set_ylabel('VAMP-2 score')
ax.set_xticks([])
ax.set_title(r'Input feature score with max dim={dim} and lag $\tau={tau}$'.format(
    dim=n_processes, tau=lag));

As we see, the most kinetic variance is included within the distance based feature. It also has the highest dimension, so it is not suprising we are able to grasp more slow processes.

Now we want to use it in the next excercise.

**Exercise 2**: Create different state space discretizations on the low-dimensional VAMP projection of the best found feature and compare them in terms of the cross-validated (scorecv) and unvalidated (score-method) VAMP-2 score of a MSM estimated at a fixed lag time. 

* What do you observe, when you choose too many clusters?
* Why is it a good idea to work with a cross-validated measure?

In [None]:
results_msm = []
n_clusters = [10, 50, 100, 150, 200,
              300, 500, 800, 1000][::-1]
feat.active_features = []  # clear the active features.
feat.add_  #FIXME
data = pyemma.coordinates.vamp(
    reader, dim=???, lag=???).get_output()[0]  # FIXME


def score_discretization(k, init_centers=None):
    """ re-estimate the kmeans object and estimate/score a MSM on the state space."""
    kw = {} if init_centers is None else {'clustercenters': init_centers}
    cl.estimate(data, n_clusters=k, stride=3, **kw)
    dtrajs = cl.assign(data)
    msm = pyemma.msm.estimate_markov_model(dtrajs, lag=lag)
    score_k = min(msm.nstates, n_processes)
    # FIXME: score the msm!
    s_cv = msm.???(score_k=score_k)
    s=msm.???(score_k=score_k)
    results_msm.append((k, s, s_cv))

    
# estimate the first kmeans discretization
cl = pyemma.coordinates.cluster_kmeans(keep_data=True, max_iter=15)
score_discretization(k=n_clusters[0])

# we sub-sample the already estimated centers
# and iterate these for a while prior estimating and scoring the MSM.
for k in n_clusters[1:]:
    inds=np.random.randint(0, k, size=k)
    new_initial_centers=cl.cluster_centers_[inds]
    score_discretization(k, init_centers=new_initial_centers)

fig, ax=plt.subplots()
for i, (x, score, scores_cv) in enumerate(results_msm):
    y_mean=scores_cv.mean()
    ax.errorbar(x, y=y_mean, yerr=scores_cv.std(), color='green',
                marker='o', label='CV' if i == 0 else None)
    ax.scatter(x, score, marker='s', color='red',
               label='Not validated' if i == 0 else None)
ax.set_xlabel('Number of centers')
ax.set_ylabel('VAMP-2 score')
ax.set_title('MSM VAMP-2 score vs. cluster centers')
ax.legend();

In [None]:
results_msm = []
n_clusters = [10, 50, 100, 150, 200,
              300, 500, 800, 1000][::-1]
feat.active_features = []
feat.add_distances(distance_pairs)
data = pyemma.coordinates.vamp(
    reader, dim=n_processes, lag=lag).get_output()[0]


def score_discretization(k, init_centers=None):
    """ re-estimate the kmeans object and estimate/score a MSM on the state space."""
    kw = {} if init_centers is None else {'clustercenters':init_centers}
    cl.estimate(data, n_clusters=k, stride=3, **kw)
    dtrajs = cl.assign(data)
    msm = pyemma.msm.estimate_markov_model(dtrajs, lag=lag)
    score_k = min(msm.nstates, n_processes)
    s_cv = msm.score_cv(dtrajs, score_k=score_k)
    s = msm.score(dtrajs, score_k=score_k)
    results_msm.append((k, s, s_cv))


# estimate the first kmeans discretization
cl = pyemma.coordinates.cluster_kmeans(keep_data=True, max_iter=15)
score_discretization(k=n_clusters[0])

# we sub-sample the already estimated centers
# and iterate these for a while prior estimating and scoring the MSM.
for k in n_clusters[1:]:
    inds = np.random.randint(0, k, size=k)
    new_initial_centers = cl.cluster_centers_[inds]
    score_discretization(k, init_centers=new_initial_centers)

fig, ax = plt.subplots()
for i, (x, score, scores_cv) in enumerate(results_msm):
    y_mean = scores_cv.mean()
    ax.errorbar(x, y=y_mean, yerr=scores_cv.std(), color='green',
                marker='o', label='CV' if i == 0 else None)
    ax.scatter(x, score, marker='s', color='red',
               label='Not validated' if i == 0 else None)
ax.set_xlabel('Number of centers')
ax.set_ylabel('VAMP-2 score')
ax.set_title('MSM VAMP-2 score vs. cluster centers')
ax.legend();

## Wrapping up
In this notebook, we have learned how to evaluate the quality of input features in terms of maximizing the amount kinetic information with `pyemma`. In detail, we have used
- `v = pyemma.coordinates.vamp()` to estimate covariances of the input features
- `v.score(data)` to compute the VAMP score on the covariances.
- `pyemma.msm.MSM.score()` to judge the final quality of the estimated Markov state model.
- `pyemma.msm.MSM.score_cv()` to have a cross-validated measure of the quality of the MSM.