VAMP score based feature selection
=============================



In [None]:
%%javascript
Jupyter.utils.load_extensions('rubberband/main')
Jupyter.utils.load_extensions('exercise2/main')

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import mdshare
import pyemma

In [None]:
pdb = mdshare.fetch('alanine-dipeptide-nowater.pdb', working_directory='data')
files = mdshare.fetch('alanine-dipeptide-*-250ns-nowater.dcd', working_directory='data')
print(pdb)
print(files)

We split the available data into test and train sets. The first two realizations are used for training, while the last trajectory will be used for evaluating the score (the test set). A reader is created with the minimum RMSD to a reference feature. This reader is passed on to the VAMP estimator

In [None]:
train_set = files[:2]
test_set = files[-1]

reader_rmsd = pyemma.coordinates.source(train_set, top=pdb)
reader_rmsd.featurizer.add_minrmsd_to_ref(pdb)

vamp_minRMSD = pyemma.coordinates.vamp(reader_rmsd, lag=10)

The score is being computed on the test set

In [None]:
reader_test_data = pyemma.coordinates.source(test_set, top=pdb)
reader_test_data.featurizer.add_minrmsd_to_ref(pdb)

vamp_minRMSD.score(reader_test_data)

In [None]:
reader_bt = pyemma.coordinates.source(files[:2], top=pdb)
reader_bt.featurizer.add_backbone_torsions()

vamp_bt = pyemma.coordinates.vamp(reader_bt, lag=10)

We now switch the active feature to backbone torsion angles.

In [None]:
reader_test_data.featurizer.active_features = []
reader_test_data.featurizer.add_backbone_torsions()
vamp_bt.score(reader_test_data)

Now lets check, whether the higher vamp score of the backbone torsion angles is related to the timescales when we build a kinetic model. First we have a look at the feature histogram to get an idea how to discretize it.

In [None]:
reader_rmsd = pyemma.coordinates.source(files, top=pdb)
reader_rmsd.featurizer.add_minrmsd_to_ref(pdb)
data = reader_rmsd.get_output()

ax = pyemma.plots.plot_feature_histograms(np.concatenate(data))
ax.set_title('Minimum RMSD to reference')
ax;

Now we disretize this one dimensional space into ten states.

In [None]:
cl_rmsd = pyemma.coordinates.cluster_kmeans(data, k=10)

... and have a look at the resolved timescales.

In [None]:
its_rmsd = pyemma.msm.its(cl_rmsd.dtrajs, lags=60)
ax = pyemma.plots.plot_implied_timescales(its_rmsd)
ax.set_title('Implied timescales for minRMSD, 10 kmeans clusters.');

By inspection of the implied timescales, we see only one process, which is only closely above the actual lag time and does not seem to be converged either. Now lets compute the VAMP score for a Markov state model at a lag time of interest.

In [None]:
msm_rmsd = pyemma.msm.estimate_markov_model(cl_rmsd.dtrajs, lag=40)
msm_rmsd.score(cl_rmsd.dtrajs)

In [None]:
reader_backbone_torsions = pyemma.coordinates.source(files, top=pdb)
reader_backbone_torsions.featurizer.add_backbone_torsions()
cl_backbone_torsions = pyemma.coordinates.cluster_kmeans(reader_backbone_torsions, k=60)

In [None]:
its_bt = pyemma.msm.its(cl_backbone_torsions.dtrajs, lags=2000)
pyemma.plots.plot_implied_timescales(its_bt)

In [None]:
msm_bt = pyemma.msm.estimate_markov_model(cl_backbone_torsions.dtrajs, lag=10)
msm_bt.score(cl_backbone_torsions.dtrajs, score_k=2)

In [None]:
Pentapeptide