In [None]:
import pyemma
import numpy as np
import mdshare
import itertools
import matplotlib.pyplot as plt
%matplotlib inline

# Feature selection with VAMP

One mission crucial step in building kinetic models like MSMs is to choose a set of molecular features to describe the system of interest. 

The variational approach to Markov processes (VAMP) provides a systematic way to select the set of molecular features. VAMP provides a set of scoring functions (the so-called VAMP scores) that allow to rank a set of features. Figuatively speaking, the VAMP scores measure the degree of "slowness" that can be captured with a given set of features. 

Here, we want demontrate the application of VAMP scores using the example of the pepentapeptide simulation data.
We will compare dihedrals angles, minimal residues distance and heavy atom contacts.

We start by downloading the pentapeptide data.

In [None]:
# download the peptapeptide data
topfile = mdshare.load('pentapeptide-nowater.pdb')
traj_list = [mdshare.load('pentapeptide-%02d-500ns.xtc' % i) for i in range(25)]

## 1. Dihedrals

First, we test dihedral angles. We start by computing the sine and cosine of all diheral angles at saving them as a list of numpy arrays.

In [None]:
feat = pyemma.coordinates.featurizer(topfile)
feat.add_backbone_torsions(cossin=True)
feat.add_sidechain_torsions(which=['chi1'])
data_dih = pyemma.coordinates.load(traj_list, features=feat)

Next, we compute a so-called Koopman model.

#### Scoring the features works by scoring a dynamic model


The Koopman operator $\mathcal{K}$ is an integral operator
that describes conditional future expectation values. Let
$p(\mathbf{x},\,\mathbf{y})$ be the conditional probability
density of visiting an infinitesimal phase space volume around
point $\mathbf{y}$ at time $t+\tau$ given that the phase
space point $\mathbf{x}$ was visited at the earlier time
$t$ Then the action of the Koopman operator on a function
$f$ can be written as follows:

$$ \mathcal{K}f=\int p(\mathbf{x},\,\mathbf{y})f(\mathbf{y})\,\mathrm{dy}=\mathbb{E}\left[f(\mathbf{x}_{t+\tau}\mid\mathbf{x}_{t}=\mathbf{x})\right] $$

If we approximate $f$ by a linear superposition of ansatz
functions $\boldsymbol{\chi}$ of the conformational
degrees of freedom (features), the operator $\mathcal{K}$
can be approximated by a (finite-dimensional) matrix $\mathbf{K}$.

The approximation is computed as follows: From the time-dependent
input features $\boldsymbol{\chi}(t)$, we compute the mean
$\boldsymbol{\mu}_{0}$ ($\boldsymbol{\mu}_{1}$) from
all data excluding the last (first) $\tau$ steps of every
trajectory as follows:

$$
  \boldsymbol{\mu}_{0}	:=\frac{1}{T-\tau}\sum_{t=0}^{T-\tau}\boldsymbol{\chi}(t) \\
  \boldsymbol{\mu}_{1}	:=\frac{1}{T-\tau}\sum_{t=\tau}^{T}\boldsymbol{\chi}(t)
$$

Next, we compute the instantaneous covariance matrices
$\mathbf{C}_{00}$ and $\mathbf{C}_{11}$ and the
time-lagged covariance matrix $\mathbf{C}_{01}$ as follows:

$$
\mathbf{C}_{00}	:=\frac{1}{T-\tau}\sum_{t=0}^{T-\tau}\left[\boldsymbol{\chi}(t)-\boldsymbol{\mu}_{0}\right]\left[\boldsymbol{\chi}(t)-\boldsymbol{\mu}_{0}\right] \\
\mathbf{C}_{11}	:=\frac{1}{T-\tau}\sum_{t=\tau}^{T}\left[\boldsymbol{\chi}(t)-\boldsymbol{\mu}_{1}\right]\left[\boldsymbol{\chi}(t)-\boldsymbol{\mu}_{1}\right] \\
\mathbf{C}_{01}	:=\frac{1}{T-\tau}\sum_{t=0}^{T-\tau}\left[\boldsymbol{\chi}(t)-\boldsymbol{\mu}_{0}\right]\left[\boldsymbol{\chi}(t+\tau)-\boldsymbol{\mu}_{1}\right]
$$

The Koopman matrix is then computed as follows:

$$ \mathbf{K}=\mathbf{C}_{00}^{-1}\mathbf{C}_{01} $$

We now estimate a Koopman model using the dihedral angles 
as input features.

In [None]:
vamp_dih = pyemma.coordinates.vamp(data_dih, lag=20, dim=4)

It can be shown that the leading singular functions of the
half-weighted Koopman matrix
$$ \bar{\mathbf{K}}:=\mathbf{C}_{00}^{-\frac{1}{2}}\mathbf{C}_{01}\mathbf{C}_{11}^{-\frac{1}{2}} $$
encode the best reduced dynamical model for the time series.
The corresponding singular values $\boldsymbol{\sigma}$ measure the 
amont "slowness" caputured in the reduced dynamical model.

The so-called VAMP1-score of the model $\mathbf{K}$ is just the sum of the first `dim` singular values.
Above, in the call to `pyemma.coordinates.vamp` we have fixed the dimension of the Koopman model to 4. Therefore the VAMP-1 score will be the sum of the largest 4 singular values + 1. The + 1 term comes from the constant singular value $\sigma_0=1$ that every Koopman model posesses and which is accompanied by a constant singular functions. We do not include the constant singular function in the output (`vamp_dih.get_output()`) and it is never counted in `dim`.

In [None]:
vamp_dih.score(score_method='VAMP1')

Alternative choices are the VAMP-2 score and the VAMP-E score. For details, please refer to the docstring of `score`.

#### Computing a generalizable score with the help of cross-validation

Like all models, the Koopman model can be subject to overfitting. Overfitting means that model parameters learned do not
generalize to an independent data set.

To assess how the results of a statistical analysis will generalize to an independent data set, we perform cross-validation

One round of cross-validation involves partitioning a sample of data into complementary subsets, performing the analysis on one subset (called the training set), and validating the analysis on the other subset (called the validation set or testing set). To reduce variability, in most methods multiple rounds of cross-validation are performed using different partitions, and the validation results are combined (e.g. averaged) over the rounds to estimate a final predictive model.

Next, we implement leave one out cross-validation. There we run as many rounds of estimation as we have trajectories. The training data in each run consits of all the trajectories except one trajectory, which will be used as the test data.

#### Cross-validation requires a VAMP-score that depends on training data and test data

The VAMP score in this case cannot be simply the sum of singular values of the model.
Here, *both* the training data *and* the test data need to be taken into account. 
A sensible way of defined a VAMP scores that depends both data sets was propose by Wu et al.
and it is given by the equation
$$ \text{VAMP-}r\,\text{score} = \|(\mathbf{U}^{\text{train},T}\mathbf{C}_{00}^{\text{test}}\mathbf{U}^{\text{train}})^{-\frac{1}{2}}(\mathbf{U}^{\text{train},T}\mathbf{C}_{01}^{\text{test}}\mathbf{V}^{\text{train}})(\mathbf{V}^{\text{train},T}\mathbf{C}_{11}^{\text{test}}\mathbf{V}^{\text{train}})^{-\frac{1}{2}}\|_{r}^{r} $$

where $\mathbf{C}_{ij}$ are the covariance matrices as defined above and $\mathbf{U}$ and $\mathbf{V}$ are transformed singular vectors the half-weighted Koopman matrix computed from the training data.
$$ \bar{\mathbf{K}}^{\text{train}}=\mathbf{U}^{\prime \text{train}}\mathbf{S}\mathbf{V}^{\prime \text{train}} $$
$$ \mathbf{U}^{\text{train}} = \left( C_{00}^{\text{train}} \right)^{-\frac{1}{2}} \mathbf{U}^{\prime \text{train}} $$
$$ \mathbf{V}^{\text{train}} = \left( C_{11}^{\text{train}} \right)^{-\frac{1}{2}} \mathbf{V}^{\prime \text{train}} $$

In Pyemma this equation is implemented under the function `v.score(test_data)` where  `v` is a `VAMP` estiamtor that was trained with the training data and `test_data` is the test data.

In [None]:
def leave_out_one_cv_score(trajs):
    scores = []
    for i in range(len(trajs)):
        # split the data into a training set and a test set
        traning_data = [ t for j,t in enumerate(trajs) if j!=i ]
        test_data = [trajs[i]]
        # train a Koopman model (VAMP) with the training data
        vamp = pyemma.coordinates.vamp(traning_data, lag=20, dim=4)
        # test the model that we just estimated with the test data
        scores.append(vamp.score(test_data, score_method='VAMP1'))
    return np.mean(scores), np.std(scores)

In [None]:
score_dih = leave_out_one_cv_score(data_dih)
print(score_dih[0], '+-', score_dih[1])

To get an rough insight into the quality of the dynamic model, we plot a 2-D histogram where we project all the simulation data into the singualar fucntions of the half-weigted Koopman matrix.

In [None]:
def show_ic_fel(vamp):
    ics_trajs = vamp.get_output()
    ics = np.concatenate(ics_trajs)
    pyemma.plots.plot_free_energy(ics[:, 0], ics[:, 1])

In [None]:
show_ic_fel(vamp_dih)

We see that the projection show well separated density blobs which is indicative of a metastable system that spends much time in a single blob and rarely transtions to a different density blob.

## 2. Residue mindists

We repeat all steps for another feature set consititing of all minimal distances between all residues.

In [None]:
feat = pyemma.coordinates.featurizer(topfile)
feat.add_residue_mindist(list(itertools.combinations(range(5), 2)), periodic=False)
data_rmindist = pyemma.coordinates.load(traj_list, features=feat)

In [None]:
vamp_rmindist = pyemma.coordinates.vamp(data_rmindist, lag=20,  dim=4)

In [None]:
vamp_rmindist.score(score_method='VAMP1')

In [None]:
score_rmindist = leave_out_one_cv_score(data_rmindist)
print(score_rmindist[0], '+-', score_rmindist[1])

Compared with the Koopman model estimated from dihedral angles, the Koopman model estimated from minimal residue distances has a lower VAMP score.

Looking at the projection of the simulation data onto the singular functions of the half-weighted Koopman matrix that was estimated from minimal residue distances is in agreement with the low score. The projection look more fuzzy at shows less clearly separated density blobs.

In [None]:
show_ic_fel(vamp_rmindist)

## Summary

To conclude, we present the scores graphically. This shows that dihedral angles yield a significantly larger score then the minimal residue distances. 

In [None]:
fig, ax = plt.subplots()
rects1 = ax.bar([1, 2], [score_dih[0], score_rmindist[0]], yerr=[score_dih[1], score_rmindist[1]])
ax.set_ylabel('VAMP score')
ax.set_xlabel('feature set')
ax.set_xticks((1, 2))
ax.set_xticklabels(('dihedrals','residue dist.'));

## A. Contacts between heavy atoms (computationally more expensive)

As a third example, we investigate heavy atom constacts as a candidate for a set of good features.
Running this takes a couple of minutes.

In [None]:
feat = pyemma.coordinates.featurizer(topfile)
feat.add_contacts(list(itertools.combinations(feat.select_Heavy(), 2)), threshold=0.45)
data_con = pyemma.coordinates.load(traj_list, features=feat)

In [None]:
vamp_con = pyemma.coordinates.vamp(data_con, lag=20,  dim=4)

In [None]:
vamp_con.score()

In [None]:
score_con = leave_out_one_cv_score(data_con)

In [None]:
show_ic_fel(vamp_con)

In [None]:
fig, ax = plt.subplots()
rects1 = ax.bar([1, 2, 3], [score_dih[0], score_rmindist[0], score_con[0]], 
                yerr=[score_dih[1], score_rmindist[1], score_con[1]])
ax.set_ylabel('VAMP score')
ax.set_xlabel('feature set')
ax.set_xticks((1, 2, 3))
ax.set_xticklabels(('dihedrals', 'residue dist.', 'contacts'));