Markov state model for pentapeptide
=====

In this tutorial notebook we will give a brief overview of some of PyEMMA's capabilities by analyzing MD simulations of a Pentapeptide. Only the first steps of loading the data, featurizing and clustering will be demonstrated. Please go through the notebook and complete #FIXME-comments noted below TODO sections.

Now we import a few general packages, including basic numerics and algebra routines (numpy) and plotting routines (matplotlib), and makes sure that all plots are shown inside the notebook rather than in a separate window (nicer that way).

In [None]:
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline
matplotlib.rcParams.update({'font.size': 12})

Now we import the PyEMMA modules required for the following steps.

In [None]:
import pyemma

Load pentapeptide coordinates and select features
------

We first have to load the PDB file and the trajectory data, in this case for WW-pentapeptide. They are stored on a FTP server and can easily be downloaded with mdshare. Please use `pip install mdshare` for installation.

In [None]:
from mdshare import load

In [None]:
topfile = load('pentapeptide-impl-solv.pdb')
traj_list = [load('pentapeptide-%02d-500ns-impl-solv.xtc' % i) for i in range(25)]

##### TODO:
We can decide here which features we would like to use in the further analysis. Since it has proven the best features in this case, please go ahead and 
- add backbone torsions (with cossin option) and
- chi1 sidechain torsions (with cossin option) 

to the featurizer.
As we want to do TICA on those coordinates, which requires subtracting the mean from each feature, we cannot use angles directly but have to transform them into a space where an arithmetic mean can be computed. We are using the cos/sin transform to do this, specified by the *cossin* option.

In [None]:
feat = pyemma.coordinates.featurizer(topfile)
#FIXME !

In [None]:
feat.dimension()

Now we define the source of input coordinates. We don't load them into memory at this stage to demonstrate pyEMMA's abilities to cope with huge amounts of data - they will be loaded as needed). Compute a few basic data statistics gives:

In [None]:
inp = pyemma.coordinates.source(traj_list, feat)
print('number of trajectories = ',inp.number_of_trajectories())
print('trajectory length = ',inp.trajectory_length(0))
print('trajectory time step = ', 500.0 / (inp.trajectory_length(0)-1),'ns')
print('number of dimension = ',inp.dimension())

TICA and clustering 
-----

For TICA we have to choose a *lag* time and we have to define the output dimension. This can be either set by the *dim* keyword, or by specify a percentage the kinetic variance we want to keep. Here we choose 90%, which gives us three dimensions. From the original 16-dimensional space, most of the relevant kinetic information is in a four-dimensional subspace.
##### TODO:
Please go ahead and define the ``tica_obj`` with a lag of 20 steps and a kinetic variance as discussed above.

In [None]:
tica_obj = #FIXME !
print('TICA dimension ', tica_obj.dimension())

We can have a look at the cumulative kinetic variance, which is similar to the cumulative variance in PCA. Three dimensions explain 78% of the data, five dimensions 95%.

In [None]:
tica_obj.cumvar

Now we get the TICA output, i.e. the coordinates after being transformed to the three slowest components. You can think of this as a low-dimensional space of good reaction coordinates. 
Having a look at the shape of the output reveals that we still have 25 trajectories, each of length 5001, but now only three dimensions.
##### TODO:
Get the output of TICA, i.e. the ``tica_obj`` and store it in ``Y``!

In [None]:
Y = # FIXME
print('number of trajectories = ', np.shape(Y)[0])
print('number of frames = ', np.shape(Y)[1])
print('number of dimensions = ',np.shape(Y)[2])

Note that at this point we loaded the compressed coordinates into memory. We don't have to do this, but it will significantly speed up any further analysis. It is also easy because it's low-dimensional. In general, after the TICA-transformation we can often keep the data in memory even if we are working with massive data of a large protein. 

Now we look at the distribution on the two dominant TICA coordinates (three are hard to visualize). For that, we build a histogram of the first two TICA dimensions and then compute a free energy by taking
$F_i = -\ln z_i$, where $z_i$ is the number of bin counts.

In [None]:
pyemma.plots.plot_free_energy(np.vstack(Y)[:,0], np.vstack(Y)[:,1]);

Let's have a look how one of the trajectories looks like in the space of the first three TICA components. We can see that the TICA components nicely resolve the slow transitions as discrete jumps.

In [None]:
matplotlib.rcParams.update({'font.size': 14})
dt = 0.1
plt.figure(figsize=(8,5))
ax1=plt.subplot(311)
x = dt*np.arange(Y[0].shape[0])
plt.plot(x, Y[0][:,0]); plt.ylabel('IC 1'); plt.xticks([]); plt.yticks(np.arange(-8, 4, 2))
ax1=plt.subplot(312)
plt.plot(x, Y[0][:,1]); plt.ylabel('IC 2'); plt.xticks([]);  plt.yticks(np.arange(-6, 4, 2))
ax1=plt.subplot(313)
plt.plot(x, Y[0][:,2]); plt.xlabel('time / ns'); plt.ylabel('IC 3'); plt.yticks(np.arange(-4, 6, 2));

The TICA coordinates are now clustered into a number of discrete states using the k-means algorithm. The k-means algorithm requires as input the number of clusters *n_clusters*. For the metric there is only one choice possible here which is *euclidean*.

In [None]:
n_clusters = 250      # number of k-means clusters

##### TODO:
Cluster the data using k-means with ``k=n_clusters`` cluster centers.

In [None]:
clustering = # FIXME

The trajectories are now assigned to the cluster centers.

In [None]:
dtrajs = clustering.dtrajs

In [None]:
pyemma.plots.plot_free_energy(np.vstack(Y)[:,0], np.vstack(Y)[:,1])
cc_x = clustering.clustercenters[:,0]
cc_y = clustering.clustercenters[:,1]
plt.plot(cc_x,cc_y, linewidth=0, marker='o', markersize=5, color='black')

The states are well distributed in phase space. Congratulations, you have finished all the steps of data processing necessary to build a Markov model. 
##### TODO:
Check what results you get by
- using a different number of k-means cluster centers (going lower will speed up computations...)
- doing TICA with another lag time
- directly clustering in high dimensional space
- using PCA instead of TICA
- taking completely different input features
- etc

Please be creative at this point. If you want to keep this working example configuration as a reference, you can create a copy by clicking on file -> Make a copy.

You find pyEMMA's coordinate package user guide here: http://www.emma-project.org/latest/api/index_coor.html