# Data input, featurization and coordinate transforms in PyEMMA
**Remember**:
- to run the currently highlighted cell, hold <kbd>&#x21E7; Shift</kbd> and press <kbd>&#x23ce; Enter</kbd>;
- to get help for a specific function, place the cursor within the function's brackets, hold <kbd>&#x21E7; Shift</kbd>, and press <kbd>&#x21E5; Tab</kbd>;
- you can find the full documentation at [PyEMMA.org](http://www.pyemma.org).

## Loading MD example data from our FTP server
Ingredients:
- Topology file: PDB
- Trajectory data: List of .XTC files

In [None]:
from mdshare import fetch

In [None]:
topfile = fetch('alanine-dipeptide-nowater.pdb', working_directory='data')
traj_list = [fetch('alanine-dipeptide-%d-250ns-nowater.xtc' % i, working_directory='data') for i in range(3)]

The `fetch` function fetches the data from our servers. **Do not use `mdshare` for your own data!**

## Import PyEMMA & friends

In [None]:
import pyemma
import deeptime as dt
import pyemma.util.contexts
import numpy as np
import matplotlib.pyplot as plt
plt.matplotlib.rcParams.update({'font.size': 16})

## Several ways of processing the same data
### Backbone torsions
- The best possible discription for Ala2
- Two dimensions that discribe the full dynamics
- A priori known

#### Exercise: Define the featurizer and add backbone torsions.

In [None]:
bbtorsion_feat = # FIXME
# FIXME

In [None]:
bbtorsion_feat = pyemma.coordinates.featurizer(topfile)
bbtorsion_feat.add_backbone_torsions()

#### Exercise: Load the data into memory

In [None]:
bbtorsions = # FIXME

In [None]:
bbtorsions = pyemma.coordinates.load(traj_list, bbtorsion_feat)

In [None]:
pyemma.plots.plot_free_energy(np.concatenate(bbtorsions)[:, 0], np.concatenate(bbtorsions)[:, 1])
plt.xlabel('$\Phi$ / rad') 
plt.ylabel('$\Psi$ / rad');

### heavy atom distances
- without prior knowledge usually a good choice
- very high dimensional even for this system

#### Exercise: define a second featurizer object and add heavy atom distances:

In [None]:
heavy_atom_dist_feat =  # FIXME
heavy_atom_indices =  # FIXME
# FIXME

In [None]:
heavy_atom_dist_feat = pyemma.coordinates.featurizer(topfile)
heavy_atom_indices = heavy_atom_dist_feat.select_Heavy()

heavy_atom_dist_feat.add_distances(heavy_atom_indices, periodic=False)

In [None]:
print(heavy_atom_indices)

In [None]:
heavy_atom_dist_feat.dimension()

In [None]:
heavy_atom_distances = pyemma.coordinates.load(traj_list, heavy_atom_dist_feat)

#### Exercise: Visualize the heavy atom distances.

In [None]:
fig, ax = plt.subplots(figsize=(10, 14))
pyemma.plots.plot_feature_histograms(np.concatenate(heavy_atom_distances), feature_labels=heavy_atom_dist_feat, ax=ax)
ax.set_xlabel('heavy atom distance')
ax.set_title('distance histograms per dimension (normalized)');

## VAMP-scoring: Which features are best?
We already learned that two dimensions are a good choice for our data. Now, we want to compare different input features with the VAMP-2 score.
Please complete the next task at the following lag times:

In [None]:
dim = 2
lags = [10, 100, 1000]  # ps

#### Exercise: Perform cross-validated VAMP-scoring for backbone torsions and heavy-atom distances.

In [None]:
fig, axes = plt.subplots(1, 3, figsize=(12, 3), sharey=True)

labels = ['backbone\ntorsions', 'heavy Atom\ndistances']

tica_estimator = dt.decomposition.VAMP(lagtime=lags[0], dim=dim)

for ax, lag in zip(axes.flat, lags):
    tica_estimator.lagtime = lag
    torsions_scores = dt.decomposition.vamp_score_cv(tica_estimator, trajs=bbtorsions, 
                                                     blocksplit=False, n=3)
    scores = [torsions_scores.mean()]
    errors = [torsions_scores.std()]
    distances_scores = dt.decomposition.vamp_score_cv(tica_estimator, trajs=heavy_atom_distances, 
                                                      blocksplit=False, n=3)
    scores += [distances_scores.mean()]
    errors += [distances_scores.std()]
    ax.bar(labels, scores, yerr=errors, color=['C0', 'C1', 'C2'])
    ax.set_title(r'lag time $\tau$={}ps'.format(lag))

axes[0].set_ylabel('VAMP2 score')
fig.tight_layout()

#### Discussion:
Which feature looks best and why?

## TICA projection of heavy atom distances
#### Exercise: Do a TICA projection of the heavy atom distances

In [None]:
tica = # FIXME

In [None]:
tica_estimator = dt.decomposition.TICA(lagtime=10, var_cutoff=0.95)
tica = tica_estimator.fit(heavy_atom_distances).fetch_model()

In [None]:
tica.output_dimension

In [None]:
tics = tica.transform(heavy_atom_distances)

In [None]:
pyemma.plots.plot_free_energy(np.concatenate(tics)[:, 0], np.concatenate(tics)[:, 1])
plt.xlabel('TIC 1') 
plt.ylabel('TIC 2');

#### Exercise: Perform a PCA projection of heavy atom distances

In [None]:
pca = pyemma.coordinates.pca()  # FIXME

In [None]:
pca = pyemma.coordinates.pca(heavy_atom_distances, dim=2)

In [None]:
pcs = [pca.transform(traj) for traj in heavy_atom_distances]

In [None]:
pyemma.plots.plot_free_energy(np.concatenate(pcs)[:, 0], np.concatenate(pcs)[:, 1])
plt.xlabel('IC 1') 
plt.ylabel('IC 2');

In [None]:
fig, axes = plt.subplots(1, 3, figsize=(15, 4))
t = ['backbone torsions', 'TICs', 'PCs']
for n, _y in enumerate([bbtorsions, tics, pcs]):
    pyemma.plots.plot_free_energy(np.concatenate(_y)[:, 0], np.concatenate(_y)[:, 1], ax=axes[n], cbar=False)
    axes[n].set_title(t[n])

#### Discussion:
What do you think are the differences between the plots in terms of the dynamics they describe?

## Different ways of discretizing the output

In [None]:
y = bbtorsions  # if you want, you can change this later and try e.g. the TICA transformed data

#### Exercise: Perform k-means clustering and plot the cluster centers into the free energy landscape

In [None]:
clustering_kmeans =  # FIXME

In [None]:
kmeans_estimator = dt.clustering.KMeans(75, max_iter=30)
stride = 10
clustering_kmeans = kmeans_estimator.fit(np.concatenate(y)[::stride]).fetch_model()
# different k, stride, max_iter can be used!

In [None]:
fig, ax = plt.subplots()
# FIXME
pyemma.plots.plot_free_energy(*np.concatenate(y).T, ax=ax)
ax.set_xlabel('$\Phi$ / rad') 
ax.set_ylabel('$\Psi$ / rad');

In [None]:
fig, ax = plt.subplots()
ax.plot(*clustering_kmeans.cluster_centers.T, 'ko')
pyemma.plots.plot_free_energy(*np.concatenate(y).T, ax=ax)
ax.set_xlabel('$\Phi$ / rad') 
ax.set_ylabel('$\Psi$ / rad');

#### Exercise: Do the same with regular space clustering

In [None]:
clustering_regspace = # FIXME
clustering_regspace.n_clusters

In [None]:
regspace_estimator = dt.clustering.RegularSpace(dmin=0.4)
clustering_regspace = regspace_estimator.fit(np.concatenate(y)).fetch_model()
clustering_regspace.n_clusters

In [None]:
fig, ax = plt.subplots()
ax.plot(*clustering_regspace.cluster_centers.T, 'ko')
pyemma.plots.plot_free_energy(*np.concatenate(y).T, ax=ax)
ax.set_xlabel('$\Phi$ / rad') 
ax.set_ylabel('$\Psi$ / rad');

#### Discussion:
In your group, discuss the differences between the two clustering algorithms. Which one do you think is better? Which one is faster?

## Add-on: A quick MSM estimate to check our work
If you are already familiar with Markov state modeling, have a look at the following plots. It tells us which combination of features/projection/clustering conserves the slowest process in the system. Further, we might find that in some cases, MSM implied timescales converge faster than in others.

In [None]:
fig, axes = plt.subplots(3, 3, figsize=(15, 13))
t = ['backbone torsions', 'TICs', 'PCs']
from timescales import implied_timescales_msm
with pyemma.util.contexts.settings(show_progress_bars=False):
    for n, _y in enumerate([bbtorsions, tics, pcs]):
        pyemma.plots.plot_free_energy(*np.concatenate(_y).T, ax=axes[0][n], cbar=False)
        axes[0][n].set_title(t[n], fontweight='bold')

        data = np.concatenate(_y)[::100]
        clusterings = [
            dt.clustering.KMeans(75, max_iter=30).fit(data).fetch_model(),
            dt.clustering.RegularSpace(dmin=0.4 if n==0 else .4 / (2.2 * n)).fit(data).fetch_model()
        ]
        for cl_n, cl_obj in enumerate(clusterings):
            axes[0][n].plot(*cl_obj.cluster_centers.T, 'ko' if cl_n == 0 else 'rs', alpha=.8)
            dtrajs = [cl_obj.transform(traj) for traj in _y]
            timescales = []
            its = implied_timescales_msm(dtrajs, lagtimes=[1, 2, 4, 6, 8], nits=4, bayesian=False)
            pyemma.plots.plot_implied_timescales(its, ax=axes[cl_n+1][n])
            axes[cl_n+1][n].set_ylim(1e-1, 3e3)
            axes[cl_n+1][n].set_ylabel('')
axes[1][0].set_ylabel('k-means clustering', fontweight='bold')
axes[2][0].set_ylabel('regspace clustering', fontweight='bold')

fig.tight_layout()