# 01 - Data-I/O and featurization

In this notebook, we will cover how to load (and visualize) molecular simulation data.

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import mdshare
import pyemma

## Case 1: preprocessed data (toy model)
In the most convenient case, we already have preprocessed time series data available in some kind of archive which can be read using `numpy`:

In [None]:
file = mdshare.fetch('hmm-doublewell-2d-100k.npz', working_directory='data')
with np.load(file) as fh:
    data = fh['trajectory']

print(data)

Once we have the data in memory, we can use one of `pyemma`'s plotting functions to visualize what we have loaded:

In [None]:
pyemma.plots.plot_feature_histograms(data, feature_labels=['$x$', '$y$'])

The `plot_feature_histograms()` function visualizes the distributions of all degrees of freedom where we assume that the columns of `data` represent different features and the rows represent different time steps.

While `plot_feature_histograms()` can handle arbitrary numbers of features, we have an additional plotting function for the special case of two features:

In [None]:
fig, ax = pyemma.plots.plot_free_energy(*data.T)
ax.set_xlabel('$x$')
ax.set_ylabel('$y$')
fig.tight_layout()

The `plot_free_energy()` function makes a two-dimensional histogram for the given features. It visualizes their free energy surface which is defined by the negative logarithm of the probability computed from the histogram counts.

## Case 2: loading `*.dcd` files (alanine dipeptide)
To load molecular dynamics data from one of the standard file formats ( `*.dcd`), we need not only the actual simulation data, but a topology file, too. This might differ for other formats though.

In [None]:
pdb = mdshare.fetch('alanine-dipeptide-nowater.pdb', working_directory='data')
files = mdshare.fetch('alanine-dipeptide-*-250ns-nowater.dcd', working_directory='data')
print(pdb)
print(files)

We can have a look at the structure with the aid of nglview. We first load the PDB file into memory with mdtraj and then visualize it.

In [None]:
import mdtraj
import nglview
from nglview.player import TrajectoryPlayer
import os
from threading import Timer

widget = nglview.show_mdtraj(mdtraj.load(
    os.path.join('data', 'alanine-dipeptide-nowater.pdb')))
p = TrajectoryPlayer(widget)
widget._camera_orientation = [13, 9.2, -8.2, 0, 11.95,
                              -5.7, 10.95, 0, 3.16, -13.3, -10.4, 0, -2.4, -22.3, 0, 1]
p.spin = True
def stop_spin():
    p.spin = False
Timer(30, stop_spin).start()

widget

We start with creating a featurizer object using the topology file:

In [None]:
feat = pyemma.coordinates.featurizer(pdb)

Next, we start adding features which we want to extract from the simulation data. Here, we want to load the backbone torsions:

In [None]:
feat.add_backbone_torsions()

We can always call the featurizer's `describe()` method to show which features are requested:

In [None]:
print(feat.describe())

After we have selected all desired features, we can call the `load()` function to load all features into memory or, alternatively, the `source()` function to create a streamed feature reader. For now, we will use `load()`:

In [None]:
data = pyemma.coordinates.load(files, features=feat)

print(data)

Apparently, we have loaded a list of three two-dimensional `numpy.ndarray` objects from our three trajectory files. We can visualize these features using the aforementioned plotting functions, but to do so we have to concatenate the three individual trajectories:

In [None]:
pyemma.plots.plot_feature_histograms(np.concatenate(data), feature_labels=['$\Phi$', '$\Psi$'])

In [None]:
fig, ax = pyemma.plots.plot_free_energy(*np.concatenate(data).T)
ax.set_xlabel('$\Phi$')
ax.set_ylabel('$\Psi$')
fig.tight_layout()

Let us look at a different featurization example and load the positions of all heavy atoms instead. We create a new featurizer object and use its `add_selection()` method to request the positions of a given selection of atoms. For this selection, we can use the `select_Heavy()` method which returns the indices of all heavy atoms.

Again, we load the data into memory and show what we loaded using the `describe()` method:

In [None]:
feat = pyemma.coordinates.featurizer(pdb)
feat.add_selection(feat.select_Heavy())

data = pyemma.coordinates.load(files, features=feat)

feat_desc = feat.describe()
print(feat_desc)

And we visualize the distributions of all loaded features:

In [None]:
fig, ax = plt.subplots(figsize=(10, 7))
pyemma.plots.plot_feature_histograms(np.concatenate(data), feature_labels=feat_desc, ax=ax)
fig.tight_layout()

### `load()` versus `source()`
Using `load()`, we put the full data into memory which is possible for all examples in this tutorial:

In [None]:
print(data)

Many real world apllications, though, require more memory than your workstation might provide. For these cases, you should use the `source()` function:

In [None]:
data = pyemma.coordinates.source(files, features=feat)

print(data)

This function allows to stream the data and work on chunks instead of the full set. Most of the functions in the `coordinate` submodule accept data in memory as well as streamed feature readers but some plotting functions can only work with data in memory. To load a (strided) subset into memory, we can use the `get_output()` method with a stride parameter:

In [None]:
data_out = data.get_output(stride=5)

print(data_out)

We now have loaded every fifth frame into memory and we can visualize the (concatenated) features:

In [None]:
fig, ax = plt.subplots(figsize=(10, 7))
pyemma.plots.plot_feature_histograms(np.concatenate(data_out), feature_labels=feat_desc, ax=ax)
fig.tight_layout()

### Exercise: heavy atom distances
Please fix the following code block such that the distances between all heavy atoms are loaded and visualized.

**Hint**: you might find the `add_distances()` method of the featuriuzer object helpful.

In [None]:
feat = pyemma.coordinates.featurizer(pdb)
feat. #FIXME

data = pyemma.coordinates.load(files, features=feat)

fig, ax = plt.subplots(figsize=(10, 7))
pyemma.plots.plot_feature_histograms(np.concatenate(data), feature_labels=feat.describe(), ax=ax)
fig.tight_layout()

In [None]:
feat = pyemma.coordinates.featurizer(pdb)
feat.add_distances(feat.select_Heavy())

data = pyemma.coordinates.load(files, features=feat)

fig, ax = plt.subplots(figsize=(10, 7))
pyemma.plots.plot_feature_histograms(np.concatenate(data), feature_labels=feat.describe(), ax=ax)
fig.tight_layout()

## Case 3: loading `*.xtc` files (pentapeptide)
The handling of `*.xtc` files is identical to that of `*.dcd` files or any other standard molecular dynamics file format supported by `pyemma`'s dependency `mdtraj`. Once we have obtained the raw data files...

In [None]:
pdb = mdshare.fetch('pentapeptide-impl-solv.pdb', working_directory='data')
files = mdshare.fetch('pentapeptide-*-500ns-impl-solv.xtc', working_directory='data')

... and had a quick look at the structure again...

In [None]:
widget = nglview.show_mdtraj(mdtraj.load(
    os.path.join('data', 'pentapeptide-impl-solv.pdb')))
p = TrajectoryPlayer(widget)
widget.add_ball_and_stick()
p.spin = True
def stop_spin():
    p.spin = False
Timer(30, stop_spin).start()
widget

... we can load a selection of features into memory. Here, we want the $\cos/\sin$ transformations of the backbone and $\chi_1$ sidechain torsions.

In [None]:
feat = pyemma.coordinates.featurizer(pdb)
feat.add_backbone_torsions(cossin=True)
feat.add_sidechain_torsions(which='chi1', cossin=True)

data = pyemma.coordinates.load(files, features=feat)

feat_desc = feat.describe()
print(feat_desc)

Finally, we visualize the (concatenated) features:

In [None]:
fig, ax = plt.subplots(figsize=(10, 7))
pyemma.plots.plot_feature_histograms(np.concatenate(data), feature_labels=feat_desc, ax=ax)
fig.tight_layout()

### Exercises: feature selection and visualization

**Exercise 1**: Complete the following code block to load/visualize the distances between all $\text{C}_\alpha$ carbon atoms.

**Hint**: You might find the `add_distances_ca()` method of the featurizer object helpful.

In [None]:
feat = pyemma.coordinates.featurizer(pdb)
feat. #FIXME

data = pyemma.coordinates.load(files, features=feat)

fig, ax = plt.subplots(figsize=(10, 7))
pyemma.plots.plot_feature_histograms(np.concatenate(data), feature_labels=feat.describe(), ax=ax)
fig.tight_layout()

In [None]:
feat = pyemma.coordinates.featurizer(pdb)
feat.add_distances_ca()

data = pyemma.coordinates.load(files, features=feat)

fig, ax = plt.subplots(figsize=(10, 7))
pyemma.plots.plot_feature_histograms(np.concatenate(data), feature_labels=feat.describe(), ax=ax)
fig.tight_layout()

**Exercise 2**: Complete the following code block to load/visualize the minimal distances between all residues.

**Hint**: You might find the `add_residue_mindist()` method of the featurizer object helpful.

In [None]:
feat = pyemma.coordinates.featurizer(pdb)
feat. #FIXME

data = pyemma.coordinates.load(files, features=feat)

fig, ax = plt.subplots(figsize=(10, 7))
pyemma.plots.plot_feature_histograms(np.concatenate(data), feature_labels=feat.describe(), ax=ax)
fig.tight_layout()

In [None]:
feat = pyemma.coordinates.featurizer(pdb)
feat.add_residue_mindist()

data = pyemma.coordinates.load(files, features=feat)

fig, ax = plt.subplots(figsize=(10, 7))
pyemma.plots.plot_feature_histograms(np.concatenate(data), feature_labels=feat.describe(), ax=ax)
fig.tight_layout()

**Exercise 3**: Complete the following code block to load/visualize the position of all backbone atoms.

**Hint**: You might find the `select_Backbone()` method of the featurizer object helpful.

In [None]:
feat = pyemma.coordinates.featurizer(pdb)
feat. #FIXME

data = pyemma.coordinates.load(files, features=feat)

fig, ax = plt.subplots(figsize=(10, 7))
pyemma.plots.plot_feature_histograms(np.concatenate(data), feature_labels=feat.describe(), ax=ax)
fig.tight_layout()

In [None]:
feat = pyemma.coordinates.featurizer(pdb)
feat.add_selection(feat.select_Backbone())

data = pyemma.coordinates.load(files, features=feat)

fig, ax = plt.subplots(figsize=(10, 12))
pyemma.plots.plot_feature_histograms(np.concatenate(data), feature_labels=feat.describe(), ax=ax)
fig.tight_layout()

**Exercise 4**: Complete the following code block to load/visualize the position of all $\text{C}_\alpha$ atoms.

**Hint**: You might find the `select_Ca()` method of the featurizer object helpful.

In [None]:
feat = pyemma.coordinates.featurizer(pdb)
feat. #FIXME

data = pyemma.coordinates.load(files, features=feat)

fig, ax = plt.subplots(figsize=(10, 7))
pyemma.plots.plot_feature_histograms(np.concatenate(data), feature_labels=feat.describe(), ax=ax)
fig.tight_layout()

In [None]:
feat = pyemma.coordinates.featurizer(pdb)
feat.add_selection(feat.select_Ca())

data = pyemma.coordinates.load(files, features=feat)

fig, ax = plt.subplots(figsize=(10, 7))
pyemma.plots.plot_feature_histograms(np.concatenate(data), feature_labels=feat.describe(), ax=ax)
fig.tight_layout()

## Wrapping up
In this notebook, we have learned how to load and visualize molecular simulation data with `pyemma`. In detail, we have used
- `pyemma.coordinates.featurizer()` to define a selection of features we want to extract,
- `pyemma.coordinates.load()` to load data into memory, and
- `pyemma.coordinates.source()` to create a streamed feature reader in case the data does not fit into memory.

After loading the data into memory, we have used

- `pyemma.plots.plot_feature_histograms()` to show the distributions of all loaded features and
- `pyemma.plots.plot_free_energy()` to visualize the free energy surface of two selected features.