# 01 - Data-I/O and featurization
In this notebook, we will cover how to load (and visualize) molecular simulation data.

<a rel="license" href="http://creativecommons.org/licenses/by/4.0/"><img alt="Creative Commons Licence" style="border-width:0" src="https://i.creativecommons.org/l/by/4.0/88x31.png" title='This work is licensed under a Creative Commons Attribution 4.0 International License.' align="right"/></a>

Maintainers: [@cwehmeyer](https://github.com/cwehmeyer), [@marscher](https://github.com/marscher), [@thempel](https://github.com/thempel), [@psolsson](https://github.com/psolsson)

Remember, to
- run the currently highlighted cell, hold <kbd>&#x21E7; Shift</kbd> and press <kbd>&#x23ce; Enter</kbd>;
- get help for a specific function, place the cursor within the function's brackets, hold <kbd>&#x21E7; Shift</kbd>, and press <kbd>&#x21E5; Tab</kbd>;
- find the full documentation at [PyEMMA.org](http://www.pyemma.org).

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import mdshare
import pyemma
# for visualization of molecular structures:
import nglview
import mdtraj
from threading import Timer
from nglview.player import TrajectoryPlayer

## Case 1: preprocessed data (toy model)
In the most convenient case, we already have preprocessed time series data available in some kind of archive which can be read using `numpy`:

In [None]:
file = mdshare.fetch('hmm-doublewell-2d-100k.npz', working_directory='data')
with np.load(file) as fh:
    data = fh['trajectory']

print(data)

Once we have the data in memory, we can use one of `pyemma`'s plotting functions to visualize what we have loaded:

In [None]:
pyemma.plots.plot_feature_histograms(data, feature_labels=['$x$', '$y$']);

The `plot_feature_histograms()` function visualizes the distributions of all degrees of freedom where we assume that the columns of `data` represent different features and the rows represent different time steps.

While `plot_feature_histograms()` can handle arbitrary numbers of features, we have a additional plotting functions for the special case of two features. First, we visualize the sample density in the $x/y$-plane...

In [None]:
fig, ax, misc = pyemma.plots.plot_density(*data.T)
ax.set_xlabel('$x$')
ax.set_ylabel('$y$')
ax.set_xlim(-4, 4)
ax.set_ylim(-4, 4)
ax.set_aspect('equal')
fig.tight_layout()

... then, we show the corresponding free energy:

In [None]:
fig, ax, misc = pyemma.plots.plot_free_energy(*data.T, legacy=False)
ax.set_xlabel('$x$')
ax.set_ylabel('$y$')
ax.set_xlim(-4, 4)
ax.set_ylim(-4, 4)
ax.set_aspect('equal')
fig.tight_layout()

Both functions make a two-dimensional histogram for the given features; the free energy surface is defined by the negative logarithm of the probability computed from the histogram counts.

Please note that these functions visualize the density and free energy of the **sampled data**, not the equilibrium distribution of the underlying system. To account for nonequiblibrium data, you can supply frame-wise weights using the `weights` parameter.

## Case 2: loading `*.xtc` files (alanine dipeptide)
To load molecular dynamics data from one of the standard file formats ( `*.xtc`), we need not only the actual simulation data, but a topology file, too. This might differ for other formats though.

In [None]:
pdb = mdshare.fetch('alanine-dipeptide-nowater.pdb', working_directory='data')
files = mdshare.fetch('alanine-dipeptide-*-250ns-nowater.xtc', working_directory='data')
print(pdb)
print(files)

We can have a look at the structure with the aid of nglview. We first load the PDB file into memory with mdtraj and then visualize it. The widget will auto-close after 30 seconds, if you want to watch it again, please execute the cell below again.

In [None]:
widget = nglview.show_mdtraj(mdtraj.load(pdb))
p = TrajectoryPlayer(widget)
widget.add_ball_and_stick()
p.spin = True
def stop_spin():
    p.spin = False
    widget.close()
Timer(30, stop_spin).start()
widget

We start with creating a featurizer object using the topology file:

In [None]:
feat = pyemma.coordinates.featurizer(pdb)

Next, we start adding features which we want to extract from the simulation data. Here, we want to load the backbone torsions:

In [None]:
feat.add_backbone_torsions()

We can always call the featurizer's `describe()` method to show which features are requested:

In [None]:
print(feat.describe())

After we have selected all desired features, we can call the `load()` function to load all features into memory or, alternatively, the `source()` function to create a streamed feature reader. For now, we will use `load()`:

In [None]:
data = pyemma.coordinates.load(files, features=feat)

print('type of data:', type(data))
print('lengths:', len(data))
print('shape of elements:', data[0].shape)

Apparently, we have loaded a list of three two-dimensional `numpy.ndarray` objects from our three trajectory files. We can visualize these features using the aforementioned plotting functions, but to do so we have to concatenate the three individual trajectories:

We can now measure the quantity of kinetic variance of the just selected feature by computing a VAMP-2 score. This score gives us information on the kinetic variance contained in the feature. The minimum value of this score is 1, which corresponds to the invariant measure or the equilibrium.

With the dimension parameter we specify the amount of dynamic processes we want to score. This is of importance later on, when we want to compare different input features. If we did not fix this number, we would not have an upper bound for the score.

In [None]:
score_phi_psi = pyemma.coordinates.vamp(
    data[:-1], dim=2).score(
        test_data=data[-1:],
        score_method='VAMP2')
print('VAMP2-score: {:f}'.format(score_phi_psi))

The score of $\approx1.5$ means, we have the constant of $1$ plus a total contribution of $0.5$ from the other dynamic process.

In [None]:
data_concatenated = np.concatenate(data)
pyemma.plots.plot_feature_histograms(data_concatenated, feature_labels=feat);

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(10, 4), sharex=True, sharey=True)
# the * operator used in a function call is used to unpack
# the iterable variable into its single elements. 
pyemma.plots.plot_density(*data_concatenated.T, ax=axes[0])
pyemma.plots.plot_free_energy(*data_concatenated.T, ax=axes[1], legacy=False)
for ax in axes.flat:
    ax.set_xlabel('$\Phi$')
    ax.set_aspect('equal')
axes[0].set_ylabel('$\Psi$')
fig.tight_layout()

Let us look at a different featurization example and load the positions of all heavy atoms instead. We create a new featurizer object and use its `add_selection()` method to request the positions of a given selection of atoms. For this selection, we can use the `select_Heavy()` method which returns the indices of all heavy atoms.

Again, we load the data into memory and show what we loaded using the `describe()` method:

In [None]:
feat = pyemma.coordinates.featurizer(pdb)
feat.add_selection(feat.select_Heavy())

data = pyemma.coordinates.load(files, features=feat)

feat.describe()

And we visualize the distributions of all loaded features:

In [None]:
fig, ax = plt.subplots(figsize=(10, 7))
pyemma.plots.plot_feature_histograms(np.concatenate(data), feature_labels=feat, ax=ax)
fig.tight_layout()

Again we have a look at the VAMP-2 score of the heavy atom coordinates.

In [None]:
score_heavy_atoms = pyemma.coordinates.vamp(
    data[:-1], dim=2).score(
        test_data=data[-1:],
        score_method='VAMP2')
print('VAMP2-score: {:f}'.format(score_heavy_atoms))

As we see, the score for the heavy atom positions is much higher as the one for the $\phi/\psi$ torsion angles. We will learn later what this means.

In [None]:
feat = pyemma.coordinates.featurizer(pdb)
heavy_atom_distance_pairs = feat.pairs(feat.select_Heavy())
feat.add_distances(heavy_atom_distance_pairs)
data = pyemma.coordinates.load(files, features=feat)

print(feat.describe())

Now let us compare the score of heavy atom distance pairs to the other scores.

In [None]:
score_pair_dists_ca = pyemma.coordinates.vamp(
    data[:-1]).score(
        test_data=data[-1:],
        score_method='VAMP2')
print('VAMP2-score: {:f}'.format(score_pair_dists_ca))

It seems like the heavy atom distance pairs cover a similar amount of kinetic variance as the distances themselfs, but a bit less.

### `load()` versus `source()`
Using `load()`, we put the full data into memory which is possible for all examples in this tutorial:

In [None]:
print(data)

Many real world apllications, though, require more memory than your workstation might provide. For these cases, you should use the `source()` function:

In [None]:
data = pyemma.coordinates.source(files, features=feat)
print(data)

This function allows to stream the data and work on chunks instead of the full set. Most of the functions in the `coordinate` sub-package accept data in memory as well as streamed feature readers but some plotting functions can only work with data in memory. To load a (strided) subset into memory, we can use the `get_output()` method with a stride parameter:

In [None]:
data_output = data.get_output(stride=5)
len(data_output)
print('number of frames in first file: {}'.format(data.trajectory_length(0)))
print('number of frames after striding: {}'.format(len(data_output[0])))

We now have loaded every fifth frame into memory and we can visualize the (concatenated) features:

In [None]:
fig, ax = plt.subplots(figsize=(10, 7))
pyemma.plots.plot_feature_histograms(
    np.concatenate(data_output), feature_labels=feat, ax=ax)
fig.tight_layout()

### Testing your progress

In the remainder of this notbook, you will find short excercises where you can put your newly learned skills to the test. The exercises are anounced by the keyword **Exercise** and followed by an incomplete cell where you have to fill in missing parts, indicated by
```python
#FIXME
```
After that comes a button (**Show Solution**) to reveal the solution.

**Exercise 1**: heavy atom distances
Please fix the following code block such that the distances between all heavy atoms are loaded and visualized.

**Hint**: you might find the `add_distances()` method of the featurizer object helpful.

In [None]:
feat = pyemma.coordinates.featurizer(pdb)
pairs = feat.pairs(# FIXME)
feat. #FIXME

data = pyemma.coordinates.load(files, features=feat)

fig, ax = plt.subplots(figsize=(10, 7))
pyemma.plots.plot_feature_histograms(np.concatenate(data), feature_labels=feat, ax=ax)
fig.tight_layout()

In [None]:
feat = pyemma.coordinates.featurizer(pdb)
pairs = feat.pairs(feat.select_Heavy())
feat.add_distances(pairs, periodic=False)

data = pyemma.coordinates.load(files, features=feat)

fig, ax = plt.subplots(figsize=(10, 7))
pyemma.plots.plot_feature_histograms(np.concatenate(data), feature_labels=feat, ax=ax)
fig.tight_layout()

## Case 3: loading `*.xtc` files (pentapeptide)
Once we have obtained the raw data files...

In [None]:
pdb = mdshare.fetch('pentapeptide-impl-solv.pdb', working_directory='data')
files = mdshare.fetch('pentapeptide-*-500ns-impl-solv.xtc', working_directory='data')

... and had a quick look at the structure again...

In [None]:
widget = nglview.show_mdtraj(mdtraj.load(pdb))
p = TrajectoryPlayer(widget)
widget.add_ball_and_stick()
p.spin = True
def stop_spin():
    p.spin = False
    widget.close()
Timer(30, stop_spin).start()
widget

... we can load a selection of features into memory. Here, we want the $\cos/\sin$ transformations of the backbone and $\chi_1$ sidechain torsions.

In [None]:
feat = pyemma.coordinates.featurizer(pdb)
feat.add_backbone_torsions(cossin=True, periodic=False)
feat.add_sidechain_torsions(which='chi1', cossin=True, periodic=False)

data = pyemma.coordinates.load(files, features=feat)
data_concatenated = np.concatenate(data)

print(feat.describe())

Finally, we visualize the (concatenated) features:

In [None]:
fig, ax = plt.subplots(figsize=(10, 7))
pyemma.plots.plot_feature_histograms(data_concatenated, feature_labels=feat, ax=ax)
fig.tight_layout()

### Exercises: feature selection and visualization

**Exercise 2**: Complete the following code block to load/visualize the distances between all $\text{C}_\alpha$ carbon atoms.

**Hint**: You might find the `add_distances_ca()` method of the featurizer object helpful.

In [None]:
feat = pyemma.coordinates.featurizer(pdb)
feat. #FIXME

data = pyemma.coordinates.load(files, features=feat)
data_concatenated = np.concatenate(data)

fig, ax = plt.subplots(figsize=(10, 7))
pyemma.plots.plot_feature_histograms(data_concatenated, feature_labels=feat, ax=ax)
fig.tight_layout()

In [None]:
feat = pyemma.coordinates.featurizer(pdb)
feat.add_distances_ca(periodic=False)

data = pyemma.coordinates.load(files, features=feat)
data_concatenated = np.concatenate(data)

fig, ax = plt.subplots(figsize=(10, 7))
pyemma.plots.plot_feature_histograms(data_concatenated, feature_labels=feat, ax=ax)
fig.tight_layout()

**Exercise 3**: Complete the following code block to load/visualize the minimal distances between all residues.

**Hint**: You might find the `add_residue_mindist()` method of the featurizer object helpful.

In [None]:
feat = pyemma.coordinates.featurizer(pdb)
feat. #FIXME

data = pyemma.coordinates.load(files, features=feat)
data_concatenated = np.concatenate(data)

fig, ax = plt.subplots(figsize=(10, 7))
pyemma.plots.plot_feature_histograms(data_concatenated, feature_labels=feat, ax=ax)
fig.tight_layout()

In [None]:
feat = pyemma.coordinates.featurizer(pdb)
feat.add_residue_mindist(periodic=False)

data = pyemma.coordinates.load(files, features=feat)
data_concatenated = np.concatenate(data)

fig, ax = plt.subplots(figsize=(10, 7))
pyemma.plots.plot_feature_histograms(data_concatenated, feature_labels=feat, ax=ax)
fig.tight_layout()

**Exercise 4**: Complete the following code block to load/visualize the position of all backbone atoms.

**Hint**: You might find the `select_Backbone()` method of the featurizer object helpful.

In [None]:
feat = pyemma.coordinates.featurizer(pdb)
feat. #FIXME

data = pyemma.coordinates.load(files, features=feat)
data_concatenated = np.concatenate(data)

fig, ax = plt.subplots(figsize=(10, 7))
pyemma.plots.plot_feature_histograms(data_concatenated, feature_labels=feat, ax=ax)
fig.tight_layout()

In [None]:
feat = pyemma.coordinates.featurizer(pdb)
feat.add_selection(feat.select_Backbone())

data = pyemma.coordinates.load(files, features=feat)
data_concatenated = np.concatenate(data)

fig, ax = plt.subplots(figsize=(10, 12))
pyemma.plots.plot_feature_histograms(data_concatenated, feature_labels=feat, ax=ax)
fig.tight_layout()

**Exercise 5**: Complete the following code block to load/visualize the position of all $\text{C}_\alpha$ atoms.

**Hint**: You might find the `select_Ca()` method of the featurizer object helpful.

In [None]:
feat = pyemma.coordinates.featurizer(pdb)
feat. #FIXME

data = pyemma.coordinates.load(files, features=feat)
data_concatenated = np.concatenate(data)

fig, ax = plt.subplots(figsize=(10, 7))
pyemma.plots.plot_feature_histograms(data_concatenated, feature_labels=feat, ax=ax)
fig.tight_layout()

In [None]:
feat = pyemma.coordinates.featurizer(pdb)
feat.add_selection(feat.select_Ca())

data = pyemma.coordinates.load(files, features=feat)
data_concatenated = np.concatenate(data)

fig, ax = plt.subplots(figsize=(10, 7))
pyemma.plots.plot_feature_histograms(data_concatenated, feature_labels=feat, ax=ax)
fig.tight_layout()

## Wrapping up
In this notebook, we have learned how to load and visualize molecular simulation data with `pyemma`. In detail, we have used
- `pyemma.coordinates.featurizer()` to define a selection of features we want to extract,
- `pyemma.coordinates.load()` to load data into memory, and
- `pyemma.coordinates.source()` to create a streamed feature reader in case the data does not fit into memory.

After loading the data into memory, we have used
- `pyemma.coordinates.vamp().score()` to score the quality of the features,
- `pyemma.plots.plot_feature_histograms()` to show the distributions of all loaded features,
- `pyemma.plots.plot_density()` to visualize the sample density, and
- `pyemma.plots.plot_free_energy()` to visualize the free energy surface of two selected features.