# 04 - MSM coarse graining and analysis
In this notebook, we will cover how to coarse grain an MSM onto the metastable states and analyze the modelled process.

In [None]:
%%javascript
Jupyter.utils.load_extensions('exercise2/main')

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
import matplotlib as mpl
import numpy as np
import mdshare
import pyemma

## Case 1: preprocessed, two-dimensional data (toy model)
We load the two-dimensional trajectory from an archive using `numpy`, directly discretize the full space using $k$-means clustering, visualize the marginal and joint distributions of both components as well as the cluster centers, and show the implied timescale (ITS) convergence:

In [None]:
file = mdshare.fetch('hmm-doublewell-2d-100k.npz', working_directory='data')
with np.load(file) as fh:
    data = fh['trajectory']

cluster = pyemma.coordinates.cluster_kmeans(data, k=50, max_iter=50)
its = pyemma.msm.its(cluster.dtrajs, lags=[1, 2, 3, 5, 7, 10], nits=3, errors='bayes')

fig, axes = plt.subplots(1, 3, figsize=(12, 3))
pyemma.plots.plot_feature_histograms(data, feature_labels=['$x$', '$y$'], ax=axes[0])
axes[1].scatter(*data.T, s=1, alpha=0.3)
axes[1].scatter(*cluster.clustercenters.T, s=15)
axes[1].set_xlabel('$x$')
axes[1].set_ylabel('$y$')
pyemma.plots.plot_implied_timescales(its, ylog=False, ax=axes[2])
fig.tight_layout()

We then estimate an MSM at lag time $1$ step

In [None]:
msm = pyemma.msm.estimate_markov_model(cluster.dtrajs, lag=1)

and check for disconnectivity. The MSM is constructed on the largest set of discrete states that are (reversibly) connected and the `active_state_fraction` and `active_count_fraction` show us the fraction of discrete states and transition counts from our data which are part of this largest set:

In [None]:
print('fraction of states used = {:f}'.format(msm.active_state_fraction))
print('fraction of counts used = {:f}'.format(msm.active_count_fraction))

The fraction is, in both cases, $1$ and, thus, we have no disconnected states (which we would have to exclude from our analysis).

If there would have been any disconnectivity in our data (fractions $<1$), we can access the indices of the **active states** (members of the largest connected set) via the `active_set` attribute:

In [None]:
print(msm.active_set)

With this potential issue out of the way, we can extract our first (stationary/thermodynamic) property, the `stationary_distribution` or `pi`:

In [None]:
print(msm.stationary_distribution)
print('msm.stationary_distribution is msm.pi: {}'.format(
    np.all(msm.stationary_distribution == msm.pi)))
print('sum of weights = {:f}'.format(msm.pi.sum()))

The attribute `msm.pi` tells us, for each discrete state, the absolute probability of observing said state in global equilibrium. Mathematically speaking, the stationary distribution $\pi$ is the left eigenvector of the transition matrix $\mathbf{T}$ to the eigenvalue $1$:

$$\pi^\top \mathbf{T} = \pi^\top.$$

We can use the stationary distribution to, e.g., visualize the weight of the dicrete states and, thus, to highlight which areas of our feature space are most probable. Here, we show all data points in a two dimensional scatter plot and color/weight them according to their discrete state membership:

In [None]:
fig, ax = plt.subplots()
im = ax.scatter(*data.T, s=1, c=msm.pi[cluster.dtrajs[0]])
ax.scatter(*cluster.clustercenters.T, s=15, c='black', marker='x')
ax.set_xlabel('$x$')
ax.set_ylabel('$y$')
cb = fig.colorbar(im, ax=ax)
cb.set_label('stationary weight')
fig.tight_layout()

The stationary distribution can also be used to correct the `pyemma.plots.plot_free_energy()` function in the case where the data points alone are not sufficient to compute the free energy surface by binning, i.e., is the data pints are not sampled from global equilibrium.

In this case, we assign the weight of the corresponding discrete state to each data points and pass this information to the plotting function via its `weights` parameter:

In [None]:
fig, ax = pyemma.plots.plot_free_energy(
    *data.T,
    weights=msm.pi[cluster.dtrajs[0]])
ax.set_xlabel('$x$')
ax.set_ylabel('$y$')
fig.tight_layout()

We will see further uses of the stationary distribution later. But for now, we continue the analysis of our model by visualizing its (right) eigenvectors:

In [None]:
eigvec = msm.eigenvectors_right()

fig, axes = plt.subplots(1, 3, figsize=(12, 3))
for i, ax in enumerate(axes.flat):
    im = ax.scatter(
        *cluster.clustercenters.T,
        c=np.round(eigvec[:, i], 5),
        s=70,
        cmap=mpl.cm.bwr,
        edgecolor='k',
        linewidth=0.2)
    cb = plt.colorbar(im, ax=ax)
    ax.set_xlabel('$x$')
    ax.set_ylabel('$y$')
    cb.set_label('{}. right eigenvector'.format(i + 1))
fig.tight_layout()

The right eigenvectors can be used to visualize the processes governed by the corresponding implied timescales. The first right eigenvector (always) is $(1,\dots,1)^\top$ for an MSM's transition matrix and it corresponds to the stationary process (infinite implied timescale).

The second right eigenvector corresponds to the slowest process, and shows negative entries for one group of discrete states and positive values for the other group. This tells us that the slowest process happens between these two groups and it relaxes on the slowest ITS ($\approx 8.5$ steps).

The third eigenvector shows a larger spread of values and no clear grouping. In combination with the ITS convergence plot, we can safely assume that this eigenvector contains just noise and does not indicate any resolved process.

We then continue to validate our MSM with a CK test for both metastable states which are already indicated by the second right eigenvector.

In [None]:
nstates = 2
pyemma.plots.plot_cktest(msm.cktest(nstates));

We currently have an MSM with 50 discrete states which has been validated for two metastable states. Using the `coarse_grain()` method, we can obtain an actual two-state MSM by mapping all states within a metastable set into one metastable state:

In [None]:
coarse_msm = msm.coarse_grain(nstates)
print(coarse_msm.transition_matrix)

We can compute the stationary weights of both metastable states by summing the stationary weights of the original MSM:

In [None]:
print('π_0 = {:f}'.format(msm.pi[coarse_msm.metastable_sets[0]].sum()))
print('π_1 = {:f}'.format(msm.pi[coarse_msm.metastable_sets[1]].sum()))

Via the `metastable_assigments` attribute, we can create a trajectory of metastable states. We can then use this trajectory to visualize the metastable state membership for all data points in the original trajectory. Further, we can visualize the coarse-grained MSM using the `pyemma.plots.plot_markov_model()` function:

In [None]:
metastable_traj = coarse_msm.metastable_assignments[cluster.dtrajs[0]]

fig, axes = plt.subplots(1, 2, figsize=(10, 4))
im = axes[0].scatter(*data.T, s=1, c=metastable_traj)
axes[0].set_xlabel('$x$')
axes[0].set_ylabel('$y$')
cb = fig.colorbar(im, ax=axes[0])
cb.set_label('metastable state')
cb.set_ticks([0, 1])
pyemma.plots.plot_markov_model(
    coarse_msm,
    pos=np.asarray([[0, 0], [3, 2]]),
    arrow_curvature=2.0,
    figpadding=0.2,
    size=12,
    ax=axes[1])
axes[1].set_aspect('equal')
fig.tight_layout()

As another example of kinetic information is the calculation of mean first passage times (MFPTs). We use the `mfpt()` method of the original MSM object to compute MFPTs between pairs of metastable sets (accessible via the `metastable_sets` attribute of the coarse-grained MSM object). Then, we compute pairwise transition rates (inverse MFPTs) and visualize the kinetic network using the `pyemma.plots.plot_network()` function:

In [None]:
mfpt = np.zeros((nstates, nstates))
for i in range(nstates):
    for j in range(nstates):
        mfpt[i, j] = msm.mfpt(
            coarse_msm.metastable_sets[i],
            coarse_msm.metastable_sets[j])

rate = np.zeros_like(mfpt)
nz = mfpt.nonzero()
rate[nz] = 1.0 / mfpt[nz]

pyemma.plots.plot_network(
    rate,
    pos=np.asarray([[0, 0], [2, 1]]),
    arrow_label_format='%.0f steps',
    arrow_labels=mfpt,
    figpadding=0.3,
    size=12);

Here, the arrow thickness is proportional to the transition rates while the arrows' labels show the MFPTs.

## Case 2: low-dimensional molecular dynamics data (alanine dipeptide)
We fetch the alanine dipeptide data set, load the backbone torsions into memory, directly discretize the full space using $k$-means clustering, visualize the margial and joint distributions of both components as well as the cluster centers, and show the ITS convergence to help selecting a suitable lag time:

In [None]:
pdb = mdshare.fetch('alanine-dipeptide-nowater.pdb', working_directory='data')
files = mdshare.fetch('alanine-dipeptide-*-250ns-nowater.dcd', working_directory='data')

feat = pyemma.coordinates.featurizer(pdb)
feat.add_backbone_torsions()
data = pyemma.coordinates.load(files, features=feat)

cluster = pyemma.coordinates.cluster_kmeans(data, k=100, max_iter=50, stride=10)
its = pyemma.msm.its(cluster.dtrajs, lags=[1, 2, 5, 10, 20, 50], nits=4, errors='bayes')

fig, axes = plt.subplots(1, 3, figsize=(12, 3))
pyemma.plots.plot_feature_histograms(
    np.concatenate(data), feature_labels=['$\Phi$', '$\Psi$'], ax=axes[0])
axes[1].scatter(*np.concatenate(data).T, s=1, alpha=0.3)
axes[1].scatter(*cluster.clustercenters.T, s=15)
axes[1].set_xlabel('$\Phi$')
axes[1].set_ylabel('$\Psi$')
pyemma.plots.plot_implied_timescales(its, ax=axes[2], units='ps')
fig.tight_layout()

We then estimate an MSM at lag time $10$ ps and visualize the stationary distribution by coloring all data points according to their discrete state membership and the corresponding stationary weight:

In [None]:
msm = pyemma.msm.estimate_markov_model(cluster.dtrajs, lag=10, dt_traj='0.001 ns')

print('fraction of states used = {:f}'.format(msm.active_state_fraction))
print('fraction of counts used = {:f}'.format(msm.active_count_fraction))

fig, ax = plt.subplots()
im = ax.scatter(*np.concatenate(data).T, s=1, c=msm.pi[np.concatenate(cluster.dtrajs)])
ax.scatter(*cluster.clustercenters.T, s=15, c='black', marker='x')
ax.set_xlabel('$\Phi$')
ax.set_ylabel('$\Psi$')
cb = fig.colorbar(im, ax=ax)
cb.set_label('stationary weight')
fig.tight_layout()

Next, we visualize the first six right eigenvectors:

In [None]:
eigvec = msm.eigenvectors_right()

fig, axes = plt.subplots(2, 3, figsize=(12, 6))
for i, ax in enumerate(axes.flat):
    im = ax.scatter(
        *cluster.clustercenters.T,
        c=np.round(eigvec[:, i], 5),
        s=70,
        cmap=mpl.cm.bwr,
        edgecolor='k',
        linewidth=0.2)
    cb = plt.colorbar(im, ax=ax)
    ax.set_xlabel('$\Phi$')
    ax.set_ylabel('$\Psi$')
    cb.set_label('{}. right eigenvector'.format(i + 1))
fig.tight_layout()

Again, we have the $(1,\dots,1)^\top$ first right eigenvector of the stationary process.

The second to fourth right eigenvectors illustrate the three slowest processes.

Eigenvectors five and six indicate to further processes which, however, relax faster than the lag time and cannot be resolved clearly.

We now proceed our validation process using a Bayesian MSM with four metastable states:

In [None]:
bayesian_msm = pyemma.msm.bayesian_markov_model(cluster.dtrajs, lag=10)

nstates = 4
pyemma.plots.plot_cktest(bayesian_msm.cktest(nstates));

Seeing that four metastable states are a reasonable choice for our MSM, we obtain a coarse-grained MSM for further analysis:

In [None]:
coarse_msm = msm.coarse_grain(nstates)
print(coarse_msm.transition_matrix)

for i, s in enumerate(coarse_msm.metastable_sets):
    print('π_{} = {:f}'.format(i, msm.pi[s].sum()))

Now we define a small function to visualize samples of metastable states with NGLView.

In [None]:
def visualize_metastable(samples, cmap, selection='backbone'):
    """ visualize metastable states
    Parameters
    ----------
    samples: list of mdtraj.Trajectory objects
        each element contains all samples for one metastable state.
    cmap: matplotlib.colors.ListedColormap
        color map used to visualize metastable states before.
    selection: str
        which part of the molecule to selection for visualization. For details have a look here:
        http://mdtraj.org/latest/examples/atom-selection.html#Atom-Selection-Language
    """
    import nglview
    from matplotlib.colors import to_hex

    widget = nglview.NGLWidget()
    widget.clear_representations()
    ref = samples[0]
    for i, s in enumerate(samples):
        s = s.superpose(ref)
        s = s.atom_slice(s.top.select(selection))
        comp = widget.add_trajectory(s)
        comp.add_ball_and_stick()

    # this has to be done in a separate loop for whatever reason...
    x = np.linspace(0, 1, num=len(samples))
    for i, x_ in enumerate(x):
        c = to_hex(cmap(x_))
        widget.update_ball_and_stick(color=c, component=i, repr_index=i)
        widget.remove_cartoon(component=i)
    return widget

We concatenate all three discrete trajectories and obtain a single trajectory of metastable states which we use to visualize the metastable state memberships of all datapoints. We also visualize the coarse-grained MSM:

In [None]:
metastable_traj = coarse_msm.metastable_assignments[np.concatenate(cluster.dtrajs)]
highest_membership = coarse_msm.metastable_distributions.argmax(1)
coarse_state_centers = cluster.clustercenters[msm.active_set[highest_membership]]
cmap = mpl.cm.get_cmap('viridis', nstates)

fig, ax = plt.subplots(figsize=(6, 6))
im = ax.scatter(
    *np.concatenate(data).T,
    s=1,
    c=metastable_traj,
    zorder=-1,
    cmap=cmap,
    norm=mpl.colors.BoundaryNorm(np.arange(-0.5, nstates, 1), nstates))
ax.set_xlabel('$\Phi$')
ax.set_ylabel('$\Psi$')
cb = fig.colorbar(im, ax=ax)
cb.set_label('metastable state')
cb.set_ticks(np.arange(nstates))
pyemma.plots.plot_markov_model(
    coarse_msm,
    pos=coarse_state_centers,
    figpadding=0.1,
    size=12,
    show_frame=True,
    ax=ax)
ax.set_aspect('equal')
ax.set_xlim(-np.pi, np.pi)
ax.set_ylim(-np.pi, np.pi)
fig.tight_layout()

We now sample some representative structures and visualize these with the aid of nglview. For the sake of clarity, we draw only the backbone atoms. Since we have obtained several samples for each metastable state, you can click the play button to iterate over all samples. For each iteration the samples of all four states will be drawn.
You can double click the molecule to show it at full screen. Press escape to go back. 

In [None]:
my_samples = [pyemma.coordinates.save_traj(files, idist, outfile=None, top=pdb)
              for idist in msm.sample_by_distributions(coarse_msm.metastable_distributions, 50)]

visualize_metastable(my_samples, cmap, selection='backbone')

Have you noticed how well the metastable state coloring agrees with the eigenvector visualization of the three slowest processes?

If we could afford a shorter lag time, we might even be able to resolve more processes and, thus, subdivide the metastable states three (fifth slowest process) and zero (sixth slowest process).

Now, we use the `mfpt()` method of the original MSM object to compute MFPTs between pairs of metastable sets and also the pairwise transition rates, and visualize the kinetic network:

In [None]:
mfpt = np.zeros((nstates, nstates))
for i in range(nstates):
    for j in range(nstates):
        mfpt[i, j] = msm.mfpt(
            coarse_msm.metastable_sets[i],
            coarse_msm.metastable_sets[j])

rate = np.zeros_like(mfpt)
nz = mfpt.nonzero()
rate[nz] = 1.0 / mfpt[nz]

pyemma.plots.plot_network(
    rate,
    pos=np.asarray([[0, 0], [4, 0], [2, 4], [6, 4]]),
    arrow_label_format='%.1f ns',
    arrow_labels=mfpt,
    arrow_scale=3.0,
    size=12);

**Exercise 1**: Load the heavy atoms' distances into memory, TICA (`lag=3` and `dim=2`), discretize with  100 $k$-means centers and a stride of $10$, and show the ITS convergence.

In [None]:
feat = #FIXME
feat. #FIXME
data = #FIXME

tica = #FIXME
tica_out = #FIXME
cluster = #FIXME
its = #FIXME

fig, axes = plt.subplots(1, 3, figsize=(12, 3))
pyemma.plots.plot_feature_histograms(tica_out, feature_labels=['IC 1', 'IC 2'], ax=axes[0])
axes[1].scatter(*tica_out.T, s=1, alpha=0.3)
axes[1].scatter(*cluster.clustercenters.T, s=15)
axes[1].set_xlabel('IC 1')
axes[1].set_ylabel('IC 2')
pyemma.plots.plot_implied_timescales(its, ax=axes[2], units='ps')
fig.tight_layout()

In [None]:
feat = pyemma.coordinates.featurizer(pdb)
feat.add_distances(feat.select_Heavy())
data = pyemma.coordinates.load(files, features=feat)

tica = pyemma.coordinates.tica(data, lag=3, dim=2)
tica_out = np.concatenate(tica.get_output())
cluster = pyemma.coordinates.cluster_kmeans(tica, k=100, max_iter=50, stride=10)
its = pyemma.msm.its(cluster.dtrajs, lags=[1, 2, 5, 10, 20, 50], nits=4, errors='bayes')

fig, axes = plt.subplots(1, 3, figsize=(12, 3))
pyemma.plots.plot_feature_histograms(tica_out, feature_labels=['IC 1', 'IC 2'], ax=axes[0])
axes[1].scatter(*tica_out.T, s=1, alpha=0.3)
axes[1].scatter(*cluster.clustercenters.T, s=15)
axes[1].set_xlabel('IC 1')
axes[1].set_ylabel('IC 2')
pyemma.plots.plot_implied_timescales(its, ax=axes[2], units='ps')
fig.tight_layout()

**Exercise 2**: Estimate an MSM at lag time $10$ ps with `dt_traj='0.001 ns'` and visualize the stationary distribution using a two-dimensional colored scatter plot of all data points in TICA space.

In [None]:
msm = #FIXME

print('fraction of states used = {:f}'. #FIXME
print('fraction of counts used = {:f}'. #FIXME

fig, ax = plt.subplots()
im = ax.scatter(*tica_out.T, s=1, c=msm.pi[np.concatenate(cluster.dtrajs)])
ax.scatter(*cluster.clustercenters.T, s=15, c='black', marker='x')
ax.set_xlabel('IC 1')
ax.set_ylabel('IC 2')
cb = fig.colorbar(im, ax=ax)
cb.set_label('stationary weight')
fig.tight_layout()

In [None]:
msm = pyemma.msm.estimate_markov_model(cluster.dtrajs, lag=10, dt_traj='0.001 ns')

print('fraction of states used = {:f}'.format(msm.active_state_fraction))
print('fraction of counts used = {:f}'.format(msm.active_count_fraction))

fig, ax = plt.subplots()
im = ax.scatter(*tica_out.T, s=1, c=msm.pi[np.concatenate(cluster.dtrajs)])
ax.scatter(*cluster.clustercenters.T, s=15, c='black', marker='x')
ax.set_xlabel('IC 1')
ax.set_ylabel('IC 2')
cb = fig.colorbar(im, ax=ax)
cb.set_label('stationary weight')
fig.tight_layout()

**Exercise 3**: now visualize the first six right eigenvectors.

In [None]:
eigvec = #FIXME

fig, axes = plt.subplots(2, 3, figsize=(12, 6))
for i, ax in enumerate(axes.flat):
    im = ax.scatter(
        # FIXME
        c=np.round(eigvec[:, i], 5),
        s=70, cmap=mpl.cm.bwr,
        edgecolor='k',
        linewidth=0.2)
    cb = plt.colorbar(im, ax=ax)
    ax.set_xlabel('IC 1')
    ax.set_ylabel('IC 2')
    cb.set_label('{}. right eigenvector'.format(i + 1))
fig.tight_layout()

In [None]:
eigvec = msm.eigenvectors_right()

fig, axes = plt.subplots(2, 3, figsize=(12, 6))
for i, ax in enumerate(axes.flat):
    im = ax.scatter(
        *cluster.clustercenters.T,
        c=np.round(eigvec[:, i], 5),
        s=70, cmap=mpl.cm.bwr,
        edgecolor='k',
        linewidth=0.2)
    cb = plt.colorbar(im, ax=ax)
    ax.set_xlabel('IC 1')
    ax.set_ylabel('IC 2')
    cb.set_label('{}. right eigenvector'.format(i + 1))
fig.tight_layout()

Can you already guess from eigenvectors two to four which the metastable states are?

**Exercise 4**: Estimate a Bayesian MSM at lag time $10$ ps and perform/show a CK test for four metastable states.

In [None]:
bayesian_msm = #FIXME

nstates = 4
pyemma.plots. #FIXME

In [None]:
bayesian_msm = pyemma.msm.bayesian_markov_model(cluster.dtrajs, lag=10)

nstates = 4
pyemma.plots.plot_cktest(bayesian_msm.cktest(nstates));

**Exercise 5**: Coarse grain the MSM onto the four metastable states, obtain the metastable state trajectory, and visualize the metastable state memberships and the coarse-grained MSM.

In [None]:
coarse_msm = #FIXME
print(coarse_msm.transition_matrix)

for i, s in enumerate(coarse_msm.metastable_sets):
    print('π_{} = {:f}'.format(i, msm.pi[s].sum()))

metastable_traj = #FIXME

fig, axes = plt.subplots(1, 2, figsize=(10, 4))
cmap = mpl.cm.get_cmap('viridis', nstates)
norm = mpl.colors.BoundaryNorm(np.arange(-.5, nstates, 1), nstates)
im = axes[0].scatter(*tica_out.T, s=1, c=metastable_traj, cmap=cmap, norm=norm)
axes[0].set_xlabel('IC 1')
axes[0].set_ylabel('IC 2')
cb = fig.colorbar(im, ax=axes[0])
cb.set_label('metastable state')
cb.set_ticks([i for i in range(nstates)])
pyemma.plots.plot_markov_model(
    coarse_msm,
    pos=np.asarray([[0, 0], [4, 0], [2, 4], [6, 4]]),
    figpadding=0.1,
    size=12,
    ax=axes[1])
axes[1].set_aspect('equal')
fig.tight_layout()

In [None]:
coarse_msm = msm.coarse_grain(nstates)
print(coarse_msm.transition_matrix)

for i, s in enumerate(coarse_msm.metastable_sets):
    print('π_{} = {:f}'.format(i, msm.pi[s].sum()))

metastable_traj = coarse_msm.metastable_assignments[np.concatenate(cluster.dtrajs)]

fig, axes = plt.subplots(1, 2, figsize=(10, 4))
cmap = mpl.cm.get_cmap('viridis', nstates)
norm = mpl.colors.BoundaryNorm(np.arange(-0.5, nstates, 1), nstates)
im = axes[0].scatter(*tica_out.T, s=1, c=metastable_traj, cmap=cmap, norm=norm)
axes[0].set_xlabel('IC 1')
axes[0].set_ylabel('IC 2')
cb = fig.colorbar(im, ax=axes[0])
cb.set_label('metastable state')
cb.set_ticks([i for i in range(nstates)])
pyemma.plots.plot_markov_model(
    coarse_msm,
    pos=np.asarray([[0, 0], [4, 0], [2, 4], [6, 4]]),
    figpadding=0.1,
    size=12,
    ax=axes[1])
axes[1].set_aspect('equal')
fig.tight_layout()

Did you guess the metastable states correctly?

Note the similarities between the MSM built from the backbone torsions and the MSM built from the TICA projection of heavy atom distances. Even though we started from different features, both models found the same kinetic information in the data.

**Exercise 6**: Compute the pairwise MFPTs and transition rates, and visualize the resulting kinetic network.

In [None]:
mfpt = np.zeros((nstates, nstates))
for i in range(nstates):
    for j in range(nstates):
        mfpt[i, j] = #FIXME

rate = np.zeros_like(mfpt)
nz = mfpt.nonzero()
rate[nz] = 1.0 / mfpt[nz]

pyemma.plots.plot_network(
    rate,
    pos=np.asarray([[0, 0], [4, 0], [2, 4], [6, 4]]),
    arrow_label_format='%.1f ns',
    arrow_labels=mfpt,
    arrow_scale=3.0,
    size=12);

In [None]:
mfpt = np.zeros((nstates, nstates))
for i in range(nstates):
    for j in range(nstates):
        mfpt[i, j] = msm.mfpt(
            coarse_msm.metastable_sets[i],
            coarse_msm.metastable_sets[j])

rate = np.zeros_like(mfpt)
nz = mfpt.nonzero()
rate[nz] = 1.0 / mfpt[nz]

pyemma.plots.plot_network(
    rate,
    pos=np.asarray([[0, 0], [4, 0], [2, 4], [6, 4]]),
    arrow_label_format='%.1f ns',
    arrow_labels=mfpt,
    arrow_scale=3.0,
    size=12);

## Case 3: another molecular dynamics data set (pentapeptide)

**Exercise 7**: Fetch the pentapeptide data set, load the cossin transformations of the backbone and $\chi_1$ sidechain torsions into memory, perform TICA with `lag=20` and `var_cutoff=0.9`, discretize with $250$ $k$-means centers using a stride of $10$, visualize the margial distributions, and show the ITS convergence:

In [None]:
pdb = mdshare.fetch('pentapeptide-impl-solv.pdb', working_directory='data')
files = mdshare.fetch('pentapeptide-*-500ns-impl-solv.xtc', working_directory='data')

feat = #FIXME
feat. #FIXME
feat. #FIXME
data = #FIXME

tica = #FIXME
tica_out = #FIXME
cluster = #FIXME
its = #FIXME

fig, axes = plt.subplots(1, 2, figsize=(10, 4))
pyemma.plots.plot_feature_histograms #FIXME
pyemma.plots.plot_implied_timescales #FIXME
fig.tight_layout()

In [None]:
pdb = mdshare.fetch('pentapeptide-impl-solv.pdb', working_directory='data')
files = mdshare.fetch('pentapeptide-*-500ns-impl-solv.xtc', working_directory='data')

feat = pyemma.coordinates.featurizer(pdb)
feat.add_backbone_torsions(cossin=True)
feat.add_sidechain_torsions(which='chi1', cossin=True)
data = pyemma.coordinates.load(files, features=feat)

tica = pyemma.coordinates.tica(data, lag=20, var_cutoff=0.9)
tica_out = np.concatenate(tica.get_output())
cluster = pyemma.coordinates.cluster_kmeans(tica, k=250, max_iter=50, stride=10)
its = pyemma.msm.its(cluster.dtrajs, lags=30, nits=10, errors='bayes')

fig, axes = plt.subplots(1, 2, figsize=(10, 4))
pyemma.plots.plot_feature_histograms(tica_out, ax=axes[0])
pyemma.plots.plot_implied_timescales(its, ax=axes[1], dt=0.1, units='ns')
fig.tight_layout()

**Exercise 8**: Estimate an MSM at lag time $12$ steps with `dt_traj='0.1 ns'` and visualize the first six right eigenvectors.

In [None]:
msm = #FIXME

print('fraction of states used = {:f}'.format(msm.active_state_fraction))
print('fraction of counts used = {:f}'.format(msm.active_count_fraction))

eigvec = msm.eigenvectors_right()

fig, axes = plt.subplots(2, 3, figsize=(12, 6))
for i, ax in enumerate(axes.flat):
    im = ax.scatter(
        # FIXME
        c=np.round(eigvec[:, i], 5),
        s=70,
        cmap=mpl.cm.bwr,
        edgecolor='k',
        linewidth=.2)
    cb = plt.colorbar(im, ax=ax)
    ax.set_xlabel('IC 1')
    ax.set_ylabel('IC 2')
    cb.set_label('{}. right eigenvector'.format(i + 1))
fig.tight_layout()

In [None]:
msm = pyemma.msm.estimate_markov_model(
    cluster.dtrajs, lag=12, dt_traj='0.1 ns')

print('fraction of states used = {:f}'.format(msm.active_state_fraction))
print('fraction of counts used = {:f}'.format(msm.active_count_fraction))

eigvec = msm.eigenvectors_right()

fig, axes = plt.subplots(2, 3, figsize=(12, 6))
for i, ax in enumerate(axes.flat):
    im = ax.scatter(
        *cluster.clustercenters[:, :2].T,
        c=np.round(eigvec[:, i], 5),
        s=70,
        cmap=mpl.cm.bwr,
        edgecolor='k',
        linewidth=0.2)
    cb = plt.colorbar(im, ax=ax)
    ax.set_xlabel('IC 1')
    ax.set_ylabel('IC 2')
    cb.set_label('{}. right eigenvector'.format(i + 1))
fig.tight_layout()

**Exercise 9**: Plot the first ten timescales of the estimated MSM and look for spectral gaps.

In [None]:
timescales = #FIXME

plt.plot(timescales, '-o')
plt.xlabel('timescale index')
plt.ylabel('timescale / ns');

In [None]:
timescales = msm.timescales(k=10)

plt.plot(timescales, '-o')
plt.xlabel('timescale index')
plt.ylabel('timescale / ns');

**Exercise 10**: Estimate a Bayesian MSM at lag time $12$ steps and perform/show a CK test for four metastable states.

In [None]:
bayesian_msm = #FIXME

nstates = 4
pyemma.plots. #FIXME

In [None]:
bayesian_msm = pyemma.msm.bayesian_markov_model(cluster.dtrajs, lag=12)

nstates = 4
pyemma.plots.plot_cktest(bayesian_msm.cktest(nstates));

**Exercise 11**: Coarse grain the MSM onto the four metastable states, obtain the metastable state trajectory, and visualize the metastabel state memberships and the coarse-grained MSM.

In [None]:
coarse_msm = #FIXME
print(coarse_msm.transition_matrix)

for i, s in enumerate(coarse_msm.metastable_sets):
    print('π_{} = {:f}'.format(i, msm.pi[s].sum()))

mtraj = #FIXME

fig, axes = plt.subplots(1, 2, figsize=(10, 4))
cmap = mpl.cm.get_cmap('viridis', nstates)
norm = mpl.colors.BoundaryNorm(np.arange(-.5, nstates, 1), nstates)
im = axes[0].scatter(*tica_out[:, :2].T, s=1, c=mtraj, cmap=cmap, norm=norm)
axes[0].set_xlabel('IC 1')
axes[0].set_ylabel('IC 2')
cb = fig.colorbar(im, ax=axes[0])
cb.set_label('metastable state')
cb.set_ticks(np.arange(nstates))
pyemma.plots.plot_markov_model(
    coarse_msm,
    pos=np.asarray([[0, 0], [4, 0], [2, 4], [6, 4]]),
    figpadding=0.1,
    size=12,
    ax=axes[1])
axes[1].set_aspect('equal')
fig.tight_layout()

my_samples = [pyemma.coordinates.save_traj([files], idist, outfile=None, top=pdb)
              for idist in msm.sample_by_distributions(coarse_msm.metastable_distributions, 50)]

visualize_metastable(my_samples, cmap)

In [None]:
coarse_msm = msm.coarse_grain(nstates)
print(coarse_msm.transition_matrix)

for i, s in enumerate(coarse_msm.metastable_sets):
    print('π_{} = {:f}'.format(i, msm.pi[s].sum()))

mtraj = coarse_msm.metastable_assignments[np.concatenate(cluster.dtrajs)]

fig, axes = plt.subplots(1, 2, figsize=(10, 4))
cmap = mpl.cm.get_cmap('viridis', nstates)
norm = mpl.colors.BoundaryNorm(np.arange(-.5, nstates, 1), nstates)
im = axes[0].scatter(*tica_out[:, :2].T, s=1, c=mtraj, cmap=cmap, norm=norm)
axes[0].set_xlabel('IC 1')
axes[0].set_ylabel('IC 2')
cb = fig.colorbar(im, ax=axes[0])
cb.set_label('metastable state')
cb.set_ticks(np.arange(nstates))
pyemma.plots.plot_markov_model(
    coarse_msm,
    pos=np.asarray([[0, 0], [4, 0], [2, 4], [6, 4]]),
    figpadding=0.1,
    size=12,
    ax=axes[1])
axes[1].set_aspect('equal')
fig.tight_layout()

my_samples = [pyemma.coordinates.save_traj([files], idist, outfile=None, top=pdb)
              for idist in msm.sample_by_distributions(coarse_msm.metastable_distributions, 50)]

visualize_metastable(my_samples, cmap)

**Exercise 12**: Compute the pairwise MFPTs and transition rates, and visualize the resulting kinetic network.

In [None]:
mfpt = np.zeros((nstates, nstates))
for i in range(nstates):
    for j in range(nstates):
        mfpt[i, j] = #FIXME

rate = np.zeros_like(mfpt)
nz = mfpt.nonzero()
rate[nz] = 1.0 / mfpt[nz]

pyemma.plots.plot_network(
    rate,
    pos=np.asarray([[0, 0], [4, 0], [2, 4], [6, 4]]),
    arrow_label_format='%.1f ns',
    arrow_labels=mfpt,
    arrow_scale=3.0,
    size=12);

In [None]:
mfpt = np.zeros((nstates, nstates))
for i in range(nstates):
    for j in range(nstates):
        mfpt[i, j] = msm.mfpt(
            coarse_msm.metastable_sets[i],
            coarse_msm.metastable_sets[j])

rate = np.zeros_like(mfpt)
nz = mfpt.nonzero()
rate[nz] = 1.0 / mfpt[nz]

pyemma.plots.plot_network(
    rate,
    pos=np.asarray([[0, 0], [4, 0], [2, 4], [6, 4]]),
    arrow_label_format='%.1f ns',
    arrow_labels=mfpt,
    arrow_scale=3.0,
    size=12);

## Wrapping up
In this notebook, we have learned how to coarse grain an MSM and how to extract kinetic information from the model. In detail, we have used
- the `active_state_fraction`, `active_count_fraction`, and `active_set` attributes of an MSM object to see how much (and which parts) of our data form the largest connected set represented by the MSM,
- the `stationary_distribution` (or `pi`) attribute of an MSM object to access its stationary vector,
- the `eigenvectors_right()` method of an MSM object to access its (right) eigenvectors,
- the `coarse_grain()` method of an MSM object to coarse grain the model onto a selcted number of metastable states,
- the `mfpt()` method of an MSM object to compute mean first passage times between metastable states which, in turn, are accessible via
- the `metastable_sets` and `metastable_assignments` attributes of a coarse-grained MSM object.

For visualizing MSMs or kinetic networks we used
- `pyemma.plots.plot_markov_model()` and
- `pyemma.plots.plot_network()`.