This notebook serves as an example of how to analyze a simulation trajectory using unsupervised techniques. Here, specifically, we'll be analyzing a simulation of cyclohexane conformations, simulated using quantum-espresso.

Before running this notebook, you will need to install:
    
- [ase](https://wiki.fysik.dtu.dk/ase/index.html)
- [scikit-learn](https://scikit-learn.org/)
- [scikit-matter](https://github.com/scikit-learn-contrib/scikit-matter)
- [openTSNE](https://opentsne.readthedocs.io/en/stable/)
- [umap](https://umap-learn.readthedocs.io/en/latest/how_umap_works.html)
- [chemiscope](https://chemiscope.org)

in addition to standard packages [numpy](https://numpy.org/) and [matplotlib](https://matplotlib.org/).

## Loading Chemiscope widgets in Jupyter

Please make sure you have jupyter extensions enabled.

If at *any time* you are unable to load the chemiscope widgets in Jupyter, you can replace `chemiscope.show(` with `chemiscope.write_input('filename.json', ...` and upload the resulting file to [chemiscope.org](chemiscope.org).

In [None]:
import numpy as np
from ase.io import read
from matplotlib import pyplot as plt
import chemiscope
import scipy
from openTSNE import TSNE
from umap import UMAP

## Preparing the Data

### Read Data

Here we read in 5 MD trajectories and place them in a concatenated list `traj`.

`ranges` is storing the range of `traj` corresponding to each original file.
`conf_idx` is storing the location of the initial conformations.

`rgb_colors` is the set of colors used for each conformer, stored in rgba format.

In [None]:
# read in the frames from each MD simulation
traj = []
names = ["chair", "twist-boat", "boat", "half-chair", "planar"]
rgb_colors = [
    (0.13333333333333333, 0.47058823529411764, 0.7098039215686275),
    (0.4588235294117647, 0.7568627450980392, 0.34901960784313724),
    (0.803921568627451, 0.6078431372549019, 0.16862745098039217),
    (0.803921568627451, 0.13725490196078433, 0.15294117647058825),
    (0.4392156862745098, 0.2784313725490196, 0.611764705882353),
]

ranges = np.zeros((len(names), 2), dtype=int)
conf_idx = np.zeros(len(names), dtype=int)

for i, n in enumerate(names):
    frames = read(f"../../datasets/cyclohexane/{n}.xyz",":",)

    ranges[i] = (len(traj), len(traj) + len(frames))
    conf_idx[i] = len(traj)
    traj = [*traj, *frames]

In [None]:
# energies of the simulation frames, relative to the chair conformation
energy = np.array([a.info["relative_energy_eV"] for a in traj])

# energies of the known conformers, relative to the chair conformation
c_energy = np.array([traj[c].info["relative_energy_eV"] for c in conf_idx])

# extrema for the energies
max_e = max(energy)
min_e = min(energy)

Here we can confirm what our analysis will tell us: 

- the simulation starting in the planar conformation transitions to the chair conformation
- the simulations starting in the twist-boat, boat, and half-chair conformations ultimately get stuck in the twist formation.

In [None]:
fig, ax = plt.subplots(1, figsize=(6, 4))

for n, c, r, rgb in zip(names, c_energy, ranges, rgb_colors):
    ax.plot(
        range(0, r[1] - r[0]), energy[r[0] : r[1]] - min_e, label=n, c=rgb, zorder=-1
    )

ax.legend()
ax.set_xlabel("Simulation Timestep")
ax.set_ylabel("Energy")

ax.set_xlim([0, len(energy) // 5])
ax.set_ylim([-0.1, 1.25 * (max_e - min_e)])
ax.set_yticklabels([])

plt.tight_layout()
# plt.savefig('figures/Figure5/energy.png')
plt.show()

### Load descriptors 
We will use some precomputed geometric descriptors -- more on this this afternoon!

Here's what you need to know.

`atomic_desc` is `5000 x 6 x q` tensor, where `q` is the number of descriptors we have.

For each frame, we have one descriptor per carbon atom (hence the 6!).

We'll average this per-molecule into the variable `desc`.

In [None]:
atomic_desc = np.load("../../datasets/cyclohexane/cyclohexane_descriptors.npy")

X = np.mean(atomic_desc, axis=1)
atomic_desc.shape, X.shape

### Setting the colormap
Here we are going to color each of our points based upon their similar to the initial conformers (which has been pre-computed).

In [None]:
closest_config = np.array([frame.info["closest_conformer"] for frame in traj])
colors = np.array([frame.info["color"] for frame in traj])

# Mapping time!

Before we start, we're going to reduce our dimensionality for computation-sake.

In [None]:
from sklearn.decomposition import PCA

# 0.9999 tells us here to keep 99.99% of the variance
pca = PCA(n_components=0.9999)
pca.fit(X)

pca_desc = pca.transform(X)

### t-SNE

PCA is not intended as a clustering algorithm -- it just sometimes work out to give nice clusters.
Let's employ one of the most popular non-linear dimensionality reduction algorithm in ML field `T-distributed Stochastic Neighbor Embedding (t-SNE)` to obtain 2 dimensional representation of our descriptor space. 

Here we can see how increasing the perplexity (number of expected neighbors) changes the layout of the projection.

In [None]:
perplexities = np.logspace(0, 2, 6, dtype=int)
fig, ax = plt.subplots(
    1,
    len(perplexities),
    figsize=(4 * len(perplexities), 4),
)

for i, perp in enumerate(perplexities):
    tsne = TSNE(
        n_components=2,  # number of components to project across
        perplexity=perp,
        metric="l2",  # distance metric
        n_jobs=2,  # parallelization
        random_state=42,
        verbose=False,
    )
    t_tsne = tsne.fit(pca_desc)
    ax[i].scatter(*t_tsne.T, c=colors, s=2)
    ax[i].axis("off")
    ax[i].set_title("Perplexity = {}".format(perp))
plt.show()

In [None]:
# How many neighbors do you think we should use?

n_neighbors_TSNE = # ...

In [None]:
tsne = TSNE(
    n_components=2,  # number of components to project across
    perplexity=n_neighbors_TSNE,  # amount of neighbors one point is posited to have... play around with this!
    metric="l2",  # distance metric
    n_jobs=2,  # parallelization
    random_state=42,
    verbose=False,
)
T = tsne.fit(pca_desc)

In [None]:
chemiscope.show(
    traj,
    properties={
        "t": T,
        "Relative Energy [eV]": energy,
        "Closest Conformer": closest_config,
    },
    settings={
        "map": {
            "symbol": "Closest Conformer",
            "color": {"property": "Relative Energy [eV]"},
        }
    },
)

Because t-SNE will change based upon the dimensions of your resulting projection, let's see how this affects our embedding.

In [None]:
ndim = np.arange(2, 6, dtype=int)
fig, ax = plt.subplots(
    1,
    len(ndim),
    figsize=(4 * len(ndim), 4),
)

for i, dim in enumerate(ndim):
    tsne = TSNE(
        n_components=dim,  # number of components to project across
        perplexity=n_neighbors_TSNE,
        metric="l2",  # distance metric
        n_jobs=2,  # parallelization
        random_state=42,
        verbose=False,
    )
    t_tsne = tsne.fit(pca_desc)[:, :2]
    ax[i].scatter(*t_tsne.T, c=colors, s=2)
    ax[i].axis("off")
    ax[i].set_title("n_dim = {}".format(dim))
plt.show()

We can also see how t-SNE changes based upon the dimensionality of the dataset provided.

In [None]:
ndim = np.logspace(np.log10(2), np.log10(pca_desc.shape[1]), 6, dtype=int)
fig, ax = plt.subplots(
    1,
    len(ndim),
    figsize=(4 * len(ndim), 4),
)

for i, dim in enumerate(ndim):
    tsne = TSNE(
        n_components=2,  # number of components to project across
        perplexity=n_neighbors_TSNE,
        metric="l2",  # distance metric
        n_jobs=2,  # parallelization
        random_state=42,
        verbose=False,
    )
    t_tsne = tsne.fit(pca_desc[:, :dim])
    ax[i].scatter(*t_tsne.T, c=colors, s=2)
    ax[i].axis("off")
    ax[i].set_title("n_dim of descriptor = {}".format(dim))
plt.show()

t-SNE is fickle! When you've reached this point in the notebook, raise your hand and we'll discuss the appropriate uses for t-SNE. If you need something to do while you wait, this article is one of the best:

https://distill.pub/2016/misread-tsne/

## UMAP

UMAP _should_ obtain similar results to t-sne, but with a shorter compute time. However, you will note a greater stochasticity to the projection when using a smaller number of neighbors -- this is due to the disconnection of the locally constructed manifolds.

In [None]:
nneigh = np.maximum(2, np.logspace(0, 2.0, 5, dtype=int))
fig, ax = plt.subplots(1,
                       len(nneigh),
                       figsize=(4*len(nneigh), 4),
                      )

for i, n in enumerate(nneigh):
    um = UMAP(n_components=2, n_neighbors=n, init='random')
    um.fit(pca_desc)
    t_um = um.transform(pca_desc)
    ax[i].scatter(*t_um.T, c=colors, s=2)
    ax[i].axis('off')
    ax[i].set_title("# Neighbors = {}".format(n))
plt.show()

In [None]:
# How many neighbors do you think we should use?

n_neighbors_UMAP = # ...

In [None]:
um = UMAP(n_components=2, n_neighbors=n_neighbors_UMAP, init='random')
um.fit(pca_desc)
T = um.transform(pca_desc)

chemiscope.show(
    traj,
    properties={
        "t": T,
        "Relative Energy [eV]": energy,
        "Closest Conformer": closest_config,
    },
    settings={
        "map": {
            "symbol": "Closest Conformer",
            "color": {"property": "Relative Energy [eV]"},
        }
    },
)