This notebook serves as an example of how to analyze a simulation trajectory using unsupervised techniques. Here, specifically, we'll be analyzing a simulation of cyclohexane conformations, simulated using quantum-espresso.

Before running this notebook, you will need to install:
    
- [ase](https://wiki.fysik.dtu.dk/ase/index.html)
- [scikit-learn](https://scikit-learn.org/)
- [scikit-matter](https://github.com/scikit-learn-contrib/scikit-matter)
- [chemiscope](https://chemiscope.org)

in addition to standard packages [numpy](https://numpy.org/) and [matplotlib](https://matplotlib.org/).

## Loading Chemiscope widgets in Jupyter

Please make sure you have jupyter extensions enabled.

If at *any time* you are unable to load the chemiscope widgets in Jupyter, you can replace `chemiscope.show(` with `chemiscope.write_input('filename.json', ...` and upload the resulting file to [chemiscope.org](chemiscope.org).

In [None]:
import numpy as np
from ase.io import read
from matplotlib import pyplot as plt
import chemiscope
import scipy

## Preparing the Data

### Read Data

Here we read in 5 MD trajectories and place them in a concatenated list `traj`.

`ranges` is storing the range of `traj` corresponding to each original file.
`conf_idx` is storing the location of the initial conformations.

`rgb_colors` is the set of colors used for each conformer, stored in rgba format.

In [None]:
# read in the frames from each MD simulation
traj = []
names = ["chair", "twist-boat", "boat", "half-chair", "planar"]
rgb_colors = [
    (0.13333333333333333, 0.47058823529411764, 0.7098039215686275),
    (0.4588235294117647, 0.7568627450980392, 0.34901960784313724),
    (0.803921568627451, 0.6078431372549019, 0.16862745098039217),
    (0.803921568627451, 0.13725490196078433, 0.15294117647058825),
    (0.4392156862745098, 0.2784313725490196, 0.611764705882353),
]

ranges = np.zeros((len(names), 2), dtype=int)
conf_idx = np.zeros(len(names), dtype=int)

for i, n in enumerate(names):
    frames = read(f"../../datasets/cyclohexane/{n}.xyz",":",)

    ranges[i] = (len(traj), len(traj) + len(frames))
    conf_idx[i] = len(traj)
    traj = [*traj, *frames]

In [None]:
# energies of the simulation frames, relative to the chair conformation
energy = np.array([a.info["relative_energy_eV"] for a in traj])

# energies of the known conformers, relative to the chair conformation
c_energy = np.array([traj[c].info["relative_energy_eV"] for c in conf_idx])

# extrema for the energies
max_e = max(energy)
min_e = min(energy)

Here we can confirm what our analysis will tell us: 

- the simulation starting in the planar conformation transitions to the chair conformation
- the simulations starting in the twist-boat, boat, and half-chair conformations ultimately get stuck in the twist formation.

In [None]:
fig, ax = plt.subplots(1, figsize=(6, 4))

for n, c, r, rgb in zip(names, c_energy, ranges, rgb_colors):
    ax.plot(
        range(0, r[1] - r[0]), energy[r[0] : r[1]] - min_e, label=n, c=rgb, zorder=-1
    )

ax.legend()
ax.set_xlabel("Simulation Timestep")
ax.set_ylabel("Energy")

ax.set_xlim([0, len(energy) // 5])
ax.set_ylim([-0.1, 1.25 * (max_e - min_e)])
ax.set_yticklabels([])

plt.tight_layout()
# plt.savefig('figures/Figure5/energy.png')
plt.show()

### Load descriptors 
We will use some precomputed geometric descriptors -- more on this this afternoon!

Here's what you need to know.

`atomic_desc` is `5000 x 6 x q` tensor, where `q` is the number of descriptors we have.

For each frame, we have one descriptor per carbon atom (hence the 6!).

We'll average this per-molecule into the variable `desc`.

In [None]:
atomic_desc = np.load("../../datasets/cyclohexane/cyclohexane_descriptors.npy")

X = np.mean(atomic_desc, axis=1)
atomic_desc.shape, X.shape

### Setting the colormap
Here we are going to color each of our points based upon their similar to the initial conformers (which has been pre-computed).

In [None]:
closest_config = np.array([frame.info["closest_conformer"] for frame in traj])
colors = np.array([frame.info["color"] for frame in traj])

# Mapping time!

### Linear Principal Components Analysis

Finish the code to compute PCA from scratch.

In [None]:
C = X.T @ X # ...

v_C, U_C = scipy.sparse.linalg.eigsh(C, k=100)

# U_C/v_C are already sorted, but in *increasing* order, so reverse them
U_C = np.flip(U_C, axis=1)
v_C = np.flip(v_C, axis=0)

Kgram = X @ X.T # ...

v_K, U_K = scipy.sparse.linalg.eigsh(Kgram, k=100)

U_K = np.flip(U_K, axis=1)
v_K = np.flip(v_K, axis=0)

As you can see, the covariance and Gram matrices have the same eigenvalues:

In [None]:
plt.semilogy(v_C)
plt.semilogy(v_K)

Our projections should be identical, bar any mirroring.

In [None]:
# we add a factor of v_K^{1/2} (our singular values) to normalize our first projection
T_K = U_K @ np.diag(np.sqrt(v_K))

# no factor needed here!
T_C = X @ U_C

fig, (axK, axC) = plt.subplots(1, 2, figsize=(10, 4))
axK.scatter(T_K[:, 0], T_K[:, 1], marker=".")
axC.scatter(T_C[:, 0], T_C[:, 1], marker=".")

In [None]:
from sklearn.decomposition import PCA

pca = PCA()
pca.fit(X)

T = pca.transform(X)

chemiscope.show(
    traj,
    properties={
        "t": T[:, :5],
        "Relative Energy [eV]": energy,
        "Closest Conformer": closest_config,
    },
    settings={
        "map": {
            "symbol": "Closest Conformer",
            "color": {"property": "Relative Energy [eV]"},
        }
    },
)

Even when our PCA is not as easily interpretable as it is here, we can use it towards data compression by looking at the variance contained in the components:

In [None]:
plt.loglog(pca.explained_variance_ratio_)
plt.gca().set_xlabel(r"$n_{PC}$")
plt.gca().set_ylabel("Explained Variance Ratio")

n_pca = np.where(np.cumsum(pca.explained_variance_ratio_) > 0.9999)[0][0]
plt.axvline(n_pca, c="k", linestyle="--")
print(
    "This shows that we can retain most of the variance (>99.99%) in {} vectors. We'll use this as our descriptor in some other algorithms below for complexity's sake.".format(
        n_pca
    )
)

plt.tight_layout()

## MDS

In [None]:
from sklearn.manifold import MDS

mds = MDS(n_components=5)
mds.fit(X)

T = mds.embedding_

chemiscope.show(
    traj,
    properties={
        "t": T,
        "Relative Energy [eV]": energy,
        "Closest Conformer": closest_config,
    },
    settings={
        "map": {
            "symbol": "Closest Conformer",
            "color": {"property": "Relative Energy [eV]"},
        }
    },
)

## KPCA

Let's try KPCA! I have precomputed the kernel, to spare everyone's computer some headaches.

In [None]:
K = np.load("../../datasets/cyclohexane/normalized_kernel.npy")

In [None]:
from sklearn.decomposition import KernelPCA

kpca = KernelPCA(kernel="precomputed", n_components=2)
kpca.fit(K)

T = kpca.transform(K)

chemiscope.show(
    traj,
    properties={
        "t": T,
        "Relative Energy [eV]": energy,
        "Closest Conformer": closest_config,
    },
    settings={
        "map": {
            "symbol": "Closest Conformer",
            "color": {"property": "Relative Energy [eV]"},
        }
    },
)