This notebook serves as an example of how to analyze a simulation trajectory using unsupervised techniques. Here, specifically, we'll be analyzing a simulation of cyclohexane conformations, simulated using quantum-espresso.

Before running this notebook, you will need to install:
    
- [ase](https://wiki.fysik.dtu.dk/ase/index.html)
- [scikit-learn](https://scikit-learn.org/)
- [scikit-matter](https://github.com/scikit-learn-contrib/scikit-matter)
- [hdbscan](https://hdbscan.readthedocs.io/en/latest/how_hdbscan_works.html)
- [chemiscope](https://chemiscope.org)

in addition to standard packages [numpy](https://numpy.org/) and [matplotlib](https://matplotlib.org/).

## Loading Chemiscope widgets in Jupyter

Please make sure you have jupyter extensions enabled.

If at *any time* you are unable to load the chemiscope widgets in Jupyter, you can replace `chemiscope.show(` with `chemiscope.write_input('filename.json', ...` and upload the resulting file to [chemiscope.org](chemiscope.org).

In [None]:
!pip install ase skmatter chemiscope
!git clone https://github.com/icomse/5th_workshop_MachineLearning.git
import numpy as np
from ase.io import read
from matplotlib import pyplot as plt
import chemiscope
import scipy
from sklearn import cluster
from sklearn import metrics
import hdbscan
import pandas as pd
from functools import partial
import os
os.chdir('5th_workshop_MachineLearning/Day_3')

## Preparing the Data

### Read Data

Here we read in 5 MD trajectories and place them in a concatenated list `traj`.

`ranges` is storing the range of `traj` corresponding to each original file.
`conf_idx` is storing the location of the initial conformations.

`rgb_colors` is the set of colors used for each conformer, stored in rgba format.

In [None]:
# read in the frames from each MD simulation
traj = []
names = ["chair", "twist-boat", "boat", "half-chair", "planar"]
rgb_colors = [
    (0.13333333333333333, 0.47058823529411764, 0.7098039215686275),
    (0.4588235294117647, 0.7568627450980392, 0.34901960784313724),
    (0.803921568627451, 0.6078431372549019, 0.16862745098039217),
    (0.803921568627451, 0.13725490196078433, 0.15294117647058825),
    (0.4392156862745098, 0.2784313725490196, 0.611764705882353),
]

ranges = np.zeros((len(names), 2), dtype=int)
conf_idx = np.zeros(len(names), dtype=int)

for i, n in enumerate(names):
    frames = read(f"./datasets/cyclohexane/{n}.xyz",":",)

    ranges[i] = (len(traj), len(traj) + len(frames))
    conf_idx[i] = len(traj)
    traj = [*traj, *frames]

In [None]:
# energies of the simulation frames, relative to the chair conformation
energy = np.array([a.info["relative_energy_eV"] for a in traj])

# energies of the known conformers, relative to the chair conformation
c_energy = np.array([traj[c].info["relative_energy_eV"] for c in conf_idx])

# extrema for the energies
max_e = max(energy)
min_e = min(energy)

Here we can confirm what our analysis will tell us: 

- the simulation starting in the planar conformation transitions to the chair conformation
- the simulations starting in the twist-boat, boat, and half-chair conformations ultimately get stuck in the twist formation.

In [None]:
fig, ax = plt.subplots(1, figsize=(6, 4))

for n, c, r, rgb in zip(names, c_energy, ranges, rgb_colors):
    ax.plot(
        range(0, r[1] - r[0]), energy[r[0] : r[1]] - min_e, label=n, c=rgb, zorder=-1
    )

ax.legend()
ax.set_xlabel("Simulation Timestep")
ax.set_ylabel("Energy")

ax.set_xlim([0, len(energy) // 5])
ax.set_ylim([-0.1, 1.25 * (max_e - min_e)])
ax.set_yticklabels([])

plt.tight_layout()
# plt.savefig('figures/Figure5/energy.png')
plt.show()

### Load descriptors 
We will use some precomputed geometric descriptors -- more on this this afternoon!

Here's what you need to know.

`atomic_desc` is `5000 x 6 x q` tensor, where `q` is the number of descriptors we have.

For each frame, we have one descriptor per carbon atom (hence the 6!).

We'll average this per-molecule into the variable `desc`.

In [None]:
atomic_desc = np.load("./datasets/cyclohexane/cyclohexane_descriptors.npy")

X = np.mean(atomic_desc, axis=1)
atomic_desc.shape, X.shape

### Setting the colormap
Here we are going to color each of our points based upon their similar to the initial conformers (which has been pre-computed).

In [None]:
closest_config = np.array([frame.info["closest_conformer"] for frame in traj])
correct_labels = np.array([names.index(c) for c in closest_config])
colors = np.array([frame.info["color"] for frame in traj])

### Mapping time!

Let's use our t-sne from the previous notebook to perform clustering on.

In [None]:
from sklearn.decomposition import PCA
from openTSNE import TSNE

# 0.9999 tells us here to keep 99.99% of the variance
pca = PCA(n_components=0.999)
pca.fit(X)

pca_desc = pca.transform(X)

n_neighbors_TSNE = 6

tsne = TSNE(
    n_components=2,  # number of components to project across
    perplexity=n_neighbors_TSNE,  # amount of neighbors one point is posited to have... play around with this!
    n_jobs=2,  # parallelization
    random_state=42,
    verbose=False,
)
T = tsne.fit(pca_desc)
plt.figure(figsize=(8,8))
plt.scatter(T[:, 0], T[:, 1], color=colors)
plt.axis("off")

And let's make a plotting utility to help ourselves in the future.

In [None]:
def plot_tsne_with_labels(labels, cutoff=10, fig=None, ax=None):
    
    all_colors = [
        "#ebac23",
        "#b80058",
        "#008cf9",
        "#006e00",
        "#00bbad",
        "#d163e6",
        "#b24502",
        "#ff9287",
        "#5954d6",
        "#00c6f8",
        "#878500",
    ]
    
    counts = np.array([labels.tolist().count(l) for l in list(sorted(set(labels)))])
    colors = ["none" for _ in counts]

    for i in range(1, len(counts) + 1):
        if np.sort(counts)[-i] >= cutoff:
            if i<len(all_colors):
                colors[np.argsort(counts)[-i]] = all_colors[i]
            else:
                colors[np.argsort(counts)[-i]] = "orange"
        
    if fig is None or ax is None:
        fig, ax = plt.subplots(1, figsize=(8, 8))
    ax.scatter(T[:, 0], T[:, 1], c="k", alpha=0.1, marker=".")
    ax.scatter(
        T[:, 0],
        T[:, 1],
        fc="none",
        ec=[colors[l] for l in labels],
        linewidth=0.5,
    )
    
    ax.set_xlabel(r"$t-SNE_1$")
    ax.set_ylabel(r"$t-SNE_2$")
    ax.set_yticks([])
    ax.set_xticks([])

In [None]:
# These are our correct labels!

plot_tsne_with_labels(correct_labels)

## Perform Clustering

In [None]:
# Tune the parameters of K-Means to get an appropriate clustering

km = cluster.KMeans(
    # ...
    )
km.fit(T)

plot_tsne_with_labels(km.labels_)

In [None]:
chemiscope.show(
    traj,
    properties={
        "t": T,
        "Relative Energy [eV]": energy,
        "Closest Conformer": closest_config,
        "Cluster": km.labels_,
        "Correct Cluster": correct_labels,
    },
    settings={
        "map": {
            "symbol": "Closest Conformer",
            "color": {"property": "Relative Energy [eV]"},
        }
    },
)

### Choose a few other clustering methods to try out:

[Agglomerative Clustering](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.AgglomerativeClustering.html#sklearn.cluster.AgglomerativeClustering): `cluster.AgglomerativeClustering(n_clusters = int, affinity = sklearn.metrics.pairwise_distances`
- You can also use sklearn to make a dendrogram of the clustering hierarchy! [Instructions here](https://scikit-learn.org/stable/auto_examples/cluster/plot_agglomerative_dendrogram.html#sphx-glr-auto-examples-cluster-plot-agglomerative-dendrogram-py)

[DBSCAN](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html): `cluster.DBSCAN(eps = float, min_samples=int, metric=sklearn.metrics.pairwise_distance)`
- eps is the size of the radius from each point
- min_samples is the unmber of samples in a neighborhood for the algorithm to care about that cluster

[HDBSCAN](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.HDBSCAN.html#sklearn.cluster.HDBSCAN): `cluster.HDBSCAN(min_cluster_size=int, metric=sklearn.metrics.pairwise_distance)`

[Mean Shift](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.MeanShift.html#sklearn.cluster.MeanShift): `cluster.MeanShift()`

and place their results in a dictionary like this:

``` python
    cluster_results = {
                       "technique_name_1": list of labels,
                       "technique_name_2": list of labels,
                       "technique_name_3": list of labels,
                      }
```

## Evaluating the Clusters

In [None]:
cluster_results = {
    
}

In [None]:
scores_to_try = {
    "Rand": lambda labels: metrics.rand_score(
        labels_true=correct_labels, labels_pred=labels
    ),
    "Jaccard": lambda labels: metrics.jaccard_score(
        y_true=correct_labels, y_pred=labels, average='macro',
    ),
    "Fowlkes-Mallows": lambda labels: metrics.fowlkes_mallows_score(
        labels_true=correct_labels, labels_pred=labels
    ),
    "F indicator": lambda labels: metrics.f1_score(
        y_true=correct_labels, y_pred=labels, average='macro',
    ),
    "Silhouette": partial(metrics.silhouette_score, X=T),
    "Davies-Bouldin": partial(metrics.davies_bouldin_score, X=T),
}

fig, axes = plt.subplots(
    1, len(cluster_results.keys())+1, figsize=(4 * len(cluster_results.keys())+4, 4)
)
plot_tsne_with_labels(correct_labels, ax=axes[0], fig=fig)
for (key, value), ax in zip(cluster_results.items(), axes[1:]):
    plot_tsne_with_labels(value, ax=ax, fig=fig)
    ax.set_title(key)
plt.show()

scores = np.array(
    [
        [v(labels=labels) for v in scores_to_try.values()]
        for labels in cluster_results.values()
    ]
)

pd.DataFrame(scores, columns=scores_to_try.keys(), index=cluster_results.keys())

# Discuss with the students next to you the pros and cons of each clustering technique!