# 1: Molecules, Structural Representations, and Training Sets

* Time to run the cells: ~ 1 minute

First thing's first; set the absolute path of the ``rho_learn`` directory on
your local machine, for instance:

``RHOLEARN_DIR = "/Users/joe.abbott/Documents/phd/code/qml/rho_learn/"``

We also need to import a ``rholearn`` module here for use later.

In [None]:
from rholearn.features import lambda_soap_vector

RHOLEARN_DIR = "/Users/joe.abbott/Documents/phd/code/qml/rho_learn/"
# RHOLEARN_DIR = "/path/to/rho_learn/"

## Reference Data

### Electron Densities

The data used here is a 10-molecule subset of a largest dataset of azoswitch
molecules used in the electron density learning of excited state properties. You
can read the paper at __"Learning the Exciton Properties of Azo-dyes"__, J.
Phys. Chem. Lett. 2021, 12, 25, 5957–5962. DOI:
[10.1021/acs.jpclett.1c01425](https://doi.org/10.1021/acs.jpclett.1c01425). 

For the purposes of this workflow we are focussing on predicting only the
ground-state electron density, but can easily be extended to first- and
second-excited state hole and particle densities, for which there is reference
QM data at the above source.

All the data needed to run this proof-of-concept workflow is shipped in the
GitHub repo, stored in the ``rho_learn/docs/example/azoswitch/data/`` directory.
Inspect this directory. There is a file called ``mollist.dat`` containing the
filenames of 10 structures, a subfolder ``xyz/`` containing these ``.xyz``
files, a folder containing some QM-calculated Coulomb repulsion matrices, and
the QM-calculated (i.e. reference) ground state electron density coefficients of
the moelcules included in the training set.

Both the Coulomb matrices and electron density are stored in equistore TensorMap
format. Let's load and inspect the structure of the electron density.

In [None]:
import os
import equistore.io

data_dir = os.path.join(RHOLEARN_DIR, "docs/example/azoswitch/data")
e_density = equistore.io.load(os.path.join(data_dir, "gs_edensity.npz"))

TensorMaps are main object users interact with when using equistore, storing in
principle any kind of data useful in atomistic simulations and their associated
metadata. 

A TensorMap is a collection of TensorBlocks, each of which is indexed by a key
and contains atomistic data on a subset of a system of interest. In our case,
the electron density TensorMap has blocks for each combination of spherical
harmonic channel, $l$, and chemical species. 

Run the cell below. Notice how the $l$ values run from 0 -> 5 (inclusive) and
the chemical species (or 'species_center') span values 1, 6, 7, 8, 16, for
elements H, C, N, O, S respectively.

In [None]:
e_density.keys

Now let's look at a specific block. TensorBlock contain three axis: the first is
a single dimension, the samples. The last is also a single axis, the properties.
And all other intermediate dimensions are the components. In general, samples
are used to describe what we are representing, i.e. atomic environments in a
given structure, and properties are used to describe how we are representing it.

In this example, a set of coefficients for the expansion of the electron density
on a set of basis functions are given as the learning targets and therefore the
data that appears in the TensorMaps. For a given structure, $A$

$ \rho_A (x) = \sum_{inlm} c^i_{nlm} \phi_{nlm}(x - r_i)$

where $c^i_{nlm}$ are the expansion coefficients, $\phi$ the basis functions.
$i$ is an atomic index for the atoms in a molecule, $n$ the radial index, and
$l$ and $m$ the spherical harmonics indices.

In [None]:
e_density.block(0)

The samples contain 'structure' (i.e. $A$ in the equation above) and 'center' ($i$) indices. The
components contains 'spherical_harmonics_m' ($m$) indices, and the properties
contains 'n' (i.e. radial channel $n$) indices. Remember from above that the
keys of the TensorMap store the sparse indices for 'spherical_harmonics_l' (i.e.
$l$) as well as 'species_center' - the latter because often different basis
functions are used for different chemical species.

### Coulomb Matrices



For each structure in the training set, a Coulomb repulsion metric can be
calculated betweeen pairs of basis functions indexed by ${n_1l_1m_1}$ and
${n_2l_2m_2}$. The provided Coulomb matrices contains these repulsions, measured
in Hartree units of energy.

These metrics will be used to define a physically-inspired loss
function used in model training (in the second example notebook).

Because these matrices are quite large, they had to be split up in order to be
stored on GitHub. Run the cell below to recombine them, and observe the keys.
Notice how each block is indexed by a pair of $l$ values and chemical species now.

In [None]:
from azoswitch_utils import recombine_coulomb_matrices

coulomb_matrices = recombine_coulomb_matrices(data_dir, 6)

In [None]:
# Just view the first 10 keys (as there are > 600 of them)
coulomb_matrices.keys[:10]

Now inspect a single block. Samples monitor a single structure index, and the 2
atomic center indices the basis functions belong to. Note only a single
structure index is present here because it doesn't make sense to calculate
repulsion between atoms in different structures. The components index the $m$
value for the 2 basis functions, and properties indexes the $n$ values.

In [None]:
coulomb_matrices.block(0)

## Structural Descriptors

Now we can build a $\lambda$-SOAP structural representation of the input data,
using only the ``.xyz`` files. First, we load the filenames from
``mollist.dat``. The order of the filenames as listed dictates their structure
index, of which all will run from 0 -> 9.

In [None]:
# Read the filenames from mollist.dat 
with open(os.path.join(data_dir, "molecule_list.dat"), "r") as molecule_list:
    xyz_files = molecule_list.read().splitlines()
xyz_files

Each of these ``.xyz`` files can be read into an ASE object, or 'frame', and
these frames can be visualized with chemiscope. Use the slider to have a look at
each molecule in turn.

In [None]:
import ase.io
import chemiscope

# Read into ASE frames
frames = [ase.io.read(os.path.join(data_dir, "xyz", f)) for f in xyz_files]

# Display molecules with chemiscope
cs = chemiscope.show(frames, mode="structure")
display(cs)

In [None]:
# Print the unique chemical species present in the dataset
unique_species = list(set([specie for f in frames for specie in f.get_atomic_numbers()]))
unique_species

In [None]:
# Rascaline hypers
rascal_hypers = {
    "cutoff": 5.0,  # Angstrom
    "max_radial": 6,  # Exclusive
    "max_angular": 5,  # Inclusive
    "atomic_gaussian_width": 0.2,
    "radial_basis": {"Gto": {}},
    "cutoff_function": {"ShiftedCosine": {"width": 0.5}},
    "center_atom_weight": 1.0,
}

# Compute lambda-SOAP: uses rascaline to compute a SphericalExpansion
# runtime approx 15 seconds
input = lambda_soap_vector(
    frames, rascal_hypers, save_dir=data_dir, neighbor_species=unique_species
)

## Clean the $\lambda$-SOAP descriptor

Now we've generated our descriptor let's load the it (our 'input') and the
ground-state electron density ('output') TensorMaps from file using the
``equistore.io.load`` function.

In [None]:
from equistore import io

input = io.load(os.path.join(data_dir, "lambda_soap.npz"))
output = io.load(os.path.join(data_dir, "gs_edensity.npz"))
print("Lambda SOAP key names: ", input.keys.names, "\nNumber of blocks: ", len(input.keys))
print("GS e-density key names: ", output.keys.names, "\nNumber of blocks: ", len(output.keys))

In general, higher-order correlations are described by combining descriptors of
lower order. The `SphericalExpansion` calculator in `rascaline` computes an
atom-centered density correlation for each structure in the input molecular
dataset. These correspond to $\nu = 1$ order features. SOAP-based
representations are by definition $\nu = 2$ order features as they measure
pairwise atom density correlations. As such, using Clebsch-Gordan iterations,
they are generated by combining $\nu=1$ features with themselves. $\lambda$-SOAP
features, by extension, are a given by projecting (and rotationally averaging)
$\nu = 2$ SOAP features on a hierarchy of spherical harmonics which behave
equivariantly under rotations.

In principle, one could build a descriptor that is comprised of different order
of $\nu$. This is why, in the ``input`` TensorMap, there exists a key monitoring
``'order_nu'``. In our case, using $\lambda$-SOAP, the order of all blocks in
the TensorMap is by definition $\nu = 2$. We can therefore drop this key, as it
is redundant.

Furthermore, when the $\nu = 2$ descriptor is generated by combining $\nu = 1$
features, inversion symmetry is accounted for such that the resulting
descriptor transforms covariantly under both proper and improper rotations,
belonging to the SO(3) and O(3) symmetry groups, respectively. As we are
interested in modelling the electron density, which transforms rigidly and
covariantly with the molecule to which it belongs, we only need to consider
descriptors that are equivariant under actions of the SO(3) (proper) rotations
group. We can therefore drop all blocks in the ``input`` $\lambda$-SOAP
TensorMap that have odd parity, i.e. $\sigma = -1$, indicated by the key
``'inversion_sigma'``.

A final bit of cleaning needs to be performed. When generating the reference
data, the electron denisty coefficients (now stored in the ``output`` TensorMap)
for the Hydrogen (``'species_center' = 1``) basis functions were only calculated
up to angular momentum channel $0 \leq \lambda \leq 4$, whereas the other atoms
present in the molecular dataset (C=6, N=7, O=8, S=16) were calcualated up to $0
\leq \lambda \leq 5$. Therefore, in order to map inputs to outputs, we need to
drop the block in the ``input`` TensorMap corresponding to the key
``('spherical_harmonics_l', 'species_center') = (5, 1)``.

In [None]:
from azoswitch_utils import clean_azoswitch_lambda_soap

input_cleaned = clean_azoswitch_lambda_soap(input)
input_cleaned

In [None]:
# Save the cleaned lambda-SOAP descriptor
io.save(os.path.join(data_dir, "lambda_soap_cleaned.npz"), input_cleaned)

Let's perform some checks on the input and output TensorMaps. In order to do
supervised ML in PyTorch, it is required that all dimensions of input and output
tensors, except the last, are exactly equivalent, both in terms of size and the
ordering of the data they correspond to. The final dimension, i.e. the
properties/features (in equistore/torch terminology) need not match as learning
the mapping of input properties onto output properties is the goal of supervised
ML.

As we are using ``equistore`` and storing our atomistic ML data in ``TensorMap``
objects, we can use the ``Labels`` metadata to check our data before training.
PyTorch doesn't track metadata, so we need to be certain that:

* a) input/output keys are equivalent, but order doesn't matter as each
  ``TensorBlock`` indexed by these keys will be a separate input to its own
  model.
* b) the input/output samples of each block indexed by a given key are *exactly*
  equivalent, in size, values, and order.
* c) the input/output components of each block indexed by a given key are *exactly*
  equivalent, in size, values, and order.

We can perform these checks by first checking the keys ``Labels`` objects of the
input/output ``TensorMaps``, then iterating over these keys, extracting the
input/output ``TensorBlocks`` and comparing their samples and components
``Labels``. The code cell below does this - if everything is ok no error should
be raised.

In [None]:
from rholearn import utils

utils.equal_metadata(input_cleaned, output)
print("input and output TensorMaps checked - consistent metadata, checks passed")

## Perform data partitioning

The input and output data has been defined, cleaned, and checked for metadata
consistency. Now we need to perform a train-test-validation split and, in order
to perform a learning exercise, create some subsets of the training data.

We will define a dict of settings that we will provide to the function that will
execute this partitioning; ``partition_data``.

``settings`` is a nested dict. For each of the nested dicts indexed by the
following keys:

* ``"io"``: stores the paths of the input (i.e. lambda-SOAP) and output (i.e.
ground-state electron density) TensorMaps and the directory where the
partitioned data will be stored.

* ``"numpy"``: stores the random seed used to control reproducible shuffling of
  the structure indices when the data is partitioned.

* ``"train_test_split"``: stores the settings for how to perform the train-test
  split. In this case, we want to split our TensorMaps along the samples axis,
  splitting by structure. As we want a train-test-validation split, we specify
  ``n_groups: 3``, and indicate the absolute group sizes of 7, 2, and 1 of the
  full data in the train, test, and validation TensorMaps, respectively. We
  could also pass ``"group_sizes_rel": [0.7, 0.2, 0.1]`` here with the same
  outcome. If we just wanted a train-test split, with no validation set, we
  would pass ``"n_groups": 2`` and ``"group_sizes_rel": [x, y]``, where ``x + y
  <= 1``

In [None]:
settings = {
    "io": {
        "input": os.path.join(data_dir, "lambda_soap_cleaned.npz"),
        "output": os.path.join(data_dir, "gs_edensity.npz"),
        "data_dir": os.path.join(data_dir, "partitions"),
    },
    "numpy": {
        "random_seed": 10,
    },
    # Perform a train-test-validation split, with 7, 2, 1 molecules in each
    "train_test_split": {
        "axis": "samples",
        "names": ["structure"],
        "n_groups": 3,
        "group_sizes_abs": [7, 2, 1],
        # "group_sizes_rel": [0.7, 0.2, 0.1],  # we could also use this argument
    },
    # Prepare training data partitions for 2 exercises, each with 3 subsets
    "data_partitions": {
        "n_exercises": 2,
        "n_subsets": 3,
    },
}

In [None]:
from rholearn.pretraining import partition_data

# Runtime approx 20 seconds
partition_data(settings)

Let's inpsect how the data was partitioned. In the "partitions" folder, a numpy
array called "subset_sizes.npy" was saved. This stores the sizes (i.e. number of
training structures) of each of the training subsets.

You can see that, of the 7 structures that we designated as the the total
training set, 2, 4, and 6 structures were assigned to each of the training
subsets to be used in a learning exercise. While these seem evenly spaced in
linear space, in practice the ``partition_data`` function ensures that the sizes
of training subsets are evenly spaced along a *log* (base ``e``) scale, to the
nearest integer.

In [None]:
import numpy as np

np.load(os.path.join(data_dir, "partitions", "subset_sizes_train.npy"))

For each of the 3 learning exercises, the training structures indices were
shuffled before subsets were created. Let's check this by printing the ordered
structure indices from which the training set was partitioned.

In [None]:
print("train structure idxs:")
print("exercise 0: ", np.load(os.path.join(data_dir, "partitions", "exercise_0", "structure_idxs_train.npy")))
print("exercise 1: ", np.load(os.path.join(data_dir, "partitions", "exercise_1", "structure_idxs_train.npy")))

While the 7 structure indices are equivalent across all the lists, the order is
different. That means, for instance, when the first subset of size 2 is created,
structures 2 and 8 will be present in the training set for exercise 0,
structures 0 and 6 for exercise 1, and structures 3 and 2 for exercise 2.

Just as a sanity check, let's print the test and validation structure indices.

In [None]:
print("test structure idxs: ", np.load(os.path.join(data_dir, "partitions", "structure_idxs_test.npy")))
print("val structure idxs: ", np.load(os.path.join(data_dir, "partitions", "structure_idxs_val.npy")))

Next, let's address the warnings outputted when calling the ``partition_data``
function above. 

The samples dimension of each TensorBlock in the cleaned $\lambda$-SOAP
representation contains indices for multiple structures. Let's look at the block
indexed by key ``(0, 8)`` as an example, as this is one of the blocks we were
warned about above when the ``partition_data`` function was called, for exercise
0.

In [None]:
from rholearn.utils import key_tuple_to_npvoid

key = key_tuple_to_npvoid((0, 8), names=["spherical_harmonics_l", "species_center"])

print(
    f"samples of block indexed by key {key} in the cleaned lsoap TensorMap:\n",
    input_cleaned[key].samples.names, ": ", input_cleaned[key].samples, 
)

The block indexed by key ``(0, 8)`` contains only structures 3, 5, 6, and 7.

When the data is partitioned (i.e. in the train-test-validation split) based on
structure index, some blocks may end up being empty. It is important that we
still hold on to this empty block, indexed by its appropriate key, so that we
can learn the relationship between input and output. In the case of an empty
block, there will be nothing to learn and the weights matrix will be of size 0.
However, using a larger training set or a different way of performing the
train-test-validation split, the block may not be empty.

Remember from above that the ordered sample indices from with the
train-test-validation split for exercise 0 is performed is ``[(2,) (8,) (0,)
(3,) (5,) (6,) (1,)]``. As exercise 0, subset 0 only contains 1 structure in the
training subset, and the test and validation 

In [None]:
# Get the input train TensorMap from exercise 0, subset 0
in_train = io.load(
    os.path.join("data", "partitions", "exercise_0", "subset_0", "in_train.npz")
)
in_train[key].samples

In [None]:
# Get the input test TensorMap
in_test = io.load(
    os.path.join("data", "partitions", "in_test.npz")
)
in_test[key].samples

In [None]:
# Get the input validation TensorMap
in_val = io.load(
    os.path.join("data", "partitions", "in_val.npz")
)
in_val[key].samples

Finally, run the cell below. The details of what this does isn't scientifically
or computationally important - it just combines 6 TensorMaps into a single one,
and needs to be done due to GitHub's file size limits for public repositories.
We will go over the meaning and use of these coulomb matrices in the next
notebook.

Now that the data has been partitioned, we are ready to move on to setting up
the training simulations.