# 1: Molecules, Structural Representations, and Training Sets

* Time to run the cells: ~ 1 minute

First thing's first, in `settings.py`, set the `RHOLEARN_DIR` to be the absolute
path of `.../rholearn/` on your local machine. Inspect the other options set in
this file. The settings relevant to this notebook are the `RASCAL_HYPERS` and
`DATA_SETTINGS`.

`RASCAL_HYPERS` sets the hyperparameters used to generate the $\lambda$-SOAP
structural representation. `DATA_SETTINGS` contains settings for performing a
train-test(-validation) split, partitioning the data ready for learning
exercises, and where this data should be written.

Provided you have set the correct `RHOLEARN_DIR`, these settings should allow
these notebook tutorials to run out of the box.

First, import all the necessary packages:

In [None]:
# Useful standard and scientific ML libraries
import os
import ase.io
import numpy as np

# M-Stack packages
import chemiscope
import equistore
from equistore import Labels

from rholearn import io, features, pretraining, utils
from settings import RASCAL_HYPERS, DATA_SETTINGS

## Reference Data

### Electron Densities

The data used here is a 10-molecule subset of a largest dataset of azoswitch
molecules used in the electron density learning of excited state properties. You
can read the paper at __"Learning the Exciton Properties of Azo-dyes"__, J.
Phys. Chem. Lett. 2021, 12, 25, 5957–5962. DOI:
[10.1021/acs.jpclett.1c01425](https://doi.org/10.1021/acs.jpclett.1c01425). 

For the purposes of this workflow we are focussing on predicting only the
ground-state electron density, but can easily be extended to first- and
second-excited state hole and particle densities, for which there is reference
QM data at the above source.

All the data needed to run this proof-of-concept workflow is shipped in the
GitHub repo, stored in the ``rho_learn/docs/example/azoswitch/data/`` directory.
Inspect this directory. There is a file called ``molecule_list.dat`` containing the
filenames of 10 structures, a subfolder ``xyz/`` containing these ``.xyz``
files, a folder containing some QM-calculated Coulomb repulsion matrices, and
the QM-calculated (i.e. reference) ground state electron density coefficients of
the moelcules included in the training set.

Both the Coulomb matrices and electron density are stored in equistore TensorMap
format. Let's load and inspect the structure of the electron density.

In [None]:
output = equistore.load(os.path.join(DATA_SETTINGS["data_dir"], "e_densities.npz"))

TensorMaps are main object users interact with when using equistore, storing in
principle any kind of data useful in atomistic simulations and their associated
metadata. 

A TensorMap is a collection of TensorBlocks, each of which is indexed by a key
and contains atomistic data on a subset of a system of interest. In our case,
the electron density TensorMap has blocks for each combination of spherical
harmonic channel, $l$, and chemical species. 

Run the cell below. Notice how the $l$ values run from 0 -> 5 (inclusive) and
the chemical species (or 'species_center') span values 1, 6, 7, 8, 16, for
elements H, C, N, O, S respectively.

In [None]:
output.keys

Now let's look at a specific block. TensorBlock contain three axis: the first is
a single dimension, the samples. The last is also a single axis, the properties.
And all other intermediate dimensions are the components. In general, samples
are used to describe what we are representing, i.e. atomic environments in a
given structure, and properties are used to describe how we are representing it.

In this example, a set of coefficients for the expansion of the electron density
on a set of basis functions are given as the learning targets and therefore the
data that appears in the TensorMaps. For a given structure, $A$

$ \rho_A (x) = \sum_{inlm} c^i_{nlm} \phi_{nlm}(x - r_i)$

where $c^i_{nlm}$ are the expansion coefficients, $\phi$ the basis functions.
$i$ is an atomic index for the atoms in a molecule, $n$ the radial index, and
$l$ and $m$ the spherical harmonics indices.

In [None]:
output.block(0)

The samples contain 'structure' (i.e. $A$ in the equation above) and 'center' ($i$) indices. The
components contains 'spherical_harmonics_m' ($m$) indices, and the properties
contains 'n' (i.e. radial channel $n$) indices. Remember from above that the
keys of the TensorMap store the sparse indices for 'spherical_harmonics_l' (i.e.
$l$) as well as 'species_center' - the latter because often different basis
functions are used for different chemical species.

### Coulomb Metrics

For each structure in the training set, a Coulomb repulsion metric can be
calculated betweeen pairs of basis functions indexed by ${n_1l_1m_1}$ and
${n_2l_2m_2}$. The provided Coulomb matrices contains these repulsions, measured
in Hartree units of energy.

These metrics will be used to define a physically-inspired loss
function used in model training (in the second example notebook).

Because these matrices are quite large, they had to be split up in order to be
stored on GitHub. Run the cell below to recombine them, and observe the keys.
Notice how each block is indexed by a pair of $l$ values and chemical species now.

In [None]:
from azoswitch_utils import recombine_coulomb_metrics

coulomb_metrics = recombine_coulomb_metrics(DATA_SETTINGS["data_dir"])

In [None]:
# Just view the first 10 keys (as there are > 600 of them)
coulomb_metrics.keys[:10]

Now inspect a single block. Samples monitor a single structure index, and the 2
atomic center indices the basis functions belong to. Note only a single
structure index is present here because it doesn't make sense to calculate
repulsion between atoms in different structures. The components index the $m$
value for the 2 basis functions, and properties indexes the $n$ values.

In [None]:
coulomb_metrics.block(0)

## Generate Structural Descriptors

Now we can build a $\lambda$-SOAP structural representation of the input data,
using only the ``.xyz`` files. First, we load the filenames from
``molecule_list.dat``. The order of the filenames as listed dictates their structure
index, of which all will run from 0 -> 9.

In [None]:
# Read the filenames from molecule_list.dat 
with open(os.path.join(DATA_SETTINGS["data_dir"], "molecule_list.dat"), "r") as molecule_list:
    xyz_files = molecule_list.read().splitlines()
xyz_files

Each of these ``.xyz`` files can be read into an ASE object, or 'frame', and
these frames can be visualized with chemiscope. Use the slider to have a look at
each molecule in turn.

In [None]:
# Read xyz structures into ASE frames
frames = [ase.io.read(os.path.join(DATA_SETTINGS["data_dir"], "xyz", f)) for f in xyz_files]

# Display molecules with chemiscope
chemiscope.show(
    frames,
    properties={
        "Number of atoms": [f.get_global_number_of_atoms() for f in frames],
        "Molecular mass / u": [np.sum(f.get_masses()) for f in frames],
    },
)

In [None]:
# Print the unique chemical species present in the dataset
unique_species = list(set([specie for f in frames for specie in f.get_atomic_numbers()]))
unique_species

In [None]:
# Compute lambda-SOAP: uses rascaline to compute a SphericalExpansion (~ 15 secs)
input = features.lambda_soap_vector(
    frames, RASCAL_HYPERS, neighbor_species=unique_species, even_parity_only=True
)
# Drop the block for l=5, Hydrogen as this isn't included in the output electron density
input = equistore.drop_blocks(input, keys=Labels(input.keys.names, np.array([[5, 1]])))

# Load the output data (i.e. electron density)
output = equistore.load(os.path.join(DATA_SETTINGS["data_dir"], "e_densities.npz"))

# Check that the metadata of input and output match along the samples and components axes
assert equistore.equal_metadata(input, output, check=["samples", "components"])

# Save lambda-SOAP descriptor to file
equistore.save(os.path.join(DATA_SETTINGS["data_dir"], "lambda_soap.npz"), input)

## Perform data partitioning

The input and output data has been defined, cleaned, and checked for metadata
consistency. Now we need to perform a train-test-validation split and, in order
to perform a learning exercise, create some subsets of the training data.

In the `DATA_SETTINGS` dict of `settings.py` are the options used to perform
this data partitioning, that we will provide to the function ``partition_data``.

* `axis` controls the TensorMap axis the train-test split should be performed
  along. As we want to split our data by structure, we specify `axis="samples"`.
* `names` dictates the names of the samples we want to split according to.
  Again, we want to split by structure here, so set `names="structures"`.
* `n_groups` is how many groups to split the data into. We want to perform a
  train-test-validation split, so specify `n_groups=3`.
* `group_sizes` controls the number of our named splitting index (in this case
  the structures along the samples axis) in each group. We have a dataset of
  size 10, and want 7, 2, and 1 structure(s) to be in teh train, test, and
  validation sets respectively, so set `group_sizes=[7, 2, 1]`. We could also
  pass relative sizes as floats, for instance `group_sizes=[0.7, 0.2, 0.1]`.
* `seed` defines the numpy random seed used for shuffling the structure indices
  before splitting. Passing this as none gives no shuffling.
* `n_exercises` specifies how many learning exercises should be performed, and
  thus how many top-level directories with partitioned data should be created.
  For each exercise, the data is shuffled differently, leading to different
  train, test, and validation data.
* `n_subsets` controls how many subsets of the training data should be creted
  for each exercise. The size of the subsets, relative to the total number of
  training structures, are equally spaced along a log (base 10) scale.

In [None]:
# Runtime approx 15 seconds
pretraining.partition_data(
    input_path=os.path.join(DATA_SETTINGS["data_dir"], "lambda_soap.npz"),
    output_path=os.path.join(DATA_SETTINGS["data_dir"], "e_densities.npz"),
    data_settings=DATA_SETTINGS,
)

Let's inspect how the data was partitioned. In the "partitions" folder, a numpy
array called "subset_sizes.npy" was saved. This stores the sizes (i.e. number of
training structures) of each of the training subsets.

You can see that, of the 7 structures that we designated as the the total
training set, 2, 4, and 6 structures were assigned to each of the training
subsets to be used in a learning exercise (provided the random `seed=10` in
`DATA_SETTINGS`). While these seem evenly spaced in linear space, in practice
the ``partition_data`` function ensures that the sizes of training subsets are
evenly spaced along a *log* (base ``e``) scale, to the nearest integer.

In [None]:
np.load(os.path.join(DATA_SETTINGS["data_dir"], "subset_sizes_train.npy"))

For each of the 2 learning exercises, the training structures indices were
shuffled before subsets were created. Let's check this by printing the ordered
structure indices from which the training set was partitioned.

In [None]:
print("train structure idxs:")
print("exercise 0: ", np.load(os.path.join(DATA_SETTINGS["data_dir"], "exercise_0", "structure_idxs_train.npy")))
print("exercise 1: ", np.load(os.path.join(DATA_SETTINGS["data_dir"], "exercise_1", "structure_idxs_train.npy")))

While the 7 structure indices are equivalent across both of the lists, the order is
different. That means, for instance, when the first subset of size 2 is created,
structures 2 and 5 will be present in the training set for exercise 0,
and structures 1 and 9 for exercise 1.

Just as a sanity check, let's print the test and validation structure indices.
We see indices 6, 0, and 8 returned, none of which are present in the training
indices above.

In [None]:
print("test structure idxs: ", np.load(os.path.join(DATA_SETTINGS["data_dir"], "structure_idxs_test.npy")))
print("val structure idxs: ", np.load(os.path.join(DATA_SETTINGS["data_dir"], "structure_idxs_val.npy")))

Now that the data has been partitioned, we are ready to move on to building and
training models.