# Looking inside metatensor data

In these tutorial notebooks, we will learn how to work with data in the metatensor format, and to put our own data into the metatensor format.

## 1 - The Dataset

For all these tutorials, we will use a dataset containing a collection of distorted 2-Propen-1-ol
conformations, taken from the ANI-1 dataset (see
https://github.com/isayev/ANI1_dataset) with their energies and forces
re-computed using DFTB+ (see https://dftbplus.org).

In [None]:
import numpy as np
import torch

import ase.io  # read the dataset

import chemiscope  # display the structures and associated properties in jupyter


In [None]:
# Read the data and extract energies and forces from ASE
frames = ase.io.read("propenol_conformers_dftb.xyz", ":100")

energies = np.array([[f.info["dftb_energy_eV"]] for f in frames])
forces = np.vstack([f.arrays["dftb_forces_eV_per_Ang"] for f in frames])


We can use chemiscope (https://chemiscope.org/) to visualize the structures and
corresponding energies in this dataset.

In [None]:
chemiscope.show(frames, properties={
    "frame index": np.arange(len(frames)),
    "energy": energies,
})


## 2 - SOAP representation

Our model will be built using a basic neural network applied on top of a SOAP
power spectrum, as computed by featomic (https://github.com/metatensor/featomic/).

[SOAP](https://journals.aps.org/prb/abstract/10.1103/PhysRevB.87.184115) (Smooth Overlap of Atomic Position) is a family of atomistic
representations, which encode information about a collection of atoms in a manner
well suited for use with machine learning models. In particular, the resulting
per-atom SOAP descriptor is invariant to global translations, invariant to
permutations of neighbors atoms, and *equivariant* to rotations. The SOAP power
spectrum is the three-body representation, and is *invariant* to rotations.

The SOAP construction starts by representing atoms with a Gaussian density (instead of point
particles), and then expanding the neighbors density around an atom on a set of
radial and angular basis. This initial 2-bodies expansion is called the SOAP
spherical expansion $\langle \alpha n l m | \rho_i \rangle$, and there is one
such spherical expansion per neighbor species $\alpha$.

$$
\langle \alpha n l m | \rho_i \rangle = \sum_j \int R_{nl}(r) \, Y^l_m(r) \, \rho_{ij}^\alpha(r) \, dr
$$

<center>
    <img src="img/SOAP.png">
</center>

From here, the SOAP power spectrum $\langle \alpha_1 \alpha_2 n_1 n_2 l | \rho_i^2 \rangle$ is taken as correlations of the spherical expansion with itself, building a 3-bodies representation of each atom's environment:

$$
\langle \alpha_1 \alpha_2 n_1 n_2 l | \rho_i^2 \rangle = \sum_m \langle \alpha_1 n_1 l m | \rho_i \rangle \otimes \langle \alpha_2 n_2 l m | \rho_i \rangle
$$

In [None]:
from featomic.torch import SoapPowerSpectrum
import metatomic.torch as mta
import metatensor.torch as mts

The SOAP power spectrum has a handful of hyper-parameters, defined below. A full description of what they do is out of scope for this tutorial, but the three main sections are explained.

In [None]:
SOAP_PARAMETERS = {
    # description of which atoms should be included in a neighborhood
    "cutoff": {
        "radius": 3.5,
        "smoothing": {
            "type": "ShiftedCosine",
            "width": 0.5
        }
    },
    # description of each atom's density to be expanded
    "density": {
        "type": "Gaussian",
        "width": 0.3
    },
    # description of the basis to use when expanding the density
    "basis": {
        "type": "TensorProduct",
        "max_angular": 6,
        "radial": {
            "type": "Gto",
            "max_radial": 5
        }
    }
}

calculator = SoapPowerSpectrum(**SOAP_PARAMETERS)

Now that we have a calculator, we can use it to compute the SOAP power spectrum for all structures in our dataset. Because we are using the TorchScript version of `featomic`, the first step will be to convert the ASE-formatted structures into the `metatomic.torch.System` type, using `systems_to_torch`

In [None]:
systems = mta.systems_to_torch(frames, dtype=torch.float64)

print(systems[0])

Finally, we can run our SOAP power spectrum descriptor calculation (shown here for the first 10 structures)  

In [None]:
descriptor = calculator.compute(systems[:10], gradients=["positions"])

descriptor

## 3 - What's in a TensorMap

As we see above, the descriptor is stored in a TensorMap, which associates TensorBlock with keys. 

<center>
    <img width=300 src="img/TensorMap.png">
</center>


To each key corresponds a block, which can be accessed with the `block` function

In [None]:
block = descriptor.block({"center_type": 8, "neighbor_1_type": 1, "neighbor_2_type": 6})
block

As you see, each block contains metadata (samples, components, properties), the data itself in the values and the gradients of the values, here with respect to positions.

<center>
    <img width=300 src="img/TensorBlock-Components.png">
</center>

First, the actual values are accessible in `block.values`, and here they are stored in a `torch.Tensor`

In [None]:
block.values

The samples (like all metadata) are stored in `Labels`, and here they contain the atom with ID 0 for all 10 structures. Only 10 entries corresponding to atom 0 is included because we are looking at the block for `center_type=8`, and there is only a single oxygen atom at the beginning of each structure.

In [None]:
block.samples

The properties are quite a bit larger, covering all values of `l` (the index of the angular basis function) and `n_1`/`n_2` (the indices of the two correlated basis functions).

In [None]:
block.properties

The metadata in the samples, components and properties can be used to find the position of some specific data in the `values` array. For example, if we want to see the position of the (l, n1, n2) = (1, 1, 1) coefficients, we can use

In [None]:
position = block.properties.position([1, 1, 1])

print(
    "the coefficients for (l, n1, n2) = (1, 1, 1) are in the column", 
    f"{position} of the block.value array (with shape {list(block.values.shape)})"
)

| ![TASK](img/clipboard.png) | Find the coefficient for atom 0 in system 3, corresponding to (l, n1, n2) = (1, 2, 3) |
|----------------------------|---------------------------------------------------------------------------------------|


In [None]:
coefficient = ...

In [None]:
if abs(coefficient.item() - 0.0263) > 1e-3:
    raise Exception("wrong coefficient, check your code!")

## 4 - Condensing the data

The data returned by featomic is maximally sparse, making full use of metatensor block-sparse format to only store non-zero coefficients. This results in a very memory efficient format, but can be a bit harder to integrate with other libraries in the machine learning ecosystem (scikit-learn, PyTorch, …), which expect dense matrices. 

Thankfully, metatensor also provides functions to make the data dense by merging blocks together, which we willl explore now!

Our starting TensorMap has 18 blocks:

In [None]:
descriptor

The first function to merge blocks and make the data dense is [keys_to_samples](https://docs.metatensor.org/latest/torch/reference/tensor.html#metatensor.torch.TensorMap.keys_to_samples). This function will take out one or more key dimensions, and merge the remaining blocks together according to the remaining dimensions.

Here we will move the `center_type` (atomic type of the central atom for the SOAP power spectrum) to the samples field, creating a new TensorMap with 6 blocks:

In [None]:
descriptor_1 = descriptor.keys_to_samples("center_type")
descriptor_1

The new TensorMap now contains the `center_type` information in the samples, and the corresponding blocks have been merged "vertically", across samples. This means all blocks now contain 100 samples (since we have 100 atoms overall in the 10 first structures).

In [None]:
descriptor_1.block_by_id(0)

The other function that can merge blocks together is [keys_to_properties](https://docs.metatensor.org/latest/torch/reference/tensor.html#metatensor.torch.TensorMap.keys_to_properties), which merges blocks "horizontally", along properties.

In [None]:
descriptor_2 = descriptor_1.keys_to_properties(["neighbor_1_type", "neighbor_2_type"])
descriptor_2

In [None]:
descriptor_2.block()

As you see, `neighbor_1_type`, and `neighbor_2_type` are now part of the properties, and we have a TensorMap with a single block. This block of `values` can be used directly with other ML tools.

In [None]:
descriptor_2.block().values.shape

-----------------------------------


Metatensor also provides multiple "operations" to work with data in TensorMap, in the `metatensor-operations` package. You can find the corresponding documentation here: https://docs.metatensor.org/latest/operations/reference/index.html

For example, one of these operations is [sum_over_samples](https://docs.metatensor.org/latest/operations/reference/manipulation/samples-reduction.html#metatensor.sum_over_samples), which can be used to reduce samples, for example to create structure representations by summing over the atom representations:

In [None]:
summed = mts.sum_over_samples(descriptor_2, ["atom", "center_type"])

summed.block().samples

After the summation, the samples now describe a different system: each row in the block's values is the SOAP representation for this system.

----------------

The `keys_to_properties` and `keys_to_samples` functions also have an advanced interface, where instead of giving them the names of the keys dimensions to move, one can provide a `Labels` object with the desired names and values.

In [None]:
all_types = [1, 6, 7, 8]

neighbors_types = mts.Labels(
    ["neighbor_1_type", "neighbor_2_type"],
    torch.tensor([[i, j] for i in all_types for j in all_types if i <= j])
)

descriptor_3 = descriptor_1.keys_to_properties(neighbors_types)

The code snippet shown above is useful when computing descriptors for a large dataset with inconsistencies in the atom types across entries: that is, some of the input system might not have all the atom types, and the corresponding key would then be missing. Giving the expected values explicitly as above allows metatensor to compute descriptors while accounting for all atom types, even if they are absent in a specific system.

Here, `descriptor_2` contains 1512 properties, while `descriptor_3` contains 2520 properties. This is because `descriptor_2` only contains 1, 6, and 8 as a potential `neighbor_1_type`, but `descriptor_3` also contains 7.

In [None]:
descriptor_2.block()

In [None]:
descriptor_2.block().properties.column("neighbor_1_type").unique()

In [None]:
descriptor_3.block()

In [None]:
descriptor_3.block().properties.column("neighbor_1_type").unique()

Here, although 7 is included as a possible `neighbor_1_type` (and `neighbor_2_type`), all the corresponding values are zero, reasonably so since they were missing from the initial TensorMap:

In [None]:
block = descriptor_3.block()

mask = block.properties.column("neighbor_1_type") == 7
block.values[:,mask]

## 5 - Doing some simple machine learning

Let's use what we've learned and run some simple machine learning models. We'll re-compute the descriptor for all structures in the dataset, and then transform it using [principal component analysis (PCA)](https://en.wikipedia.org/wiki/Principal_component_analysis). PCA is a dimensionality reduction algorithm that will allow us to visualize the (reduced) SOAP representation.

In [None]:
from sklearn.decomposition import PCA

In [None]:
systems = mta.systems_to_torch(frames, dtype=torch.float64)

| ![TASK](img/clipboard.png) | Transform the descriptor for all the systems into one with a single block that can be used with PCA |
|----------------------------|-----------------------------------------------------------------------------------------------------|


In [None]:
descriptor = calculator.compute(systems)

descriptor = ...

if not (hasattr(descriptor, "blocks") and len(descriptor) == 1):
    raise Exception("the descriptor still contains too many blocks")

# now sum the descriptor to get a per-structure representation instead of a per-atom one
descriptor = ...

if not (hasattr(descriptor, "blocks") and len(descriptor.block().values) == 100):
    raise Exception("the descriptor should be summed over the atoms before running PCA")

# finally, use sklearn to compute a 2D PCA
descriptor_pca = PCA(n_components = 2).fit_transform(descriptor.block().values)

And now let's have a look at the resulting representation in 2D space:

In [None]:
chemiscope.show(frames, properties={
    "PCA": descriptor_pca,
    "energy": energies,
})

We see a pretty good correlation between the SOAP represenation and the energy. In the next tutorial, we will try to use this to create a machine learning model that predicts energies from SOAP representations.