# Using Rascal to Calculate SOAP Vectors

This notebook is intended as an introductory to calculating and understanding SOAP vectors. For more information on the variable conventions, derivation, and utility of SOAP vectors, please refer to (among others): 
- [On representing chemical environments (Bartók 2013)](https://journals.aps.org/prb/abstract/10.1103/PhysRevB.87.184115)
- [Gaussian approximation potentials: A brief tutorial introduction (Bartók 2015)](https://onlinelibrary.wiley.com/doi/full/10.1002/qua.24927)
- [Comparing molecules and solids across structural and alchemical space (De 2016)](https://pubs.rsc.org/en/content/articlepdf/2016/cp/c6cp00415f)
- [Machine Learning of Atomic-Scale Properties Based on Physical Principles (Ceriotti 2018)](https://link.springer.com/content/pdf/10.1007%2F978-3-319-42913-7_68-1.pdf)

Beyond libRascal, the packages used in this tutorial are:  [json](https://docs.python.org/2/library/json.html), [numpy](https://numpy.org/), [ipywidgets](https://ipywidgets.readthedocs.io/en/latest/), [matplotlib](https://matplotlib.org/), and [ase](https://wiki.fysik.dtu.dk/ase/index.html).

In [None]:
%matplotlib inline
%reload_ext autoreload
%autoreload 2
from IPython.display import Image
import sys
sys.path.append('./utilities')
from tutorial_utils import *
try:
    from rascal.representations import SphericalInvariants as SOAP
except:
    from rascal.representations import SOAP
readme_button()

# Disclaimer

This notebook uses Dirac notation, which may be unfamiliar to many. For a refresher on Dirac notation, check out any of these resources: [Wikipedia](https://en.wikipedia.org/wiki/Bra%E2%80%93ket_notation), [GA Tech](http://vergil.chemistry.gatech.edu/notes/intro_estruc/node5.html).


# This is going to look rill rough for a bit
In the search for novel materials through computational discovery, it is often necessary to simulate at the limits of computational space and time complexity. Atomic descriptors can make room for greater algorithmic complexity by eliminating unnecessary information or compressing atomic information efficiently. Furthermore, the descriptors ascribed to an atomic environment may prove greater correlation to the properties exhibited by such a material. For example, a common descriptor XX correlates with property XX more than the raw positional and elemental data.

When developing atomic environment descriptors, the ideal descriptor is that which is most compact while losing the least information. For example, consider a descriptor which simply contains the number of atoms in the molecule. While compact (scalar), the information loss would be too great (there would be no way to distinguish between graphene ($C_{24}$) and glucose ($C_6H_{12}O_6$)). Ideally, a descriptor should contain enough information necessary to both 1) correctly match an environment to any isometric transformation of itself and 2) distinguish it from all other environments. Therefore, it should be translationally, rotationally, and permutatively invariant. 

SOAP, or Smooth Overlap of Atomic Positions, is an efficient approach for representing chemical environments with minimal information loss. 

# A First Pass at Atomic Representations: Density Functions

Suppose you have some structure $\mathcal{A}$.

<!-- <center>
{{BCN_struct}}
</center>
-->

A first pass at representing the structure is through normalized density functions centered on each atom $i$. 

\begin{equation}
g(\mathbf{r}-\mathbf{r_i}) = \sum_{j\in N(i)} e^{-\frac{|\mathbf{r_{j}}-\mathbf{r}|^2}{\sigma^2}}
\end{equation}

where N(i) is the neighbors of $i$ (defined for some cutoff) and $e^{-|\mathbf{r_{j}}-\mathbf{r}|^2/\sigma^2}$ is a gaussian of width $\sigma$ centered at atom $i'$. We will omit the $\in N(i)$ from our sums from hereon and it should be assumed that $j$ represents a neighbor of $i$.

In this case $A$ is represented as a sum over these gaussians:


\begin{equation}
\left\langle \mathbf{r} \big| \mathcal{A}\right\rangle = \sum_{i\in A} g(\mathbf{r}-\mathbf{r_i}) \left| \alpha_i\right\rangle
\end{equation}

where $\alpha_i$ is the species identifier of the atom.


<!---

<center>{{BCN_braket}}</center>

Therefore, we can say 
We will denote the environment centered at $i$ and containing all neighbors of $i$ as $X_i$.

Here, centering $g$ with respect to atom $i$ satisfies the translational invariance. If $i$ is found at x=0 or x=1000, the relative distances of its neighbors remain the same. 

#### Let's look at the 1D case first:


To compare these the density functions of two atoms, one could simply take:

\begin{equation}
g^{{i)}(\mathbf{r}-\mathbf{r_i}) \cdot g^{{j)}(\mathbf{r}-\mathbf{r_j})
\end{equation}
                                                                                                                                                                                                                                                                    


From hereon we will use Dirac notation, where the descriptor of an an environment $X$ centered at atom $i$ ($X_i$) is given by:

\begin{equation}
\big\langle \mathbf{r} \big| X_i\big\rangle = g^{{i)}(\mathbf{r}-\mathbf{r_i})
\end{equation}


We can sum over all density functions to give a descriptor of a structure:

\begin{equation*}
\big\langle \mathbf{r} \big| \mathcal{A}\big\rangle = \sum_i g^{{i)}\left(\mathbf{r}-\mathbf{r_i}\right) \big| \alpha_i\big\rangle
\end{equation*}

where $\alpha_i$ denotes the chemical composition of the atom.


This is where I'll explain why \nu=1 does not suffice
--->


### Ex. how much information does this first pass contain? 

Suppose $A$ is [1-propanol](https://en.wikipedia.org/wiki/1-Propanol).

Our molecular density function becomes:

\begin{split}\begin{equation}
\left\langle \mathbf{r} \big| \mathcal{A}\right\rangle = 
\left(g_{C_1} + g_{C_2} + g_{C_3}\right)\left|C\right\rangle + 
\left(g_{O}\right)\left|O\right\rangle + 
\left(g_{H_1} + g_{H_2} + g_{H_3} + g_{H_4} + g_{H_5} + g_{H_6} + g_{H_7} + g_{H_8}\right)\left|H\right\rangle
\end{equation}\end{split}

So then when we integrate our molecular density function across all translations to satisfy translational invariance, we have:

\begin{split}\begin{equation}
\big\langle \mathbf{r} \big| \mathcal{A}\big\rangle_{\hat{t}}
= \int d\hat{t} \langle\mathbf{r} \left|\hat{t} \right|\mathcal{A}\rangle\\
= \sum_i \left|\alpha_i\right\rangle \int d\mathbf{t} \hspace{0.25cm} g\left(\mathbf{t}+\mathbf{r}-\mathbf{r}_i\right)\\
= 3 \left| C \right\rangle + 1 \left| O \right\rangle +8 \left| H \right\rangle
\end{equation}\end{split}

From here we can see that using a simple gaussian to describe atomic environments is not sufficient -- the same descriptor for 1-propanol applies to two isomers: [2-propanol](https://en.wikipedia.org/wiki/Isopropyl_alcohol) and [methoxyethane](https://en.wikipedia.org/wiki/Methoxyethane), which in the latter case has drastically different properties.

<center> 1-Propanol </center>| <center> 2-Propanol </center> | <center> Methoxyethane </center>
- | - | - 
![1-Propanol](./images/1-propanol-stick.png) | ![2-Propanol](./images/2-propanol-stick.png) | ![Methoxyethane](./images/methoxyethane-stick.png)


In [None]:
hps = dict(
          body_order=1,
          interaction_cutoff=3.5,
          max_radial=6,
          max_angular=6,
          gaussian_sigma_constant=0.3,
          )
soap1=SOAP_tutorial(input_file='./data/molecules/1-Propanol.xyz', hyperparameters=hps)

In [None]:
soap1.show_molecule()

In [None]:
soap1.show_all()

In [None]:
soap1.format_positions()

In [None]:
hps = dict(
          body_order=1,
          interaction_cutoff=3.5,
          max_radial=6,
          max_angular=6,
          gaussian_sigma_constant=0.3,
          )
# soap1=SOAP_tutorial(input_file='./data/molecules/1-Propanol.xyz', hyperparameters=hps)
soap2=SOAP_tutorial(input_file='./data/molecules/2-Propanol.xyz', hyperparameters=hps)
soap3=SOAP_tutorial(input_file='./data/molecules/Methoxyethane.xyz', hyperparameters=hps)
# x=soap1.get_soap_vectors()
y=soap2.get_soap_vectors()
z=soap3.get_soap_vectors()
# print(np.dot(x/np.linalg.norm(x),y/np.linalg.norm(y)))
# print(np.dot(x/np.linalg.norm(x),z/np.linalg.norm(z)))
# print(np.dot(y/np.linalg.norm(y),z/np.linalg.norm(z)))

In [None]:
hps = dict(
          body_order=3,
          interaction_cutoff=3.5,
          max_radial=6,
          max_angular=6,
          gaussian_sigma_constant=0.3,
          )
soap1=SOAP_tutorial(input_file='./data/molecules/1-Propanol.xyz', hyperparameters=hps)
soap2=SOAP_tutorial(input_file='./data/molecules/2-Propanol.xyz', hyperparameters=hps)
soap3=SOAP_tutorial(input_file='./data/molecules/Methoxyethane.xyz', hyperparameters=hps)
x=soap1.get_soap_vectors()
y=soap2.get_soap_vectors()
z=soap3.get_soap_vectors()
print(np.dot(x/np.linalg.norm(x),y/np.linalg.norm(y)))
print(np.dot(x/np.linalg.norm(x),z/np.linalg.norm(z)))
print(np.dot(y/np.linalg.norm(y),z/np.linalg.norm(z)))

# Increasing the Information Density with the Power Spectrum
We can take the [Haar Integral](http://www.gbv.de/dms/hebis-darmstadt/toc/34627456.pdf) over tensor products of our $\big| \mathcal{A}\big\rangle$, giving:

\begin{equation}
\left|\mathcal{A}^{(\nu)}\right\rangle_{\hat{t}} = \int d\hat{t} \hat{t} \left| \mathcal{A}\right\rangle \otimes \hat{t} \left| \mathcal{A}\right\rangle ... \hat{t} \left| \mathcal{A}\right\rangle
\end{equation}

When the descriptor is raised to the $\nu^{th}$ power, it contains information on correlations on the $(\nu +1)^{th}$ order. For $\nu=2$, this evaluates as a power spectrum:

\begin{split}
\begin{equation}
\left\langle \mathbf{r}\mathbf{r'} | \mathcal{A}^{(2)}\right\rangle_{\hat{t}} = \int d\hat{t} \sum_{ij} \hspace{0.25cm} g\left(\hat{t}\mathbf{r}-\mathbf{r}_i\right) g\left(\hat{t}\mathbf{r'}-\mathbf{r}_j\right)\left|\alpha_i\alpha_j\right\rangle\\
= \sum_{ij} \left|\alpha_i\alpha_j\right\rangle\int d\mathbf{t} \hspace{0.25cm} g\left(\mathbf{t}+\mathbf{r}-\mathbf{r}_i\right) g\left(\mathbf{t}+\mathbf{r'}-\mathbf{r}_j\right)\\
= \sum_{ij} \left|\alpha_i\alpha_j\right\rangle \left(g *g\right)\left(\mathbf{r'}-\mathbf{r} - \mathbf{r}_{ij}\right)\\
\end{equation}
\end{split}

where $(g*g)$ is the standard [convolution](https://en.wikipedia.org/wiki/Convolution) of g and $\mathbf{r}_{ij} = \mathbf{r}_j - \mathbf{r}_i$.

We will shorten this notation as $h\left(\Delta\mathbf{r}-\mathbf{r}_{ij}\right) = \left(g *g\right)\left(\mathbf{r'}-\mathbf{r} - \mathbf{r}_{ij}\right)$

\begin{split}
\begin{equation}
\left\langle \mathbf{\Delta r} | \mathcal{A}^{(2)}\right\rangle_{\hat{t}} =  \sum_{j} \left|\alpha_j\right\rangle \sum_i h\left(\Delta\mathbf{r}-\mathbf{r}_{ij}\right) \left| \alpha_i \right\rangle\\
\end{equation}
\end{split}

<!---
Here, we have two atomic environments centered at atoms $i$ and $j$. Their density functions are:

\begin{equation}
\big\langle r \big| X_i\big\rangle = \sum_{i'\in X_i} g(\mathbf{r}-\mathbf{r_{ii'}}) \hspace{0.5in}
\big\langle r \big| X_j\big\rangle = \sum_{j'\in X_j} g(\mathbf{r}-\mathbf{r_{jj'}})
\end{equation}

which, in this case, simplifies to:

\begin{equation}
g^{{i)}(\mathbf{r}-\mathbf{r_i}) =  e^{-\frac{|\mathbf{r_{i'}}-\mathbf{r}|^2}{\sigma^2}} 
\end{equation}
--->

#
# Paradigm Shift: Atomic _Environments_
Given all of these definitions, it is best to define our descriptors for the atomic $\textit{environment}$ $\mathcal{X}$. This has its benefits, all distances included in the descriptor will be relative, which means that the descriptors should be inherently translationally invariant.

For the environment centered at atom $j$:

\begin{equation}
\left\langle\mathbf{r} | \mathcal{X}_j\right\rangle = \sum_i f_c(r_{ij})h\left(\mathbf{r}-\mathbf{r}_{ij}\right) \left | \alpha_i\right\rangle
\end{equation}

where $f_c(r_{ij})$ is some smooth cut-off. This is also a sum over all specie contributions:

\begin{split}
\begin{equation}
\left\langle\mathbf{r} | \mathcal{X}_j\right\rangle 
= \sum_\alpha \left(\sum_{i \in \alpha} f_c(r_{ij})h\left(\mathbf{r}-\mathbf{r}_{ij}\right) \right) \left | \alpha_i\right\rangle
= \sum_\alpha \psi^\alpha_{\mathcal{X}_j} \left| \alpha \right\rangle
\end{equation}
\end{split}

where $\psi^\alpha_{\mathcal{X}_j}(\mathbf{r}) \equiv \sum_{i \in \alpha} f_c(r_{ij})h\left(\mathbf{r}-\mathbf{r}_{ij}\right)$. With translational invariance already precoded, we must integrate over rotational symmetries:

\begin{equation}
\left\langle\alpha r | \mathcal{X}^{(1)}\right\rangle_\hat{R} \propto r \int d\hat{R} \psi_\mathcal{X}^\alpha(r\hat{R}\hat{e}_z)
\end{equation}

and just with the structure-based descriptors, we can compute higher order versions:

\begin{equation}
\left\langle\alpha r \alpha' r' \omega| \mathcal{X}^{(2)}\right\rangle_\hat{R}
\propto rr' \int d\hat{R} \psi_\mathcal{X}^{\alpha'}(r'\hat{R}\hat{e}_z)
\times \psi_\mathcal{X}^\alpha \left(r\hat{R}\left(\omega \hat{e}_z + \sqrt{1-\omega^2}\hat{e}_x\right)\right)
\end{equation}
 
where $\omega$ is the fixed angle between $\mathbf{r}$ and $\mathbf{r'}$.

# Finally, we get to SOAP vectors!
So far, we have seen how to represent the neighborhood of an atom as a set of gaussian density functions. We have also seen how a structure can be represented as a sum of these atomic representations. While defining our descriptors over environments handles translational invariance, we have to integrate over the set to create a truly symmetrically-invariant descriptor which contains all necessary information.

To make the representation more compact, we can redefine our descriptors over [radial basis functions](https://en.wikipedia.org/wiki/Radial_basis_function) $R_n(r)$ and [spherical harmonics](http://mathworld.wolfram.com/SphericalHarmonic.html) $Y^l_m(\hat{\mathbf{r}})$:

\begin{equation}
\langle \alpha n l m | \mathcal{X}_j \rangle = \int d\mathbf{r} R_n(r) Y^l_m(\hat{r}) \langle\alpha \mathbf{r} | \mathcal{X}_j \rangle
\end{equation}

There are many benefits to this definition (\red{CITE SOME THINGS HERE}), one of the best benefits is the explicit integration over rotation group. 

And again, we have multiple orders $\nu$ of these representations. (Citations) contain up to $\nu=3$, however we will limit ourselves to $\nu=2$ in this discussion.

\begin{split}
\begin{equation}
\mathbf{\nu = 1} : \\\langle\alpha n | \mathcal{X}^{(1)}_j\rangle_\hat{R} \propto \langle \alpha n 0 0 | \mathcal{X}_j\rangle\\\\
\\\mathbf{\nu = 2} : \\\langle\alpha n \alpha' n' l| \mathcal{X}^{(2)}_j\rangle_\hat{R} \propto 
\frac{1}{\sqrt{2l+1}}\sum_{m=-l}^{m=l} \langle \alpha n l m | \mathcal{X}_j\rangle^*\langle \alpha' n' l m | \mathcal{X}_j\rangle
\end{equation}
\end{split}

### Here is what the 1st order SOAP vectors would look like for methane and water:

In [None]:
frame = ase.io.read("data/Propanol.xyz",":")
print(frame)

In [None]:
# have a few well-known molecules available 
# on left, molecule as rendered
# on right molecule as SOAP vectors

From this we can also get the overlap kernel:

\begin{equation*}
K^{(\nu)}\left(\mathcal{X}_j, \mathcal{X}_k\right) = \int d\hat{R} \left|\langle \mathcal{X}_j \left| R\rangle \right| \mathcal{X}_k\rangle\right|^\nu = \left\langle \mathcal{X}_j^{(\nu)} \big| \mathcal{X}_k^{(\nu)}\right\rangle
\end{equation*}

In [None]:
# have a few well-known molecule pairings available 
# compare SOAP vectors

In [None]:
# have a few well-known molecule pairings available 
# compare SOAP kernels

In [None]:
#Code breakdown for earlier examples

In [None]:
# SOAP is a part of a more general family  of density-based descriptors, 
# please feel free to check out these publications for the full range
# of weird and wonderful.

In [None]:
# Link to next notebook
# THE SAGA CONTINUES
# SOAP 2: Electronic Boogaloo

In [None]:
from matplotlib import pyplot,cm
from matplotlib import patches as mpatches
from matplotlib.collections import PatchCollection
style={"horizontalalignment":"center", "verticalalignment":"center", "fontsize":20, 'color':'k'}
rjs = [[-1.0, 0.3], [-0.35, 0.95]]
sig = 0.2
srange = np.linspace(-2,2,100)
ps = np.array([[sum([np.exp(-(np.abs(rj-r)**2)/(sig**2.0)) for rj in rset]) for r in srange] for rset in rjs])
ps = np.array([p/sum(p) for p in ps])
for p in ps:
    pyplot.plot(srange, p)

pyplot.plot(srange, list(reversed(ps[-1])))
pyplot.show()
print(np.dot(ps[0], list(reversed(ps[1]))), np.dot(ps[0], ps[1]))