# Using Rascal to Calculate SOAP Vectors

This notebook is intended as an introductory to calculating and understanding SOAP vectors. For more information on the variable conventions, derivation, and utility of SOAP vectors, please refer to (among others): 
- [On representing chemical environments (Bartók 2013)](https://journals.aps.org/prb/abstract/10.1103/PhysRevB.87.184115)
- [Gaussian approximation potentials: A brief tutorial introduction (Bartók 2015)](https://onlinelibrary.wiley.com/doi/full/10.1002/qua.24927)
- [Comparing molecules and solids across structural and alchemical space (De 2016)](https://pubs.rsc.org/en/content/articlepdf/2016/cp/c6cp00415f)
- [Machine Learning of Atomic-Scale Properties Based on Physical Principles (Ceriotti 2018)](https://link.springer.com/content/pdf/10.1007%2F978-3-319-42913-7_68-1.pdf)

Beyond libRascal, the packages used in this tutorial are:  [json](https://docs.python.org/2/library/json.html), [numpy](https://numpy.org/), [ipywidgets](https://ipywidgets.readthedocs.io/en/latest/), [matplotlib](https://matplotlib.org/), and [ase](https://wiki.fysik.dtu.dk/ase/index.html).

In [None]:
%matplotlib inline
%reload_ext autoreload
%autoreload 2
from IPython.display import Image
import sys
sys.path.append('./utilities')
from tutorial_utils import *
try:
    from rascal.representations import SphericalInvariants as SOAP
except:
    from rascal.representations import SOAP
readme_button()

# Disclaimer

This notebook uses Dirac notation, which may be unfamiliar to many. For a refresher on Dirac notation, check out any of these resources: [Wikipedia](https://en.wikipedia.org/wiki/Bra%E2%80%93ket_notation), [GA Tech](http://vergil.chemistry.gatech.edu/notes/intro_estruc/node5.html).


# This is going to look rill rough for a bit
In the search for novel materials through computational discovery, it is often necessary to simulate at the limits of computational space and time complexity. Atomic descriptors can make room for greater algorithmic complexity by eliminating unnecessary information or compressing atomic information efficiently. Furthermore, the descriptors ascribed to an atomic environment may prove greater correlation to the properties exhibited by such a material. For example, a common descriptor XX correlates with property XX more than the raw positional and elemental data.

When developing atomic environment descriptors, the ideal descriptor is that which is most compact while losing the least information. For example, consider a descriptor which simply contains the number of atoms in the molecule. While compact (scalar), the information loss would be too great (there would be no way to distinguish between graphene ($C_{24}$) and glucose ($C_6H_{12}O_6$)). Ideally, a descriptor should contain enough information necessary to both 1) correctly match an environment to any isometric transformation of itself and 2) distinguish it from all other environments. Therefore, it should be translationally, rotationally, and permutatively invariant. 

## An Introduction to SOAP Vectors

SOAP, or Smooth Overlap of Atomic Positions, is an efficient approach for representing chemical environments with minimal information loss. 

Suppose you have some structure $A$.

<!-- <center>
{{BCN_struct}}
</center>
-->

A first pass at representing the structure is through normalized density functions centered on each atom $i$. 

\begin{equation}
g(\mathbf{r}-\mathbf{r_i}) = \sum_{j\in N(i)} e^{-\frac{|\mathbf{r_{j}}-\mathbf{r}|^2}{\sigma^2}}
\end{equation}

where N(i) is the neighbors of $i$ (defined for some cutoff) and $e^{-|\mathbf{r_{j}}-\mathbf{r}|^2/\sigma^2}$ is a gaussian of width $\sigma$ centered at atom $i'$. We will omit the $\in N(i)$ from our sums from hereon and it should be assumed that $j$ represents a neighbor of $i$.

$A$ is represented as a sum over these gaussians:


\begin{equation}
\left\langle \mathbf{r} \big| A\right\rangle = \sum_{i\in A} g(\mathbf{r}-\mathbf{r_i}) \big| \alpha_i\big\rangle
\end{equation}

where $\alpha_i$ is the species identifier of the atom.


<!---

<center>{{BCN_braket}}</center>

Therefore, we can say 
We will denote the environment centered at $i$ and containing all neighbors of $i$ as $X_i$.

Here, centering $g$ with respect to atom $i$ satisfies the translational invariance. If $i$ is found at x=0 or x=1000, the relative distances of its neighbors remain the same. 

#### Let's look at the 1D case first:


To compare these the density functions of two atoms, one could simply take:

\begin{equation}
g^{{i)}(\mathbf{r}-\mathbf{r_i}) \cdot g^{{j)}(\mathbf{r}-\mathbf{r_j})
\end{equation}
                                                                                                                                                                                                                                                                    


From hereon we will use Dirac notation, where the descriptor of an an environment $X$ centered at atom $i$ ($X_i$) is given by:

\begin{equation}
\big\langle \mathbf{r} \big| X_i\big\rangle = g^{{i)}(\mathbf{r}-\mathbf{r_i})
\end{equation}


We can sum over all density functions to give a descriptor of a structure:

\begin{equation*}
\big\langle \mathbf{r} \big| A\big\rangle = \sum_i g^{{i)}\left(\mathbf{r}-\mathbf{r_i}\right) \big| \alpha_i\big\rangle
\end{equation*}

where $\alpha_i$ denotes the chemical composition of the atom.


This is where I'll explain why \nu=1 does not suffice
--->


### Example: how does this fare with our requirements of invariance?

Suppose $A$ is 1-propanol.

Our molecular density function becomes:

\begin{split}\begin{equation}
\big\langle \mathbf{r} \big| A\big\rangle = 
\left(g_{C_1} + g_{C_2} + g_{C_3}\right)\big|C\big\rangle + 
\left(g_{O}\right)\big|O\big\rangle + 
\left(g_{H_1} + g_{H_2} + g_{H_3} + g_{H_4} + g_{H_5} + g_{H_6} + g_{H_7} + g_{H_8}\right)\big|H\big\rangle
\end{equation}\end{split}

So then when we integrate our molecular density function across all translations, we have:

\begin{split}\begin{equation}
\big\langle \mathbf{r} \big| A\big\rangle = 
3 | C \big\rangle + 
1 | O \big\rangle +
8 | H \big\rangle
\end{equation}\end{split}

From here we can see that using a simple gaussian to describe atomic environments is not sufficient -- the same descriptor for 1-propanol applies to two isomers: [2-propanol](https://en.wikipedia.org/wiki/Isopropyl_alcohol) and [methoxyethane](https://en.wikipedia.org/wiki/Methoxyethane), which in the latter case has drastically different properties.

<center> 1-Propanol </center>| <center> 2-Propanol </center> | <center> Methoxyethane </center>
- | - | - 
![1-Propanol](./images/1-propanol-stick.png) | ![2-Propanol](./images/2-propanol-stick.png) | ![Methoxyethane](./images/methoxyethane-stick.png)



In order to make our representation more sensitive, we can take the [Haar Integral](https://link.springer.com/content/pdf/10.1007%2F11499145_85.pdf) over tensor products of our $\big| A\big\rangle$, giving:

\begin{equation}
\left|A^{(\nu)}\right\rangle_{\hat{t}} = \int d\hat{t} \hat{t} \left|A\right\rangle \otimes \hat{t} \left|A\right\rangle ... \hat{t} \left|A\right\rangle
\end{equation}

<!---
Here, we have two atomic environments centered at atoms $i$ and $j$. Their density functions are:

\begin{equation}
\big\langle r \big| X_i\big\rangle = \sum_{i'\in X_i} g(\mathbf{r}-\mathbf{r_{ii'}}) \hspace{0.5in}
\big\langle r \big| X_j\big\rangle = \sum_{j'\in X_j} g(\mathbf{r}-\mathbf{r_{jj'}})
\end{equation}

which, in this case, simplifies to:

\begin{equation}
g^{{i)}(\mathbf{r}-\mathbf{r_i}) =  e^{-\frac{|\mathbf{r_{i'}}-\mathbf{r}|^2}{\sigma^2}} 
\end{equation}
--->

To compare two atomic environments, one can define the kernel comparing the density functions, integrating over all rotations:

\begin{equation}
k(i,j) = \int \hat{R} d\mathbf{r} g^{{i)}(\mathbf{r}-\mathbf{r_i}) g^{{j)}(\mathbf{r}-\mathbf{r_j}) = g_{n n'l}^{{i)}\cdot g_{n n'l}^{{j)}
\end{equation}

where $g_{n n'l}^{{i)}$ is a representation of the density function expanded in terms of [spherical harmonics](https://en.wikipedia.org/wiki/Spherical_harmonics) using radial degree $n$ and angular degree $l$.

Such as with previously-proposed atomic descriptors, translational invariance is achieved by atom-centered descriptors. The rotational invariance is achieved with a symmetrized version of the overlap kernel:

\begin{equation*}
K^{{\nu)}\left(X_j, X_k\right) = \int d\hat{R} \Big|\langle X_j \left| R\rangle \right| X_k\rangle\Big|^\nu = \big\langle X_j^{{\nu)} \big| X_k^{{\nu)}\big\rangle
\end{equation*}

where $\int d\hat{R}$ is the integration over all rotation matrices and when the kernel is raised to the $\nu^{th}$ power, the descriptor contains information on correlations on the $(\nu +1)^{th}$ order. $\big| X_j^{{\nu)}\big\rangle$ and $\big| X_k^{{\nu)}\big\rangle$ represent the $\nu^{th}$ order SOAP descriptor vectors of the $j^{th}$ and $k^{th}$ environments.

****

For $\nu = 1$, the symmetrized SOAP descriptor (in terms of spherical harmonic coefficients), simplifies to:

\begin{equation*}
\big\langle \alpha n \big| X_k^{{1)}\big\rangle = \sqrt{8\pi^2}\langle\alpha n 0 0 \big| X_k\rangle
\end{equation*}


### Here is what the 1st order SOAP vectors would look like for methane and water:

In [None]:
# Concept of Smooth Overlap of Atomic Positions
# Why do we need descriptors? What defines a *good* descriptor?
# Translations, rotations, and permutations of like atoms
# Compact while retaining maximum information
# Systematically improvable

# Illustrate with a molecule the roles of:
# - atom-centering
# - because atoms of the same type are summed together, there is no way to distinguish their positions
# - rotational invariance 

# TADA A WILD SOAP VECTOR APPEARS
# Caveat: Soap *vectors* vs. Soap *kernels*

In [None]:
# have a few well-known molecules available 
# on left, molecule as rendered
# on right molecule as SOAP vectors

In [None]:
# have a few well-known molecule pairings available 
# compare SOAP vectors

In [None]:
# have a few well-known molecule pairings available 
# compare SOAP kernels

In [None]:
#Code breakdown for earlier examples

In [None]:
# SOAP is a part of a more general family  of density-based descriptors, 
# please feel free to check out these publications for the full range
# of weird and wonderful.

In [None]:
# Link to next notebook
# THE SAGA CONTINUES
# SOAP 2: Electronic Boogaloo

In [None]:
from matplotlib import pyplot,cm
from matplotlib import patches as mpatches
from matplotlib.collections import PatchCollection
style={"horizontalalignment":"center", "verticalalignment":"center", "fontsize":20, 'color':'k'}
rjs = [[-1.0, 0.3], [-0.35, 0.95]]
sig = 0.2
srange = np.linspace(-2,2,100)
ps = np.array([[sum([np.exp(-(np.abs(rj-r)**2)/(sig**2.0)) for rj in rset]) for r in srange] for rset in rjs])
ps = np.array([p/sum(p) for p in ps])
for p in ps:
    pyplot.plot(srange, p)

pyplot.plot(srange, list(reversed(ps[-1])))
pyplot.show()
print(np.dot(ps[0], list(reversed(ps[1]))), np.dot(ps[0], ps[1]))