# Using Rascal to Calculate SOAP Vectors of Small Molecules

This notebook is intended as an introductory how-to on calculating the SOAP vectors of small molecules and train a model for their atomization energies on these vectors. For more information on the derivation, utility, and calculation of SOAP vectors, please refer to (among others): 
- [On representing chemical environments (Bartók 2013)](https://journals.aps.org/prb/abstract/10.1103/PhysRevB.87.184115)
- [Comparing molecules and solids across structural and alchemical space (De 2016)](https://pubs.rsc.org/en/content/articlepdf/2016/cp/c6cp00415f)

Beyond libRascal, the packages used in this tutorial are: [time](https://docs.python.org/2/library/time.html), [json](https://docs.python.org/2/library/json.html), [tqdm](https://tqdm.github.io/), [numpy](https://numpy.org/), [matplotlib](https://matplotlib.org/), and [ase](https://wiki.fysik.dtu.dk/ase/index.html).

In [71]:
%matplotlib notebook
%reload_ext autoreload
%autoreload 2
from utils import *
readme_button()

librascal
=========

.. start-intro

librascal is a versatile and scalable fingerprint and machine learning
code. It focuses on the efficient construction of representations of
atomic structures, that can then be fed to any supervised or
unsupervised learning algorithm. Simple regression code will be included
for testing purposes, but the long-term goal is to develop a separate
collection of tools to this end.

librascal is currently considered a standalone code. However, we aim to
provide enough flexibility to interface it with other codes such as
LAMMPS and PLUMED-2.0.  It can be used as a C++ library as well as a
python module.  To be able to call it from python, we have used the
pybind11 library.

Although at the moment is a serial-only code, we aim to write it in MPI
so that it will be possible to take advantage of parallelization to
speed up the calculations significantly.  Parallelization is possible especially
over atoms in a structure (for large structures), over structures in a
collection (for large collections of small structures), or over components of a
representation (for representations with a large number of independent functions
or components).

It comes with a GNU Lesser General Public License of version 3, which
means that it can be modified and freely distributed, although we take
no responsibility for its misuse.

Development
-----------

The code is currently in the alpha development phase; it is not yet
suitable for public use. Nevertheless, there is a significant amount of
functionality (including two tutorials) currently working and available
to test if you’re feeling adventurous. Feedback and bug reports are
welcome, as long as you keep the above in mind.

.. end-intro

Installation
------------

.. start-install

Dependencies
~~~~~~~~~~~~

Before installing librascal, please make sure you have at least the
following packages installed:

+-------------+--------------------+
| Package     | Required version   |
+=============+====================+
| gcc (g++)   | 4.9 or higher      |
+-------------+--------------------+
| clang       | 4.0 or higher      |
+-------------+--------------------+
| cmake       | 2.8 or higher      |
+-------------+--------------------+
| python      | 3.6 or higher      |
+-------------+--------------------+
| numpy       | 1.13 or higher     |
+-------------+--------------------+
| ASE         | 3.18 or higher     |
+-------------+--------------------+

Other necessary packages (such as Eigen and PyBind11) are downloaded
automatically when compiling Rascal.

The following packages are required for building some optional features:

+------------------+-------------+--------------------+
| Feature          | Package     | Required version   |
+==================+=============+====================+
| Documentation    | pandoc      | (latest)           |
+------------------+-------------+--------------------+
|                  | sphinx      | 2.1.2              |
+------------------+-------------+--------------------+
|                  | breathe     | 4.13.1             |
+------------------+-------------+--------------------+
|                  | nbsphinx    | (latest)           |
+------------------+-------------+--------------------+

Compiling
~~~~~~~~~

To compile the code it is necessary to have CMake 3.0 and a C++ compiler
supporting C++14. During the configuration, it will automatically try to
download the external libraries on which it depends:

-  Eigen
-  Pybind11
-  Boost (only the unit test framework library)
-  Python3

And the following libraries to build the documentation:

-  Doxygen
-  Sphinx
-  Breathe

Beware, Python3 is mandatory. The code won’t work with a Python version
older than 3.

Using the package manager of your choice this yaml script should install all
required python packages required for rascal.

.. code:: shell

    name: librascal-env
    dependencies:
      - python=3.6 
      - pip:
        - numpy
        - matplotlib
        - scipy
        - mpmath
        - ase
        - ubjson
        - cpplint
        - sphinx=2.1.2
        - sphinx_rtd_theme
        - breathe=4.13.1
        - pandoc
        - nbsphinx
        - jupyter
        - qml
        - autopep8
        - pytest

To configure and compile the code with the default options, on \*nix
systems (Windows is not supported):

.. code:: shell

   mkdir build
   cd build
   cmake ..
   make

Customizing the build
~~~~~~~~~~~~~~~~~~~~~

The library supports several alternative builds that have additional
dependencies. Note that the ``ncurses`` GUI for cmake (ccmake) is quite
helpful to customize the build options.

1. Tests

   Librascal source code is extensively tested (both c++ and python).
   The BOOST unit_test_framework is requiered to build the tests (see
   BOOST.md for further details on how to install the boost library). To
   build and run the tests:

   .. code:: shell

      cd build
      cmake -DBUILD_TESTS=ON ..
      make
      ctest -V

   In addition to testing the behaviour of the code, the test suite also check
   for formatting compliance with the clang-format and autopep8 packages (these
   dependencies are optional). To install these dependencies on ubuntu:

   .. code:: shell

      sudo apt-get install clang-format
      pip3 install autopep8

2. Build Type

   Several build types are available Release (default), Debug and
   RelWithDebInfo. To build an alternative mode

   .. code:: shell

      cd build
      cmake -DCMAKE_BUILD_TYPE=Debug
      ..
      make

   Or

   .. code:: shell

      cd build
      cmake -DCMAKE_BUILD_TYPE=RelWithDebInfo  \\
         CMAKE_C_FLAGS_RELWITHDEBUBINFO="-03 -g -DNDEBUG" ..
      make

3. Documentation

   The documentation relies on the sphinx (with nbsphinx and breathe
   extensions), doxygen, pandoc, and graphviz
   packages. To install them on ubuntu:

   .. code:: shell

     pip3 install sphinx sphinx_rtd_theme breathe nbsphinx
     sudo apt-get install pandoc doxygen graphviz

   Then to build the documentation run:

   .. code:: shell

     cd build
     cmake -DENABLE_DOC=ON ..
     make doc

   the index html file is located in ./docs/dox_html/index.html

4. Helpers for Developers

   -  To remove all the cmake files/folders except for the external
      library (enable glob and remove):

   .. code:: shell

      shopt -s extglob
      rm -fr -- !(external|third-party)

   -  To help developers conform their contribution to the coding
      convention, the formatting of new functionalities can be automated
      using clang-format (for the c++ files) and autopep8 (for the
      python files). The .clang-format and .pycodestyle files define
      common settings to be used.

      To enable these functionalities (optional) you can install these
      tools with:

      .. code:: shell

         sudo apt-get install clang-format
         pip install autopep8

      The automatic formating of the c++ and python files can be
      trigered by:

      .. code:: shell

         cd build
         cmake ..
         make pretty-cpp
         make pretty-python

      Please use these tools with caution as they can potentially
      introduce unwanted changes to the code. If code needs to be
      specifically excluded from auto formatting, e.g. a matrix which
      should be human-readable, code comments tells the formatters to
      ignore lines:

      C++

      .. code:: C++

         // clang-format off
         SOME CODE TO IGNORE
         // clang-format on

      python

      .. code:: python

         SOME LINE TO IGNORE # noqa

      where ``noqa`` stands for ``no`` ``q``\ uality ``a``\ ssurance.

5. Bindings

   Librascal relies on the pybind11 library to automate the generation
   of the python bindings which are built by default. Nevertheless, to
   build only the c++ library:

   .. code:: shell

      cd build
      cmake -DBUILD_BINDINGS=OFF ..
      make

Miscellaneous Information
-------------------------

-  Common cmake flags:

   -  -DCMAKE_C_COMPILER
   -  -DBUILD_BINDINGS
   -  -DUSER
   -  -DINSTALL_PATH
   -  -DCMAKE_BUILD_TYPE
   -  -DENABLE_DOC
   -  -DBUILD_TESTS

-  Special flags:

   -  -DBUILD_BINDINGS:

      -  ON (default) -> build python binding
      -  OFF -> does not build python binding

   -  -DINSTALL_PATH:

      -  empty (default) -> does not install in a custom folder
      -  custom string -> root path for the installation

   -  -DUSER:

      -  OFF (default) -> changes nothing
      -  ON -> install root is in the user’s home directory, i.e.
         ``~/.local/``

To build librascal as a docker environment:

.. code:: shell

   sudo docker build -t test -f ./docker/install_env.dockerfile  .
   sudo docker run -it -v /path/to/repo/:/home/user/  test


ToggleButtons(description='Show: ', options=('compute_kernel', 'compute_representation', 'extract_energy', 'ge…

In [63]:
import os, sys

import time
import json
from tqdm import tqdm
import numpy as np
from matplotlib import pylab as plt

import ase
from ase.io import read, write
from ase.build import make_supercell
from ase.visualize import view

import rascal
from rascal.representations import SphericalInvariants as SOAP

# SOAP: Power spectrum

In [64]:
frames = read("./data/small_molecules-1000.xyz",":1")

In [65]:
hypers = dict(soap_type="PowerSpectrum",
              interaction_cutoff=3.5, 
              max_radial=2, 
              max_angular=1, 
              gaussian_sigma_constant=0.5,
              gaussian_sigma_type="Constant",
              cutoff_smooth_width=0.,
              normalize=False,
              )
soap = SOAP(**hypers)

In [66]:
representation = soap.transform(frames)

In [67]:
X = representation.get_feature_matrix().T
X.shape

(17, 80)

In [68]:
frames = read("./data/small_molecules-1000.xyz",":1")

In [69]:
# select a subset of the atoms in the stucture
for frame in frames:
    is_a_center_atom = np.ones_like(frame.get_atomic_numbers())
    is_a_center_atom[3] = False
    frame.set_array("is_a_center_atom", is_a_center_atom)
representation = soap.transform(frames)

In [None]:
X_sub = representation.get_feature_matrix().T
X_sub.shape

# Learning the formation energies of small molecules
Use the buttons below to see source code on the utility functions

In [70]:
make_buttons()

ToggleButtons(description='Show: ', options=('compute_kernel', 'compute_representation', 'extract_energy', 'ge…

```python
def train_krr_model(zeta,Lambda,representation,frames,y,jitter=1e-8):
    features = compute_representation(representation,frames)
    kernel = compute_kernel(zeta,features)    
    # adjust the kernel so that it is properly scaled
    delta = np.std(y) / np.mean(kernel.diagonal())
    kernel[np.diag_indices_from(kernel)] += Lambda**2 / delta **2 + jitter
    # train the krr model
    weights = np.linalg.solve(kernel,y)
    model = KRR(zeta, weights, representation, features)
    return model,kernel
```

In [None]:
# Load the small molecules 
frames = read('./data/small_molecules-1000.xyz',':600')

## With the full power spectrum

In [None]:
hypers = dict(soap_type="PowerSpectrum",
              interaction_cutoff=3.5, 
              max_radial=6, 
              max_angular=6, 
              gaussian_sigma_constant=0.4,
              gaussian_sigma_type="Constant",
              cutoff_smooth_width=0.5,
              )
soap = SOAP(**hypers)

In [None]:
frames_train, y_train, frames_test, y_test = split_dataset(frames,0.8)

In [None]:
zeta = 2
Lambda = 5e-3
krr,k = train_krr_model(zeta, Lambda, soap, frames_train, y_train)

In [None]:
y_pred = krr.predict(frames_test)
get_score(y_pred, y_test)

In [None]:
plt.scatter(y_pred, y_test, s=3)
plt.axis('scaled')
plt.xlabel('DFT energy / (eV/atom)')
plt.ylabel('Predicted energy / (eV/atom)')

## With just the radial spectrum

In [None]:
hypers = dict(soap_type="RadialSpectrum",
              interaction_cutoff=3.5, 
              max_radial=6, 
              max_angular=0, 
              gaussian_sigma_constant=0.4,
              gaussian_sigma_type="Constant",
              cutoff_smooth_width=0.5,
              )
soap = SOAP(**hypers)

In [None]:
frames_train, y_train, frames_test, y_test = split_dataset(frames,0.8)

In [None]:
zeta = 2
Lambda = 5e-4
krr,k = train_krr_model(zeta, Lambda, soap, frames_train, y_train)

In [None]:
y_pred = krr.predict(frames_test)
get_score(y_pred, y_test)

In [None]:
plt.scatter(y_pred, y_test, s=3)
plt.axis('scaled')
plt.xlabel('DFT energy / (eV/atom)')
plt.ylabel('Predicted energy / (eV/atom)')

# Make a map of the dataset

## utils

In [None]:
def compute_representation(representation,frames):
    expansions = soap.transform(frames)
    return expansions

def compute_kernel(zeta, rep1, rep2=None):
    if rep2 is None:
        kernel = rep1.cosine_kernel_global(zeta)
    else:
        kernel = rep1.cosine_kernel_global(rep2,zeta)
    return kernel

In [None]:
def link_ngl_wdgt_to_ax_pos(ax, pos, ngl_widget):
    from matplotlib.widgets import AxesWidget
    from scipy.spatial import cKDTree
    r"""
    Initial idea for this function comes from @arose, the rest is @gph82 and @clonker
    """
    
    kdtree = cKDTree(pos)        
    #assert ngl_widget.trajectory_0.n_frames == pos.shape[0]
    x, y = pos.T
    
    lineh = ax.axhline(ax.get_ybound()[0], c="black", ls='--')
    linev = ax.axvline(ax.get_xbound()[0], c="black", ls='--')
    dot, = ax.plot(pos[0,0],pos[0,1], 'o', c='red', ms=7)

    ngl_widget.isClick = False
    
    def onclick(event):
        linev.set_xdata((event.xdata, event.xdata))
        lineh.set_ydata((event.ydata, event.ydata))
        data = [event.xdata, event.ydata]
        _, index = kdtree.query(x=data, k=1)
        dot.set_xdata((x[index]))
        dot.set_ydata((y[index]))
        ngl_widget.isClick = True
        ngl_widget.frame = index
    
    def my_observer(change):
        r"""Here comes the code that you want to execute
        """
        ngl_widget.isClick = False
        _idx = change["new"]
        try:
            dot.set_xdata((x[_idx]))
            dot.set_ydata((y[_idx]))            
        except IndexError as e:
            dot.set_xdata((x[0]))
            dot.set_ydata((y[0]))
            print("caught index error with index %s (new=%s, old=%s)" % (_idx, change["new"], change["old"]))
    
    # Connect axes to widget
    axes_widget = AxesWidget(ax)
    axes_widget.connect_event('button_release_event', onclick)
    
    # Connect widget to axes
    ngl_widget.observe(my_observer, "frame", "change")

## make a map with kernel pca projection

In [None]:
# Load the small molecules 
frames = read('./data/small_molecules-1000.xyz',':600')

In [None]:
hypers = dict(soap_type="PowerSpectrum",
              interaction_cutoff=3.5, 
              max_radial=6, 
              max_angular=6, 
              gaussian_sigma_constant=0.4,
              gaussian_sigma_type="Constant",
              cutoff_smooth_width=0.5,
              )
soap = SOAP(**hypers)

In [None]:
zeta = 2

features = compute_representation(soap, frames)

kernel = compute_kernel(zeta,features)

In [None]:
from sklearn.decomposition import KernelPCA

In [None]:
kpca = KernelPCA(n_components=2,kernel='precomputed')
kpca.fit(kernel)

In [None]:
X = kpca.transform(kernel)

In [None]:
plt.scatter(X[:,0],X[:,1],s=3)

## make an interactive map

In [None]:
# package to visualize the structures in the notebook
# https://github.com/arose/nglview#released-version
import nglview

In [None]:
iwdg = nglview.show_asetraj(frames)
# set up the visualization
iwdg.add_unitcell()
iwdg.add_spacefill()
iwdg.remove_ball_and_stick()
iwdg.camera = 'orthographic'
iwdg.parameters = { "clipDist": 0 }
iwdg.center()
iwdg.update_spacefill(radiusType='covalent',
                                   scale=0.6,
                                   color_scheme='element')
iwdg._remote_call('setSize', target='Widget',
                               args=['%dpx' % (600,), '%dpx' % (400,)])
iwdg.player.delay = 200.0

In [None]:
link_ngl_wdgt_to_ax_pos(plt.gca(), X, iwdg)
plt.scatter(X[:,0],X[:,1],s=3)
iwdg