# Using Rascal and SOAP to Predict Properties

This notebook is intended as an introductory how-to on training a model on materials properties based upon SOAP vectors. For more information on the variable conventions, derivation, utility, and calculation of SOAP vectors, please refer to (among others): 
- [On representing chemical environments (Bartók 2013)](https://journals.aps.org/prb/abstract/10.1103/PhysRevB.87.184115)
- [Gaussian approximation potentials: A brief tutorial introduction (Bartók 2015)](https://onlinelibrary.wiley.com/doi/full/10.1002/qua.24927)
- [Comparing molecules and solids across structural and alchemical space (De 2016)](https://pubs.rsc.org/en/content/articlepdf/2016/cp/c6cp00415f)
- [Machine Learning of Atomic-Scale Properties Based on Physical Principles (Ceriotti 2018)](https://link.springer.com/content/pdf/10.1007%2F978-3-319-42913-7_68-1.pdf)

Beyond libRascal, the packages used in this tutorial are:  [json](https://docs.python.org/2/library/json.html), [numpy](https://numpy.org/), [ipywidgets](https://ipywidgets.readthedocs.io/en/latest/), [matplotlib](https://matplotlib.org/), and [ase](https://wiki.fysik.dtu.dk/ase/index.html).

In [None]:
%matplotlib inline
%reload_ext autoreload
%autoreload 2
import sys
sys.path.append('./utilities')
from learning_utils import *
try:
    from rascal.representations import SphericalInvariants as SOAP
except:
    from rascal.representations import SOAP
readme_button()

# Code-Light Overview of Property Prediction

## The Impact of the Hyperparameters on Training a Kernel Ridge Regression (KRR)
This time when we open up the tutorial, you will be able to change the input file, hyperparameters, and property to use for the kernel ridge regression, which are saved to mySOAP as they are changed. We've even included some suggestions for hyperparameters, why not try the Power Spectrum first?

In [None]:
mySOAP=learning_tutorial(interactive=True)

## Training the KRR

In [None]:
mySOAP.train_krr_model()

## Predicting the Properties of our Test Set

In [None]:
mySOAP.predict_test_set()

## Calculating the Properties of a New Dataset

In [None]:
mySOAP.predict_new_set(filename='./data/learning/small_molecules-1000.xyz')

# Coding Prediction Explicitly
Now that we've explained the workflow, let's strip away the learning_tutorial wrapper and run the computation again:

## Imports and Helper Functions

In [None]:
from ase.io import read
import numpy as np
from rascal.models import Kernel
try:
    from rascal.representations import SphericalInvariants as SOAP
except:
    from rascal.representations import SOAP

def split_dataset(N, training_percentage, seed=20):
    np.random.seed(seed)
    ids = list(range(N))
    np.random.shuffle(ids)
    return ids[:int(training_percentage*N)], ids[int(training_percentage*N):]


def compute_kernel(calculator, features1, features2=None, kernel_type="Atomic", **kwargs):
    my_kernel = Kernel(representation=calculator, name='Cosine', 
            target_type=kernel_type, zeta=2, **kwargs)
    return my_kernel(X=features1, Y=features2)

class KRR(object):
    
    def __init__(self, weights, features, kernel_type, **kwargs):
        self.weights = weights
        self.hypers = dict(**kwargs)
        self.calculator = SOAP(**kwargs)
        self.X = features
        self.kernel_type = kernel_type

    def predict(self, frames):
        features = self.calculator.transform(frames)
        
        kernel = compute_kernel(calculator=self.calculator,
                                features1=self.X, features2=features,
                                **self.hypers)
        return np.dot(self.weights, kernel)

## Setting the Inputs and Hyperparameters
(Everything else in the workflow is a function of these parameters)

In [None]:
input_file = 'data/learning/small_molecules-1000.xyz'
hyperparameters = dict(soap_type = 'PowerSpectrum', \
                       interaction_cutoff = 3.5, \
                       max_radial = 2, \
                       max_angular = 1, \
                       gaussian_sigma_constant = 0.5, \
                       gaussian_sigma_type = 'Constant', \
                       cutoff_smooth_width = 0.0
                      )
property_to_ml = "dft_formation_energy_per_atom_in_eV"
kernel_type = "Structure"

training_percentage = 0.8
zeta = 2
Lambda = 5e-3
jitter=1e-8

## Computing the representation and feature set

In [None]:
frames = np.array(read(input_file,":"))
number_of_frames = int(len(frames)*0.1)
print(number_of_frames)

representation = SOAP(**hyperparameters)

property_values = np.array([cc.info[property_to_ml] for cc in frames])

train_idx, test_idx = split_dataset(number_of_frames, training_percentage)

features = representation.transform(frames[train_idx])

## Constructing the kernel for ML and KRR

In [None]:
kernel = compute_kernel(representation, \
                        features, \
                        kernel_type=kernel_type, \
                        **hyperparameters)

delta = np.std(property_values[train_idx]) / np.mean(kernel.diagonal())
kernel[np.diag_indices_from(kernel)] += Lambda**2 / delta **2 + jitter

weights = np.linalg.solve(kernel,property_values[train_idx])

model = KRR(weights, features, kernel_type, **hyperparameters)

## Computing and Plotting the Prediction

In [None]:
from matplotlib import pyplot as plt
y_pred = model.predict(frames[test_idx])
print(dict(
        mean_average_error= [np.mean(np.abs(y_pred-property_values[test_idx]))],
        root_mean_squared_error=[np.sqrt(np.mean((y_pred-property_values[test_idx])**2))],
        R2 = [np.mean(1 - (((property_values[test_idx] - y_pred) ** 2).sum(axis=0,dtype=np.float64) / ((property_values[test_idx] - np.average(property_values[test_idx], axis=0) ** 2).sum(axis=0,dtype=np.float64))))]
        ))
plt.scatter(y_pred, property_values[test_idx], s=3)
plt.axis('scaled')
plt.xlabel(property_to_ml)
plt.ylabel('Predicted '+property_to_ml)
plt.gca().set_aspect('equal')
plt.show()

## Predicting from Another Data Set

In [None]:
filename='./data/learning/small_molecules-1000.xyz'
new_frames = read(filename,":400")
new_property_values = np.array([cc.info[property_to_ml] for cc in new_frames])
y_pred = model.predict(new_frames)

print(dict(
        mean_average_error= [np.mean(np.abs(y_pred-new_property_values))],
        root_mean_squared_error=[np.sqrt(np.mean((y_pred-new_property_values)**2))],
        R2 = [np.mean(1 - (((new_property_values - y_pred) ** 2).sum(axis=0,dtype=np.float64) / ((new_property_values - np.average(new_property_values, axis=0) ** 2).sum(axis=0,dtype=np.float64))))]
        ))
plt.scatter(y_pred, new_property_values, s=3)
plt.axis('scaled')
plt.xlabel(property_to_ml)
plt.ylabel('Predicted '+property_to_ml)
plt.gca().set_aspect('equal')