# Building Local Gaussian Process Surrogate Models from Scratch

In this notebook, we outline how to implement local Gaussian process (LGP) surrogate models for complex experimental observables from scratch. This code can be customized to any existing experimental data / training set. LGP surrogate models are a fast and accurate machine learning model that can approximate the outputs of an expensive function (e.g. a molecular simulation) with uncertainty quantification (UQ). UQ alleviates the 'black box' nature of machine learning and allows us to directly quantify parameter, model, and predictive uncertainties. This uncertainty naturally allows us to perform on-the-fly learning, model validation, and parameter sensitivity analysis.

The code provided below is designed to be a gentle introduction to LGP surrogate models with detailed instructions on how to implement LGP surrogates for any application. We will examine a LGP surrogate model of radial distribution functions in liquid Ne (described in the paper), but constructing your own surrogate model is as simple as uploading an experimental observable and molecular simulation training set and running them through our prebuilt functions.

In [4]:
# Required Math Packages
import numpy as np
import torch as torch
from scipy import interpolate
from scipy.optimize import minimize
import time as time
import pandas as pd

# Plotting Packages
import matplotlib.pyplot as plt  
from matplotlib.pyplot import figure

## Constructing a Training Dataset

The first step in building any machine learning surrogate model is generating a training set. The training set represents "observations" from your model at many combinations of parameters. Ideally, the training set will span the parameter space of interest and be dense enough that the machine learning model can learn the patterns and relationships in the data to a high level of accuracy.  

For a single observable, $S(r_k)$, of data with many independent variables (spectra, scattering pattern, etc) the training set matrix for a local Gaussian process, $\mathbf{\hat{X'}}$, is an ($N$ $\times$ dim($\boldsymbol{\theta}$)) matrix, 

\begin{equation}\label{eq:subsurrogate_training}
    \mathbf{\hat{X'}} = 
        \begin{bmatrix}
        \theta_{1,1} & \theta_{2,1} & ...\\
        \theta_{1,2} & \theta_{2,2} & ...\\
        \vdots & \vdots & \vdots\\
        \theta_{1,N} & \theta_{2,N} & ...\\
        \end{bmatrix}
\end{equation}

where \theta_{i,j} is the value of the $i^{th}$ model parameter $i = 1, ..., N_{params}$ for training sample $j = 1, ..., N$. This training set matrix is just row after row of training parameters appended together.

The training set observations, $\mathbf{\hat{Y'}}_k$, is a ($N$ $\times$ 1) column vector of the QOIs from the training set at $r_k$,

\begin{equation}
    \mathbf{\hat{Y}'}_k = [S(\boldsymbol{\theta}_1,r_k), ..., S(\boldsymbol{\theta}_N,r_k)]^T
\end{equation}

where the $k$ indexes over independent variables. Therefore, all we do is take the value of $r_k$ calculated from the training simulation and make a vector of these quantities in the same order as the training set data. 

Below, we ran 480 molecular simulations for a ($\lambda$-6) Mie fluid, which has three parameters: $\lambda$, the repulsive exponent that describes the deformability of a particle in a collision, $\sigma$ the effective collision diameter of the particle, and $\epsilon$ the dispersive attraction or "well-depth" of the potential. We used molecular dynamics (MD) to calculate a radial distribution function with 73 independent variables. For now, we choose a single independent variable to model,  which gives a training set matrix of size $(480 \times 3)$ and a training set observation vector with dimensions $(480 x 1)$. Later, we will compute a surrogate model over all 73 points independently, but for now we consider only a single point in the RDF. 

In [16]:
# Import the training data generated by 1_sample_gen.ipynb
rmin = 0.0118331810091873
rmax = 15.512161254882812
rnum = 73

from pickle import dump, load

input_dict = load(open('training_data/samples.p', 'rb'))
xd = input_dict['xs']

input_dict = load(open('training_data/training_rdf.p', 'rb'))
r_raw = input_dict['r']
model_rdf_raw = input_dict['model_rdf']

model_rdf = torch.zeros(len(model_rdf_raw),rnum)
r = np.linspace(rmin,rmax,rnum)

# We can interpolate this data to have the same resolution as the experiment.
for i in range(len(model_rdf_raw)):
    rdf_i = interpolate.splrep(r_raw, model_rdf_raw[i], s=0)
    model_rdf[i] = torch.from_numpy(interpolate.splev(r, rdf_i, der=0))
    
print('Training Set Matrix Shape:       '      , np.shape(xd))                 #total training set
print('Training Set Observations Shape: ', np.shape(np.array(model_rdf[:,0]))) #single observation 

Training Set Matrix Shape:        (480, 3)
Training Set Observations Shape:  (480,)


## Kernels

A kernel, or covariance function, completely specifies a Gaussian process. The kernel is highly specific to a given problem and therefore requires careful consideration before computing Gaussian process expectations. In general, the kernel can enforce function properties like continuity and differentiability, general behavior (periodicity, positivity), and symmetry. 

In this code, we use the squared-exponential kernel because it enforces continuity and infinite differentiability with only two hyperparameters. The squared-exponential kernel has the general form,

$$
    K(x, x') = \sigma^2 \exp \bigg(-\frac{(x - x')^2}{2\ell^2}\bigg)
$$

where $\sigma^2$ is a scaling factor that represents the variance of the function and $\ell$ determines the length scale of fluctuations in the function. 

There are many resources available to choose the correct kernel for a given application. We recommend the Kernel Cookbook https://www.cs.toronto.edu/~duvenaud/cookbook/ or the textbook, Gaussian Processes by Rasmussen and Williams (2006). 

In [None]:
def se_kernel(x1, x2, l, width):
    """
    Computes the squared exponential kernel between the tensors x and y with hyper-parameters l and width.
    N corresponds to the number of samples and D corresponds to the number of dimensions of the input function.
    
    Parameters
    ----------
    x1: Tensor [N,D]
        Feature vector for N samples with D dimensions each
    
    x2: Tensor [N,D]
        Feature vector for N samples with D dimensions each
        
    l: Tensor [D]
        Lengthscale hyper parameter
        
    width: Float
        Width hyper parameter
        
    Returns
    ----------
    
    K: Tensor [N, N]
        Kernel matrix
        
    """
    K = width**2 * torch.exp(-(torch.cdist(x1/l,x2/l,p=2)**2)/2)
    return K

## The Local Gaussian Process Surrogate Model

The time-complexity of the training-kernel matrix inversion and the matrix product can be substantially reduced by fragmenting a standard Gaussian process into a subset of Gaussian processes along the independent variables of the target quantity-of-interest (QOI). Under this construction, an individual $GP_k$ is trained to map a set of model parameters to an individual QOI,

\begin{equation}
    \mathbb{E}[GP_k] : \boldsymbol{\theta} \mapsto S(r_k)
\end{equation}

where $\mathbf{r}$ is no longer an input parameter. The LGP surrogate model prediction for the observable at $r_k$, $S_{loc}^*(r_k)$, at a new set of parameters, $\boldsymbol{\theta}^*$, is just the expectation of the $k^{th}$ Gaussian process given the training set data,

\begin{equation}\label{eq:subsurrogate}
    S_{loc}^*(r_k) = \mathbb{E}[\textit{GP}_k(\boldsymbol{\theta}^*)] = \mathbf{K}_{\boldsymbol{\theta}^*,\mathbf{\hat{X'}}} [\mathbf{K}_{\mathbf{\hat{X'}}, \mathbf{\hat{X'}}} + \sigma_{noise}^2 \mathbf{I}]^{-1} \mathbf{\hat{Y'}}_k
\end{equation}

In the following function, we take in variables created in the training set and the hyperparameters of the kernel to create a function that calculates this expectation for a single $k$. Later, we will need to loop this over the entire QOI.

In [17]:
def subset_surrogate(Xi, Xd, l, width, y, KddInv):
    """
    Computes the subset gaussian process estimate of the structure factor given a set of pair potential parameters. 
    
    Parameters
    ----------
    
    Xi: Tensor [M,D]
        Feature vector for M samples with D dimensions each. This corresponds to the points you wish to infer.
    
    Xd: Tensor [M,D]
        Feature vector for N samples with D dimensions each. This corresponds to the points you trained on.
        
    l: Tensor [D]
        Lengthscale hyper parameter.
        
    width: Float
        Width hyper parameter
        
    KddInv: Tensor [N,N]
        This is the inverted kernel matrix of the training set Xd
        
    Returns
    -------
    
    μ: Tensor [η,N]
        The mean estimate for each S(q) at each potential. The S(q)'s are organized in vertical lines where the column
        dimension indexes the potential parameters. 
        
    """
    Kid = se_kernel(Xi, Xd, l, width)
    return 1 + ((Kid @ KddInv) @ (y-1)).T

## Hyperparameter Optimization

## Validation with a Test Set

## Concluding Remarks