### In this tutorial, we will:

1. Load a small dataset of **DFT-calculated C structures**.
2. Compute **SOAP descriptors** to represent atomic environments.
3. Write a simple **GAP** model ourselvs.  
4. Train a **GAP** model to predict energies.
5. Train a **GAP** model with gap_fit code.
6. Wrap the model in an **ASE calculator**.


### Requirements:
- `ase`
- `quippy`
- `matplotlib`
- `numpy`
- `gap_fit`


# Local GPR with SOAP Descriptors

In this notebook, we build a Gaussian Process Regression (GPR) model to predict atomic energies from SOAP (Smooth Overlap of Atomic Positions) descriptors.
We assume a local energy model:


$$E = \sum_{i=1}^{N} E_i(\mathbf{p}_i)$$

Where:

- $E_i$ is the local energy of atom _i_,

- $\mathbf{p}_i$ is the SOAP vector of atom _i_
  
$$E_{\text{total}}^{\text{test}} = \sum_{i \in \text{test}} \sum_{j \in \text{train}} \alpha_j \cdot k(\mathbf{p}_i, \mathbf{p}_j)$$
   
- $k(\mathbf{p}_i, \mathbf{p}_j)$ is the SOAP kernel
- $\alpha_j$ is the coefficient from GPR training


### We use QUIP to compute SOAP vectors and NumPy for GPR training

#### Imports & Parameters

In [None]:
from quippy.descriptors import Descriptor 
from ase.io import read, write
import numpy as np
import matplotlib.pyplot as plt
import os
import sys
import random
#import nglview as nv

In [None]:
# Descriptor string 
soap_str = "soap cutoff=3 l_max=4 n_max=4 atom_sigma=0.5 n_Z=1 Z1=6"



*l_max=4* -  spherical harmonics are expanded up to angular momentum 4

*n_max=4* - the radial basis expansion uses 4 radial basis functions

*atom_sigma* - width of a Gaussian function

In [None]:
# Initialize SOAP descriptor
soap_descriptor = Descriptor(soap_str)

#### Load the structures using ASE

In [None]:
db=read('bulk_cryst.xyz', ':')

In [None]:
print("The database has:", len(db), " carbon structures")

<img src="figures/bulk_cryst_2.png" width="300"/> <img src="figures/bulk_cryst_0.png" width="300"/> <img src="figures/bulk_cryst_1.png" width="300"/> <img src="figures/bulk_cryst_4.png" width="300"/>

In [None]:
def split_xyz(input_file, test_size=0.2, seed=42):
    
    # Automatically detect the input file path if not provided
    if not os.path.exists(input_file):
        raise FileNotFoundError(f"Input file '{input_file}' not found in the current directory.")

    # Load the structures
    structures = read(input_file, index=":")
    print(f"Loaded {len(structures)} structures from {input_file}")

    # Randomly shuffle the structures
    random.seed(seed)
    random.shuffle(structures)

    # Split into train and test sets
    split_idx = int(len(structures) * (1 - test_size))
    train_structures = structures[:split_idx]
    test_structures = structures[split_idx:]

    # Derive output file names
    input_name = os.path.basename(input_file).split('.')[0]  # Remove file extension
    train_file = os.path.join(f"{input_name}_train.xyz")
    test_file = os.path.join(f"{input_name}_test.xyz")

    # Write to output files
    write(train_file, train_structures)
    write(test_file, test_structures)

    print(f"Split complete: {len(train_structures)} training structures, {len(test_structures)} test structures.")
    print(f"Saved to:\n  - Training set: {train_file}\n  - Test set: {test_file}")



In [None]:
split_xyz('bulk_cryst.xyz', 0.65)

In [None]:
structures=read('bulk_cryst_train.xyz', ':')
print(f"Loaded {len(structures)} C structures.")

#### Compute SOAP descriptors and Energy per atom

#### We have a total energy of the structure. However, in order to learn our fitting coefficients, we need a decomposition of energy. What if we simply assume that energy can be devided by the number of atoms in the structure?
We assign per-atom energies by dividing the total energy of the structure equally among its atoms (simple assumption). Then each atom gets a SOAP descriptor. 



In [None]:
X_all=[] # List of arrays
e_all=[] # List of atomic energies

for atoms in structures:
    E_total = atoms.get_potential_energy()
    
    # Get per-atom SOAP (unnormalized)
    desc = soap_descriptor.calc(atoms, descriptor_only=True)
    X_atoms = desc['data']
    # Assign per-atom energies equally (can be improved!)
    E_per_atom = E_total / len(atoms)
    e_atoms = np.full(len(atoms), E_per_atom)

    X_all.append(X_atoms)
    e_all.append(e_atoms)


In [None]:
print( "Number of SOAP vectors:", sum([x.shape[0] for x in X_all]))
print( "SOAP vector dimensions:", X_all[0].shape[1])

In [None]:
# Stack all training data
X_train = np.vstack(X_all)   # shape (N_atoms_total, D)
e_train = np.concatenate(e_all) 
print("X_train shape:", X_train.shape)
print("e_train shape:", e_train.shape)

### Define SOAP Kernel and Train GAP

### We need to compute the SOAP kernel between each pair of training descriptors 

#### The resulting matrix K is a kernel matrix, where each entry K[i, j] measures similarity between atomic environments i and j.

In [None]:
# SOAP kernel exponent and regularization (try different regularization, for instance, 1e-5)
zeta = 4
sigma_n = 1e-1

In [None]:
def soap_kernel_matrix(X1, X2, zeta=4):
    K = np.dot(X1, X2.T)
    return K ** zeta


In [None]:
# Compute SOAP kernel
K = soap_kernel_matrix(X_train, X_train, zeta=zeta)

# Add regularization
K += sigma_n**2 * np.eye(len(X_train))



#### Visualize the Kernel Matrix

In [None]:
plt.figure(figsize=(6, 5))
plt.imshow(K[:20, :20], cmap='viridis')
plt.title("SOAP Kernel Matrix (first 20 atoms)")
plt.colorbar()
plt.tight_layout()
plt.show()


The fitting coefficients can be then obtained from matrix inversion:
$$
\alpha = (K + \sigma^2 I)^{-1} y
$$
- $\alpha$ is a weight vector (length = number of training data points)              
- $ K $ is a kernel matrix, with entries $ K_{ij} = k(\mathbf{x}_i, \mathbf{x}_j)$

- $\sigma^2$ is a regularization parameter
- $I$ - Identity matrix (same shape as \( K \))                                
- $ y$ - vector of training targets (e.g., atomic energies)

#### Instead of directly computing the inverse, we use Cholesky decomposition, which expresses the matrix as a product of a lower-triangular matrix L and its transpose L$^⊤$:



**Cholesky decomposition**:

$$
K + \sigma_n^2 I = L L^\top
$$

We then solve the system in two steps:

$$
L \mathbf{z} = \mathbf{y}
$$

Then:

$$
L^\top \boldsymbol{\alpha} = \mathbf{z}
$$


In [None]:
L = np.linalg.cholesky(K) 
z = np.linalg.solve(L, e_train) # solve 

alpha = np.linalg.solve(L.T, z)  # we basically got our model!


In [None]:
model1=alpha.copy() #store the model
print("Alpha is a vector of the same length of our energies:", alpha.shape)

**Now that we have our $\alpha$ we can actually predict the energy of system based on our freshly fitted model**

In [None]:
def predict_structure_energy(atoms, descriptor, X_train_all, alpha, zeta=4):
    """Predict total energy for a new structure using our model."""
    desc = descriptor.calc(atoms,properties=['energy'], descriptor_only=True)
  
    X_test = desc['data']  
    
    # Kernel between test atoms and training atoms
    K_test = soap_kernel_matrix(X_test, X_train_all, zeta=zeta) 
    # Predict local energies and sum
    local_energies = np.dot(K_test, alpha) 
    total_energy = np.sum(local_energies)
    return total_energy

#### Predict the enegy of a new structure

In [None]:
test_structures = read("bulk_cryst_test.xyz", index=":100")

In [None]:
E_preds = [predict_structure_energy(a, soap_descriptor, X_train, alpha) for a in test_structures]
E_pred_atom = [E_preds[i]/len(test_structures[i]) for i in range(0, len(test_structures))]
print(f"We predicted {len(E_preds)} energies of the test set using our GAP model")

In [None]:
E_dft = [a.get_potential_energy()/len(a) for a in test_structures]

In [None]:
from sklearn.metrics import r2_score, mean_squared_error
def parity_plot(y_true, y_pred, title="Parity Plot", xlabel="DFT Energy (eV)", ylabel="Predicted Energy (eV)"):
    y_true = np.array(y_true)
    y_pred = np.array(y_pred)
    # Compute metrics
    r2 = r2_score(y_true, y_pred)
    rmse = np.sqrt(mean_squared_error(y_true, y_pred))
    # Plot
    plt.figure(figsize=(6, 6))
    plt.scatter(y_true, y_pred, s=10, alpha=0.7, label='Data')
    plt.plot([y_true.min(), y_true.max()], [y_true.min(), y_true.max()], 'k--', lw=2, label='Ideal')
    plt.xlabel(xlabel)
    plt.ylabel(ylabel)
    plt.title(title)
    plt.grid(True, linestyle='--', alpha=0.3)
    plt.axis('equal')
    plt.legend()
    plt.text(0.05, 0.95, f"$R^2$ = {r2:.3f}\nRMSE = {rmse:.3f} eV",
             transform=plt.gca().transAxes,
             verticalalignment='top',
             bbox=dict(boxstyle="round", facecolor='white', alpha=0.8))
    plt.tight_layout()
    plt.show()

In [None]:
parity_plot(E_dft, E_pred_atom)

### Not bad! We got a pretty good fit considering the amount of data and our asumptions. 

## Try to change the regularization $\sigma$ and refit the model. Will the RMSE change?  

Let's wrap the process into a function

In [None]:
def train_gap(X_train, e_train, zeta, sigma_n):
    """
    Parameters:
    ----------
    X_train : array-like, shape (n_samples, ...)
        Training data (e.g., list of SOAP descriptors or atomic environments).
    
    e_train : array-like, shape (n_samples,)
        Target values (e.g., training energies or forces).
    
    zeta : float
        SOAP kernel hyperparameter (e.g., power of the dot product).
    
    sigma_n : float
        Noise level (regularization term).
    
    kernel_func : callable
        Function that computes the kernel matrix. Signature:
        kernel_func(X1, X2, zeta) -> K_matrix
    
    Returns:
    -------
    alpha : ndarray, shape (n_samples,)
        Coefficients for prediction: used in K(X_test, X_train) @ alpha.
    
    """
    
    K = soap_kernel_matrix(X_train, X_train, zeta=zeta)
    
    K += sigma_n**2 * np.eye(len(X_train))
    
    L = np.linalg.cholesky(K)
    
    z = np.linalg.solve(L, e_train)
    alpha = np.linalg.solve(L.T, z)
    
    return alpha

In [None]:
# Train a new model using train_gap function 
sigma_n =  # change the regularization
new_model =   #obtain new fitting coefficients alpha



In [None]:
#use predict_structure(atoms, soap_descriptor, X_train, alpha) 
E_preds = [] #predict the energies of the test set 
print(f"We predicted {len(E_preds)} energies of the test set using our GAP model")

In [None]:
parity_plot(E_dft, E_preds)

## Test the model on amorphous carbon structures 
Until now, we have used a test set composed of structures similar to the training set — often with nearly identical atomic environments. This makes it easier for the model to interpolate and achieve low RMSE. Can the model accurately predict energies for systems that are structurally different, such as amorphous carbon or surfaces?

In [None]:
test_amorphous=read('a_C_db.xyz', ':100')

In [None]:
E_preds = [predict_structure_energy(a, soap_descriptor, X_train, model1)/len(a) for a in test_amorphous]
print(f"We predicted {len(E_preds)} energies of the test set using our GPR model")

In [None]:
E_dft = [a.get_potential_energy()/len(a) for a in test_amorphous]

In [None]:
parity_plot(E_dft, E_preds)

Let's have a look how these structures in database look. Clearly very different from what we have trained our model on! 

<img src="figures/bulk_amo_2.png" width="300"/> <img src="figures/bulk_amo_4.png" width="300"/> <img src="figures/bulk_cryst_1.png" width="300"/> <img src="figures/bulk_cryst_4.png" width="300"/>

## Fit GAP with the amorphous carbons structures
Let's quickly refit the model on this data

In [None]:
split_xyz('a_C_db.xyz', 0.75)

In [None]:
# split_xyz('a_C_db.xyz', 0.25)
structures_amo = read('a_C_db_train.xyz', ':')
structures_test = read('a_C_db_test.xyz', ':')
X_all=[] # List of arrays
e_all=[] # List of atomic energies
zeta=4
sigma_n=1e-1
for atoms in structures_amo:
    E_total = atoms.get_potential_energy()
    
    # Get per-atom SOAP (unnormalized)
    desc = soap_descriptor.calc(atoms, descriptor_only=True)
    X_atoms=desc['data']
    # Assign per-atom energies equally (can be improved!)
    E_per_atom = E_total / len(atoms)
    e_atoms = np.full(len(atoms), E_per_atom)

    X_all.append(X_atoms)
    e_all.append(e_atoms)
X_train = np.vstack(X_all)   # shape (N_atoms_total, D)
e_train = np.concatenate(e_all) 
model_amo = train_gap(X_train, e_train, zeta, 1e-1)
E_preds = [predict_structure_energy(a, soap_descriptor, X_train, model_amo)/len(a) for a in structures_test]
print(f"We predicted {len(E_preds)} energies of the test set using our GPR model")
E_dft = [a.get_potential_energy()/len(a) for a in structures_test]
parity_plot(E_dft, E_preds)

### It seems that our assumption on decomposition of total energies was too simplified.  

# Gaussian Approximation Potential (GAP)

## Sparse GPR with SOAP Descriptors using gap_fit and quippy

In GAP, we don't use all atomic environments to define the regression basis. Instead, we select a subset of atomic environments — **the sparse set** — to serve as representative points. But why Sparse GPR?

- **Computational Efficiency**

Full GPR scales like $O(N^3$) with N the number of data points.

By selecting $M≪N$ sparse points, GAP reduces the scaling to $O(NM^2)$, where M is the number of sparse environments.

This makes training and prediction feasible for thousands of atoms.

- **Avoiding Redundancy**

Many atomic environments are very similar — using all of them would just increase the model size without adding new information.

The sparse set acts like a dictionary of distinct environments.

- **Improving Generalization**

A sparse set made of diverse, informative environments allows the model to generalize better to unseen structures.

If chosen well (e.g., via CUR decomposition or clustering), the model:

- avoids overfitting,

- focuses on physically meaningful variations


In [None]:
from ase.io import read
from quippy.descriptors import Descriptor
import numpy as np
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt


all_structures = read("a_C_db_test.xyz", ":")

# Extract all atoms and calculate SOAP descriptors
soap_str = "soap cutoff=3.0 l_max=4 n_max=4 atom_sigma=0.5 n_species=1 species_Z=6"  
soap = Descriptor(soap_str)

soap_vectors = []
for at in all_structures:
    desc = soap.calc(at)
    soap_vectors.append(desc['data'])

soap_all = np.vstack(soap_vectors) 


In [None]:
# Simple random selection
n_sparse = 500
np.random.seed(42)
sparse_indices = np.random.choice(len(soap_all), size=n_sparse, replace=False)


In [None]:
pca = PCA(n_components=2)
soap_pca = pca.fit_transform(soap_all)

# Mark sparse points
sparse_pca = soap_pca[sparse_indices]


In [None]:
plt.figure(figsize=(8, 6))
plt.scatter(soap_pca[:, 0], soap_pca[:, 1], s=10, alpha=0.3, label='All Environments')
plt.scatter(sparse_pca[:, 0], sparse_pca[:, 1], s=30, c='red', label='Sparse Set', edgecolors='k')
plt.xlabel("PCA Component 1")
plt.ylabel("PCA Component 2")
plt.title("SOAP Descriptor Space with random selection of atomic environments")
plt.legend()
plt.tight_layout()
plt.show()


### Leverage Score Selection from SOAP Vectors

We need to find main patterns in the data that capture most variation.
- **X matrix**: rows = atomic environments, columns = SOAP features.

- **SVD decomposes** the matrix \( X \) into three parts:
  \[
  A = U $\Sigma$ V$^T$
  \]
  - \( U \): how each row relates to main directions.
  - \( $\Sigma$ \): importance of each direction (singular values).
  - \( V$^T$ \): main directions expressed as combinations of features.

- We keep only top \( k \) directions (number of our sparse data points) to reduce data size while preserving main structure.

- **Leverage scores** measure how strongly each row contributes to these main directions — high scores indicate important or unique environments.

- This helps select a smaller, representative subset of environments for efficient modeling and analysis.


In [None]:
import numpy as np
from sklearn.utils.extmath import randomized_svd

# Center the SOAP matrix by subtracting the mean of each feature (column-wise).
X = soap_all - np.mean(soap_all, axis=0)

# Apply randomized SVD
U, S, VT = randomized_svd(X, n_components=20, random_state=42)

# Compute leverage scores, rows importance
leverage_scores = np.sum(U**2, axis=1)  # shape: (N_atoms,)

# Select top-k rows with highest leverage scores
n_sparse = 500
sparse_indices = np.argsort(leverage_scores)[-n_sparse:]

In [None]:
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt

# Project to 2D
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)
sparse_pca = X_pca[sparse_indices]

# Plot
plt.figure(figsize=(8, 6))
plt.scatter(X_pca[:, 0], X_pca[:, 1], s=10, alpha=0.2, label='All environments')
plt.scatter(sparse_pca[:, 0], sparse_pca[:, 1], c='red', s=30, edgecolors='k', label='CUR-selected sparse set')
plt.xlabel("PCA Component 1")
plt.ylabel("PCA Component 2")
plt.title(" SVD Selection of Sparse Environments")
plt.legend()
plt.tight_layout()
plt.show()


## Use gap_fit code to fit GAP 

In [None]:
import os
new_path = "/home/jovyan/shared/installations/bin"

# Append it to the PATH environment variable if it's not already there
if new_path not in os.environ["PATH"]:
    os.environ["PATH"] += os.pathsep + new_path

# Confirm the path was added
print(os.environ["PATH"])


In [13]:
! gap_fit \
  energy_parameter_name=free_energy \
  force_parameter_name=DUMMY \
  virial_parameter_name=DUMMY \
  at_file=a_C_db_train.xyz \
  gap="SOAP cutoff=3.0 l_max=4 n_max=4 atom_sigma=0.5 zeta=4 delta=1.0  n_species=1 species_Z=6\
  n_sparse=500 sparse_method=cur_points covariance_type=dot_product" \
  default_sigma={0.001 0.01 0.0 0.0} \
  e0=0.0 \
  do_copy_at_file=F \
  gp_file=C_GAP.xml

libAtoms::Hello World: 2025-08-07 21:05:37
libAtoms::Hello World: git version  https://github.com/libAtoms/QUIP,v0.9.14-dirty
libAtoms::Hello World: QUIP_ARCH    linux_x86_64_gfortran_openmp
libAtoms::Hello World: compiled on  Jun 15 2023 at 19:28:03
libAtoms::Hello World: OpenMP parallelisation with 1 threads
libAtoms::Hello World: Random Seed = 75937840
libAtoms::Hello World: global verbosity = 0

Calls to system_timer will do nothing by default



config_file =
atoms_filename = //MANDATORY//
at_file = a_C_db_train.xyz
gap = "SOAP cutoff=3.0 l_max=4 n_max=4 atom_sigma=0.5 zeta=4 delta=1.0  n_species=1 species_Z=6   n_sparse=500 sparse_method=cur_points covariance_type=dot_product"
e0 = 0.0
local_property0 = 0.0
e0_offset = 0.0
e0_method = isolated
default_kernel_regularisation = //MANDATORY//
default_sigma = "0.001 0.01 0.0 0.0"
default_kernel_regularisation_local_property = 0.001
default_local_property_sigma = 0.001
sparse_jitter = 1.0e-10
hessian_displacement = 1.0e-2
hessian_delta

In [None]:
import os
os.environ["OPENBLAS_NUM_THREADS"] = "1"
os.environ["OMP_NUM_THREADS"] = "1"
os.environ["MKL_NUM_THREADS"] = "1"
os.environ["VECLIB_MAXIMUM_THREADS"] = "1"
os.environ["NUMEXPR_NUM_THREADS"] = "1"

In [None]:
import ase.io
from ase.optimize.bfgs import BFGS
from quippy.potential import Potential
import numpy as np
import matplotlib.pyplot as plt


In [None]:
atoms = ase.io.read("a_C_db_test.xyz", ':')
E_dft = []
E_dft = [a.get_potential_energy()/len(a) for a in atoms]

In [14]:
E_pred = []
calc = Potential(param_filename="C_GAP.xml")
atoms = ase.io.read("a_C_db_test.xyz", ':')
for atom in atoms:
    atom.calc = calc
    e_bulk = atom.get_potential_energy()/len(atom)
    E_pred.append(e_bulk)
    # print("Final energy:", e_bulk)


In [None]:
parity_plot(E_dft, E_pred)