# Fit G4MP2 Models
Fit models using the FCHL representation that predict the G4MP2 energy of molecules. We are going to try both a model that directly predicts G4MP2 atomization energy, and one that predicts the difference between B3LYP and G4MP2 (i.e., $\Delta$-learning).

In [1]:
%matplotlib inline
from matplotlib import pyplot as plt
from sklearn.metrics import mean_absolute_error
from ase.units import Hartree, eV
from time import perf_counter
from tqdm import tqdm
import pickle as pkl
import pandas as pd
import numpy as np
import gzip
import json

## Load in the Training Data
Load the training data, complete with the representations

In [2]:
train_data = pd.read_pickle('train_data.pkl.gz')

In [3]:
test_data = pd.read_pickle('test_data.pkl.gz')

## Load in the Model
Use the model from the previous calculation

In [4]:
with gzip.open('fchl-model.pkl.gz', 'rb') as fp:
    model = pkl.load(fp)

## Train a Model on G4MP2 Atomization Energies
Train only on G4MP2 atomization energy, not using the B3LYP results

In [5]:
results = []
for train_size in tqdm([10, 100, 1000, 10000]):
    # Get some training data
    train_subset = train_data.sample(train_size)
    
    # Train the model
    train_time = perf_counter()
    model.fit(train_subset['rep'].tolist(), train_subset['g4mp2_atomization'])
    train_time = perf_counter() - train_time

    # Predict the u0 for the test set
    test_time = perf_counter()
    pred_y = model.predict(test_data['rep'].tolist())
    test_time = perf_counter() - test_time
    
    results.append({
        'train_size': train_size,
        'mae': mean_absolute_error(pred_y, test_data['g4mp2_atomization']), 
        'train_time': train_time, 
        'test_time': test_time,
    })

100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [36:44:03<00:00, 38866.20s/it]


In [6]:
results = pd.DataFrame(results)
results

Unnamed: 0,mae,test_time,train_size,train_time
0,0.041503,87.115176,10,0.149087
1,0.011979,668.050757,100,2.524154
2,0.002458,7586.430827,1000,279.609858
3,0.00082,89427.240106,10000,34191.970684


In [7]:
with open('fchl.json', 'w') as fp:
    json.dump({
        'name': 'FCHL',
        'description': 'Model built using the FCHL representation, as implemented in QML, and KRR',
        'g4mp2_benchmark': results.to_dict('records')
    }, fp, indent=2)

In [8]:
with gzip.open('fchl_g4mp2.pkl.gz', 'wb') as fp:
    pkl.dump(model, fp)

## Train a $\Delta$-Learning Model on G4MP2 Atomization Energies
Train on the difference between B3LYP and G4MP2

In [9]:
delta_results = []
for train_size in tqdm([10, 100, 1000, 10000]):
    # Get some training data
    train_subset = train_data.sample(train_size)
    
    # Train the model
    train_time = perf_counter()
    model.fit(train_subset['rep'].tolist(), train_subset['delta'])
    train_time = perf_counter() - train_time

    # Predict the u0 for the test set
    test_time = perf_counter()
    pred_y = model.predict(test_data['rep'].tolist())
    test_time = perf_counter() - test_time
    
    delta_results.append({
        'train_size': train_size,
        'mae': mean_absolute_error(pred_y, test_data['delta']), 
        'train_time': train_time, 
        'test_time': test_time,
    })

100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [36:42:04<00:00, 38814.84s/it]


In [10]:
delta_results = pd.DataFrame(delta_results)
delta_results

Unnamed: 0,mae,test_time,train_size,train_time
0,0.004968,85.386432,10,0.130719
1,0.00146,643.299935,100,2.410578
2,0.000426,7783.987396,1000,296.710287
3,0.00019,89210.823406,10000,34101.14137


*Finding*: The FCHL representation performs remarkably well, even for small dataset sizes.

In [11]:
with open('fchl-delta.json', 'w') as fp:
    json.dump({
        'name': 'FCHL $\Delta$-Learning',
        'description': '$\Delta$-Learning model built using the FCHL representation, as implemented in QML, and KRR',
        'g4mp2_with_b3lyp_results': delta_results.to_dict('records')
    }, fp, indent=2)

In [12]:
with gzip.open('fchl_g4mp2-delta.pkl.gz', 'wb') as fp:
    pkl.dump(model, fp)