# Evaluate Outliers
In this notebook, we assess the performance of the best model for water solvation energy prediction

In [1]:
%matplotlib inline
from matplotlib import pyplot as plt
from sklearn.neighbors import NearestNeighbors
from sklearn.pipeline import Pipeline
from sklearn.decomposition import PCA
from sklearn.preprocessing import MinMaxScaler
from jcesr_ml.mpnn import set_custom_objects, run_model, GraphModel
from jcesr_ml.benchmark import load_benchmark_data
from keras.models import load_model
from tqdm import tqdm
import pickle as pkl
import pandas as pd
import numpy as np
import json
import os

Using TensorFlow backend.
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])


## Load in Benchmark Data
We are going to use the standard set

In [2]:
train_data, test_data = load_benchmark_data()

## Get MPNN Results
Get the best MPNN for water that did not use DFT charges

In [3]:
mpnn_data = pd.read_json(os.path.join('..', 'mpnn', 'mpnn-results.json'))
mpnn_data = mpnn_data[~ mpnn_data.network.str.contains('dielectric-constant-charges')]

In [4]:
chosen_model = mpnn_data.query('nodes==128 and batch_size==16384 and ("constant-charges" not in network)').sort_values('mae_water', ascending=True).head(1).iloc[0]

In [5]:
print(f'Our best-performing network is: {chosen_model["network"]}')

Our best-performing network is: single-task


In [6]:
set_custom_objects()

In [7]:
model = load_model(os.path.join('..', 'mpnn', chosen_model['path'], 'best_model.h5'))

Instructions for updating:
Colocations handled automatically by placer.
Instructions for updating:
Use tf.cast instead.


  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "


In [8]:
with open(os.path.join('..', 'mpnn', 'networks', chosen_model['network'], 'converter.pkl'), 'rb') as fp:
    conv = pkl.load(fp)

In [9]:
%%time
test_data['mpnn_pred'] = run_model(model, conv, test_data['smiles_0'], n_jobs=1)

CPU times: user 55.7 s, sys: 51.5 s, total: 1min 47s
Wall time: 50.3 s


## Compute Distance to Training Set
We want to compute the distance of each point in the test set to the nearest entries in the training 

In [10]:
rep_model = GraphModel(inputs=model.inputs, outputs=model.get_layer('reduce_atom_to_mol_1').output)

Compute the representations for the train and test sets

In [11]:
%%time
train_mols = run_model(rep_model, conv, train_data['smiles_0'].sample(32768), n_jobs=4)

CPU times: user 2min 16s, sys: 2min 5s, total: 4min 21s
Wall time: 1min 48s


In [12]:
%%time
test_mols = run_model(rep_model, conv, test_data['smiles_0'], n_jobs=4)

CPU times: user 53 s, sys: 49.9 s, total: 1min 42s
Wall time: 43.1 s


Make the nearest neighbor computer on a reduced space

In [13]:
dim_reduction = Pipeline([
    ('scale', MinMaxScaler()),
    ('pca', PCA(128)),
])
nn_computer = NearestNeighbors(n_jobs=-1).fit(dim_reduction.fit_transform(train_mols))

In [14]:
test_data['train_dist'] = nn_computer.kneighbors(dim_reduction.transform(test_mols))[0].mean(axis=1)

## Sort Molecules by Error
Print out the the models with the largest error

In [15]:
test_data['error'] = (test_data['sol_water'] - test_data['mpnn_pred']).abs()

In [16]:
test_data.sort_values('error', ascending=False)[['smiles_0', 'sol_water', 'mpnn_pred', 'error', 'train_dist']].head(15)

Unnamed: 0,smiles_0,sol_water,mpnn_pred,error,train_dist
97387,C[NH2+]C(C(N)=O)C([O-])=O,-18.4847,-38.409863,19.925163,3.635105
118917,C[NH2+]CCC(=O)C([O-])=O,-60.636,-45.644169,14.991831,3.281838
120055,C[NH2+]CC[N-]C(=O)C#N,-49.4573,-35.464993,13.992307,3.222445
74071,CC1([NH3+])C(N)C1C([O-])=O,-50.455,-42.094906,8.360094,2.765052
53724,[O-]C(=O)C1C[NH2+]CCO1,-52.8067,-44.575325,8.231375,2.758512
6122,CNC(=O)NC(C)=O,-8.0464,-15.339846,7.293446,2.524287
101320,CC1(C[NH3+])CC1C([O-])=O,-59.2984,-52.320507,6.977893,2.70395
120404,[NH3+]CCCCCC([O-])=O,-84.3461,-77.878914,6.467186,3.95539
4856,OC1=CNC=CC1=O,-18.6437,-12.199085,6.444615,2.56301
120405,[NH3+]CCOCCC([O-])=O,-54.0893,-60.092457,6.003157,3.421061


Note that many of the top errors are for molecules with very large formation energy

In [17]:
per99 = np.percentile(test_data["sol_water"], 1)
print(f'99th percentile of sol_water: {per99: .2f} kcal/mol')

99th percentile of sol_water: -17.80 kcal/mol


In [18]:
top_errors_are_outliers = (test_data.sort_values('error', ascending=False)['sol_water'].head(25) < per99).mean()
print(f'Fraction of top errors in 1%: {top_errors_are_outliers * 100: .1f}%')

Fraction of top errors in 1%:  60.0%


Also that many have charged species

In [19]:
test_data[test_data['smiles_0'].str.contains('C\(=O\)N') & test_data['smiles_0'].str.contains('\+')][['smiles_0', 'sol_water', 'error']]

Unnamed: 0,smiles_0,sol_water,error
49876,[NH2+]=CNC1=CC(=O)N[CH-]1,-32.914,0.806379
