# Example 3

### xTB descriptors to predict solubility using GNN

This workflow includes:

i) RDKit conformer sampling \
ii) xTB porperty calculations to determine molecular and atomic properties \
iii) Generate a GNN model to predict solubility

#### Steps involved in this example

- Step 1: Import AQME and other python modules, and the required CSV
- Step 2: Run CSEARCH (RDKit) on the CSV
- Step 3: Run xTB calculations using QDESCP
- Step 4: Create the CSV file with descriptors for the GNN model 
- Step 5: Load the solubility CSV file and split the data into training, validation and test sets
- Step 6: Set up the GNN model
- Step 7: Predict solubities of external test set using the GNN model

###  Step 1: Import AQME and other python modules, and the required CSV

In [6]:
import glob
from aqme.csearch import csearch
from aqme.qdescp import qdescp
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from gnn_functions import *
from sklearn.metrics import r2_score
import sklearn.metrics as metrics
import tensorflow as tf

file = 'solubility.csv'

2022-10-21 10:46:16.474736: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/BIOVIA/TURBOMOLE//libso:/usr/local/g16/bsd:/usr/local/g16:/usr/local/gv/lib
2022-10-21 10:46:16.474766: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.


###  Step 2: Run CSEARCH (RDKit) on the CSV

In [6]:
csearch(program='rdkit',input=file,ewin_csearch=1)

AQME v 1.2 2022/07/21 10:56:00 
Citation: AQME v 1.2, Alegre-Requena, J. V.; Sowndarya, S.; Perez-Soto, R.; Alturaifi, T. M.; Paton, R. S., 2022. https://github.com/jvalegre/aqme



Starting CSEARCH with 1128 job(s) (SDF, XYZ, CSV, etc. files might contain multiple jobs/structures inside)



   ----- mol_1 -----


o  Applying filters to initial conformers


   ----- mol_2 -----


o  Applying filters to initial conformers


   ----- mol_3 -----


o  Applying filters to initial conformers


   ----- mol_4 -----


o  Applying filters to initial conformers


   ----- mol_5 -----


o  Applying filters to initial conformers


   ----- mol_6 -----


o  Applying filters to initial conformers


   ----- mol_7 -----


o  Applying filters to initial conformers


   ----- mol_8 -----


o  Applying filters to initial conformers


   ----- mol_9 -----


o  Applying filters to initial conformers


   ----- mol_10 -----


o  Applying filters to initial conformers


   ----- mol_11 -----


o  Applying 

<aqme.csearch.csearch at 0x2b9f9db714f0>

### Step 3 : Run xTB calculations using QDESCP

In [3]:
sdf_rdkit_files = glob.glob(f'CSEARCH/rdkit/*.sdf')
qdescp(files=sdf_rdkit_files, boltz=True, program='xtb')

#or run with python script on terminal if the number of molecules are large

#python run_qdescp.py

### Step 4 : Create the CSV file with descriptors for the GNN model 

In [2]:
data =  pd.read_csv(file)
data['xtbjson'] = data['code_name'].apply(lambda x: 'QDESCP/boltz/{}_rdkit_boltz.json'.format(x))
data.to_csv('solubility_xtb.csv',index=False)

### The following steps can be done using a script as shown or run in the notebook itself

###  Step 5: Run the gnn.py to get results (~3-4hr)

##### We had done it by using the python script for our result of R2=0.8 

In [4]:
#python gnn.py

###  Step 5a: Load the solubility CSV file and split the data into training, validation and test sets

In [10]:
sol = pd.read_csv('solubility_xtb.csv')
valid, test, train = np.split(sol[['smiles','xtbjson']].sample(frac=1., random_state=41), [50, 100])

###  Step 5b: Set up the GNN dataset and model

In [9]:
train_dataset, valid_dataset, test_dataset = gnn_data(valid, test, train, sol)
inputs, outputs = next(train_dataset.as_numpy_iterator())

In [8]:
model = gnn_model()
model.compile(loss='mae', optimizer=tf.keras.optimizers.Adam(1E-3))
model.fit(train_dataset, validation_data=valid_dataset, epochs=500)

### Step 5c: Predict solubities of external test set using the GNN model

In [7]:
# Predict solubility of the external test set
test_predictions = model.predict(test_dataset)
test_db_values = sol.set_index('smiles').reindex(test.smiles)['measured log solubility in mols per litre'].values

# Plot the results
fig = plt.subplots(figsize=(3,3))

ax1 = sns.scatterplot(test_db_values,test_predictions.flatten(),s=30,marker='o',color='b',alpha=0.5)
ax1.set_xlabel(r'Measured',fontsize=10)
ax1.set_ylabel(r'Predicted',fontsize=10)
ax1.grid(linestyle='--', linewidth=1)

mae = metrics.mean_absolute_error(test_db_values,test_predictions.flatten())
r2 = metrics.r2_score(test_db_values,test_predictions.flatten())

plt.annotate(f"$R^2$ = {round(r2,1)} \nMAE = {round(mae,1)} ", xy=(-1.5, -5.9), fontsize=10)
plt.savefig('solubility-gnn.jpg',dpi=400)
plt.show()