# Example 3

### xTB descriptors to predict solubility using GNN

This workflow includes:

i) RDKit conformer sampling \
ii) xTB porperty calculations to determine molecular and atomic properties \
iii) Generate a GNN model to predict solubility

#### Steps involved in this example

- Step 1: Import AQME and other python modules, and the required CSV
- Step 2: Run CSEARCH (RDKit) on the CSV
- Step 3: Run xTB calculations using QDESCP
- Step 4: Create the CSV file with descriptors for the GNN model 
- Step 5: Run the gnn.py to get results
  - Step 5a: Load the solubility CSV file and split the data into training, validation and test sets
  - Step 6a: Set up the GNN model
  - Step 7a: Predict solubities of external test set using the GNN model

###  Step 1: Import AQME and other python modules, and the required CSV

In [None]:
import glob
from aqme.csearch import csearch
from aqme.qdescp import qdescp
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from gnn_functions import *
from sklearn.metrics import r2_score
import sklearn.metrics as metrics
import tensorflow as tf

file = 'solubility.csv'

###  Step 2: Run CSEARCH (RDKit) on the CSV

In [None]:
csearch(program='rdkit',input=file,ewin_csearch=1)

### Step 3 : Run xTB calculations using QDESCP

In [None]:
sdf_rdkit_files = glob.glob(f'CSEARCH/*.sdf')
qdescp(files=sdf_rdkit_files, boltz=True, program='xtb')

#or run with python script on terminal if the number of molecules are large

#python run_qdescp.py

### Step 4 : Create the CSV file with descriptors for the GNN model 

In [None]:
data =  pd.read_csv(file)
drop = ['mol_556','mol_641']
data = data[~data.code_name.isin(drop)]
data['xtbjson'] = data['code_name'].apply(lambda x: 'QDESCP/boltz/{}_rdkit_boltz.json'.format(x))
data.to_csv('solubility_xtb.csv',index=False)

### The following steps can be done using a script as shown or run in the notebook itself

###  Step 5: Run the gnn.py to get results

##### We had done it by using the gnn.py script for our result of R2=0.8 

In [None]:
#python gnn.py

###  Step 5a: Load the solubility CSV file and split the data into training, validation and test sets

In [None]:
sol = pd.read_csv('solubility_xtb.csv')
valid, test, train = np.split(sol[['smiles','xtbjson']].sample(frac=1., random_state=41), [50, 100])

###  Step 5b: Set up the GNN dataset and model

In [None]:
train_dataset, valid_dataset, test_dataset = gnn_data(valid, test, train, sol)
inputs, outputs = next(train_dataset.as_numpy_iterator())

In [None]:
model = gnn_model()
model.compile(loss='mae', optimizer=tf.keras.optimizers.Adam(1E-3))
model.fit(train_dataset, validation_data=valid_dataset, epochs=200)

### Step 5c: Predict solubities of external test set using the GNN model

In [None]:
# Predict solubility of the external test set
test_predictions = model.predict(test_dataset)
test_db_values = sol.set_index('smiles').reindex(test.smiles)['measured log solubility in mols per litre'].values

# Plot the results
fig = plt.subplots(figsize=(4,4))

ax1 = sns.scatterplot(x=test_db_values,y=test_predictions.flatten(),s=30,marker='o',color='b',alpha=0.5)
ax1.set_xlabel(r'Measured',fontsize=10)
ax1.set_ylabel(r'Predicted',fontsize=10)

mae = metrics.mean_absolute_error(test_db_values,test_predictions.flatten())
r2 = metrics.r2_score(test_db_values,test_predictions.flatten())

plt.annotate(f"$R^2$ = {round(r2,1)} \nMAE = {round(mae,1)} ", xy=(-2.3, -5.9), fontsize=10)
plt.savefig('solubility-gnn.jpg',dpi=400)
plt.show()