# DeepChem

- [Installation](https://github.com/deepchem/deepchem#installation)
- [Tutorial](https://deepchem.readthedocs.io/en/latest/get_started/tutorials.html)
- [Sample Notebooks](https://github.com/deepchem/deepchem/tree/master/examples/tutorials) 

## Environment

- Python 3.7.11

- deepchem == 2.5.0

- TensorFlow == 2.6.0

- PyTorch == 1.9.0

- scikit-learn == 0.23.2

- DGL == 0.7.0

- RDKit == 2021.03.4

- numpy == 1.20.3

- pandas == 1.3.2

When using neural networks and deep neural models, dataset is usually divided into train, validate, and test sets. Especially, when dataset is large, cross-validation is not applied. 

However, in this notebook, we try to apply cross-validation. Dataset is divided into several (train - test) subsets. By this way, we can get predicted values for all observations. 


**Hyperparameter tuning**

Method : Grid Search implemented in DeepChem

Before CV, we apply hyperparameter tuning. 

Hyperparameter tuning is done using the whole dataset. We don’t repeat hyperparameter tuning in every fold of cross-validation.

20% of data (as validation set) is used to evaluate model performance (R2 score). Best hyperparameters are the ones with the highest model performance on validation set. 

In [1]:
import numpy as np
import pandas as pd

import deepchem as dc

from sklearn.metrics import r2_score

from tools.models_graph import CV_graph, CV_graph_models
from tools.models_graph import hyperparams_tuning_models
from tools.models_graph import standard_scaling

from tools.get_params import get_search_space, get_best_hyperparams_models

# Parameters , Settings

In [2]:
data_filepath = './data/ESOL_modified.csv'   # observation 934 ('C') was removed to avoid error for featurizer "MolGraphConvFeaturizer"   

save_hyperparams_folder = './result/Hyperparameter'   # folder to save models during hyperparameter tuning
metric_hyperparams_filepath = './result/metric_hyperparams.json'   # filepath to save hyperparameter tuning results (metrics)
fig_save_folder = './result/figures'   # folder to save figures
metrics_filepath = './result/metrics.json'   # filepath to save GPR results

n_tasks = 1   # No. of tasks (No. of dependent variables)
nb_epoch = 100

CV_method = 'k-fold'   
k = 5   # value of k for k-fold cross validation

# Data

In [4]:
data = pd.read_csv(data_filepath)  
 
smiles = data['smiles'].values   # should be 1D

y = data['measured log solubility in mols per litre'].values.reshape(-1,1)   # can be 1D or 2D

# It is recommended to standardize y
y_ss = standard_scaling(y)   # 2D numpy array

# Hyperparameter Tuning (Grid Search)

In [None]:
# We apply Hyperparameter Tuning on the whole dataset (after diving into train and validation sets)
# We don't apply Hyperparameter Tuning in every fold of CV. 
search_space_models = get_search_space(n_tasks)

metric_hyperparams, best_hyperparams_all = hyperparams_tuning_models(search_space_models, smiles, y_ss, 
                  metric_hyperparams_filepath, save_hyperparams_folder, nb_epoch=nb_epoch)

Hyperparameter Tuning is time-consuming. 

Therefore, we have saved best hyperparameters in a JSON file. Next time, we can load the hyperparameters, no need to repeat hyperparameter tuning. 

# CV

In [5]:
# Load hyperparameters
best_hyperparams_filepath = './result/best_hyperparams_ESOL_norm.json'   # filepath to load best hyperparameters
hyperparams_models = get_best_hyperparams_models(best_hyperparams_filepath)

### CV for all models

In [None]:
fig_name = 'LogS'
plot_title = 'LogS'
results_metrics, results_y = CV_graph_models(hyperparams_models, smiles, y_ss, nb_epoch, metrics_filepath, fig_save_folder, fig_name, plot_title,
                  CV_method, show_plot = True, k = k, apply_inverse_scaling=True, y_train_original = y, get_uncertainty=True)    