# TRAINING, OPTIMIZING, AND SELECTING THE DEEP LEARNING ALGORITHMS FOR FLAVOR PREDICTION WITH MOLECULAR GRAPHS

IMPORTANT NOTE! Activate the GPUs of Google Colab to speed up the running of This code. Go to Edit>Notebook settings>T4 GPU

This script comprises the process for training, hyperprameter optimization and testing the Machine Learning algorithms for flavor prediction. The data for both training, validation, and testing is splitted using a partition training-testing 70:10:20.

The selected algorithm was a Convolutional Graph Neural Network (GraphConvModel), from the Python Library DeepChem.

Check out the DeepChem documentation of DeepChem for further information.

[https://deepchem.io/](https://)



DeepChem is not a default library in Google Colab. Therefore, it must be installed. Also, Deepchem requires RDKit to be installed in order to work. Similarly, hyperopt was the selected library for optimizing the hyperparameters of the Neural Network, and it must also be installed. This library was selected because there is plenty of documentation to use it with DeepChem, facilitating the configuration and running.

In [None]:
! pip install rdkit-pypi deepchem hyperopt

Collecting rdkit-pypi
  Downloading rdkit_pypi-2022.9.5-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (29.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m29.4/29.4 MB[0m [31m60.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting deepchem
  Downloading deepchem-2.7.1-py3-none-any.whl (693 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m693.2/693.2 kB[0m [31m62.4 MB/s[0m eta [36m0:00:00[0m
Collecting pytorch_lightning
  Downloading pytorch_lightning-2.0.7-py3-none-any.whl (724 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m725.0/725.0 kB[0m [31m73.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting lightning
  Downloading lightning-2.0.7-py3-none-any.whl (1.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.9/1.9 MB[0m [31m91.8 MB/s[0m eta [36m0:00:00[0m
Collecting scipy<1.9 (from deepchem)
  Downloading scipy-1.8.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (42.2 MB)
[2K

In [None]:
import numpy as np
import pandas as pd
import deepchem as dc
from deepchem.models import GCNModel
from hyperopt import hp, fmin, tpe, Trials
from sklearn.metrics import confusion_matrix
import tempfile

DGL backend not selected or invalid.  Assuming PyTorch for now.


Setting the default backend to "pytorch". You can change it in the ~/.dgl/config.json file or export the DGLBACKEND environment variable.  Valid options are: pytorch, mxnet, tensorflow (all lowercase)


# 2. Training the Convolutional Graph Neural Network with the original data



This fuction takes as input a  trained model, a dataset, a label and an empty dictionary and add to the dictionary the metrics for the respective label. This function needs to be configure before the training to obtain the performance also during the training.

In [None]:
def performance_metrics(model, dataset, label, metrics_dictionary):

  ''' Use the train model to do the prediction'''
  predictions = model.predict(dataset)
  true_labels = dataset.y

  ''' Calculate the recall'''
  recall = model.evaluate(dataset, [dc.metrics.Metric(dc.metrics.recall_score, mode='classification')])

  ''' Calculate the specificity'''
  threshold = 0.5
  binary_predictions = (predictions[:, 1] >= threshold).astype(int)
  tn, fp, fn, tp = confusion_matrix(dataset.y, binary_predictions).ravel()
  specificity = tn / (tn + fp)

  ''' Calculate the roc score'''
  roc_auc = model.evaluate(dataset, [dc.metrics.Metric(dc.metrics.roc_auc_score, mode='classification')])

  ''' Store the metrics in the metrics dictionary'''
  metrics_dictionary[('CGNN', label)] = {'Recall': recall['recall_score'], 'Specificity': specificity,
                                                 'ROC Score': roc_auc['roc_auc_score']}

The selceted model was a preconfigured Graph Convolutional Neural Network (GCNModel) from DeepChem. This model was selected because is the recommended for the featurizer used to produce the molecular graphs (MolGraphConvFeaturizer). The hyperparameter optimization during during the training was performed with the library hyperopt.  

The partion used in the data to train the Neural Networks was 70:10:20 train-validation-test. These datasets were previously splitted using the script "FeaturizationSplitingResampling" and stored as joblib files as this is one of the best ways to store graph data.

A fuction taking as input the tags for the different datasets (Train, Validation, and Test) and perform the training and test both with the validation set during the training and test.

In [None]:
def training_GCNModel(tag_train, tag_val, tag_test, tag_train_and_val):

  labels = ['Bitter', 'Floral', 'Fruity', 'Off_flavor', 'Nutty', 'Sour', 'Sweet']
  models_CGNN = []

  evaluation_metrics_test = {}
  evaluation_metrics_valid = {}

  for label in labels:

    print(f'Training the Neural Network for {label}')

    # Create the train, validation, and test datasets

    train_data = dc.utils.load_from_disk(f'{tag_train}_{label}.joblib')
    valid_dataset = dc.utils.load_from_disk(f'{tag_val}_{label}.joblib')
    test_data = dc.utils.load_from_disk(f'{tag_test}_{label}.joblib')
    train_and_valid_data = dc.utils.load_from_disk(f'{tag_train_and_val}_{label}.joblib')

    search_space = { 'layer_sizes': hp.choice('layer_sizes',[[500], [1000], [2000]]),
                    'learning_rate': hp.uniform('learning_rate', high=0.001, low=0.0001)}

    metric = dc.metrics.Metric(dc.metrics.roc_auc_score)

    def fm(args):

      save_dir = tempfile.mkdtemp()

      model = GCNModel(mode='classification', n_tasks=1, dropout=0.4)

      #validation callback that saves the best checkpoint, i.e the one with the maximum score.

      validation=dc.models.ValidationCallback(valid_dataset, 1000,[metric],save_dir=save_dir)
      model.fit(train_data, nb_epoch=25,callbacks=validation)

      #restoring the best checkpoint and passing the negative of its validation score to be minimized.

      model.restore(model_dir=model.model_dir)
      valid_score = model.evaluate(valid_dataset, [metric])

      performance_metrics(model, valid_dataset, label, evaluation_metrics_valid)

      return -1*valid_score['roc_auc_score']

    trials=Trials()

    best_hyperparameters = fmin(fm, space= search_space, algo=tpe.suggest, max_evals=5, trials = trials)

    best_model = GCNModel(mode='classification', n_tasks=1, dropout=0.4, **best_hyperparameters)

    print('Training the best estimator with the best hyperparameters\n')

    best_model.fit(train_and_valid_data, nb_epoch=25)

    models_CGNN.append(('CGNN', label, best_model))

    # Test the models

    performance_metrics(best_model, test_data, label, evaluation_metrics_test)

  return models_CGNN, evaluation_metrics_test, evaluation_metrics_valid

In [None]:
model_original, metrics_test_original, metrics_valid_original = training_GCNModel('train_data', 'valid_dataset',
                                                                                  'test_data', 'train_and_valid_data')

Training the Neural Network for Bitter
  0%|          | 0/5 [00:00<?, ?trial/s, best loss=?]Step 1000 validation: roc_auc_score=0.701496
Step 2000 validation: roc_auc_score=0.714194
 20%|██        | 1/5 [03:53<15:32, 233.07s/trial, best loss: -0.7133275548951625]Step 1000 validation: roc_auc_score=0.645409
Step 2000 validation: roc_auc_score=0.719069
 40%|████      | 2/5 [07:49<11:44, 234.81s/trial, best loss: -0.7391286159859711]Step 1000 validation: roc_auc_score=0.721794
Step 2000 validation: roc_auc_score=0.682861
 60%|██████    | 3/5 [11:39<07:45, 232.62s/trial, best loss: -0.7391286159859711]Step 1000 validation: roc_auc_score=0.759539
Step 2000 validation: roc_auc_score=0.7567
 80%|████████  | 4/5 [15:26<03:50, 230.65s/trial, best loss: -0.7391286159859711]Step 1000 validation: roc_auc_score=0.708648
Step 2000 validation: roc_auc_score=0.666173
100%|██████████| 5/5 [19:13<00:00, 230.67s/trial, best loss: -0.7391286159859711]
Training the best estimator with the best hyperparamet

The metrics both during training and testing were converted to a DataFrame and finally stored in excel files.

In [None]:
metrics_test_df = pd.DataFrame.from_dict(metrics_test_original, orient='index')

print(metrics_test_df)

                   Recall  Specificity  ROC Score
CGNN Bitter      0.129676     0.994291   0.594583
     Floral      0.003540     0.991008   0.700110
     Fruity      0.004739     0.980543   0.707994
     Off_flavor  0.936765     0.216216   0.670555
     Nutty       0.000000     0.999580   0.621557
     Sour        0.000000     0.991535   0.575059
     Sweet       0.350000     0.948037   0.759007


In [None]:
metrics_val_df = pd.DataFrame.from_dict(metrics_valid_original, orient='index')

print(metrics_val_df)

                   Recall  Specificity  ROC Score
CGNN Bitter      0.375375     0.915305   0.735133
     Floral      0.000000     0.951788   0.680507
     Fruity      0.021978     0.920673   0.676449
     Off_flavor  0.037037     0.860700   0.607220
     Nutty       0.035714     0.969489   0.678599
     Sour        0.000000     1.000000   0.547278
     Sweet       0.503297     0.866516   0.805504


In [None]:
metrics_val_df.to_excel('Validation_metrics_original.xlsx')
metrics_test_df.to_excel('Test_metrics_original.xlsx')

# 3. Training the Convolutional Graph Neural Network with the data balanced with the transformer

The process above described was repeated with the balanced molecular graphs

In [None]:
model_balanced, metrics_test_balanced, metrics_valid_balanced = training_GCNModel('balanced_train_data', 'balanced_valid_dataset',
                                                                                  'balanced_test_data', 'balanced_train_and_valid_data')

Training the Neural Network for Bitter
  0%|          | 0/5 [00:00<?, ?trial/s, best loss=?]Step 1000 validation: roc_auc_score=0.620289
Step 2000 validation: roc_auc_score=0.767146
 20%|██        | 1/5 [02:27<09:50, 147.67s/trial, best loss: -0.760091071755262]Step 1000 validation: roc_auc_score=0.697691
Step 2000 validation: roc_auc_score=0.719533
 40%|████      | 2/5 [04:52<07:18, 146.25s/trial, best loss: -0.760091071755262]Step 1000 validation: roc_auc_score=0.667387
Step 2000 validation: roc_auc_score=0.71601
 60%|██████    | 3/5 [07:18<04:51, 145.97s/trial, best loss: -0.760091071755262]Step 1000 validation: roc_auc_score=0.752097
Step 2000 validation: roc_auc_score=0.715556
 80%|████████  | 4/5 [09:43<02:25, 145.54s/trial, best loss: -0.761765480190443]Step 1000 validation: roc_auc_score=0.76262
Step 2000 validation: roc_auc_score=0.757026
100%|██████████| 5/5 [12:11<00:00, 146.38s/trial, best loss: -0.761765480190443]
Training the best estimator with the best hyperparameters



In [None]:
metrics_val_df_bal = pd.DataFrame.from_dict(metrics_valid_balanced, orient='index')

print(metrics_val_df_bal)

                   Recall  Specificity  ROC Score
CGNN Bitter      0.830330     0.453195   0.753726
     Floral      0.622642     0.583981   0.613002
     Fruity      0.703297     0.608173   0.692361
     Off_flavor  0.648148     0.618677   0.655887
     Nutty       0.500000     0.708619   0.622071
     Sour        0.357143     0.789434   0.670512
     Sweet       0.573626     0.837104   0.786867


In [None]:
metrics_test_df_bal = pd.DataFrame.from_dict(metrics_test_balanced, orient='index')

print(metrics_test_df_bal)

                   Recall  Specificity  ROC Score
CGNN Bitter      0.231920     0.931489   0.533864
     Floral      0.736283     0.593469   0.728220
     Fruity      1.000000     0.049858   0.659562
     Off_flavor  0.998529     0.046046   0.682708
     Nutty       0.993243     0.038203   0.629486
     Sour        0.962025     0.036168   0.563654
     Sweet       0.387500     0.938799   0.756332


In [None]:
metrics_val_df_bal.to_excel('Validation_metrics_balanced.xlsx')
metrics_test_df_bal.to_excel('Test_metrics_balanced.xlsx')