# TRAINING, OPTIMIZING, AND SELECTING THE MACHINE LEARNING ALGORITHMS FOR FLAVOR PREDICTION WITH RDKit DESCRIPTORS

This script comprises the process for training, hyperprameter optimization and testing the Machine Learning algorithms for flavor prediction. The data for both training and testing was previously splitted using a partition training-testing 80:20.

This script contains the steps for the training of the Random Forest, and K-Nearest KNeighbors classifiers. These steps are the same for the training with both molecular descriptors.

In [1]:

# Import the training data for molecular descriptors

import pandas as pd

RDKit_train_data = pd.read_excel('https://github.com/FabioHerrera97/FlavorMiner/raw/main/Data/RDKit_train.xlsx')

X_RDKit_train = RDKit_train_data.drop(['Bitter', 'Floral', 'Fruity', 'Off_flavor', 'Nutty', 'Sour', 'Sweet'], axis=1)

y_RDKit_train = RDKit_train_data [['Bitter', 'Floral', 'Fruity', 'Off_flavor', 'Nutty', 'Sour', 'Sweet']]



# 1. Training and optimizing the Random Forest and KNN algorithms

Three algorihms selected for the training with molecular descriptors and fingerprint are Random Forest and K-Nearest Neibours. These algorithms were chosen because they were previously used for flavor prediction and have several tools to interpret and explain the results, offering further information beyond the predictions. Additionally the optimization of the hyperparameters of these models is relatively fast. The library used for this training and optimization is sklearn. The support vector machine must be trained separately as it requires more computer power and a different hyperparameter optimization procedure.

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV

All the models are submitted to hyperparameter optimization using the gridsearch fuction. 5-fold Cross Validation is used to do the validation of the model during the hyperparameter optimization. The metric used for selecting the best estimator during the optimization is the recall. This is because this metrics measure the performance of the model predicting True Positives, the lower category in this case.

In [None]:

# Function to train a binary classifier and return the trained model

def train_classifier(X, y, classifier, param_grid):

    # Perform hyperparameter optimization using GridSearchCV

    grid_search = GridSearchCV(classifier, param_grid, cv=5, scoring='recall')
    grid_search.fit(X, y)

    return grid_search.best_estimator_


trained_classifiers = []

# Hyperparameter grids for each classifier

rf_param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [None, 10, 20],
    'min_samples_split': [2, 5, 10]
}

svm_param_grid = {
    'C': [0.1, 1, 10],
    'kernel': ['linear', 'rbf']
}

knn_param_grid = {
    'n_neighbors': [3, 5, 7],
    'weights': ['uniform', 'distance']
}



In [None]:

for label in y_RDKit_train:
    print(f"Training classifiers for '{label}'")

    # Random Forest

    rf_classifier = RandomForestClassifier(random_state=42)
    rf_classifier = train_classifier(X_RDKit_train, y_RDKit_train[label], rf_classifier, rf_param_grid)
    trained_classifiers.append(("Random Forest", label, rf_classifier))

    # KNN

    knn_classifier = KNeighborsClassifier()
    knn_classifier = train_classifier(X_RDKit_train, y_RDKit_train[label], knn_classifier, knn_param_grid)
    trained_classifiers.append(("KNN", label, knn_classifier))


Training classifiers for 'Bitter'
Training classifiers for 'Floral'
Training classifiers for 'Fruity'
Training classifiers for 'Off_flavor'
Training classifiers for 'Nutty'
Training classifiers for 'Sour'
Training classifiers for 'Sweet'


In [None]:
import joblib
from google.colab import files

# save model with joblib
for classifier_type, label, classifier in trained_classifiers:
  filename = f'{label}_{classifier_type}.sav'
  joblib.dump(classifier, filename)
  files.download(filename)

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

# 2. Testing the algorithms trained with the RDKit molecular descriptors

The models trained previously  were saved as .sav files. This allows subsequent use of the models without retraining. Similarly, multiple tests cab be performed on this saved models.

In [None]:
import joblib

labels = ['Bitter', 'Floral', 'Fruity', 'Off_flavor', 'Nutty', 'Sour', 'Sweet']

models_RDKit_data = []

classifier_types = ["Random_Forest", "KNN"]

for label, classifier_type in labels, classifier_types:

  loaded_model = joblib.load(f"{label}_{classifier_type}.sav")

  models_RDKit_data.append(loaded_model)


After training the algorithms the best estimators are tested using the test set. The metrics used during the testing are recall, specificity, and roc_score. These metrics were selected because they are speacially designed to test classifiers trained with imbalanced data. The recall measuress the performance on True Positives, the specifity on False Positives, and the roc_score works as a weighted average of the recall and specifity.

The mentioned metrics are calculated both during the training (with cross validation) and testing to identify additional pathologies in the models, such as overfitting.

In [2]:
import pandas as pd

''' Import the testing data for molecular descriptors'''

RDKit_test_data = pd.read_excel('https://github.com/FabioHerrera97/FlavorMiner/raw/main/Data/RDKit_test.xlsx')

X_RDKit_test = RDKit_test_data.drop(['Bitter', 'Floral', 'Fruity', 'Off_flavor', 'Nutty', 'Sour', 'Sweet'], axis=1)

y_RDKit_test = RDKit_test_data [['Bitter', 'Floral', 'Fruity', 'Off_flavor', 'Nutty', 'Sour', 'Sweet']]

In [None]:
from sklearn.metrics import recall_score, roc_auc_score, confusion_matrix

def evaluate_classifiers(trained_classifiers, X_train, y_train, X_test, y_test):
    evaluation_metrics_train = {}
    evaluation_metrics_test = {}

    for classifier_type, label, classifier in trained_classifiers:
        ''' Training evaluation. Use the resampled training set for evaluation'''

        y_pred_train = classifier.predict(X_train)
        recall_train = recall_score(y_train[label], y_pred_train)
        tn_train, fp_train, fn_train, tp_train = confusion_matrix(y_train[label], y_pred_train).ravel()
        specificity_train = tn_train / (tn_train + fp_train)
        roc_score_train = roc_auc_score(y_train[label], classifier.predict_proba(X_train)[:, 1])

        evaluation_metrics_train[(classifier_type, label)] = {'Recall': recall_train,
                                                              'Specificity': specificity_train,
                                                              'ROC Score': roc_score_train}

        ''' Testing evaluation'''

        y_pred_test = classifier.predict(X_test)
        recall_test = recall_score(y_test[label], y_pred_test)
        tn_test, fp_test, fn_test, tp_test = confusion_matrix(y_test[label], y_pred_test).ravel()
        specificity_test = tn_test / (tn_test + fp_test)
        roc_score_test = roc_auc_score(y_test[label], classifier.predict_proba(X_test)[:, 1])

        evaluation_metrics_test[(classifier_type, label)] = {'Recall': recall_test,
                                                             'Specificity': specificity_test,
                                                             'ROC Score': roc_score_test}

    return evaluation_metrics_train, evaluation_metrics_test

In [None]:
train_metrics, test_metrics = evaluate_classifiers(trained_classifiers, X_RDKit_train, y_RDKit_train, X_RDKit_test, y_RDKit_test)

The evaluation of the models show in general good values for the specificiy and ROC score of the models. However, the recall is still considerably low (except for sweet compounds). For cases such as Bitter, Floral, Fruity, and Off-Flavor the results are promising, although the metric is still low. For Nutty and Sour these approaches show the worst performance. Additionally, except for the bitter flavor, KNN algorithm performs slighltly better than Random Forest.

In [None]:
metrics_test_df = pd.DataFrame.from_dict(test_metrics, orient='index')

print(metrics_test_df)

                            Recall  Specificity  ROC Score
Random Forest Bitter      0.591146     0.986486   0.938389
KNN           Bitter      0.575521     0.957280   0.839378
Random Forest Floral      0.416667     0.967671   0.896347
KNN           Floral      0.440476     0.937998   0.786926
Random Forest Fruity      0.353723     0.962641   0.884246
KNN           Fruity      0.390957     0.928323   0.762269
Random Forest Off_flavor  0.417778     0.952873   0.893795
KNN           Off_flavor  0.482222     0.922352   0.793901
Random Forest Nutty       0.271084     0.978687   0.875503
KNN           Nutty       0.340361     0.953964   0.753136
Random Forest Sour        0.040000     0.997311   0.849422
KNN           Sour        0.066667     0.991548   0.620871
Random Forest Sweet       0.844307     0.951694   0.947829
KNN           Sweet       0.862122     0.885364   0.916645


After calculating the performance metrics during training with cross-validation, it is observed how the recall for all the models is above 86%. This fact reflects overfitting problems in trained models because the performance during testing drops more than 50% on average.  The most likely cause for this issue is the considerable imbalance between positive and negative examples. This hypothesis is based on the fact that the differences between the training and testing performance are significant considering the recall but small at the specificity level. Another evidence of this problem is the sweet category, were the class imbalance is low, shows no severe overfitting. Thus, implementiong a class imbalance strategy has the potential to help solving this problem.

In [None]:
metrics_train_df = pd.DataFrame.from_dict(train_metrics, orient='index')

print(metrics_train_df)

                            Recall  Specificity  ROC Score
Random Forest Bitter      0.977124     0.998257   0.999437
KNN           Bitter      0.973856     0.998911   0.997185
Random Forest Floral      0.953033     0.994551   0.999156
KNN           Floral      0.942596     0.996622   0.999410
Random Forest Fruity      0.948681     0.995379   0.999105
KNN           Fruity      0.941554     0.996884   0.998759
Random Forest Off_flavor  0.938616     0.995066   0.998798
KNN           Off_flavor  0.934710     0.996075   0.997768
Random Forest Nutty       0.918281     0.997585   0.998864
KNN           Nutty       0.560236     0.978156   0.960667
Random Forest Sour        0.869565     0.999519   0.999639
KNN           Sour        0.897516     0.999422   0.998184
Random Forest Sweet       0.981734     0.994916   0.999441
KNN           Sweet       0.983272     0.994916   0.999064


# 3. Training with the oversampled data



After doing the SMOTE oversampling, the following step is training the algorithms with the resampled data. The hyperparameter optimization was repeated during this step to help improve the performance.

WARNING: The training can take up to 2 hours

In [None]:
import pandas as pd

labels = ['Bitter', 'Floral', 'Fruity', 'Off_flavor', 'Nutty', 'Sour', 'Sweet']
trained_classifiers_SMOTE = []

for lab in labels:
    print(f"Training classifiers for '{lab}'")
    oversampled_data = pd.read_excel(f"{lab}_oversampled.xlsx")

    X_train = oversampled_data.drop([f"{lab}"], axis=1)
    y_train = oversampled_data [f"{lab}"]
    # Random Forest

    rf_classifier = RandomForestClassifier(random_state=42)
    rf_classifier = train_classifier(X_train, y_train, rf_classifier, rf_param_grid)
    trained_classifiers_SMOTE.append(("Random_Forest__SMOTE", lab, rf_classifier))

    # KNN

    knn_classifier = KNeighborsClassifier()
    knn_classifier = train_classifier(X_train, y_train, knn_classifier, knn_param_grid)
    trained_classifiers_SMOTE.append(("KNN_SMOTE", lab, knn_classifier))


Training classifiers for 'Bitter'
Training classifiers for 'Floral'
Training classifiers for 'Fruity'
Training classifiers for 'Off_flavor'
Training classifiers for 'Nutty'
Training classifiers for 'Sour'
Training classifiers for 'Sweet'


In [None]:
import joblib
from google.colab import files

# save model with joblib
for classifier_type, label, classifier in trained_classifiers_SMOTE:
  filename = f'{label}_{classifier_type}.sav'
  joblib.dump(classifier, filename)
  files.download(filename)

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [None]:
import joblib

metrics_train_SMOTE = {}
metrics_test_SMOTE = {}

labels = ['Bitter', 'Floral', 'Fruity', 'Off_flavor', 'Nutty', 'Sour', 'Sweet']

for lab in labels:
    print(f"Testing classifiers for '{lab}'")
    oversampled_data = pd.read_excel(f"{lab}_oversampled.xlsx")
    X_train = oversampled_data.drop([f'{lab}'], axis=1)
    y_train = oversampled_data[f'{lab}']


    classifier_trained = {}
    RF = joblib.load(f'{lab}_Random_Forest__SMOTE.sav')
    name_RF = 'Random Forest'
    classifier_trained[RF] = name_RF
    KNN = joblib.load(f'{lab}_KNN_SMOTE.sav')
    name_KNN = 'KNN'
    classifier_trained[KNN] = name_KNN

    for classifier in classifier_trained:

      ''' Training evaluation. Use the resampled training set for evaluation'''

      y_pred_train = classifier.predict(X_train)
      recall_train = recall_score(y_train, y_pred_train)
      tn_train, fp_train, fn_train, tp_train = confusion_matrix(y_train, y_pred_train).ravel()
      specificity_train = tn_train / (tn_train + fp_train)
      roc_score_train = roc_auc_score(y_train, classifier.predict_proba(X_train)[:, 1])

      metrics_train_SMOTE[(classifier_trained[classifier], lab)] = {'Recall': recall_train, 'Specificity': specificity_train,
                                                     'ROC Score': roc_score_train}

      ''' Testing evaluation'''

      y_pred_test = classifier.predict(X_RDKit_test)
      recall_test = recall_score(y_RDKit_test[lab], y_pred_test)
      tn_test, fp_test, fn_test, tp_test = confusion_matrix(y_RDKit_test[lab], y_pred_test).ravel()
      specificity_test = tn_test / (tn_test + fp_test)
      roc_score_test = roc_auc_score(y_RDKit_test[lab], classifier.predict_proba(X_RDKit_test)[:, 1])

      metrics_test_SMOTE[(classifier_trained[classifier], lab)] = {'Recall': recall_test, 'Specificity': specificity_test,
                                                      'ROC Score': roc_score_test}

Testing classifiers for 'Bitter'
Testing classifiers for 'Floral'
Testing classifiers for 'Fruity'
Testing classifiers for 'Off_flavor'
Testing classifiers for 'Nutty'
Testing classifiers for 'Sour'
Testing classifiers for 'Sweet'


The evaluation of the models show in general good values for the specificiy and ROC score of the models. However, the recall is still considerably low (except for sweet compounds). For cases such as Bitter, Floral, Fruity, and Off-Flavor the results are promising, although the metric is still low. For Nutty and Sour these approaches show the worst performance. Additionally, except for the bitter flavor, KNN algorithm performs slighltly better than Random Forest.

In [None]:
metrics_test_df_SMOTE = pd.DataFrame.from_dict(metrics_test_SMOTE, orient='index')

print(metrics_test_df_SMOTE)

                            Recall  Specificity  ROC Score
Random Forest Bitter      0.731771     0.949433   0.939729
KNN           Bitter      0.734375     0.858326   0.840433
Random Forest Floral      0.764286     0.827724   0.896409
KNN           Floral      0.759524     0.784322   0.835069
Random Forest Fruity      0.696809     0.861859   0.893241
KNN           Fruity      0.760638     0.774109   0.829311
Random Forest Off_flavor  0.795556     0.799372   0.891115
KNN           Off_flavor  0.715556     0.797576   0.812669
Random Forest Nutty       0.674699     0.845695   0.874960
KNN           Nutty       0.686747     0.782182   0.795347
Random Forest Sour        0.226667     0.981944   0.846697
KNN           Sour        0.520000     0.821744   0.711382
Random Forest Sweet       0.845856     0.943764   0.947353
KNN           Sweet       0.865221     0.878875   0.915209


In [None]:
metrics_train_df_SMOTE = pd.DataFrame.from_dict(metrics_train_SMOTE, orient='index')
print(metrics_train_df_SMOTE)

                            Recall  Specificity  ROC Score
Random Forest Bitter      0.996296     0.994771   0.999833
KNN           Bitter      0.994989     0.919163   0.997227
Random Forest Floral      0.996731     0.856364   0.977944
KNN           Floral      0.996513     0.829119   0.992856
Random Forest Fruity      0.989254     0.892542   0.977905
KNN           Fruity      0.996669     0.829787   0.993562
Random Forest Off_flavor  0.991141     0.840978   0.976941
KNN           Off_flavor  0.991253     0.880677   0.994313
Random Forest Nutty       0.991913     0.885738   0.987405
KNN           Nutty       0.994644     0.844360   0.994253
Random Forest Sour        0.998556     0.989891   0.999856
KNN           Sour        0.999519     0.866179   0.998054
Random Forest Sweet       0.982934     0.994372   0.999446
KNN           Sweet       0.984205     0.994916   0.999095


# 5. Training with the undersampled data

In [None]:
import pandas as pd

labels = ['Bitter', 'Floral', 'Fruity', 'Off_flavor', 'Nutty', 'Sour', 'Sweet']
trained_classifiers_CC = []

for lab in labels:
    print(f"Training classifiers for '{lab}'")
    oversampled_data = pd.read_excel(f"{lab}_undersampled.xlsx")

    X_train = oversampled_data.drop([f"{lab}"], axis=1)
    y_train = oversampled_data [f"{lab}"]
    # Random Forest

    rf_classifier = RandomForestClassifier(random_state=42)
    rf_classifier = train_classifier(X_train, y_train, rf_classifier, rf_param_grid)
    trained_classifiers_CC.append(("Random_Forest_CC", lab, rf_classifier))

    # KNN

    knn_classifier = KNeighborsClassifier()
    knn_classifier = train_classifier(X_train, y_train, knn_classifier, knn_param_grid)
    trained_classifiers_CC.append(("KNN_CC", lab, knn_classifier))


Training classifiers for 'Bitter'
Training classifiers for 'Floral'
Training classifiers for 'Fruity'
Training classifiers for 'Off_flavor'
Training classifiers for 'Nutty'
Training classifiers for 'Sour'
Training classifiers for 'Sweet'


In [None]:
import joblib

# save model with joblib
for classifier_type, label, classifier in trained_classifiers_CC:
  filename = f'{label}_{classifier_type}.sav'
  joblib.dump(classifier, filename)

In [None]:
metrics_train_CC = {}
metrics_test_CC = {}

labels = ['Bitter', 'Floral', 'Fruity', 'Off_flavor', 'Nutty', 'Sour', 'Sweet']

for lab in labels:
    print(f"Testing classifiers for '{lab}'")
    oversampled_data = pd.read_excel(f"{lab}_undersampled.xlsx")
    X_train = oversampled_data.drop([f'{lab}'], axis=1)
    y_train = oversampled_data[f'{lab}']


    classifier_trained = {}
    RF = joblib.load(f'{lab}_Random_Forest_CC.sav')
    name_RF = 'Random Forest'
    classifier_trained[RF] = name_RF
    KNN = joblib.load(f'{lab}_KNN_CC.sav')
    name_KNN = 'KNN'
    classifier_trained[KNN] = name_KNN

    for classifier in classifier_trained:

      ''' Training evaluation. Use the resampled training set for evaluation'''

      y_pred_train = classifier.predict(X_train)
      recall_train = recall_score(y_train, y_pred_train)
      tn_train, fp_train, fn_train, tp_train = confusion_matrix(y_train, y_pred_train).ravel()
      specificity_train = tn_train / (tn_train + fp_train)
      roc_score_train = roc_auc_score(y_train, classifier.predict_proba(X_train)[:, 1])

      metrics_train_CC[(classifier_trained[classifier], lab)] = {'Recall': recall_train, 'Specificity': specificity_train,
                                                     'ROC Score': roc_score_train}

      ''' Testing evaluation'''

      y_pred_test = classifier.predict(X_RDKit_test)
      recall_test = recall_score(y_RDKit_test[lab], y_pred_test)
      tn_test, fp_test, fn_test, tp_test = confusion_matrix(y_RDKit_test[lab], y_pred_test).ravel()
      specificity_test = tn_test / (tn_test + fp_test)
      roc_score_test = roc_auc_score(y_RDKit_test[lab], classifier.predict_proba(X_RDKit_test)[:, 1])

      metrics_test_CC[(classifier_trained[classifier], lab)] = {'Recall': recall_test, 'Specificity': specificity_test,
                                                      'ROC Score': roc_score_test}

Testing classifiers for 'Bitter'
Testing classifiers for 'Floral'
Testing classifiers for 'Fruity'
Testing classifiers for 'Off_flavor'
Testing classifiers for 'Nutty'
Testing classifiers for 'Sour'
Testing classifiers for 'Sweet'


In [None]:
metrics_test_df_CC = pd.DataFrame.from_dict(metrics_test_CC, orient='index')

print(metrics_test_df_CC)

                            Recall  Specificity  ROC Score
Random Forest Bitter      0.906250     0.553182   0.811942
KNN           Bitter      0.765625     0.817350   0.879238
Random Forest Floral      0.883333     0.710363   0.876294
KNN           Floral      0.819048     0.746236   0.857347
Random Forest Fruity      0.840426     0.678975   0.857553
KNN           Fruity      0.784574     0.721546   0.834234
Random Forest Off_flavor  0.877778     0.692998   0.854423
KNN           Off_flavor  0.791111     0.746409   0.851823
Random Forest Nutty       0.882530     0.639386   0.822737
KNN           Nutty       0.810241     0.718670   0.838434
Random Forest Sour        0.960000     0.420668   0.795620
KNN           Sour        0.706667     0.701882   0.792737
Random Forest Sweet       0.835786     0.943764   0.946771
KNN           Sweet       0.857475     0.886806   0.915954


In [None]:
metrics_train_df_CC = pd.DataFrame.from_dict(metrics_train_CC, orient='index')

print(metrics_train_df_CC)

                            Recall  Specificity  ROC Score
Random Forest Bitter      0.994771     0.966667   0.998925
KNN           Bitter      0.998693     1.000000   0.999999
Random Forest Floral      1.000000     1.000000   1.000000
KNN           Floral      0.872798     0.851924   0.931239
Random Forest Fruity      1.000000     1.000000   1.000000
KNN           Fruity      0.837491     0.826087   0.907878
Random Forest Off_flavor  0.993862     0.998884   0.999929
KNN           Off_flavor  0.998326     1.000000   0.999999
Random Forest Nutty       0.999158     0.999158   0.999992
KNN           Nutty       1.000000     1.000000   1.000000
Random Forest Sour        1.000000     1.000000   1.000000
KNN           Sour        1.000000     1.000000   1.000000
Random Forest Sweet       0.970006     0.991348   0.998967
KNN           Sweet       0.978081     0.994809   0.999436
