# Model training

Training a model for the one-class classification of metabolites. Replicates the training process of the associated paper, using a subset of the data: link_to_paper

In [9]:
import os
import pickle
from shutil import copytree
from pyod.models import ocsvm, iforest
from sklearn.metrics import roc_auc_score

from tests.auxiliary import get_normal_non_normal_subsets
from deepmet.auxiliary import get_fingerprints_from_meta, select_features, Config
from deepmet.datasets import load_training_dataset
from deepmet.workflows import train_single_model, train_likeness_scorer

Compounds were extracted from the HMDB and ZINC12 databases, subject to the following constraints:
- Exact mass filter: 100Da < exact mass \> 800Da
- Other things

The entire set of compounds passing these filters in HMDB were retained while a random sample of 20,000 compounds were taken from ZINC12. The smiles for these compounds are available in the `data/test_set` folder.

In [10]:
# Path to write results
results_path = os.path.join(os.path.dirname(os.path.abspath("")), "notebook_results")

if not os.path.exists(results_path):
    os.mkdir(results_path)

# Copy input data to the results folder
copytree(
    os.path.join(os.path.dirname(os.path.abspath("")), "deepmet", "data"),
    os.path.join(results_path, "data")
)

# Seed to be used for loading the dataset and training models
seed = 1

# Location of the "normal" and "non-normal" smiles
normal_meta_path, non_normal_meta_path = get_normal_non_normal_subsets(results_path, seed=seed)

The function `train_likeness_scorer` implements the workflow for training the DeepMet model.
For the purposes of this vignette, we will first carry out the individual steps manually.

While smiles are provided in the data files, these are not used as input to the model.
If not given to the `train_likeness_scorer` function, these will be converted to molecular
fingerprints using the smiles given as input. These are calculated in the following chunk.

This is a particularly time-consuming step, so it is recommended not to unnecessarily regenerate the fingerprints.

In [11]:
processed_normal_fingerprints_path = os.path.join(results_path, "normal_fingerprints.csv")
processed_non_normal_fingerprints_path = os.path.join(results_path, "non_normal_fingerprints.csv")

normal_fingerprints_path = get_fingerprints_from_meta(normal_meta_path, processed_normal_fingerprints_path)
non_normal_fingerprints_path = get_fingerprints_from_meta(non_normal_meta_path, processed_non_normal_fingerprints_path)

Here, we set the seed and set the training options for training DeepMet. The learning rate was
selected that was associated with the minimum loss on the validation set.

In [16]:
# Settings required by the DeepMet model
cfg = Config({
    "net_name": "cocrystal_transformer",
    "objective": "one-class",
    "nu": 0.1,
    "rep_dim": 200,
    "seed": seed,
    "optimizer_name": "amsgrad",
    "lr": 0.000155986,
    "n_epochs": 20,
    "lr_milestones": tuple(),
    "batch_size": 2000,
    "weight_decay": 1e-5,
    "pretrain": False,
    "in_features": 2746,
    "device": "cpu"
})

While we have now generated the molecular fingerprints, these include many poorly balanced and
redundant features. We therefore use `select_features` to remove redundant and unbalanced features
prior to model training.

The data is then loaded into a torch-compatible format using `load_training_dataset`.

In [17]:
normal_fingerprints_path, non_normal_fingerprints_paths, selected_features = select_features(
        normal_fingerprints_path=normal_fingerprints_path,
        normal_fingerprints_out_path=os.path.join(results_path, "selected_normal_fingerprints.csv"),
        non_normal_fingerprints_paths=non_normal_fingerprints_path,
        non_normal_fingerprints_out_paths=os.path.join(results_path, "selected_non_normal_fingerprints.csv")
)

cfg.settings["selected_features"] = selected_features

# select_features allows for the simultaneous selection of multiple non-normal datasets
# we only have a single non-normal ZINC12 set here, which we will use to evaluate the final model
non_normal_fingerprints_path = non_normal_fingerprints_paths[0]

dataset, dataset_labels, validation_dataset = load_training_dataset(
    normal_dataset_path=normal_fingerprints_path,
    normal_meta_path=normal_meta_path,
    non_normal_dataset_path=non_normal_fingerprints_path,
    non_normal_dataset_meta_path=non_normal_meta_path,
    seed=seed,
    validation_split=0.8,
    test_split=0.9
)

With the dataset loaded, we can now train the model. The core training workflow
is carried out using `train_single_model`. We can use AUC to evaluate the model's
discriminative capacity for metabolites vs ZINC12 compounds - but, importantly,
AUC is not used for hyperparameter optimisation as the validation set does not
contain any "non-normal" compounds.

In [18]:
# Train the model (loss is calculated on the 'normal' validation set for parameter tuning)
deep_met_model = train_single_model(cfg, validation_dataset)

# Test using separate test dataset (includes the ZINC12 set of 'non-normal' compounds)
deep_met_model.test(dataset, device=cfg.settings["device"])

initial_auc = round(deep_met_model.results["test_auc"], 4)
print("AUC on test set: " + str(initial_auc))



Only one class present in y_true. ROC AUC score is not defined in that case.
AUC on test set: 0.9404


Instead of going through each of these steps, the `train_likeness_scorer`
can train a model from scratch from the original smiles. Here, we re-train
the model using the same settings as above and the pre-calculated fingerprints.

In [19]:
deep_met_model = train_likeness_scorer(
    normal_meta_path=normal_meta_path,
    non_normal_meta_path=non_normal_meta_path,
    normal_fingerprints_path=normal_fingerprints_path,
    non_normal_fingerprints_path=non_normal_fingerprints_path,
    results_path=results_path,
    net_name=cfg.settings["net_name"],
    objective=cfg.settings["objective"],
    nu=cfg.settings["nu"],
    rep_dim=cfg.settings["rep_dim"],
    device=cfg.settings["device"],
    seed=seed,
    optimizer_name=cfg.settings["optimizer_name"],
    lr=cfg.settings["lr"],
    n_epochs=cfg.settings["n_epochs"],
    lr_milestones=cfg.settings["lr_milestones"],
    batch_size=cfg.settings["batch_size"],
    weight_decay=cfg.settings["weight_decay"],
    validation_split=0.8,
    test_split=0.9
)

workflow_auc = round(deep_met_model.results["test_auc"], 4)
assert initial_auc == workflow_auc

print("AUC on test set: " + str(workflow_auc))


INFO:root:Log file is C:\Users\jackg\OneDrive\Documents\Work\Imperial_PhD\Side_projects\DeepMet\scripting\DeepMet\notebook_results/log.txt.
INFO:root:Export path is C:\Users\jackg\OneDrive\Documents\Work\Imperial_PhD\Side_projects\DeepMet\scripting\DeepMet\notebook_results.
INFO:root:Network: cocrystal_transformer
INFO:root:The filtered normal fingerprint matrix path is C:\Users\jackg\OneDrive\Documents\Work\Imperial_PhD\Side_projects\DeepMet\scripting\DeepMet\notebook_results\normal_fingerprints_processed.csv.
INFO:root:The filtered normal meta is C:\Users\jackg\OneDrive\Documents\Work\Imperial_PhD\Side_projects\DeepMet\scripting\DeepMet\notebook_results\normal_meta.csv.
INFO:root:The filtered non-normal fingerprint matrix path is C:\Users\jackg\OneDrive\Documents\Work\Imperial_PhD\Side_projects\DeepMet\scripting\DeepMet\notebook_results\non_normal_fingerprints_processed.csv.
INFO:root:The filtered non-normal meta is C:\Users\jackg\OneDrive\Documents\Work\Imperial_PhD\Side_projects\De

Only one class present in y_true. ROC AUC score is not defined in that case.
AUC on test set: 0.9404


With DeepMet trained, we can train isolation forest and one-class SVM models for comparison. As for the DeepMet
model, non-normal compounds are not used for parameter selection. The contamination and nu parameters were set to 0.1
for consistency with DeepMet. The remaining isolation forest parameters and the OC-SVM kernel are the same
as were used for the Co-crystal paper. The gamma parameter was selected using the validation set and the scaled distance of the outliers
to the hyperplane as a loss function.

In [20]:
iforest_model = iforest.IForest(
    contamination=0.1,
    n_estimators=400,
    behaviour="new",
    random_state=seed,
    max_samples=1000
)

ocsvm_model = ocsvm.OCSVM(
    contamination=0.1,
    kernel="rbf",
    nu=0.1,
    gamma=0.00386
)

x_train = validation_dataset.train_set.dataset.data[validation_dataset.train_set.indices]

iforest_model.fit(x_train)

ocsvm_model.fit(x_train)

pickle.dump(iforest_model, open(os.path.join(results_path, "iforest_model.pkl"), "wb"))
pickle.dump(ocsvm_model, open(os.path.join(results_path, "ocsvm_model.pkl"), "wb"))

  warn("max_samples (%s) is greater than the "


We can calculate AUC for these models as was done for DeepMet. Both the isolation forests and the OC-SVM
models have similar discriminative performance; they both have a lower AUC compared to DeepMet.

In [23]:
x_test = dataset.test_set.dataset.data[dataset.test_set.indices]
labels_test = dataset.test_set.dataset.labels[dataset.test_set.indices]

print("Isolation forest AUC: " + str(round(roc_auc_score(labels_test, iforest_model.decision_function(x_test)), 4)))
print("OC-SVM AUC: " + str(round(roc_auc_score(labels_test, ocsvm_model.decision_function(x_test)), 4)))

0.9492
0.9531999999999999
Isolation forest AUC: 0.95
OC-SVM AUC: 0.95
