# Model training

Training a model for the one-class classification of metabolites. Replicates the training process of the DeepMet model, using a subset of the data.

In [None]:
import os
import pickle
from shutil import copytree
from pyod.models import ocsvm, iforest
from sklearn.metrics import roc_auc_score

from tests.utils import get_normal_non_normal_subsets
from deepmet.auxiliary import get_fingerprints_from_meta, select_features, Config
from deepmet.datasets import load_training_dataset
from deepmet.workflows import train_single_model, train_likeness_scorer, get_likeness_scores

We extracted compounds from the HMDB and ZINC12 databases, subject to the following constraints:
 - Exact mass filter: 100Da < exact mass \> 800Da
 - Heavy atom filter: heavy atoms \>= 4
 - RDKit molecular sanitization

The entire set of compounds passing these filters in HMDB were retained while a random sample of 20,000 compounds were taken from ZINC12. The smiles for these compounds are available in the `deepmet/data/test_set` folder along with their respective compound IDs.

Below, we create the folder `notebook_results` and make a copy of the compound lists. These smiles must be converted to molecular fingerprints which are used as input to the models; however, parsing the smiles and converting them to fingerprints is a particularly time-consuming step. For the purposes of this notebook, we take subsets of 500 HMDB ("normal") structures and 50 ZINC12 ("non-normal") compounds. These are randomly selected using `get_normal_non_normal_subsets`.

In [10]:
# Path to write results
results_path = os.path.join(os.path.dirname(os.path.abspath("")), "notebook_results")

if not os.path.exists(results_path):
    os.mkdir(results_path)

# Copy input data to the results folder
copytree(
    os.path.join(os.path.dirname(os.path.abspath("")), "deepmet", "data"),
    os.path.join(results_path, "data")
)

# Seed to be used for loading the dataset and training models
seed = 1

# Location of the "normal" and "non-normal" smiles
normal_meta_path, non_normal_meta_path = get_normal_non_normal_subsets(results_path, seed=seed)

The DeepMet package is based on the DeepSVDD model developed by Ruff et al., 2018:
 - GitHub: https://github.com/lukasruff/Deep-SVDD-PyTorch
 - ICML paper: http://proceedings.mlr.press/v80/ruff18a.html

The authors demonstrate the model for the detection of anomalous images. DeepMet uses the DeepSVDD approach to identify anomalous compounds - specifically, we trained the model on metabolites to generate "metabolite-likeness" scores. DeepMet does, however, allow users to re-train the model for any class of compounds; therefore, likeness scores can be generated for any type of structures.

There are two key workflows in the DeepMet package:
 - `train_likeness_scorer`: implements the workflow for training a DeepMet model
 - `get_likeness_scores`: uses a pre-trained DeepMet model to generate metabolite-likeness scores for new compounds

For the purposes of this vignette, we will first carry out the individual steps of the `train_likeness_scorer` workflow manually. Then, we will generate an identical model using `train_likeness_scorer` and compare it to two other one-class classification algorithms. Finally, we will take the stored model weights and those of the full model generated in the DeepMet paper and re-score the subset of compounds using `get_likeness_scores`.

In [11]:
normal_fingerprints_path = os.path.join(results_path, "normal_fingerprints.csv")
non_normal_fingerprints_path = os.path.join(results_path, "non_normal_fingerprints.csv")

# Takes the smiles and converts them to molecular fingerprints for each compound class
processed_normal_fingerprints_path = get_fingerprints_from_meta(normal_meta_path, normal_fingerprints_path)
processed_non_normal_fingerprints_path = get_fingerprints_from_meta(non_normal_meta_path, non_normal_fingerprints_path)

Below, we set the training options for the DeepMet model. The learning rate was selected in the DeepMet paper as it was associated with the minimum loss on the validation dataset. Note that these parameters were set based on the full training dataset, so may not lead to a model with comparable performance based on the subset we are using here.

We use a set transformer architecture developed by Lee et al., 2019 (http://proceedings.mlr.press/v97/lee19d.html), which was also used by Vriza et al., 2020 (https://pubs.rsc.org/en/content/articlelanding/2021/sc/d0sc04263c) to predict co-crystal pairs.

In [16]:
# Settings required by the DeepMet model
cfg = Config({
    "net_name": "cocrystal_transformer",
    "objective": "one-class",
    "nu": 0.1,
    "rep_dim": 200,
    "seed": seed,
    "optimizer_name": "amsgrad",
    "lr": 0.000155986,
    "n_epochs": 20,
    "lr_milestones": tuple(),
    "batch_size": 25,
    "weight_decay": 1e-5,
    "pretrain": False,
    "in_features": 2746,
    "device": "cpu"
})

While we have generated the raw molecular fingerprints, these include many poorly balanced and redundant features. We therefore use `select_features` to these prior to model training.

The data is then loaded into a torch-compatible format using `load_training_dataset`.

In [17]:
normal_fingerprints_path, non_normal_fingerprints_paths, selected_features = select_features(
        normal_fingerprints_path=normal_fingerprints_path,
        normal_fingerprints_out_path=os.path.join(results_path, "selected_normal_fingerprints.csv"),
        non_normal_fingerprints_paths=non_normal_fingerprints_path,
        non_normal_fingerprints_out_paths=os.path.join(results_path, "selected_non_normal_fingerprints.csv")
)

cfg.settings["selected_features"] = selected_features

# select_features allows for the simultaneous selection of multiple non-normal datasets
# we only have a single non-normal ZINC12 set here, which we will use to evaluate the final model
non_normal_fingerprints_path = non_normal_fingerprints_paths[0]

dataset, dataset_labels, validation_dataset = load_training_dataset(
    normal_dataset_path=normal_fingerprints_path,
    normal_meta_path=normal_meta_path,
    non_normal_dataset_path=non_normal_fingerprints_path,
    non_normal_dataset_meta_path=non_normal_meta_path,
    seed=seed,
    validation_split=0.8,
    test_split=0.9
)

With the dataset loaded, we can now train the model.  We can use AUC to evaluate the model's discriminative capacity for metabolites vs ZINC12 compounds - but, importantly, AUC was not used for hyperparameter optimisation as the validation set does not contain any "non-normal" compounds.

The AUC is relatively poor compared to that reported in the DeepMet paper as we are using a subset of the training data and we did not re-optimise model hyperparameters.

In [18]:
# Train the model (loss is calculated on the 'normal' validation set for parameter tuning)
deep_met_model = train_single_model(cfg, validation_dataset)

# Test using separate test dataset (includes the ZINC12 set of 'non-normal' compounds)
deep_met_model.test(dataset, device=cfg.settings["device"])

initial_auc = round(deep_met_model.results["test_auc"], 4)
print("AUC on test set: " + str(initial_auc))



Only one class present in y_true. ROC AUC score is not defined in that case.
AUC on test set: 0.9404


Instead of going through each of these steps, the `train_likeness_scorer` function can train a model from scratch from the original smiles. Here, we re-train the model using the same settings as above and the pre-calculated fingerprints.

In [19]:
deep_met_model = train_likeness_scorer(
    normal_meta_path=normal_meta_path,
    non_normal_meta_path=non_normal_meta_path,
    normal_fingerprints_path=normal_fingerprints_path,
    non_normal_fingerprints_path=non_normal_fingerprints_path,
    results_path=results_path,
    net_name=cfg.settings["net_name"],
    objective=cfg.settings["objective"],
    nu=cfg.settings["nu"],
    rep_dim=cfg.settings["rep_dim"],
    device=cfg.settings["device"],
    seed=seed,
    optimizer_name=cfg.settings["optimizer_name"],
    lr=cfg.settings["lr"],
    n_epochs=cfg.settings["n_epochs"],
    lr_milestones=cfg.settings["lr_milestones"],
    batch_size=cfg.settings["batch_size"],
    weight_decay=cfg.settings["weight_decay"],
    validation_split=0.8,
    test_split=0.9
)

workflow_auc = round(deep_met_model.results["test_auc"], 4)
assert initial_auc == workflow_auc

print("AUC on test set: " + str(workflow_auc))

INFO:root:Log file is C:\Users\jackg\OneDrive\Documents\Work\Imperial_PhD\Side_projects\DeepMet\scripting\DeepMet\notebook_results/log.txt.
INFO:root:Export path is C:\Users\jackg\OneDrive\Documents\Work\Imperial_PhD\Side_projects\DeepMet\scripting\DeepMet\notebook_results.
INFO:root:Network: cocrystal_transformer
INFO:root:The filtered normal fingerprint matrix path is C:\Users\jackg\OneDrive\Documents\Work\Imperial_PhD\Side_projects\DeepMet\scripting\DeepMet\notebook_results\normal_fingerprints_processed.csv.
INFO:root:The filtered normal meta is C:\Users\jackg\OneDrive\Documents\Work\Imperial_PhD\Side_projects\DeepMet\scripting\DeepMet\notebook_results\normal_meta.csv.
INFO:root:The filtered non-normal fingerprint matrix path is C:\Users\jackg\OneDrive\Documents\Work\Imperial_PhD\Side_projects\DeepMet\scripting\DeepMet\notebook_results\non_normal_fingerprints_processed.csv.
INFO:root:The filtered non-normal meta is C:\Users\jackg\OneDrive\Documents\Work\Imperial_PhD\Side_projects\De

Only one class present in y_true. ROC AUC score is not defined in that case.
AUC on test set: 0.9404


In [20]:
iforest_model = iforest.IForest(
    contamination=0.1,
    n_estimators=400,
    behaviour="new",
    random_state=seed,
    max_samples=1000
)

ocsvm_model = ocsvm.OCSVM(
    contamination=0.1,
    kernel="rbf",
    nu=0.1,
    gamma=0.00386
)

x_train = validation_dataset.train_set.dataset.data[validation_dataset.train_set.indices]

iforest_model.fit(x_train)

ocsvm_model.fit(x_train)

pickle.dump(iforest_model, open(os.path.join(results_path, "iforest_model.pkl"), "wb"))
pickle.dump(ocsvm_model, open(os.path.join(results_path, "ocsvm_model.pkl"), "wb"))

  warn("max_samples (%s) is greater than the "


In [23]:
x_test = dataset.test_set.dataset.data[dataset.test_set.indices]
labels_test = dataset.test_set.dataset.labels[dataset.test_set.indices]

print("Isolation forest AUC: " + str(round(roc_auc_score(labels_test, iforest_model.decision_function(x_test)), 4)))
print("OC-SVM AUC: " + str(round(roc_auc_score(labels_test, ocsvm_model.decision_function(x_test)), 4)))

0.9492
0.9531999999999999
Isolation forest AUC: 0.95
OC-SVM AUC: 0.95


Having trained a DeepMet model, we may want to re-use it in the future to score new compounds. Alternatively, we can use the model that was trained in the DeepMet paper (based on the full set of endogenous metabolites in HMDB) which is likely to generalise better to new compounds.

In the code above, we split the data into training, validation and test sets. However here, we are re-using the entire subset of compounds to demonstrate the `get_likeness_scores` function - so these AUC scores are likely to be optimistic.

In [None]:
scores = {
    "notebook": {},  # scores for the model generated in this notebook using a limited dataset
    "paper": {}      # scores for the model generated in the DeepMet paper on the full dataset
}

for model_name in ("notebook", "paper"):
    for dataset_name, meta_path in (("normal", "non-normal"), normal_meta_path, non_normal_meta_path):

        if model_name == "paper":
            load_model, load_config = None, None
        else:
            load_model, load_config = os.path.join(results_path, "model.tar"), os.path.join(results_path, "config.json")

        scores[model_name][dataset_name] = get_likeness_scores(
            non_normal_meta_path,
            results_path,
            load_model=load_model,
            load_config=load_config,
            device=cfg.settings["device"]
        )

    all_model_scores = scores[model_name]["normal"] + scores[model_name]["non-normal"]
    all_model_labels = [0] * len(scores[model_name]["normal"]) + [0] * len(scores[model_name]["non-normal"])

    print("AUC for " + model_name + "model: " + str(roc_auc_score(all_model_labels, all_model_scores)))

In [None]:
scores = {
    "notebook": {},  # scores for the model generated in this notebook using a limited dataset
    "paper": {}      # scores for the model generated in the DeepMet paper on the full dataset
}

for model_name in ("notebook", "paper"):
    for dataset_name, meta_path in (("normal", "non-normal"), normal_meta_path, non_normal_meta_path):

        if model_name == "paper":
            load_model, load_config = None, None
        else:
            load_model, load_config = os.path.join(results_path, "model.tar"), os.path.join(results_path, "config.json")

        scores[model_name][dataset_name] = get_likeness_scores(
            non_normal_meta_path,
            results_path,
            load_model=load_model,
            load_config=load_config,
            device=cfg.settings["device"]
        )

    all_model_scores = scores[model_name]["normal"] + scores[model_name]["non-normal"]
    all_model_labels = [0] * len(scores[model_name]["normal"]) + [0] * len(scores[model_name]["non-normal"])

    print("AUC for " + model_name + "model: " + str(roc_auc_score(all_model_labels, all_model_scores)))