![Retip](../../images/retip_logo.png)
# Retip: Retention Time Prediction for Metabolomics and Lipidomics

Retip is a python tool for predicting retention times (RTs) of small molecules for high pressure liquid chromatography (HPLC) mass spectrometry. Retention time calculation can be useful in identifying unknowns and removing false positive annotations. The machine learning algorithms included in the tool are: **XGBoost**, **AutoGluon**, **AutoML** from **H2O** and **Random Forest**. This tutorial explains how to train a model with **H2O AutoML**.

## Training a Model with H2O AutoML

[H2O AutoML]((https://docs.h2o.ai/h2o/latest-stable/h2o-docs/automl.html)) is an automatic machine learning tool from [H2O](https://docs.h2o.ai/h2o/latest-stable/h2o-docs/index.html) designed to automate the machine learning workflow, which includes automatic training and tuning of many models within a user-specified time-limit.

### Loading Data

Begin by importing the `retip` library, which provides access to the training, prediction and visualization functions.

In [1]:
%reload_ext autoreload
%autoreload 2
try:
    import retip
except:
    # add the parent directory to the path to load the Retip library locally in case it isn't installed
    import os, sys
    directory = os.getcwd().split("pyRetip")[0] + 'pyRetip'
    sys.path.insert(1, directory)
    
    import retip

The input data should be a compound retention time table in CSV or MS Excel format, containing the compound name, retention time and chemical identifier. Retip currently supports SMILES and PubChem CID as chemical identifiers.

Retip will use this input file to build the model and predict retention times for other biochemical databases or an input query list of compounds. It is suggested that the file has at least 300 compounds to build a good retention time prediction model.

Use the `retip.Dataset` class to load the data and create a new dataset.

In [5]:
dataset = retip.Dataset(target_column='RT').load_retip_dataset(
    training='Plasma_positive.xlsx', training_sheet_name='lib_2',
    validation='Plasma_positive.xlsx', validation_sheet_name='ext')

In [6]:
dataset.head(2)

Training
             Name                     InChIKey  \
0       Withanone  FAZIYUIDUNHZRG-UHFFFAOYNA-N   
1  Corosolic acid  HFGSQOYIOKBQOW-UHFFFAOYNA-N   

                                              SMILES    RT  
0  CC(C1CC(C)=C(C)C(=O)O1)C1(O)CCC2C3C4OC4C4(O)CC...  6.82  
1  CC1CCC2(CCC3(C)C(=CCC4C5(C)CC(O)C(O)C(C)(C)C5C...  9.89  

Validation
                                     Name                     InChIKey  \
0       Soyasapogenol E base + O-Hex-HexA  YDNHBSRZSMNZPB-NJAHCQCINA-N   
1  Soyasapogenol E base + O-HexA-Hex-dHex  CROUPKILZUPLQA-UHFFFAOYNA-N   

                                              SMILES    RT  
0  O=C(O)C7OC(OC2CCC3(C)(C4CC=C1C5CC(C)(C)CC(=O)C...  7.84  
1  O=C(O)C8OC(OC2CCC3(C)(C4CC=C1C5CC(C)(C)CC(=O)C...  7.62  



Next, the precalculated molecular descriptors can be computed with the [Mordred Molecular Descriptor Calculator](https://github.com/mordred-descriptor/mordred) by calling the `calculate_descriptors` function. Note that molecules that cannot be parsed will be retained the dataset, but cannot be used for model training or validation.

In [7]:
dataset.calculate_descriptors()

Calculating descriptors for training dataset


  0%|          | 0/494 [00:00<?, ?it/s]

100%|██████████| 494/494 [02:38<00:00,  3.12it/s]
  descs = descs.replace({False: 0, True: 1})


Calculating descriptors for validation dataset


100%|██████████| 358/358 [02:19<00:00,  2.57it/s]
  descs = descs.replace({False: 0, True: 1})


The `describe` function shows the shape of the datasets, indicating the number of rows and columns in each dataframe.

In [8]:
dataset.describe()

Training (494, 1617)
Validation (358, 1617)


The `preprocess_features` function performs feature reduction by removing features with missing values and to restrict feature sets to descriptors which calculate non-null values for large sets of molecules. It is important to perform this step before training.

In [9]:
dataset.preprocess_features('metabolomics')

Reduced feature set from 1613 to 817


Finally, it is possible to split the data into training and testing sets if it has not been loaded in separate files before. The `split_dataset` function makes that possible.

- The `test_size` parameter defines what percentage of the dataset should be used for testing of the model's accuracy (this example uses 20%).
- The `seed` parameter sets a specific training/test split for the database, enabling reproducable model training.
- The `validation_split` parameter constructs an additional dataset for validation if desired.

In [10]:
dataset.split_dataset(test_split=0.2, seed=101)

In [11]:
dataset.describe()

Training (395, 821)
Validation (358, 821)
Testing (99, 821)


#### Save the new dataset

Given that molecular descriptor calculation is a time-comsuming process, it is possible to save the current state of the dataset. Next time this retention time library is needed, simply use this export when loading this dataset instead. Note that there is no need to include a file extension, as Retip will automatically append the dataset type to the filename provided.

In [12]:
dataset.save_retip_dataset('Plasma_positive_retip_processed')

Saved training dataset to Plasma_positive_retip_processed_training.csv
Saved validation dataset to Plasma_positive_retip_processed_validation.csv
Saved testing dataset to Plasma_positive_retip_processed_testing.csv


This dataset can be loaded by running the `load_retip_dataset` function.

In [13]:
# dataset = retip.Dataset(target_column='RT').load_retip_dataset(
#     'Plasma_positive_retip_processed_training.csv',
#     'Plasma_positive_retip_processed_testing.csv',
#     'Plasma_positive_retip_processed_validation.csv')

### Training the RT Prediction Model

Here, the RT prediction model will be trained. First, initialize the `H2OautoMLTrainer` with the dataset with computed descriptors. Set the different parameters:
- The `nfolds` parameter indicates the number of folds for k-fold cross-validation of the model in the AutoML (`>=2`). Specify `-1` to let AutoML choose if k-fold cross-validation or blending mode should be used. Use `0` to disable cross-validation, this will also disable Stacked Ensembles. This value defaults to `-1`.
- The `training_duration` parameter indicates the maximum training time in minutes. This value defaults to `None`.
- The `max_models` paramenter indicates the maximum number of models to build in an AutoML run, excluding the Stacked Ensemble models. This value defaults to `20`.

If both parameters (`training_time` and `max_models`) are set, the model will be trained until one of two options is reached.

In [14]:
trainer = retip.H2OautoMLTrainer(dataset, training_duration=None, max_models=20)
trainer.train()

Checking whether there is an H2O instance running at http://localhost:54321.

.... not found.
Attempting to start a local H2O server...
  Java Version: openjdk version "11.0.22" 2024-01-16; OpenJDK Runtime Environment (build 11.0.22+7-post-Ubuntu-0ubuntu220.04.1); OpenJDK 64-Bit Server VM (build 11.0.22+7-post-Ubuntu-0ubuntu220.04.1, mixed mode, sharing)
  Starting server from /home/neuspouamengual/Descargas/anaconda3/envs/pyretip/lib/python3.10/site-packages/h2o/backend/bin/h2o.jar
  Ice root: /tmp/tmp4sinx2hz
  JVM stdout: /tmp/tmp4sinx2hz/h2o_neuspouamengual_started_from_python.out
  JVM stderr: /tmp/tmp4sinx2hz/h2o_neuspouamengual_started_from_python.err
  Server is running at http://127.0.0.1:54321
Connecting to H2O server at http://127.0.0.1:54321 ... successful.


0,1
H2O_cluster_uptime:,02 secs
H2O_cluster_timezone:,Europe/Madrid
H2O_data_parsing_timezone:,UTC
H2O_cluster_version:,3.46.0.2
H2O_cluster_version_age:,17 days
H2O_cluster_name:,H2O_from_python_neuspouamengual_s77wrt
H2O_cluster_total_nodes:,1
H2O_cluster_free_memory:,3.842 Gb
H2O_cluster_total_cores:,8
H2O_cluster_allowed_cores:,8


Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%
AutoML progress: |
13:11:31.655: _train param, Dropping bad and constant columns: [NsF, NaaS, NsI, NsssGeH, NsCl, NssSnH2, C2SP1, SssSiH2, n7aRing, n11aRing, n4FHRing, n7FHRing, n12FaRing, n4HRing, n6FRing, NssssBe, n8aRing, SsssAs, n11AHRing, n12aRing, n8aHRing, SsssSnH, n4FARing, SssssB, SssssSn, SssssSi, NsssssP, NssPH, SaaS, n8AHRing, n9aRing, SsAsH2, SsCl, NsBr, n5FRing, n7FARing, NsNH3, SsSnH3, SsLi, n11Ring, SsssGeH, SssSnH2, nG12aRing, n5FARing, n4aRing, NssssPb, n9AHRing, NssGeH2, SssAsH, NssssB, SssBe, SsSeH, n11aHRing, n5FAHRing, n4FAHRing, SssssBe, NsssB, n7FaRing, NsGeH3, nBr, NssssGe, SssSe, SddssSe, n7FRing, NsLi, NsssP, n10aHRing, n6FaHRing, n6FARing, NssNH2, n4AHRing, n6FaRing, NssBH, nCl, NsPbH3, NssPbH2, n6FHRing, NdssSe, n4FRing, n7aHRing, NsssdAs, SsPH2, SsssssP, n10aRing, SsSiH3, NsssSiH, StCH, SssGeH2, SdSe, SssssPb, n9aHRing, n9ARing, NssBe, NsSeH, n8FaHRing, nB, n7FaH

The models used to build the `H2OautoMLTrainer` object can be inspected with `trainer.leaderboard`. To view the best model, use `trainer.leader`, or alternatively, use the `trainer.get_model(num_model=0)` function. Furthermore, this function enables the examination of the other models included in the object by specifying the model number based on the order of the `trainer.leaderboard` output.

In [15]:
# trainer.leaderboard
# trainer.leader
# trainer.get_model(3)

### Testing the RT Prediction Model

The model can be scored using the internal testing data of the `Dataset` object, or alternatively pass a dataframe with precomputed descriptors. In that case, the `target_column` needs to be specified. Set the `plot` parameter to `True` to visualize how well the model works. Moreover, it is possible to save the plot indicating the `plot_filename`.

#### Internal testing data

In [16]:
trainer.score(plot=True)

Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%
glm prediction progress: |███████████████████████████████████████████████████████| (done) 100%
Export File progress: |██████████████████████████████████████████████████████████| (done) 100%




{'root_mean_squared_error': 0.8383103451225817,
 'mean_squared_error': 0.7027642347395421,
 'mean_absolute_error': 0.6056488434633444,
 'median_absolute_error': 0.43325381126212115,
 'explained_variance_score': 0.8330964876579156,
 'mean_absolute_percentage_error': 0.1526994816775119,
 'absolute_median_relative_error': 0.10145290874614638,
 'r2_score': 0.831878055096281,
 'pearson_correlation': 0.912842216537504,
 '90_percent_confidence_interval': 1.1418041549338283,
 '95_percent_confidence_interval': 1.4452584040040661}

#### External Validation

The model can be tested using the external dataset that we loaded initially. The target column must be specified.

In [17]:
trainer.score(dataset.get_validation_data(), target_column='RT', plot=True)

Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%
glm prediction progress: |███████████████████████████████████████████████████████| (done) 100%
Export File progress: |██████████████████████████████████████████████████████████| (done) 100%




{'root_mean_squared_error': 1.0253784459688355,
 'mean_squared_error': 1.051400957457464,
 'mean_absolute_error': 0.7474267858289604,
 'median_absolute_error': 0.5382494428149855,
 'explained_variance_score': 0.7701430577296087,
 'mean_absolute_percentage_error': 0.1778018406606097,
 'absolute_median_relative_error': 0.12522764849604523,
 'r2_score': 0.7694901383372577,
 'pearson_correlation': 0.8786984734078465,
 '90_percent_confidence_interval': 1.2576410557827484,
 '95_percent_confidence_interval': 1.6296352002017835}

### RT Prediction

The trained model can be used to predict retention times for a new dataset.

In [18]:
y_pred = trainer.predict(dataset.get_validation_data())
y_pred[:25]

Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%
glm prediction progress: |███████████████████████████████████████████████████████| (done) 100%
Export File progress: |██████████████████████████████████████████████████████████| (done) 100%


array([7.41662624, 7.23533736, 6.95115764, 7.73120128, 6.61427038,
       7.76611784, 7.30714197, 7.10120347, 7.01125917, 7.09985739,
       7.07681318, 7.33740361, 6.68836564, 6.51735782, 7.62803207,
       6.70914454, 6.4036448 , 7.05589377, 7.20367985, 6.62138009,
       6.5469407 , 6.68028502, 8.84915658, 6.99760343, 6.95940001])

These predicted values can be annotated to the dataset.

In [19]:
annotated = trainer.annotate(dataset.get_validation_data(include_metadata=True), prediction_column='RTP')

Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%
glm prediction progress: |███████████████████████████████████████████████████████| (done) 100%
Export File progress: |██████████████████████████████████████████████████████████| (done) 100%


In [20]:
annotated.head()

Unnamed: 0,Name,InChIKey,SMILES,RTP,RT,ABC,ABCGG,nAcid,nBase,nAromAtom,...,SRW09,SRW10,TSRW10,MW,AMW,WPath,WPol,Zagreb1,Zagreb2,mZagreb2
0,Soyasapogenol E base + O-Hex-HexA,YDNHBSRZSMNZPB-NJAHCQCINA-N,O=C(O)C7OC(OC2CCC3(C)(C4CC=C1C5CC(C)(C)CC(=O)C...,7.416626,7.84,44.836157,32.982369,1,0,0,...,0.0,11.553443,96.348334,794.445257,6.511846,12554,131,330.0,420.0,11.631944
1,Soyasapogenol E base + O-HexA-Hex-dHex,CROUPKILZUPLQA-UHFFFAOYNA-N,O=C(O)C8OC(OC2CCC3(C)(C4CC=C1C5CC(C)(C)CC(=O)C...,7.235337,7.62,52.780741,38.46448,1,0,0,...,0.0,11.673657,107.062212,940.503166,6.623262,19659,152,386.0,489.0,13.743056
2,Soyasapogenol E base + O-HexA-Hex-Hex,JTXVTHCLTOUSSL-UHFFFAOYNA-N,O=C(O)C8OC(OC2CCC3(C)(C4CC=C1C5CC(C)(C)CC(=O)C...,6.951158,7.49,53.378458,38.971182,1,0,0,...,0.0,11.680125,108.106572,956.49808,6.688798,20502,154,390.0,494.0,14.076389
3,"Soyasapogenol B base + O-HexA-Pen-dHex, O-C6H7...",VWKBHQGCNGULAZ-UHFFFAOYNA-N,O=C(O)C9OC(OC3CCC4(C)(C5CC=C2C6CC(C)(C)CC(OC1O...,7.731201,7.79,58.615935,41.117105,1,0,0,...,0.0,11.738011,114.48911,1038.539945,6.700258,27052,164,426.0,536.0,15.104167
4,"Soyasapogenol B base + O-HexA-Hex-Pen, O-dHex",REIWEXDMDVAAEI-UHFFFAOYNA-N,CC1OC(OC2CC(C)(C)CC3C4=CCC5C6(C)CCC(OC7OC(C(O)...,6.61427,5.56,59.949268,42.489658,1,0,0,...,0.0,11.764827,116.612958,1074.561074,6.674292,28816,170,436.0,550.0,15.659722


Now the dataset includes a new column `RTP` containing the predicted retention time. The `RTP` values of molecules that could not be loaded or descriptors could not be calculated will be empty or null.

### Feature importance

The feature importance of the model can be visualized using the `plot_feature_importance` function by providing the model trained as input. It is possible to save the plot indicating the `plot_filename`.

In [35]:
retip.visualization.plot_feature_importance(trainer)

It is also possible to get all feature importance values as a dataframe using the `feature_importance` function.

In [37]:
# trainer.feature_importance().head()

### Saving/Loading Models

The whole `H2OautoMLTrainer` object cannot be saved but it is possible to save one of the models included. Use the `save_model` function indicating the `model_num` parameter (the model number based on the order of the `trainer.leaderboard` output) and the `filename`.

In [21]:
trainer.save_model(model_num=0, filename="H2O_autoML_model0")

This exported model can then be reloaded and used to score datasets and predict new retention times. However, unless a dataset is first passed to the trainer, it cannot be retrained. 

In [22]:
trainer = retip.H2OautoMLTrainer(dataset)
trainer.load_model("H2O_autoML_model0")

Checking whether there is an H2O instance running at http://localhost:54321. connected.


0,1
H2O_cluster_uptime:,1 min 09 secs
H2O_cluster_timezone:,Europe/Madrid
H2O_data_parsing_timezone:,UTC
H2O_cluster_version:,3.46.0.2
H2O_cluster_version_age:,17 days
H2O_cluster_name:,H2O_from_python_neuspouamengual_s77wrt
H2O_cluster_total_nodes:,1
H2O_cluster_free_memory:,3.835 Gb
H2O_cluster_total_cores:,8
H2O_cluster_allowed_cores:,8


Loaded H2O_autoML_model0
