![Retip](../../images/retip_logo.png)
# Retip: Retention Time Prediction for Metabolomics and Lipidomics

Retip is a python tool for predicting retention times (RTs) of small molecules for high pressure liquid chromatography (HPLC) mass spectrometry. Retention time calculation can be useful in identifying unknowns and removing false positive annotations. The machine learning algorithms included in the tool are: **XGBoost**, **AutoGluon**, **AutoML** from **H2O** and **Random Forest**. This tutorial explains how to train a model with **AutoGluon**.

## Training a Model with AutoGluon

[AutoGluon](https://auto.gluon.ai) is an AutoML library designed to automate the full machine learning pipeline, including feature  preprocessing, training multiple model types, and constructing ensembles of models to improve overall accuracy.

As AutoGluon performs so many tasks, the final model accuracy usually improves the longer it has to train.  When no time limit is specified, the training should take between 10 and 30 minutes.

### Loading Data

Begin by importing the `pyretip` library, which provides access to the training, prediction and visualization functions.

In [1]:
try:
    import retip
except:
    # add the parent directory to the path to load the Retip library locally in case it isn't installed
    import os, sys
    sys.path.insert(1, os.path.join(sys.path[0], '/home/npa/Documentos/pyRetip'))

    import retip

The input data should be a compound retention time table in CSV or MS Excel format, containing the compound name, retention time and chemical identifier. Retip currently supports SMILES and PubChem CID as chemical identifiers.

Retip will use this input file to build the model and predict retention times for other biochemical databases or an input query list of compounds. It is suggested that the file has at least 300 compounds to build a good retention time prediction model.

Use the `retip.Dataset` class to load the data and create a new dataset.

In this tutorial, the input data is already split into training and testing set, provided as separate CSV files. For this reason, the `split_dataset` function will not be used in this tutorial (go to the metabolomics tutorials to see how it works).

In [2]:
dataset = retip.Dataset(target_column='RT').load_retip_dataset(
    training='lipidomics_c18_retip_training.csv',
    testing='lipidomics_c18_retip_testing.csv')

In [3]:
dataset.head(2)

Training
       ID                         Name CompoundClass  \
0  860906  1_TG 14:0-13:0-14:0-d5_ISTD            TG   
1  860907  1_TG 14:0-15:1-14:0-d5_ISTD            TG   

                      InChIKey  \
0  AQNYNFGTMNCFBJ-GZKVWLTASA-N   
1  DODQZOMAJARNKZ-AHFZQTAFSA-N   

                                              SMILES     RT  
0  CCCCCCCCCCCCCC(=O)OCC(COC(=O)CCCCCCCCCCCCC)OC(...  9.211  
1  CCCC/C=C\CCCCCCCCC(=O)OC(COC(=O)CCCCCCCCCCCCC)...  9.235  

Testing
         ID           Name CompoundClass                     InChIKey  \
0  53481651  CAR 10:1 [M]+           CAR  GOOOCIIXFLVRAG-UHFFFAOYSA-N   
1  11953816  CAR 16:0 [M]+           CAR  XOMRRQXKHMYMOC-OAQYLSRUSA-N   

                                           SMILES        RT  
0       C[N+](C)(C)CC(CC(=O)[O-])OC(=O)CCCCCCCC=C  0.575773  
1  CCCCCCCCCCCCCCCC(=O)OC(CC(=O)[O-])C[N+](C)(C)C  1.612047  



Next, the precalculated molecular descriptors can be computed with the [Mordred Molecular Descriptor Calculator](https://github.com/mordred-descriptor/mordred) by calling the `calculate_descriptors` function. Note that molecules that cannot be parsed will be retained the dataset, but cannot be used for model training or validation.

In [4]:
dataset.calculate_descriptors()

Calculating descriptors for training dataset


100%|██████████| 194/194 [00:31<00:00,  6.24it/s]
  descs = descs.replace({False: 0, True: 1})


Calculating descriptors for testing dataset


100%|██████████| 65/65 [00:10<00:00,  6.12it/s]
  descs = descs.replace({False: 0, True: 1})


The `describe` function shows the shape of the datasets, indicating the number of rows and columns in each dataframe.

In [5]:
dataset.describe()

Training (194, 1619)
Testing (65, 1619)


The `preprocess_features` function performs feature reduction by removing features with missing values and to restrict feature sets to descriptors which calculate non-null values for large sets of molecules. It is important to perform this step before training.

In [6]:
dataset.preprocess_features('lipidomics')

Reduced feature set from 1613 to 1432


#### Save the new dataset

Given that molecular descriptor calculation is a time-comsuming process, it is possible to save the current state of the dataset. Next time this retention time library is needed, simply use this export when loading this dataset instead. Note that there is no need to include a file extension, as Retip will automatically append the dataset type to the filename provided.

In [7]:
dataset.save_retip_dataset('lipidomics_c18_retip_preprocessed')

Saved training dataset to lipidomics_c18_retip_preprocessed_training.csv
Saved testing dataset to lipidomics_c18_retip_preprocessed_testing.csv


This dataset can be loaded by running the `load_retip_dataset` function.

In [8]:
# dataset = retip.Dataset(target_column='RT').load_retip_dataset(
#     'Plasma_positive_retip_processed_training.csv',
#     'Plasma_positive_retip_processed_testing.csv',
#     'Plasma_positive_retip_processed_validation.csv')

### Training RT Prediction Model

Here, the RT prediction model will be trained. First, initialize the `AutoGluonTrainer` with the dataset with computed descriptors. Set the different parameters:
- The `training_duration` parameter indicates the maximum training time in minutes. This value defaults to `None`.
- The `preset` paramenter indicates the balance between the training speed and the prediction quality. The options are `medium_quality`, `high_quality` and `best_quality`. This value defaults to `high_quality`.

The cross-validation parameter does not need to be specified becaure AutoGluon takes care of this.

In [10]:
trainer = retip.AutoGluonTrainer(dataset, training_duration=10)
trainer.train()

No path specified. Models will be saved in: "AutogluonModels/ag-20240530_113657"
Presets specified: ['high_quality']
Setting dynamic_stacking from 'auto' to True. Reason: Enable dynamic_stacking when use_bag_holdout is disabled. (use_bag_holdout=False)
Stack configuration (auto_stack=True): num_stack_levels=1, num_bag_folds=8, num_bag_sets=1
Note: `save_bag_folds=False`! This will greatly reduce peak disk usage during fit (by ~8x), but runs the risk of an out-of-memory error during model refit if memory is small relative to the data size.
	You can avoid this risk by setting `save_bag_folds=True`.
Dynamic stacking is enabled (dynamic_stacking=True). AutoGluon will try to determine whether the input data is affected by stacked overfitting and enable or disable stacking as a consequence.
Detecting stacked overfitting by sub-fitting AutoGluon on the input data. That is, copies of AutoGluon will be sub-fit on subset(s) of the data. Then, the holdout validation data is used to detect stacked

Running the sub-fit in a ray process to avoid memory leakage.
Spend 205 seconds for the sub-fit(s) during dynamic stacking.
Time left for full fit of AutoGluon: 395 seconds.
Starting full fit now with num_stack_levels 1.
Beginning AutoGluon training ... Time limit = 395s
AutoGluon will save models to "AutogluonModels/ag-20240530_113657"
AutoGluon Version:  1.1.0
Python Version:     3.10.14
Operating System:   Linux
Platform Machine:   x86_64
Platform Version:   #35~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Tue May  7 09:00:52 UTC 2
CPU Count:          16
Memory Avail:       46.48 GB / 62.71 GB (74.1%)
Disk Space Avail:   1521.67 GB / 1831.76 GB (83.1%)
Train Data Rows:    194
Train Data Columns: 1432
Label Column:       RT
Problem Type:       regression
Preprocessing data ...
Using Feature Generators to preprocess the data ...
Fitting AutoMLPipelineFeatureGenerator...
	Available Memory:                    47590.76 MB
	Train Data (Original)  Memory Usage: 2.13 MB (0.0% of available memory)
	In

Training completed in 0:10:48.632222 with best RMSE 0.199


### Testing the RT Prediction Model

The model can be scored using the internal testing data of the `Dataset` object, or alternatively pass a dataframe with precomputed descriptors. In that case, the `target_column` needs to be specified. Set the `plot` parameter to `True` to visualize how well the model works. Moreover, it is possible to save the plot indicating the `plot_filename`.You can score this model using the internal testing data, or alternatively pass in a different `Dataset` object with precomputed descriptors.

#### Internal testing data

In [12]:
trainer.score(plot=True)



{'root_mean_squared_error': 0.1794124258296555,
 'mean_squared_error': 0.03218881854208163,
 'mean_absolute_error': 0.13143539799879345,
 'median_absolute_error': 0.10017209273193295,
 'explained_variance_score': 0.9970045146022016,
 'mean_absolute_percentage_error': 0.027884256673622616,
 'absolute_median_relative_error': 0.013533038516304547,
 'r2_score': 0.996985904864986,
 'pearson_correlation': 0.9985812263633,
 '90_percent_confidence_interval': 0.24331293331573275,
 '95_percent_confidence_interval': 0.30829230473254426}

### RT Prediction

The trained model can be used to predict retention times for a new dataset.

In [15]:
y_pred = trainer.predict(dataset.get_testing_data())
y_pred[:25]

0     0.638016
1     1.806198
2     2.531794
3     1.350287
4     9.995681
5     9.908512
6     8.344966
7     6.195411
8     6.306470
9     5.702343
10    6.151345
11    7.702169
12    7.525928
13    1.541055
14    1.543117
15    1.138519
16    2.026705
17    1.616988
18    1.222169
19    5.262007
20    4.657514
21    4.858946
22    5.581341
23    5.904990
24    6.196415
Name: RT, dtype: float32

These predicted values can be annotated to the dataset.

In [17]:
annotated = trainer.annotate(dataset.get_testing_data(include_metadata=True), prediction_column='RTP')

In [18]:
annotated.head()

Unnamed: 0,ID,Name,CompoundClass,InChIKey,SMILES,RTP,RT,ABC,ABCGG,nAcid,...,SRW10,TSRW10,MW,AMW,WPath,WPol,Zagreb1,Zagreb2,mZagreb1,mZagreb2
0,53481651,CAR 10:1 [M]+,CAR,GOOOCIIXFLVRAG-UHFFFAOYSA-N,C[N+](C)(C)CC(CC(=O)[O-])OC(=O)CCCCCCCC=C,0.638016,0.575773,15.654168,13.81809,1,...,9.247925,54.121383,313.225308,5.909911,1358,22,94.0,95.0,10.145833,5.125
1,11953816,CAR 16:0 [M]+,CAR,XOMRRQXKHMYMOC-OAQYLSRUSA-N,CCCCCCCCCCCCCCCC(=O)OC(CC(=O)[O-])C[N+](C)(C)C,1.806198,1.612047,19.896808,15.891835,1,...,9.383873,61.07585,399.334859,5.470341,3031,28,118.0,119.0,11.645833,6.625
2,6426855,CAR 18:0 [M]+,CAR,FNPHNLNTJNMAEE-UHFFFAOYSA-N,CCCCCCCCCCCCCCCCCC(=O)OC(CC(=O)[O-])C[N+](C)(C)C,2.531794,2.393059,21.311022,16.518372,1,...,9.425371,63.356146,427.366159,5.409698,3802,30,126.0,127.0,12.145833,7.125
3,6450015,CAR 18:2 [M]+,CAR,MJLXQSQYKZWZCB-DQFWFXSYSA-N,CCCCCC=CCC=CCCCCCCCC(=O)OC(CC(=O)[O-])C[N+](C)...,1.350287,1.35052,21.311022,16.518372,1,...,9.425371,63.356146,423.334859,5.644465,3802,30,126.0,127.0,12.145833,7.125
4,6436907,CE 18:3 [2M+Na]+,CE,FYMCIBHUFSIWCE-WVXFKAQASA-N,CCC=CCC=CCC=CCCCCCCCC(=O)OC1CCC2(C3CCC4(C(C3CC...,9.995681,10.181433,35.742977,23.096223,0,...,10.791605,99.139349,646.568882,5.343544,11882,75,236.0,275.0,15.375,10.666667


Now the dataset includes a new column `RTP` containing the predicted retention time. The `RTP` values of molecules that could not be loaded or descriptors could not be calculated will be empty or null.

### Saving/Loading Models

AutoGluon automatically saves its models into a directory called `AutogluonModels`, where each model is saved into a subdirectory named according to when the model started training.  You can use the same saving and loading methods to move these save directories and reload them.

In [19]:
trainer.save_model('lipidomics_c18_autogluon-model')

Moved AutoGluon model to lipidomics_c18_autogluon-model


This exported model can then be reloaded and used to score datasets and predict new retention times. However, unless a dataset is first passed to the trainer, it cannot be retrained. 

In [20]:
trainer = retip.AutoGluonTrainer()
trainer.load_model('lipidomics_c18_autogluon-model')

Loaded lipidomics_c18_autogluon-model


If you use AutoGluon a lot, remember to clear out old models from the `AutogluonModels` directory!