![Retip](../../images/retip_logo.png)
# Retip: Retention Time Prediction for Metabolomics and Lipidomics

Retip is a python tool for predicting retention times (RTs) of small molecules for high pressure liquid chromatography (HPLC) mass spectrometry. Retention time calculation can be useful in identifying unknowns and removing false positive annotations. The machine learning algorithms included in the tool are: **XGBoost**, **AutoGluon**, **AutoML** from **H2O** and **Random Forest**. This tutorial explains how to train a model with **AutoGluon**.

## Training a Model with AutoGluon

[AutoGluon](https://auto.gluon.ai) is an AutoML library designed to automate the full machine learning pipeline, including feature  preprocessing, training multiple model types, and constructing ensembles of models to improve overall accuracy.

As AutoGluon performs so many tasks, the final model accuracy usually improves the longer it has to train.  When no time limit is specified, the training should take between 10 and 30 minutes.

### Loading Data

Begin by importing the `pyretip` library, which provides access to the training, prediction and visualization functions.

In [1]:
%reload_ext autoreload
%autoreload 2
try:
    import retip
except:
    # add the parent directory to the path to load the Retip library locally in case it isn't installed
    import os, sys
    sys.path.insert(1, os.path.join(sys.path[0], '/home/npa/Documentos/pyRetip'))
    
    import retip

The input data should be a compound retention time table in CSV or MS Excel format, containing the compound name, retention time and chemical identifier. Retip currently supports SMILES and PubChem CID as chemical identifiers.

Retip will use this input file to build the model and predict retention times for other biochemical databases or an input query list of compounds. It is suggested that the file has at least 300 compounds to build a good retention time prediction model.

Use the `retip.Dataset` class to load the data and create a new dataset.

In [2]:
dataset = retip.Dataset(target_column='RT').load_retip_dataset(
    training='Plasma_positive.xlsx', training_sheet_name='lib_2',
    validation='Plasma_positive.xlsx', validation_sheet_name='ext')

In [3]:
dataset.head(2)

Training
             Name                     InChIKey  \
0       Withanone  FAZIYUIDUNHZRG-UHFFFAOYNA-N   
1  Corosolic acid  HFGSQOYIOKBQOW-UHFFFAOYNA-N   

                                              SMILES    RT  
0  CC(C1CC(C)=C(C)C(=O)O1)C1(O)CCC2C3C4OC4C4(O)CC...  6.82  
1  CC1CCC2(CCC3(C)C(=CCC4C5(C)CC(O)C(O)C(C)(C)C5C...  9.89  

Validation
                                     Name                     InChIKey  \
0       Soyasapogenol E base + O-Hex-HexA  YDNHBSRZSMNZPB-NJAHCQCINA-N   
1  Soyasapogenol E base + O-HexA-Hex-dHex  CROUPKILZUPLQA-UHFFFAOYNA-N   

                                              SMILES    RT  
0  O=C(O)C7OC(OC2CCC3(C)(C4CC=C1C5CC(C)(C)CC(=O)C...  7.84  
1  O=C(O)C8OC(OC2CCC3(C)(C4CC=C1C5CC(C)(C)CC(=O)C...  7.62  



Next, the precalculated molecular descriptors can be computed with the [Mordred Molecular Descriptor Calculator](https://github.com/mordred-descriptor/mordred) by calling the `calculate_descriptors` function. Note that molecules that cannot be parsed will be retained the dataset, but cannot be used for model training or validation.

In [4]:
dataset.calculate_descriptors()

Calculating descriptors for training dataset


100%|██████████| 494/494 [01:33<00:00,  5.30it/s]
  descs = descs.replace({False: 0, True: 1})


Calculating descriptors for validation dataset


100%|██████████| 358/358 [01:22<00:00,  4.32it/s]
  descs = descs.replace({False: 0, True: 1})


The `describe` function shows the shape of the datasets, indicating the number of rows and columns in each dataframe.

In [5]:
dataset.describe()

Training (494, 1617)
Validation (358, 1617)


The `preprocess_features` function performs feature reduction by removing features with missing values and to restrict feature sets to descriptors which calculate non-null values for large sets of molecules. It is important to perform this step before training.

In [6]:
dataset.preprocess_features('metabolomics')

Reduced feature set from 1613 to 817


Finally, it is possible to split the data into training and testing sets if it has not been loaded in separate files before. The `split_dataset` function makes that possible.

- The `test_size` parameter defines what percentage of the dataset should be used for testing of the model's accuracy (this example uses 20%).
- The `seed` parameter sets a specific training/test split for the database, enabling reproducable model training.
- The `validation_split` parameter constructs an additional dataset for validation if desired.

In [7]:
dataset.split_dataset(test_split=0.2, seed=101)

In [8]:
dataset.describe()

Training (395, 821)
Validation (358, 821)
Testing (99, 821)


#### Save the new dataset

Given that molecular descriptor calculation is a time-comsuming process, it is possible to save the current state of the dataset. Next time this retention time library is needed, simply use this export when loading this dataset instead. Note that there is no need to include a file extension, as Retip will automatically append the dataset type to the filename provided.

In [9]:
dataset.save_retip_dataset('Plasma_positive_retip_processed')

Saved training dataset to Plasma_positive_retip_processed_training.csv
Saved validation dataset to Plasma_positive_retip_processed_validation.csv
Saved testing dataset to Plasma_positive_retip_processed_testing.csv


This dataset can be loaded by running the `load_retip_dataset` function.

In [2]:
# dataset = retip.Dataset(target_column='RT').load_retip_dataset(
#     'Plasma_positive_retip_processed_training.csv',
#     'Plasma_positive_retip_processed_testing.csv',
#     'Plasma_positive_retip_processed_validation.csv')

### Training RT Prediction Model

Here, the RT prediction model will be trained. First, initialize the `AutoGluonTrainer` with the dataset with computed descriptors. Set the different parameters:
- The `training_duration` parameter indicates the maximum training time in minutes. This value defaults to `None`.
- The `preset` paramenter indicates the balance between the training speed and the prediction quality. The options are `medium_quality`, `high_quality` and `best_quality`. This value defaults to `high_quality`.

The cross-validation parameter does not need to be specified becaure AutoGluon takes care of this.

In [8]:
trainer = retip.AutoGluonTrainer(dataset, training_duration=3)
trainer.train()

No path specified. Models will be saved in: "AutogluonModels/ag-20240530_143714"
Presets specified: ['high_quality']
Setting dynamic_stacking from 'auto' to True. Reason: Enable dynamic_stacking when use_bag_holdout is disabled. (use_bag_holdout=False)
Stack configuration (auto_stack=True): num_stack_levels=1, num_bag_folds=8, num_bag_sets=1
Note: `save_bag_folds=False`! This will greatly reduce peak disk usage during fit (by ~8x), but runs the risk of an out-of-memory error during model refit if memory is small relative to the data size.
	You can avoid this risk by setting `save_bag_folds=True`.
Dynamic stacking is enabled (dynamic_stacking=True). AutoGluon will try to determine whether the input data is affected by stacked overfitting and enable or disable stacking as a consequence.
Detecting stacked overfitting by sub-fitting AutoGluon on the input data. That is, copies of AutoGluon will be sub-fit on subset(s) of the data. Then, the holdout validation data is used to detect stacked

Training completed in 0:03:14.499944 with best RMSE 0.749


### Testing the RT Prediction Model

The model can be scored using the internal testing data of the `Dataset` object, or alternatively pass a dataframe with precomputed descriptors. In that case, the `target_column` needs to be specified. Set the `plot` parameter to `True` to visualize how well the model works. Moreover, it is possible to save the plot indicating the `plot_filename`.You can score this model using the internal testing data, or alternatively pass in a different `Dataset` object with precomputed descriptors.

#### Internal testing data

In [15]:
trainer.score(plot=True)



{'root_mean_squared_error': 0.746632430017492,
 'mean_squared_error': 0.5574599855538251,
 'mean_absolute_error': 0.4870879038415774,
 'median_absolute_error': 0.31489881515502915,
 'explained_variance_score': 0.8681125221161674,
 'mean_absolute_percentage_error': 0.10790122000228491,
 'absolute_median_relative_error': 0.07857020781001936,
 'r2_score': 0.8666391197155288,
 'pearson_correlation': 0.9325613958745743,
 '90_percent_confidence_interval': 1.0300265107753241,
 '95_percent_confidence_interval': 1.2997770158705104}

#### External Validation

The model can be tested using the external dataset that we loaded initially. The target column must be specified.

In [16]:
trainer.score(dataset.get_validation_data(), target_column="RT", plot=True)



{'root_mean_squared_error': 0.9197020046111225,
 'mean_squared_error': 0.8458517772857173,
 'mean_absolute_error': 0.6596918262716112,
 'median_absolute_error': 0.4973020458221433,
 'explained_variance_score': 0.8173272958301936,
 'mean_absolute_percentage_error': 0.14872719150575137,
 'absolute_median_relative_error': 0.10611512910693996,
 'r2_score': 0.8145548805273906,
 'pearson_correlation': 0.9040662311055035,
 '90_percent_confidence_interval': 1.057349597085559,
 '95_percent_confidence_interval': 1.3889721971646514}

While we still observe the same issue where our training data is not sufficiently representative of our chemical space, the accuracy of the AutoGluon model is notably better than XGBoost, even after only 30 minutes of training.

### RT Prediction

The trained model can be used to predict retention times for a new dataset.

In [13]:
y_pred = trainer.predict(dataset.get_validation_data())
y_pred[:25]

0     6.950171
1     6.627048
2     6.557055
3     6.645472
4     6.296976
5     6.583277
6     7.230801
7     6.826734
8     6.754679
9     6.996120
10    6.551068
11    7.065253
12    6.421793
13    6.412821
14    6.784946
15    6.682776
16    6.283014
17    6.718084
18    6.683395
19    6.899658
20    6.203787
21    6.781049
22    8.971633
23    6.878408
24    6.881134
Name: RT, dtype: float32

These predicted values can be annotated to the dataset.

In [18]:
annotated = trainer.annotate(dataset.get_validation_data(include_metadata=True), prediction_column='RTP')

In [19]:
annotated.head()

Unnamed: 0,Name,InChIKey,SMILES,RTP,RT,ABC,ABCGG,nAcid,nBase,nAromAtom,...,SRW09,SRW10,TSRW10,MW,AMW,WPath,WPol,Zagreb1,Zagreb2,mZagreb2
0,Soyasapogenol E base + O-Hex-HexA,YDNHBSRZSMNZPB-NJAHCQCINA-N,O=C(O)C7OC(OC2CCC3(C)(C4CC=C1C5CC(C)(C)CC(=O)C5(C)(CCC1(C)C4(C)(CCC3(C2(C)(CO))))))C(OC6OC(CO)C(O)C(O)C6(O))C(O)C7(O),6.950171,7.84,44.836157,32.982369,1,0,0,...,0.0,11.553443,96.348334,794.445257,6.511846,12554,131,330.0,420.0,11.631944
1,Soyasapogenol E base + O-HexA-Hex-dHex,CROUPKILZUPLQA-UHFFFAOYNA-N,O=C(O)C8OC(OC2CCC3(C)(C4CC=C1C5CC(C)(C)CC(=O)C5(C)(CCC1(C)C4(C)(CCC3(C2(C)(CO))))))C(OC7OC(CO)C(O)C(O)C7(OC6OC(C)C(O)C(O)C6(O)))C(O)C8(O),6.625747,7.62,52.780741,38.46448,1,0,0,...,0.0,11.673657,107.062212,940.503166,6.623262,19659,152,386.0,489.0,13.743056
2,Soyasapogenol E base + O-HexA-Hex-Hex,JTXVTHCLTOUSSL-UHFFFAOYNA-N,O=C(O)C8OC(OC2CCC3(C)(C4CC=C1C5CC(C)(C)CC(=O)C5(C)(CCC1(C)C4(C)(CCC3(C2(C)(CO))))))C(OC7OC(CO)C(O)C(O)C7(OC6OC(CO)C(O)C(O)C6(O)))C(O)C8(O),6.556981,7.49,53.378458,38.971182,1,0,0,...,0.0,11.680125,108.106572,956.49808,6.688798,20502,154,390.0,494.0,14.076389
3,"Soyasapogenol B base + O-HexA-Pen-dHex, O-C6H7O3(DDMP)",VWKBHQGCNGULAZ-UHFFFAOYNA-N,O=C(O)C9OC(OC3CCC4(C)(C5CC=C2C6CC(C)(C)CC(OC1OC(=C(O)C(=O)C1)C)C6(C)(CCC2(C)C5(C)(CCC4(C3(C)(CO))))))C(OC8OCC(O)C(O)C8(OC7OC(C)C(O)C(O)C7(O)))C(O)C9(O),6.64542,7.79,58.615935,41.117105,1,0,0,...,0.0,11.738011,114.48911,1038.539945,6.700258,27052,164,426.0,536.0,15.104167
4,"Soyasapogenol B base + O-HexA-Hex-Pen, O-dHex",REIWEXDMDVAAEI-UHFFFAOYNA-N,CC1OC(OC2CC(C)(C)CC3C4=CCC5C6(C)CCC(OC7OC(C(O)C(O)C7OC7OC(CO)C(O)C(O)C7OC7OCC(O)C(O)C7O)C(O)=O)C(C)(CO)C6CCC5(C)C4(C)CCC23C)C(O)C(O)C1O,6.297136,5.56,59.949268,42.489658,1,0,0,...,0.0,11.764827,116.612958,1074.561074,6.674292,28816,170,436.0,550.0,15.659722


Now the dataset includes a new column `RTP` containing the predicted retention time. The `RTP` values of molecules that could not be loaded or descriptors could not be calculated will be empty or null.

### Saving/Loading Models

AutoGluon automatically saves its models into a directory called `AutogluonModels`, where each model is saved into a subdirectory named according to when the model started training.  You can use the same saving and loading methods to move these save directories and reload them.

In [9]:
trainer.save_model('Plasma_positve_autogluon-model')

Moved AutoGluon model to Plasma_positve_autogluon-model


This exported model can then be reloaded and used to score datasets and predict new retention times. However, unless a dataset is first passed to the trainer, it cannot be retrained. 

In [11]:
trainer = retip.AutoGluonTrainer()
trainer.load_model('Plasma_positve_autogluon-model')

Loaded Plasma_positve_autogluon-model


If you use AutoGluon a lot, remember to clear out old models from the `AutogluonModels` directory!