![Retip](../../images/retip_logo.png)
# Retip: Retention Time Prediction for Lipidomics

### Retention Time Prediction Overview

Retip is a tool for predicting retention times (RTs) of small molecules for high pressure liquid chromatography (HPLC) mass spectrometry.

### Training a Model with AutoGluon

[AutoGluon](https://auto.gluon.ai) is an AutoML library designed to automate the full machine learning pipeline, including feature  preprocessing, training multiple model types, and constructing ensembles of models to improve overall accuracy.

We begin by importing the retip library, which gives us access to the training, prediction and visualization functions.

In [2]:
try:
    import retip
except:
    # add the parent directory to the path to load the Retip library locally in case it isn't installed
    import os, sys
    sys.path.insert(1, os.path.join(sys.path[0], '../..'))

    import retip

### Loading Data

Now we can import our retention time dataset.  The user needs to prepare a compound retention time table in CSV or MS Excel format containing the compound name, retention time and chemical identifier.  Retip currently supports SMILES and PubChem CID as chemical identifiers. Retip will use this input file to build a the model and will predict retention times for other biochemical databases or an input query list of compounds.

Use the `retip.Dataset` class to create a new dataset.

* The `test_size` parameter defines what percentage of your dataset should be used for testing/validation of the model (this example uses 20%)
* The `seed` parameter sets a specific training/test split for the database, enabling reproducable model training

In this tutorial, we are using a pre-split dataset (the training and tests sets are provided as separate CSV files) which contain SMILES structures for each entry, and so no splitting parameters are required.

In [3]:
dataset = retip.Dataset(target_column='RT').load_retip_dataset(
    training='lipidomics_c18_retip_training.csv',
    testing='lipidomics_c18_retip_testing.csv')

In [4]:
dataset.head(2)

Training
       ID                         Name CompoundClass  \
0  860906  1_TG 14:0-13:0-14:0-d5_ISTD            TG   
1  860907  1_TG 14:0-15:1-14:0-d5_ISTD            TG   

                      InChIKey  \
0  AQNYNFGTMNCFBJ-GZKVWLTASA-N   
1  DODQZOMAJARNKZ-AHFZQTAFSA-N   

                                              SMILES     RT  
0  CCCCCCCCCCCCCC(=O)OCC(COC(=O)CCCCCCCCCCCCC)OC(...  9.211  
1  CCCC/C=C\CCCCCCCCC(=O)OC(COC(=O)CCCCCCCCCCCCC)...  9.235  

Testing
         ID           Name CompoundClass                     InChIKey  \
0  53481651  CAR 10:1 [M]+           CAR  GOOOCIIXFLVRAG-UHFFFAOYSA-N   
1  11953816  CAR 16:0 [M]+           CAR  XOMRRQXKHMYMOC-OAQYLSRUSA-N   

                                           SMILES        RT  
0       C[N+](C)(C)CC(CC(=O)[O-])OC(=O)CCCCCCCC=C  0.575773  
1  CCCCCCCCCCCCCCCC(=O)OC(CC(=O)[O-])C[N+](C)(C)C  1.612047  



In [5]:
dataset.calculate_descriptors()
dataset.preprocess_features('lipidomics')

Calculating descriptors for training dataset


100%|██████████| 194/194 [00:55<00:00,  3.50it/s]


Calculating descriptors for testing dataset


100%|██████████| 65/65 [00:19<00:00,  3.37it/s]

Reduced feature set from 1613 to 1432





In [6]:
dataset.save_retip_dataset('lipidomics_c18_retip_preprocessed')

Saved training dataset to lipidomics_c18_retip_preprocessed_training.csv
Saved testing dataset to lipidomics_c18_retip_preprocessed_testing.csv


### Training RT Prediction Model

Similarly to XGBoost, we create a model trainer.  The main difference is that we no longer need a cross-validation parameter (AutoGluon takes care of this), and instead we need to specify the training time in minutes.  Here, we select `training_duration=30` to train for 30 minutes.

In [8]:
trainer = retip.AutoGluonTrainer(dataset, training_duration=10)
trainer.train()

No path specified. Models will be saved in: "AutogluonModels\ag-20230825_211522\"
Presets specified: ['high_quality']
Stack configuration (auto_stack=True): num_stack_levels=0, num_bag_folds=5, num_bag_sets=20
Beginning AutoGluon training ... Time limit = 600s
AutoGluon will save models to "AutogluonModels\ag-20230825_211522\"
AutoGluon Version:  0.8.2
Python Version:     3.10.12
Operating System:   Windows
Platform Machine:   AMD64
Platform Version:   10.0.22621
Disk Space Avail:   134.30 GB / 1023.06 GB (13.1%)
Train Data Rows:    194
Train Data Columns: 1432
Label Column: RT
Preprocessing data ...
AutoGluon infers your prediction problem is: 'regression' (because dtype of label-column == float and many unique label-values observed).
	Label info (max, min, mean, stddev): (12.04906733, 0.507319618, 6.38335, 3.21749)
	If 'regression' is not the correct problem_type, please manually specify the problem_type parameter during predictor init (You may specify problem_type as one of: ['binar

Training completed in 0:12:09.866399 with best RMSE 0.181


You can score this model using the internal testing data, or alternatively pass in a different `Dataset` object with precomputed descriptors.

In [10]:
trainer.score(plot=True)

{'root_mean_squared_error': 0.16612211299458612,
 'mean_squared_error': 0.027596556425786036,
 'mean_absolute_error': 0.13147805841428833,
 'median_absolute_error': 0.11400159832421863,
 'explained_variance_score': 0.9977386126501109,
 'mean_absolute_percentage_error': 0.024425951678414756,
 'absolute_median_relative_error': 0.015458331086393343,
 'r2_score': 0.9974159148973623,
 'pearson_correlation': 0.9990412114168319,
 '90_percent_confidence_interval': 0.2578625140767455,
 '95_percent_confidence_interval': 0.3143210200666909}

If you use AutoGluon a lot, remember to clear out old models from the `AutogluonModels` directory!

In [3]:
dataset = retip.Dataset(target_column='RT').load_retip_dataset(
    training='lipidomics_c18_retip_preprocessed_training.csv',
    testing='lipidomics_c18_retip_preprocessed_testing.csv')

In [7]:
trainer = retip.AutoGluonTrainer(dataset)
trainer.load_model('AutogluonModels/ag-20230825_211522')

Loaded AutogluonModels/ag-20230825_211522
