![Retip](../../images/retip_logo.png)
# Retip: Retention Time Prediction for Metabolomics and Lipidomics

Retip is a python tool for predicting retention times (RTs) of small molecules for high pressure liquid chromatography (HPLC) mass spectrometry. Retention time calculation can be useful in identifying unknowns and removing false positive annotations. The machine learning algorithms included in the tool are: **XGBoost**, **AutoGluon**, **AutoML** from **H2O** and **Random Forest**. This tutorial explains how to identify false annotations in a metabolomics dataset. The model is trained with **XGBoost**.

## Identiying False Annotations

Given an annotated metabolomics dataset, the predicted retention time can be used to identify likely misannotated features.

### Loading data

Begin by importing the `retip` library, which provides access to the training, prediction and visualization functions.

In [1]:
%reload_ext autoreload
%autoreload 2
try:
    import retip
except:
    # add the parent directory to the path to load the Retip library locally in case it isn't installed
    import os, sys
    directory = os.getcwd().split("pyRetip")[0] + 'pyRetip'
    sys.path.insert(1, directory)
    
    import retip

The input data should be a compound retention time table in CSV or MS Excel format, containing the compound name, retention time and chemical identifier. Retip currently supports SMILES and PubChem CID as chemical identifiers.

Retip will use this input file to build the model and predict retention times.

Use the `retip.Dataset` class to load the data and create a new dataset.

In this tutorial, the model will not be to predict RT on other datasets and it is not necessary to split the data into training and testing sets. For this reason, the `split_dataset` function will not be used in this tutorial (go to the metabolomics tutorials to see how it works).

In [2]:
dataset = retip.Dataset().load_retip_dataset('tomato_annotations.csv')

In [3]:
dataset.head(2)

Training
                                            Name                     InChIKey  \
0  1-thiazol-2-ylethanone (PhytoBank:PHY0136536)  MOMFXATYAINJML-UHFFFAOYSA-N   
1                                        Gramine  OCDGBSUVYYVKQZ-UHFFFAOYSA-N   

                    SMILES     RT  
0          CC(=O)C1=NC=CS1  0.774  
1  CN(C)CC1=CNC2=CC=CC=C21  0.925  



Next, the precalculated molecular descriptors can be computed with the [Mordred Molecular Descriptor Calculator](https://github.com/mordred-descriptor/mordred) by calling the `calculate_descriptors` function. Note that molecules that cannot be parsed will be retained the dataset, but cannot be used for model training or validation.

In [4]:
dataset.calculate_descriptors()

Calculating descriptors for training dataset


  0%|          | 0/1164 [00:00<?, ?it/s]

100%|██████████| 1164/1164 [04:01<00:00,  4.81it/s]
  descs = descs.replace({False: 0, True: 1})


The `describe` function shows the shape of the datasets, indicating the number of rows and columns in each dataframe.

In [5]:
dataset.describe()

Training (1164, 1617)


The `preprocess_features` function performs feature reduction by removing features with missing values and to restrict feature sets to descriptors which calculate non-null values for large sets of molecules. It is important to perform this step before training.

In [6]:
dataset.preprocess_features('metabolomics')

Reduced feature set from 1613 to 817


#### Save the new dataset

Given that molecular descriptor calculation is a time-comsuming process, it is possible to save the current state of the dataset. Next time this retention time library is needed, simply use this export when loading this dataset instead. Note that there is no need to include a file extension, as Retip will automatically append the dataset type to the filename provided.

In [7]:
dataset.save_retip_dataset('tomato_annotations_processed')

Saved training dataset to tomato_annotations_processed_training.csv


This dataset can be loaded by running the `load_retip_dataset` function.

In [8]:
# dataset.load_retip_dataset('tomato_annotations_processed.csv')

### Training RT Prediction Model

Here, the RT prediction model will be trained. First, initialize the `XGBoostTrainer` with the dataset with computed descriptors. Set the different parameters:

- The `cv` parameter indicates the number of cross-validation splits. This value defaults to `10` for a 10-fold cross validation.
- The `n_cpu` parameter is the number of CPU cores to use for training (if not specified, it will use all available cores). This value defaults to `None`.

Depending on your system, this can take ~20 minutes as the trainer performs a grid search over a large parameter space.

In [9]:
trainer = retip.XGBoostTrainer(dataset, cv=5)
trainer.train()

Fitting 5 folds for each of 56 candidates, totalling 280 fits
Training completed in 0:14:27.392221 with best RMSE -11.381


### Outlier identification

False annotations are identified by running the `outlier_identification` function. The input parameters are `trainer`, `dataset` and `prediction_column`. Furthermore, it is possible to indicate the `confidence_interval` (defaults to 95) and the `output_filename`.

Running the `outlier_identification` function will provid two results:

1. A plot showing the distribution of real vs. predicted retention times overlaid by a simple linear fit with 95% confidence intervals, with any annotations outside of this CI window are highlighted in red.
2. A table listing the outliers with their name, retention time and predicted retention time.

#### 95% CI

In [44]:
outliers = retip.visualization.outlier_identification(trainer, dataset, 'RTP', confidence_interval=95)

##### Annotated features

In [11]:
outliers[0]

Unnamed: 0,Name,InChIKey,SMILES,RT,RTP
0,1-thiazol-2-ylethanone (PhytoBank:PHY0136536),MOMFXATYAINJML-UHFFFAOYSA-N,CC(=O)C1=NC=CS1,0.774,1.257465
1,Gramine,OCDGBSUVYYVKQZ-UHFFFAOYSA-N,CN(C)CC1=CNC2=CC=CC=C21,0.925,1.888878
2,1-hydroxy-6-(1-hydroxy-2-methyl-propyl)-3-isob...,MEFPAPWNDHCGJX-UHFFFAOYSA-N,CC(C)CC1=NC=C(C(O)C(C)C)[N+]([O-])=C1O,0.999,2.113504
3,FEAU (PhytoBank:PHY0170129),SWFJAJRDLUUIOA-UHFFFAOYSA-N,CCC1=CN(C2OC(CO)C(O)C2F)C(=O)NC1=O,1.004,1.859538
4,3-Acetoxypyridine,QZDWODWEESGPLC-UHFFFAOYSA-N,CC(=O)OC1=CN=CC=C1,1.007,1.712436
...,...,...,...,...,...
1159,.gamma.-Undecalactone,PHXATPHONSXBIL-UHFFFAOYSA-N,CCCCCCCC1CCC(=O)O1,11.662,8.944397
1160,C13878 (KEGG:C13878),ITQDKQBSRNKQCX-ZCXUNETKSA-N,CCCCCCCCCCCCCCCC(=O)OCC(COP(=O)(CCN)O)OC(=O)CC...,11.943,11.568171
1161,C14227 (KEGG:C14227),MQIUGAXCHLFZKX-UHFFFAOYSA-N,CCCCCCCCOC(=O)C1=CC=CC=C1C(=O)OCCCCCCCC,12.031,11.358490
1162,"N,N-Dimethyldodecylamine",YWFWDNVOPHGWMX-UHFFFAOYSA-N,CCCCCCCCCCCCN(C)C,12.036,9.136919


##### Outliers

In [12]:
outliers[1]

Unnamed: 0,Name,RT,RTP
104,3'-Hydroxyrepaglinide,2.404,5.269562
207,Nummularine T (KNApSAcK:C00028745),3.149,5.127854
287,Dibutyl phthalate,3.675,5.794065
315,"1beta,3beta-dihydroxypregna-5,16-dien-20-one 1...",3.886,5.665067
322,theopederin D (PhytoBank:PHY0026141),3.966,5.688123
...,...,...,...
1129,Ustilipid A (KNApSAcK:C00014891),11.006,11.590704
1134,2-Ethylbutan-1-amine,11.025,7.540581
1159,.gamma.-Undecalactone,11.662,8.944397
1162,"N,N-Dimethyldodecylamine",12.036,9.136919


#### 90% CI

In [45]:
outliers = retip.visualization.outlier_identification(trainer, dataset, 'RTP', confidence_interval=90)

##### Annotated features

In [14]:
outliers[0]

Unnamed: 0,Name,InChIKey,SMILES,RT,RTP
0,1-thiazol-2-ylethanone (PhytoBank:PHY0136536),MOMFXATYAINJML-UHFFFAOYSA-N,CC(=O)C1=NC=CS1,0.774,1.257465
1,Gramine,OCDGBSUVYYVKQZ-UHFFFAOYSA-N,CN(C)CC1=CNC2=CC=CC=C21,0.925,1.888878
2,1-hydroxy-6-(1-hydroxy-2-methyl-propyl)-3-isob...,MEFPAPWNDHCGJX-UHFFFAOYSA-N,CC(C)CC1=NC=C(C(O)C(C)C)[N+]([O-])=C1O,0.999,2.113504
3,FEAU (PhytoBank:PHY0170129),SWFJAJRDLUUIOA-UHFFFAOYSA-N,CCC1=CN(C2OC(CO)C(O)C2F)C(=O)NC1=O,1.004,1.859538
4,3-Acetoxypyridine,QZDWODWEESGPLC-UHFFFAOYSA-N,CC(=O)OC1=CN=CC=C1,1.007,1.712436
...,...,...,...,...,...
1159,.gamma.-Undecalactone,PHXATPHONSXBIL-UHFFFAOYSA-N,CCCCCCCC1CCC(=O)O1,11.662,8.944397
1160,C13878 (KEGG:C13878),ITQDKQBSRNKQCX-ZCXUNETKSA-N,CCCCCCCCCCCCCCCC(=O)OCC(COP(=O)(CCN)O)OC(=O)CC...,11.943,11.568171
1161,C14227 (KEGG:C14227),MQIUGAXCHLFZKX-UHFFFAOYSA-N,CCCCCCCCOC(=O)C1=CC=CC=C1C(=O)OCCCCCCCC,12.031,11.358490
1162,"N,N-Dimethyldodecylamine",YWFWDNVOPHGWMX-UHFFFAOYSA-N,CCCCCCCCCCCCN(C)C,12.036,9.136919


##### Outliers

In [15]:
outliers[1]

Unnamed: 0,Name,RT,RTP
30,Methoxetamine,1.553,3.499939
86,Pracinostat,2.237,3.905103
104,3'-Hydroxyrepaglinide,2.404,5.269562
164,C00051;C02471 (KEGG:C00051;C02471),2.866,2.090412
207,Nummularine T (KNApSAcK:C00028745),3.149,5.127854
...,...,...,...
1145,C13878 (KEGG:C13878),11.284,11.568171
1151,1-O-acetyl-2-O-[(3R)-3-(acetyloxy)eicosanoyl]-...,11.417,11.818275
1159,.gamma.-Undecalactone,11.662,8.944397
1162,"N,N-Dimethyldodecylamine",12.036,9.136919
