![Retip](../images/retip_logo.png)
# Retip: Retention Time Prediction for Metabolomics

### Retention Time Prediction Overview

Retip is a tool for predicting retention times (RTs) of small molecules for high pressure liquid chromatography (HPLC) mass spectrometry.


### Loading Data

We begin by importing the retip library, which gives us access to the training, prediction and visualization functions.

In [1]:
try:
    import retip
except:
    # add the parent directory to the path to load the Retip library locally in case it isn't installed
    import sys
    sys.path.insert(1, os.path.join(sys.path[0], '..'))
    
    import retip

Now we can import our retention time dataset.  The user needs to prepare a compound retention time table in CSV or MS Excel format containing the compound name, retention time and chemical identifier.  Retip currently supports SMILES and PubChem CID as chemical identifiers.

Retip will use this input file to build a the model and will predict retention times for other biochemical databases or an input query list of compounds. It is suggested that the file has at least 300 compounds to build a good retention time prediction model.

Use the `retip.Dataset` class to create a new dataset.

* The `test_size` parameter defines what percentage of your dataset should be used for testing/validation of the model (this example uses 20%)
* The `seed` parameter sets a specific training/test split for the database, enabling reproducable model training

In [2]:
dataset = retip.Dataset('../example_data/Plasma_positive.xlsx', test_size=0.2, seed=101, sheet_name='lib_2')

Next, if your dataset does not already contain precalculated molecular descriptors, you can compute them with the [Mordred Molecular Descriptor Calculator](https://github.com/mordred-descriptor/mordred) by calling a simple function.  Note that molecules that cannot be parsed will be retained the dataset, but cannot be used for model training or validation.

In [3]:
dataset.calculate_descriptors()

Skipping molecular descriptor calculation, descriptors have already been calculated


Since molecular descriptor calculation is a time-comsuming process, you can save the current state of your dataset. Next time you want to use retention time library, just use this export when loading your dataset instead

In [4]:
dataset.save_dataset('Plasma_positive_retip-processed.csv')

Saved dataset to Plasma_positive_retip-processed.csv


### Training RT Prediction Model

Here you can select your trainer to build your RT prediction model.  For this example we use XGBoost, but you can use `AutoGluonTrainer`.  To initialize your trainer, pass in your dataset with computed descriptors along with any of the optional parameters:

* `cv` indicates the number of cross-validation splits (we recommend `cv=10` for a 10-fold cross validation)
* `n_cpu` is the number of CPU cores to use for training (if not specified, it will use all available cores)

Depending on your system, this can take ~20 minutes as the trainer performs a grid search over a large parameter space.

In [5]:
trainer = retip.XGBoostTrainer(dataset, cv=10, n_cpu=4)
trainer.train()

Fitting 2 folds for each of 56 candidates, totalling 112 fits
Training completed in 0:02:06.166422 with best RMSE 0.789


You can score this model using the internal testing data, or alternatively pass in a different `Dataset` object with precomputed descriptors.

In [6]:
trainer.score()

{'root_mean_squared_error': 0.5877174338636932,
 'mean_absolute_error': 0.5444404803863679,
 'explained_variance_score': 0.8595191501106432,
 'r2_score': 0.8594006451230303,
 'pearson_correlation': 0.9291983987303947,
 'mean_squared_error': 0.5877174338636932,
 'median_absolute_error': 0.37269119262695316,
 '95_percent_confidence_interval': 1.2827148930884547}

### External Validation

You can also test the model using an external dataset.  We begin by loading the data in the same fashion. Since we aren't training on these data, we don't need to provide test/training split parameters.

Even though molecular descriptors need to be calculated, you don't need to explicitly call the function.  If the trainer finds that descriptors are missing, it will calculate them for you.

In [7]:
val_data = retip.Dataset('../example_data/Plasma_positive.xlsx', sheet_name='ext')

In [8]:
trainer.score(val_data)

  0%|          | 0/320 [00:00<?, ?it/s]

Calculating descriptors for 320 structures


  return ufunc.reduce(obj, axis, dtype, out, **passkwargs)
100%|██████████| 320/320 [01:49<00:00,  2.92it/s]


{'root_mean_squared_error': 0.9758720648620701,
 'mean_absolute_error': 0.7128168755403443,
 'explained_variance_score': 0.7861313154536403,
 'r2_score': 0.7860491441667808,
 'pearson_correlation': 0.8883198442717539,
 'mean_squared_error': 0.9758720648620701,
 'median_absolute_error': 0.5237401771545414,
 '95_percent_confidence_interval': 1.605217204241618}

The RMSE and other scores on the external validation set are significantly worse than on our training and test, suggesting that our trainining set isn't sufficiently representative of our chemical space.

### RT Prediction

You can now use the trained model to predict retention times for a new dataset.  

In [9]:
y_pred = trainer.predict(val_data)
y_pred[:25]

array([6.7116394, 6.352397 , 6.2965903, 6.745089 , 6.4961596, 6.592478 ,
       6.754655 , 6.40973  , 6.322197 , 6.352319 , 6.384519 , 6.3995295,
       6.440836 , 6.3601766, 6.6003065, 6.445873 , 6.138191 , 6.348131 ,
       6.306133 , 6.3378386, 6.244544 , 6.2213554, 8.327709 , 6.6690598,
       6.3473577], dtype=float32)

This is great, but a list of numbers isn't very useful.  Instead, we can annotate our dataset:

In [10]:
trainer.annotate(val_data)

In [11]:
val_data.head()

Unnamed: 0,Name,InChIKey,SMILES,RT,RTP,ABC,ABCGG,nAcid,nBase,SpAbs_A,...,SRW10,TSRW10,MW,AMW,WPath,WPol,Zagreb1,Zagreb2,mZagreb1,mZagreb2
0,Soyasapogenol E base + O-Hex-HexA,YDNHBSRZSMNZPB-NJAHCQCINA-N,O=C(O)C7OC(OC2CCC3(C)(C4CC=C1C5CC(C)(C)CC(=O)C...,7.84,6.711639,44.836157,32.982369,1,0,70.320901,...,11.553443,96.348334,794.445257,6.511846,12554,131,330.0,420.0,23.263889,11.631944
1,Soyasapogenol E base + O-HexA-Hex-dHex,CROUPKILZUPLQA-UHFFFAOYNA-N,O=C(O)C8OC(OC2CCC3(C)(C4CC=C1C5CC(C)(C)CC(=O)C...,7.62,6.352397,52.780741,38.46448,1,0,83.127288,...,11.673657,107.062212,940.503166,6.623262,19659,152,386.0,489.0,27.319444,13.743056
2,Soyasapogenol E base + O-HexA-Hex-Hex,JTXVTHCLTOUSSL-UHFFFAOYNA-N,O=C(O)C8OC(OC2CCC3(C)(C4CC=C1C5CC(C)(C)CC(=O)C...,7.49,6.29659,53.378458,38.971182,1,0,84.423353,...,11.680125,108.106572,956.49808,6.688798,20502,154,390.0,494.0,27.569444,14.076389
3,"Soyasapogenol B base + O-HexA-Pen-dHex, O-C6H7...",VWKBHQGCNGULAZ-UHFFFAOYNA-N,O=C(O)C9OC(OC3CCC4(C)(C5CC=C2C6CC(C)(C)CC(OC1O...,7.79,6.745089,58.615935,41.117105,1,0,92.117673,...,11.738011,114.48911,1038.539945,6.700258,27052,164,426.0,536.0,29.402778,15.104167
4,"Soyasapogenol B base + O-HexA-Hex-Pen, O-dHex",REIWEXDMDVAAEI-UHFFFAOYNA-N,CC1OC(OC2CC(C)(C)CC3C4=CCC5C6(C)CCC(OC7OC(C(O)...,5.56,6.49616,59.949268,42.489658,1,0,94.85473,...,11.764827,116.612958,1074.561074,6.674292,28816,170,436.0,550.0,30.513889,15.659722


Now our dataset has a new column `RTP` column with the predicted retention time! In case there are some molecules that could not be loaded or for which descriptors could not be calculated, you will see a empty/null value in the RTP column. We can export this annotated dataset and exclude the descriptor columns by:

In [12]:
val_data.save_dataset('Plasma_positive_retip-ext-annotated.csv', include_descriptors=False)

Saved dataset to Plasma_positive_retip-ext-annotated.csv
