![Retip](../../images/retip_logo.png)
# Retip: Retention Time Prediction for Metabolomics and Lipidomics

### Retention Time Prediction Overview

Retip is a tool for predicting retention times (RTs) of small molecules for high pressure liquid chromatography (HPLC) mass spectrometry.

### Training a Model with AutoGluon

[AutoGluon](https://auto.gluon.ai) is an AutoML library designed to automate the full machine learning pipeline, including feature  preprocessing, training multiple model types, and constructing ensembles of models to improve overall accuracy.

We begin by importing the retip library, which gives us access to the training, prediction and visualization functions.

In [None]:
try:
    import retip
except:
    # add the parent directory to the path to load the Retip library locally in case it isn't installed
    import os, sys
    sys.path.insert(1, os.path.join(sys.path[0], '../..'))

    import retip

### Loading Data

Now we can import our retention time dataset.  The user needs to prepare a compound retention time table in CSV or MS Excel format containing the compound name, retention time and chemical identifier.  Retip currently supports SMILES and PubChem CID as chemical identifiers. Retip will use this input file to build a the model and will predict retention times for other biochemical databases or an input query list of compounds.

Use the `retip.Dataset` class to create a new dataset.

* The `test_size` parameter defines what percentage of your dataset should be used for testing/validation of the model (this example uses 20%)
* The `seed` parameter sets a specific training/test split for the database, enabling reproducable model training

In this tutorial, we are using a pre-split dataset (the training and tests sets are provided as separate CSV files) which contain SMILES structures for each entry, and so no splitting parameters are required.

In [None]:
dataset = retip.Dataset(target_column='RT').load_retip_dataset(
    training='lipidomics_c18_retip_training.csv',
    testing='lipidomics_c18_retip_testing.csv')

In [None]:
dataset.head(2)

In [None]:
dataset.calculate_descriptors()
dataset.preprocess_features('lipidomics')

In [None]:
dataset.save_retip_dataset('lipidomics_c18_retip_preprocessed')

### Training RT Prediction Model

Similarly to XGBoost, we create a model trainer.  The main difference is that we no longer need a cross-validation parameter (AutoGluon takes care of this), and instead we need to specify the training time in minutes.  Here, we select `training_duration=30` to train for 30 minutes.

In [None]:
trainer = retip.AutoGluonTrainer(dataset, training_duration=10)
trainer.train()

You can score this model using the internal testing data, or alternatively pass in a different `Dataset` object with precomputed descriptors.

In [None]:
trainer.score(plot=True)

If you use AutoGluon a lot, remember to clear out old models from the `AutogluonModels` directory!