![Retip](../../images/retip_logo.png)
# Retip: Retention Time Prediction for Metabolomics and Lipidomics

### Retention Time Prediction Overview

Retip is a tool for predicting retention times (RTs) of small molecules for high pressure liquid chromatography (HPLC) mass spectrometry.

### Loading Data

We begin by importing the retip library, which gives us access to the training, prediction and visualization functions.

In [None]:
try:
    import retip
except:
    # add the parent directory to the path to load the Retip library locally in case it isn't installed
    import os, sys
    sys.path.insert(1, os.path.join(sys.path[0], '../..'))
    
    import retip

Now we can import our retention time dataset.  The user needs to prepare a compound retention time table in CSV or MS Excel format containing the compound name, retention time and chemical identifier.  Retip currently supports SMILES and PubChem CID as chemical identifiers.

Retip will use this input file to build a the model and will predict retention times for other biochemical databases or an input query list of compounds. It is suggested that the file has at least 300 compounds to build a good retention time prediction model.

Use the `retip.Dataset` class to create a new dataset.

* The `test_size` parameter defines what percentage of your dataset should be used for testing/validation of the model (this example uses 20%)
* The `seed` parameter sets a specific training/test split for the database, enabling reproducable model training

In [None]:
dataset = retip.Dataset(target_column='RT').load_retip_dataset(
    training='Plasma_positive.xlsx', training_sheet_name='lib_2',
    validation='Plasma_positive.xlsx', validation_sheet_name='ext')

In [None]:
dataset.head(2)

Above you can see the first few rows of our starting dataset.  It contains the three requirements described before: name, retention time and chemical identifier (SMILES).

Next, if your dataset does not already contain precalculated molecular descriptors, you can compute them with the [Mordred Molecular Descriptor Calculator](https://github.com/mordred-descriptor/mordred) by calling a simple function.  Note that molecules that cannot be parsed will be retained the dataset, but cannot be used for model training or validation.

In [None]:
dataset.calculate_descriptors()

We can view a summary data sets using the `describe` function, with the number indicating the number of rows and columns in each data frame:

In [None]:
dataset.describe()

It is important to perform feature reduction before training. Retip provides a basic tool to remove features with missing values and to restrict feature sets to descriptors which calculate non-null values for large sets of molecules.

In [None]:
dataset.preprocess_features('metabolomics')

Finally, it is possible to load separate files for training, testing and validation sets, but here we loaded only a single file which we then need to split using the `split_dataset` function.
* `test_split` defines what percentage of your dataset should be used for testing of the model's accuracy (this example uses 20%)
* `validation_split` constructs an additional dataset for validation if desired
* `seed` sets a specific training/test split for the database, enabling reproducable model training

In [None]:
dataset.split_dataset(test_split=0.2, seed=101)

If we look at the dataset summary again, we can see that the data sets have updated.

In [None]:
dataset.describe()

Since molecular descriptor calculation is a time-comsuming process, you can save the current state of your dataset. Next time you want to use retention time library, just use this export when loading your dataset instead.  Note that we do not need to include a file extension since Retip will postpend the dataset type to the filename we provide.

In [None]:
dataset.save_retip_dataset('Plasma_positive_retip_processed')

These files can be loaded by running
```
dataset.load_retip_dataset('Plasma_positive_retip_processed_training.csv',
                           'Plasma_positive_retip_processed_testing.csv',
                           'Plasma_positive_retip_processed_validation.csv')
```

### Training RT Prediction Model

Here you can select your trainer to build your RT prediction model.  For this example we use XGBoost, but you can use `AutoGluonTrainer`.  To initialize your trainer, pass in your dataset with computed descriptors along with any of the optional parameters:

* `cv` indicates the number of cross-validation splits (we recommend `cv=10` for a 10-fold cross validation)
* `n_cpu` is the number of CPU cores to use for training (if not specified, it will use all available cores)

Depending on your system, this can take ~20 minutes as the trainer performs a grid search over a large parameter space.

In [None]:
trainer = retip.XGBoostTrainer(dataset, cv=5)
trainer.train()

You can score this model using the internal testing data, or alternatively pass in a different `Dataset` object with precomputed descriptors.  The plot parameter is optional but allows you to visualize how well the model works.

In [None]:
trainer.score(plot=True)

### External Validation

You can also test the model using an external dataset that we loaded initially. Since we are providing a dataset to use, we must also specify the target column.

In [None]:
trainer.score(dataset.get_validation_data(), target_column='RT', plot=True)

The RMSE and other scores on the external validation set are significantly worse than on our training and test, suggesting that our trainining set isn't sufficiently representative of our chemical space.

### RT Prediction

You can now use the trained model to predict retention times for a new dataset.  

In [None]:
y_pred = trainer.predict(dataset.get_validation_data())
y_pred[:25]

This is great, but a list of numbers isn't very useful.  Instead, we can annotate our dataset:

In [None]:
annotated = trainer.annotate(dataset.get_validation_data(include_metadata=True), prediction_column='RTP')

In [None]:
annotated.head()

Now our dataset has a new column `RTP` column with the predicted retention time! In case there are some molecules that could not be loaded or for which descriptors could not be calculated, you will see a empty/null value in the RTP column.

### Saving/Loading Models

Once you produce a model you're happy with, you can save it to avoid needing to retrain in the future.

In [None]:
trainer.save_model('Plasma_positve_xgboost-model.sav')

This exported model can then be reloaded and used to score datasets and predict new retention times.  However, unless a dataset is first passed to the trainer, it cannot be retained. 

In [None]:
trainer = retip.XGBoostTrainer()
trainer.load_model('Plasma_positve_xgboost-model.sav')

In [None]:
trainer.score(val_data)