![Retip](../../images/retip_logo.png)
# Retip: Retention Time Prediction for Metabolomics and Lipidomics

### Identiying False Annotations

Given an annotated metabolomics dataset, we can use retention time prediction to identify likely misannotated features.  To begin, we import Retip and load our data as usual. Outlier analysis requires the same type of input as model building, namely a table of compound Name, SMILES (or other supported chemical identifier) and retention time.

Since we are not building this model to apply on other datasets, we do not need a to split it into separate training and test set.

In [None]:
try:
    import retip
except:
    # add the parent directory to the path to load the Retip library locally in case it isn't installed
    import os, sys
    sys.path.insert(1, os.path.join(sys.path[0], '../..'))
    
    import retip

In [None]:
dataset = retip.Dataset().load_retip_dataset('tomato_annotations.csv')
dataset.calculate_descriptors()
dataset.preprocess_features('metabolomics')

In [None]:
dataset.describe()

In [None]:
trainer = retip.XGBoostTrainer(dataset, cv=5)
trainer.train()

Running the `outlier_identification` function will provide two results:

1. A plot showing the distribution of real vs. predicted retention times overlaid by a simple linear fit with 95% confidence intervals, with any annotations outside of this CI window are highlighted in red
2. A table listing the outliers with their name, retention time and predicted retention time

In [None]:
outliers = retip.visualization.outlier_identification(trainer, dataset, 'RTP', confidence_interval=95)
outliers

You can also select a different confidence interval threshold, for example 90% CI:

In [None]:
outliers = retip.outlier_identification(trainer, dataset, confidence_interval=90)
outliers