![Retip](../images/retip_logo.png)
# Retip: Retention Time Prediction for Metabolomics

### Training a Model with AutoGluon

[AutoGluon](https://auto.gluon.ai) is an AutoML library designed to automate the full machine learning pipeline, including feature  preprocessing, training multiple model types, and constructing ensembles of models to improve overall accuracy.

As AutoGluon performs so many tasks, the final model accuracy usually improves the longer it has to train.  We recommend 1-2 hours.

We begin by importing the retip library, loading our datasets, and calculating descriptors as before.

In [2]:
try:
    import retip
except:
    # add the parent directory to the path to load the Retip library locally in case it isn't installed
    import os, sys
    sys.path.insert(1, os.path.join(sys.path[0], '..'))
    
    import retip

In [3]:
dataset = retip.Dataset(target_column='RT').load_retip_dataset(
    training='../example_data/Plasma_positive.xlsx', training_sheet_name='lib_2',
    validation='../example_data/Plasma_positive.xlsx', validation_sheet_name='ext')

In [4]:
dataset.calculate_descriptors()
dataset.preprocess_features('metabolomics')
dataset.split_dataset(test_split=0.2, seed=101)

Calculating descriptors for training dataset


100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 494/494 [02:02<00:00,  4.03it/s]


Calculating descriptors for validation dataset


100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 358/358 [02:01<00:00,  2.94it/s]


Reduced feature set from 1613 to 817


### Training RT Prediction Model

Similarly to XGBoost, we create a model trainer.  The main difference is that we no longer need a cross-validation parameter (AutoGluon takes care of this), and instead we need to specify the training time in minutes.  Here, we select `training_duration=30` to train for 30 minutes.

In [5]:
trainer = retip.AutoGluonTrainer(dataset, training_duration=60)
trainer.train()

No path specified. Models will be saved in: "AutogluonModels/ag-20220907_132926/"


ValueError: Preset 'high_quality' was not found. Valid presets: ['best_quality', 'high_quality_fast_inference_only_refit', 'good_quality_faster_inference_only_refit', 'medium_quality_faster_train', 'optimize_for_deployment', 'ignore_text']

You can score this model using the internal testing data, or alternatively pass in a different `Dataset` object with precomputed descriptors.

In [None]:
trainer.score()

### External Validation

In [None]:
val_data = retip.Dataset('../example_data/Plasma_positive.xlsx', sheet_name='ext')

In [None]:
trainer.score(val_data)

While we still observe the same issue where our training data is not sufficiently representative of our chemical space, the accuracy of the AutoGluon model is notably better than XGBoost, even after only 30 minutes of training.

### Saving/Loading Models

AutoGluon automatically saves its models into a directory called `AutogluonModels`, where each model is saved into a subdirectory named according to when the model started training.  You can use the same saving and loading methods to move these save directories and reload them.

In [None]:
trainer.save_model('Plasma_positve_autogluon-model')

In [None]:
trainer = retip.AutoGluonTrainer()
trainer.load_model('AutogluonModels/ag-20210318_094225')

In [None]:
trainer.score(val_data, plot=True)

If you use AutoGluon a lot, remember to clear out old models from the `AutogluonModels` directory!