![AutoGluon Logo](https://auto.gluon.ai/stable/_static/AutogluonLogo.png)

# <a name="0">AutoGluon Tutorial - TABULAR DATA</a>
   
This notebook demonstrates the simplest way to use AutoGluon for Tabular data. 

AutoGluon automates several tasks related to ML model development and builds highly accurate models. In this tutorial, you will test AutoGluon on a dataset comprising of demographics data about people. The goal is to identify whether or not a person’s yearly income exceeds $50,000.
    
> This is a __binary classification__ task. The label column indicates whether a person earns more than 50K a year or not.


Let's solve the binary classification problem using __AutoGluon__.

### Basic AutoGluon Features

- <a href="#Importing-AutoGluon">Importing AutoGluon </a>
- <a href="#Getting-the-Data">Getting the Data</a>
- <a href="#Model-Training-with-AutoGluon">Model Training with AutoGluon</a>
- <a href="#AutoGluon-Training-Results">AutoGluon Training Results</a>
- <a href="#Model-Prediction">Model Prediction with AutoGluon</a>

### Advanced AutoGluon Features

- <a href="#Specifying-performance-metric-and-Hyperparameter-Options">Specifying performance metric and Hyperparameter Options </a>
- <a href="#Model-ensembling-with-stacking-bagging">Model ensembling with stacking/bagging</a>
- <a href="#Prediction-options-inference">Prediction options (inference)</a>
- <a href="#Selecting-individual-models-predictions">Selecting individual models for predictions</a>
- <a href="#Interpretability-feature-importance">Interpretability: Feature importance</a>
- <a href="#Inference-speed-model-distillation">Inference Speed: Model distillation</a>


(<a href="#0">Go to top</a>)

Let's start by loading some libraries and packages.

In [None]:
!pip3 install -U -q pip
!pip3 install -U -q setuptools wheel

# CPU version of pytorch has smaller footprint - see installation instructions in
# pytorch documentation - https://pytorch.org/get-started/locally/
!pip3 install -q torch==1.12+cpu torchvision==0.13.0+cpu torchtext==0.13.0 -f https://download.pytorch.org/whl/cpu/torch_stable.html

# Install the proper version of PyTorch following https://pytorch.org/get-started/locally/
# UNCOMMENT THIS IF RUNNING WITH GPU
#!pip3 install -q torch==1.12.0+cu113 torchvision==0.13.0+cu113 torchtext==0.13.0 --extra-index-url https://download.pytorch.org/whl/cu113

!pip3 install -q autogluon

### <a id="Importing-AutoGluon">Importing AutoGluon</a>

We load the objects needed to work with our Tabular dataset, as well as the pandas library.

(<a href="#0">Go to top</a>)

In [None]:
import pandas as pd

# Importing objects from the newly installed AutoGluon code library
from autogluon.tabular import TabularPredictor, TabularDataset

### <a id="Getting-the-Data">Getting the Data</a>

Let's get the data for our binary classification problem and inspect it.

Note that we load data from a CSV file stored in the cloud (AWS s3 bucket), but you can also specify a local file path.

(<a href="#0">Go to top</a>)

In [None]:
# Load the training dataset
# Note that the TabularDataset can read files from local paths or from external URLs.
df_train = TabularDataset("https://autogluon.s3.amazonaws.com/datasets/Inc/train.csv")

# Load the test dataset
df_test = TabularDataset("https://autogluon.s3.amazonaws.com/datasets/Inc/test.csv")

In [None]:
print(f"Number of data points: {len(df_train)}")
print(f"Number of columns per data point: {len(df_train.columns)}")

df_train.head()

We can see that the data file contains a series of demographics features for a number of people; let’s use them to predict whether each person’s income exceeds $50,000 or not, which is recorded in the `class` column of this table.

In [None]:
print("Summary of class variable: \n", df_train["class"].describe())
print()
df_train["class"].value_counts()

### <a id="Model-Training-with-AutoGluon">Model Training with AutoGluon</a>

We can train a model using AutoGluon with only a single line of code.  All we need to do is to tell it which column from the dataset we are trying to predict, and what the dataset is.

__Optional:__ You may set a __time limit__ for AutoGluon to perform all the tasks related to ML model development. More time allows AutoGluon to try out more techniques to improve performance.

(<a href="#0">Go to top</a>)

In [None]:
# Train a model with AutoGluon on the train dataset

# Set the path to save models
save_path = "AutogluonModels/Tabular_Basic/"

# Subsample subset of data for faster demo, try setting this to much larger values
subsample_size = 1000
df_train_small = df_train.sample(n=subsample_size, random_state=42)

# Set the training time to 5 minutes here, to achieve a quick result
predictor = TabularPredictor(label="class", path=save_path).fit(
    train_data=df_train_small, time_limit=2 * 60
)

### <a id="AutoGluon-Training-Results">AutoGluon Results</a>
Now let's take a look at all the information AutoGluon provides via its __leaderboard function__, which provides a summary of all models that AutoGluon trained. <br/> 

(<a href="#0">Go to top</a>)

In [None]:
predictor.leaderboard(silent=True)

### <a id="Model-Prediction">Model Prediction with AutoGluon</a>

#### Now that your model is trained, let's use it to make predictions!

We should always run a final model performance assessment using data that was unseen by the model (the test data). Test data is not used during training and can therefore give an unbiased performance assessment. 

(<a href="#0">Go to top</a>)

In [None]:
# Inspect the test data
df_test.head()

In [None]:
# Get predictions
predictions = predictor.predict(df_test)
predictions.head()

In [None]:
evaluation = predictor.evaluate_predictions(
    y_true=df_test["class"], y_pred=predictions, auxiliary_metrics=True
)

Use `fit_summary()` to output a summary of information about models produced during `fit()`.

In [None]:
print(predictor.fit_summary())

<a id='ex'></a>
### <mark>Your Turn: Improve the performance of your binary classifier using only basic AutoGluon options

- Try training with more data points
- Try training for a longer time
    
You can code your answer in the cell below. 

In [None]:
# Write your code here:


---
# <a id="Advanced AutoGluon Options">Advanced AutoGluon Options</a>

Now that you know how to use the `TabularPredictor` using 3 lines of code, let us try to understand some of the processes and available configurations AutoGluon offers.

(<a href="#0">Go to top</a>)

## <a id="Specifying-performance-metric-and-Hyperparameter-Options">Specifying performance metric and Hyperparameter Options</a>

### Specifying performance metric
AutoGluon automatically infers the performance metric to optimize given the type of problem. However, it is possible to explicitly specify the evaluation metric as well. 
The full list of AutoGluon classification metrics can be found here:

`'accuracy', 'balanced_accuracy', 'f1', 'f1_macro', 'f1_micro', 'f1_weighted', 'roc_auc', 'average_precision', 'precision', 'precision_macro', 'precision_micro', 'precision_weighted', 'recall', 'recall_macro', 'recall_micro', 'recall_weighted', 'log_loss', 'pac_score'`

(<a href="#0">Go to top</a>)

In [None]:
# We can specify a different metric for optimization
metric = "f1"

# Train various models for ~30 min
time_limit = 30 * 60

### Hyperparameter Tuning

Hyperparameter optimization improves model performance by finding the best combination of hyperparameter values. The choice of models and hyperparameters can be specified while calling the `fit()` method.

You can specify various hyperparameter values for each type of model. For each hyperparameter, you can either specify a single fixed value, or a search space of values to consider during hyperparameter optimization. Hyperparameters which you do not specify are left at default settings chosen automatically by AutoGluon, which may be fixed values or search spaces.



In [None]:
import autogluon.core as ag

save_path = "AutogluonModels/Tabular_Tuning/"

# Specifies non-default hyperparameter values for neural network models
nn_options = {
    "num_epochs": 10,  # number of training epochs (controls training time of NN models)
    "learning_rate": ag.space.Real(
        1e-4, 1e-2, default=5e-4, log=True
    ),  # learning rate used in training (real-valued hyperparameter searched on log-scale)
    "activation": ag.space.Categorical(
        "relu", "sigmoid"
    ),  # activation function used in NN (categorical hyperparameter, default = first entry)
    "dropout_prob": ag.space.Real(
        0.0, 0.4, default=0.1
    ),  # dropout probability (real-valued hyperparameter)
}

# Specifies non-default hyperparameter values for lightGBM gradient boosted trees
gbm_options = {
    "num_boost_round": 100,  # number of boosting rounds (controls training time of GBM models)
    "num_leaves": ag.space.Int(
        lower=26, upper=66, default=36
    ),  # number of leaves in trees (integer hyperparameter)
}

# Hyperparameters of each model type
# When these keys are missing from hyperparameters dict, no models of that type are trained
hyperparameters = {
    "GBM": gbm_options,
    "NN_TORCH": nn_options,  # NOTE: comment this line out if you get errors on Mac OSX
}

num_trials = (
    5  # try at most 5 different hyperparameter configurations for each type of model
)
search_strategy = (
    "auto"  # to tune hyperparameters using random search routine with a local scheduler
)

# HPO is not performed unless hyperparameter_tune_kwargs is specified
hyperparameter_tune_kwargs = {
    "num_trials": num_trials,
    "scheduler": "local",
    "searcher": search_strategy,
}

predictor_hpo = TabularPredictor(label="class", path=save_path, eval_metric=metric).fit(
    df_train,
    time_limit=time_limit,
    hyperparameters=hyperparameters,
    hyperparameter_tune_kwargs=hyperparameter_tune_kwargs,
)

<a id='ex'></a>
### <mark>Your Turn: Compute predictions for the HPO predictor and evaluate its performance.

- Use `predict` on the test data set
- Use `evaluate_predictions` and compare the performance with your previous results


In [None]:
# Write your code here


## <a name="Model-ensembling-with-stacking-bagging">Model ensembling with stacking/bagging</a>
(<a href="#0">Go to top</a>)

Beyond hyperparameter-tuning with a correctly-specified evaluation metric, there are two other methods to boost predictive performance:
- bagging and 
- stack-ensembling

You’ll often see performance improve if you specify `num_bag_folds = 5-10`, `num_stack_levels = 1-3` in the call to `fit()`. Beware that doing this will increase training times and memory/disk usage.

You should not provide `tuning_data` when stacking/bagging, and instead provide all your available data as train_data (which AutoGluon will split in more intelligent ways). Parameter `num_bag_sets` controls how many times the K-fold bagging process is repeated to further reduce variance (increasing this may further boost accuracy but will substantially increase training times, inference latency, and memory/disk usage). Rather than manually searching for good bagging/stacking values yourself, AutoGluon will automatically select good values for you if you specify `auto_stack` instead:


In [None]:
save_path = "AutogluonModels/Tabular_Stack/"
predictor_stack = TabularPredictor(label="class", path=save_path, eval_metric=metric).fit(
    train_data=df_train, auto_stack=True, time_limit=2 * 60
)
predictions_stack = predictor_stack.predict(df_test)
evaluation_stack = predictor_stack.evaluate_predictions(
    y_true=df_test["class"], y_pred=predictions_large, auxiliary_metrics=True
)

Often stacking/bagging will produce superior accuracy than hyperparameter-tuning, but you may try combining both techniques (note: specifying `presets='best_quality'` in `fit()` simply sets `auto_stack=True`).

---
## <a name="Prediction-options-inference">Prediction options (inference)</a>
(<a href="#0">Go to top</a>)

Even if you’ve started a new Python session since last calling `fit()`, you can still load a previously trained predictor from disk:

In [None]:
save_path = "AutogluonModels/Tabular_Tuning/"

predictor = TabularPredictor.load(save_path)

Above `save_path` is the same folder previously passed to `TabularPredictor`, in which the trained models have been saved. You can train easily models on one machine and deploy them on another. Simply copy the `save_path` folder to the new machine and specify its new path in `TabularPredictor.load()`.

We can make a prediction on an individual example rather than on a full dataset:

In [None]:
# Select one datapoint to make a prediction
datapoint = df_test.iloc[[0]] # Note: .iloc[0] won't work because it returns pandas Series instead of DataFrame

predictor.predict(datapoint)

To output predicted class probabilities instead of predicted classes, you can use `predict_proba`:

In [None]:
# Returns a DataFrame that shows which probability corresponds to which class
predictor.predict_proba(datapoint)

By default, `predict()` and `predict_proba()` will utilize the model that AutoGluon thinks is most accurate, which is usually an ensemble of many individual models. Here’s how to see which model this corresponds to:

---
## <a name="Selecting-individual-models-predictions">Selecting individual models for predictions</a>
(<a href="#0">Go to top</a>)

We can specify a particular model to use for predictions (e.g. to reduce inference latency). Note that a ‘model’ in AutoGluon may refer, for instance, to a single Neural Network, a bagged ensemble of many Neural Network copies trained on different training/validation splits, a weighted ensemble that aggregates the predictions of many other models, or a stacked model that operates on predictions output by other models. This is akin to viewing a RandomForest as one ‘model’ when it is in fact an ensemble of many decision trees.

Here’s how to list all models trained by AutoGluon for a given predictor:

In [None]:
predictor.get_model_names()

To use a particular model for prediction instead of AutoGluon’s default model-choice, we need to pass the model name as argument to the method `predict`: 

`predictor.predict(datapoint, model=<name_of_model_to_use>)`

<a id='ex'></a>
### <mark>Your Turn: Compute predictions for your loaded predictor using a more performant trained model

- Use `leaderboard` and `quantile` to find the model with the fastest inference speed among those with prediction time faster than the 80th percentile 
- Call `predict` using that model on the test data
- Use `evaluate` to compute the performance of said model. Compare with AutoGluon's "best" model
- Use `%timeit` to compare the speed of the performant model with AutoGluon's default

In [None]:
# Write your code here:


---
## <a name="Interpretability-feature-importance">Interpretability: Feature importance</a>
(<a href="#0">Go to top</a>)

To better understand our trained predictor, we can estimate the overall importance of each feature:

In [None]:
predictor.feature_importance(df_test)

Computed via permutation-shuffling, these feature importance scores quantify the drop in predictive performance (of the already trained predictor) when one columns values are randomly shuffled across rows. The top features in this list contribute most to AutoGluon’s accuracy. Features with non-positive importance score hardly contribute to the predictors accuracy, or may even be actively harmful to include in the data (consider removing these features from your data and calling `fit` again). These scores facilitate interpretability of the predictors global behavior (which features it relies on for all predictions) rather than local explanations that only rationalize one particular prediction.


---
## <a name="Inference-speed-model-distillation">Inference Speed: Model distillation</a>
(<a href="#0">Go to top</a>)

While computationally-favorable, single individual models will usually have lower accuracy than weighted/stacked/bagged ensembles. Model Distillation offers one way to retain the computational benefits of a single model, while enjoying some of the accuracy-boost that comes with ensembling. The idea is to train the individual model (which we can call the student) to mimic the predictions of the full stack ensemble (the teacher). Like `refit_full()`, the `distill()` function will produce additional models we can opt to use for prediction.

### Training student models

In [None]:
# Specify much longer time limit in real applications
student_models = predictor.distill(time_limit=2*60)
student_models

In [None]:
predictor.leaderboard(silent=True)