# AutoGluon Tabular - Quick Start

- https://github.com/autogluon/autogluon
- https://auto.gluon.ai/stable/tutorials/tabular/tabular-quick-start.html

In this tutorial, we will see how to use AutoGluon’s TabularPredictor to predict the values of a target column based on the other columns in a tabular dataset.

In [None]:
from autogluon.tabular import TabularDataset, TabularPredictor

## Data

For this tutorial we will use a dataset from the cover story of [Nature issue 7887](https://www.nature.com/nature/volumes/600/issues/7887): [AI-guided intuition for math theorems](https://www.nature.com/articles/s41586-021-04086-x.pdf). The goal is to predict a knot’s signature based on its properties. We sampled 10K training and 5K test examples from the [original data](https://github.com/deepmind/mathematics_conjectures/blob/main/knot_theory.ipynb). The sampled dataset make this tutorial run quickly, but AutoGluon can handle the full dataset if desired.

We load this dataset directly from a URL. AutoGluon’s TabularDataset is a subclass of pandas DataFrame, so any DataFrame methods can be used on TabularDataset as well.

In [None]:
data_url = 'https://raw.githubusercontent.com/mli/ag-docs/main/knot_theory/'
train_data = TabularDataset(f'{data_url}train.csv')
train_data.head()

Unnamed: 0.1,Unnamed: 0,chern_simons,cusp_volume,hyperbolic_adjoint_torsion_degree,hyperbolic_torsion_degree,injectivity_radius,longitudinal_translation,meridinal_translation_imag,meridinal_translation_real,short_geodesic_imag_part,short_geodesic_real_part,Symmetry_0,Symmetry_D3,Symmetry_D4,Symmetry_D6,Symmetry_D8,Symmetry_Z/2 + Z/2,volume,signature
0,70746,0.09053,12.226322,0,10,0.507756,10.685555,1.144192,-0.519157,-2.760601,1.015512,0.0,0.0,0.0,0.0,0.0,1.0,11.393225,-2
1,240827,0.232453,13.800773,0,14,0.413645,10.453156,1.320249,-0.158522,-3.013258,0.827289,0.0,0.0,0.0,0.0,0.0,1.0,12.742782,0
2,155659,-0.144099,14.76103,0,14,0.436928,13.405199,1.101142,0.768894,2.233106,0.873856,0.0,0.0,0.0,0.0,0.0,0.0,15.236505,2
3,239963,-0.171668,13.738019,0,22,0.249481,27.819496,0.493827,-1.188718,-2.042771,0.498961,0.0,0.0,0.0,0.0,0.0,0.0,17.27989,-8
4,90504,0.235188,15.896359,0,10,0.389329,15.330971,1.036879,0.722828,-3.056138,0.778658,0.0,0.0,0.0,0.0,0.0,0.0,16.749298,4


Our targets are stored in the "signature" column, which has 18 unique integers. Even though pandas didn’t correctly recognize this data type as categorical, AutoGluon will fix this issue.

In [None]:
label = 'signature'
train_data[label].describe()

count    10000.000000
mean        -0.022000
std          3.025166
min        -12.000000
25%         -2.000000
50%          0.000000
75%          2.000000
max         12.000000
Name: signature, dtype: float64

## Training

We now construct a TabularPredictor by specifying the label column name and then train on the dataset with TabularPredictor.fit(). We don’t need to specify any other parameters. AutoGluon will recognize this is a multi-class classification task, perform automatic feature engineering, train multiple models, and then ensemble the models to create the final predictor.

In [None]:
predictor = TabularPredictor(label=label).fit(train_data)

No path specified. Models will be saved in: "AutogluonModels/ag-20241201_093654"
Verbosity: 2 (Standard Logging)
AutoGluon Version:  1.2
Python Version:     3.10.15
Operating System:   Linux
Platform Machine:   x86_64
Platform Version:   #1 SMP Tue Nov 5 00:21:55 UTC 2024
CPU Count:          12
Memory Avail:       26.42 GB / 31.35 GB (84.3%)
Disk Space Avail:   810.02 GB / 1006.85 GB (80.5%)
No presets specified! To achieve strong results with AutoGluon, it is recommended to use the available presets. Defaulting to `'medium'`...
	Recommended Presets (For more details refer to https://auto.gluon.ai/stable/tutorials/tabular/tabular-essentials.html#presets):
	presets='experimental' : New in v1.2: Pre-trained foundation model + parallel fits. The absolute best accuracy without consideration for inference speed. Does not support GPU.
	presets='best'         : Maximize accuracy. Recommended for most users. Use in competitions and benchmarks.
	presets='high'         : Strong accuracy with fas

## Prediction

Once we have a predictor that is fit on the training dataset, we can load a separate set of data to use for prediction and evaulation.

In [None]:
test_data = TabularDataset(f'{data_url}test.csv')

y_pred = predictor.predict(test_data.drop(columns=[label]))
y_pred.head()

Loaded data from: https://raw.githubusercontent.com/mli/ag-docs/main/knot_theory/test.csv | Columns = 19 / 19 | Rows = 5000 -> 5000


0   -4
1   -2
2    0
3    4
4    2
Name: signature, dtype: int64

## Evaluation

We can evaluate the predictor on the test dataset using the evaluate() function, which measures how well our predictor performs on data that was not used for fitting the models.

In [None]:
predictor.evaluate(test_data, silent=True)

{'accuracy': 0.947,
 'balanced_accuracy': 0.7428205291256212,
 'mcc': 0.9350581537178532}

AutoGluon’s TabularPredictor also provides the leaderboard() function, which allows us to evaluate the performance of each individual trained model on the test data.

In [None]:
predictor.leaderboard(test_data)

Unnamed: 0,model,score_test,score_val,eval_metric,pred_time_test,pred_time_val,fit_time,pred_time_test_marginal,pred_time_val_marginal,fit_time_marginal,stack_level,can_infer,fit_order
0,WeightedEnsemble_L2,0.947,0.965966,accuracy,0.474535,0.190845,10.179884,0.004257,0.000485,0.067302,2,True,14
1,LightGBM,0.9456,0.955956,accuracy,0.06543,0.012309,1.629,0.06543,0.012309,1.629,1,True,5
2,XGBoost,0.9448,0.956957,accuracy,0.165439,0.031721,2.623643,0.165439,0.031721,2.623643,1,True,11
3,LightGBMLarge,0.9444,0.94995,accuracy,0.181453,0.026507,4.56257,0.181453,0.026507,4.56257,1,True,13
4,CatBoost,0.9432,0.955956,accuracy,0.011856,0.002524,16.422212,0.011856,0.002524,16.422212,1,True,8
5,RandomForestEntr,0.9384,0.94995,accuracy,0.115191,0.06597,0.813525,0.115191,0.06597,0.813525,1,True,7
6,ExtraTreesGini,0.936,0.946947,accuracy,0.130709,0.066246,0.723556,0.130709,0.066246,0.723556,1,True,9
7,ExtraTreesEntr,0.9358,0.942943,accuracy,0.144481,0.075901,0.680049,0.144481,0.075901,0.680049,1,True,10
8,NeuralNetFastAI,0.9356,0.944945,accuracy,0.041444,0.010138,5.900502,0.041444,0.010138,5.900502,1,True,3
9,RandomForestGini,0.9352,0.944945,accuracy,0.109535,0.075591,0.839011,0.109535,0.075591,0.839011,1,True,6
