# Tabluar Prediction

In a tabular prediction task, we predict the values in a column based on the rest columns' values. This tutorial demonstrates how to use AutoGluon for this task. 

To start, import the {class}`autogluon.tabular.TabularDataset` and 
{class}`autogluon.tabular.TabularPredictor` classes. We will use the former to load data and the latter to train models and predict. 



In [2]:
#@title Install autogluon
!pip install autogluon

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting autogluon
  Downloading autogluon-0.5.0-py3-none-any.whl (9.5 kB)
Collecting autogluon.features==0.5.0
  Downloading autogluon.features-0.5.0-py3-none-any.whl (59 kB)
[K     |████████████████████████████████| 59 kB 7.3 MB/s 
[?25hCollecting autogluon.timeseries[all]==0.5.0
  Downloading autogluon.timeseries-0.5.0-py3-none-any.whl (63 kB)
[K     |████████████████████████████████| 63 kB 2.6 MB/s 
[?25hCollecting autogluon.core[all]==0.5.0
  Downloading autogluon.core-0.5.0-py3-none-any.whl (203 kB)
[K     |████████████████████████████████| 203 kB 69.6 MB/s 
[?25hCollecting autogluon.text==0.5.0
  Downloading autogluon.text-0.5.0-py3-none-any.whl (61 kB)
[K     |████████████████████████████████| 61 kB 299 kB/s 
[?25hCollecting autogluon.vision==0.5.0
  Downloading autogluon.vision-0.5.0-py3-none-any.whl (48 kB)
[K     |████████████████████████████████| 48 kB 6.0 MB/s 
[?

In [3]:
from autogluon.tabular import TabularDataset, TabularPredictor

The dataset we will use contains individuals' information such as occupation with if or not her income exceeds $50,000, which is the predicting target. We load this dataset directly from a URL. Note that the `TabularDataset` class is a subclass of [pandas DataFrame](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html), any pandas methods can be applied here. 

In [4]:
url = 'https://autogluon.s3.amazonaws.com/datasets/Inc/'
train_data = TabularDataset(url+'train.csv')
train_data.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,class
0,25,Private,178478,Bachelors,13,Never-married,Tech-support,Own-child,White,Female,0,0,40,United-States,<=50K
1,23,State-gov,61743,5th-6th,3,Never-married,Transport-moving,Not-in-family,White,Male,0,0,35,United-States,<=50K
2,46,Private,376789,HS-grad,9,Never-married,Other-service,Not-in-family,White,Male,0,0,15,United-States,<=50K
3,55,?,200235,HS-grad,9,Married-civ-spouse,?,Husband,White,Male,0,0,50,United-States,>50K
4,36,Private,224541,7th-8th,4,Married-civ-spouse,Handlers-cleaners,Husband,White,Male,0,0,40,El-Salvador,<=50K


Our targets are stored in the `class` column, which has two unique values. 



In [5]:
label = 'class'
train_data[label].describe()

count      39073
unique         2
top        <=50K
freq       29704
Name: class, dtype: object

Now construct a `TabularPredictor` instance by specifying the label column name, then train on the dataset with the {func}`autogluon.tabular.TabularPredictor.fit` method. We don't need to specify any other hyperparameters. This method will perform automatic feature engineering, train multiple models, and then ensemble them to form the final predictions. You can find detailed information in the output log.



In [6]:
predictor = TabularPredictor(label=label).fit(train_data)

No path specified. Models will be saved in: "AutogluonModels/ag-20220707_055423/"
Beginning AutoGluon training ...
AutoGluon will save models to "AutogluonModels/ag-20220707_055423/"
AutoGluon Version:  0.5.0
Python Version:     3.7.13
Operating System:   Linux
Train Data Rows:    39073
Train Data Columns: 14
Label Column: class
Preprocessing data ...
AutoGluon infers your prediction problem is: 'binary' (because only two unique label-values observed).
	2 unique label values:  [' <=50K', ' >50K']
	If 'binary' is not the correct problem_type, please manually specify the problem_type parameter during predictor init (You may specify problem_type as one of: ['binary', 'multiclass', 'regression'])
Selected class <--> label mapping:  class 1 =  >50K, class 0 =  <=50K
	Note: For your binary classification, AutoGluon arbitrarily selected which label-value represents positive ( >50K) vs negative ( <=50K) class.
	To explicitly set the positive_class, either rename classes to 1 and 0, or specify 

The training is often fast, as in default AutoGluon will not try very big models. For example, training the above dataset with ~40K rows should be finished within a few minutes on a normal CPU. But if you want AutoGluon to stop earlier, you can specify the `time_limit` argument in the `fit` method. For example, `fit(..., time_limit=60, ...)` means training at most 1 minute. 

Once training is done, load separate test data to predict.



In [7]:
test_data = TabularDataset(url+'test.csv')
# Optional: delete the label column for safety check.
y_pred = predictor.predict(test_data.drop(columns=[label]))
y_pred.head()

Loaded data from: https://autogluon.s3.amazonaws.com/datasets/Inc/test.csv | Columns = 15 / 15 | Rows = 9769 -> 9769


0     <=50K
1     <=50K
2      >50K
3     <=50K
4     <=50K
Name: class, dtype: object

If you just want to evaluate the model performance, you can call the {func}`autogluon.tabular.TabularPredictor.evaluate` method.

In [8]:
predictor.evaluate(test_data, silent=True)

{'accuracy': 0.8763435356740711,
 'balanced_accuracy': 0.7950062351568354,
 'f1': 0.710727969348659,
 'mcc': 0.6395678748952276,
 'precision': 0.798708288482239,
 'recall': 0.640207075064711,
 'roc_auc': 0.9313343583022541}

Now we did a quick through about using AutoGluon for tabular prediction. We used two classes, {class}`autogluon.tabular.TabularDataset` (essentially a pandas DataFrame) to load data and {class}`autogluon.tabular.TabularPredictor` to train (via the `fit` method) and predict (via the `predict` method). You will see similar APIs for other tasks, namely a `Dataset` class to load data and a `Prediction` class to train and predict. 


In addition, AutoGluon simplifies the model training by not requiring feature engineering and specifying model hyperparameters. AutoGluon automatically performs these jobs when running `fit`. You may worry about the resulted longer training time, AutoGluon balances the computational cost and model quality. You can benchmark AutoGluon's performance on the whole dataset loaded above against your favoriate machine learning model. But to be fair, you also need to count the time you spend on preprocessing data and tuning your models. 

To know more about AutoGluon, next you can read

- the cheetsheet for a quick overview of the APIs
- tutorials to customize the training and inference
- understand how AutoGluon performs feature engineering and model ensemble. 