# Tabluar Prediction

In a tabular prediction task, we predict the values in a column based on the rest columns' values. This tutorial demonstrates how to use AutoGluon for this task via a simple `fit()` call. 

To start, import AutoGluon’s `TabularPredictor` and `TabularDataset` classes from the `tabular` module. 



In [1]:
#@title
!pip install autogluon

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting autogluon
  Downloading autogluon-0.5.0-py3-none-any.whl (9.5 kB)
Collecting autogluon.core[all]==0.5.0
  Downloading autogluon.core-0.5.0-py3-none-any.whl (203 kB)
[K     |████████████████████████████████| 203 kB 3.0 MB/s 
[?25hCollecting autogluon.vision==0.5.0
  Downloading autogluon.vision-0.5.0-py3-none-any.whl (48 kB)
[K     |████████████████████████████████| 48 kB 5.3 MB/s 
[?25hCollecting autogluon.multimodal==0.5.0
  Downloading autogluon.multimodal-0.5.0-py3-none-any.whl (141 kB)
[K     |████████████████████████████████| 141 kB 43.6 MB/s 
[?25hCollecting autogluon.text==0.5.0
  Downloading autogluon.text-0.5.0-py3-none-any.whl (61 kB)
[K     |████████████████████████████████| 61 kB 257 kB/s 
[?25hCollecting autogluon.timeseries[all]==0.5.0
  Downloading autogluon.timeseries-0.5.0-py3-none-any.whl (63 kB)
[K     |████████████████████████████████| 63 kB 2.2 MB/

In [15]:
from autogluon.tabular import TabularDataset, TabularPredictor

The tabular dataset contains individuals' information such as occupation with  if or not her income exceeds $50,000, which is the predicting target. We load this dataset directly from a URL by `TabularDataset`. This class is a subclass of [pandas DataFrame](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html), any pandas methods can be applied here. 

In [16]:
url = 'https://autogluon.s3.amazonaws.com/datasets/Inc/'
train_data = TabularDataset(url+'train.csv')
# Subsample for faster demo. Comment out in real scenarios.
train_data = train_data.sample(n=500, random_state=0)
train_data.head()

Loaded data from: https://autogluon.s3.amazonaws.com/datasets/Inc/train.csv | Columns = 15 / 15 | Rows = 39073 -> 39073


Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,class
6118,51,Private,39264,Some-college,10,Married-civ-spouse,Exec-managerial,Wife,White,Female,0,0,40,United-States,>50K
23204,58,Private,51662,10th,6,Married-civ-spouse,Other-service,Wife,White,Female,0,0,8,United-States,<=50K
29590,40,Private,326310,Some-college,10,Married-civ-spouse,Craft-repair,Husband,White,Male,0,0,44,United-States,<=50K
18116,37,Private,222450,HS-grad,9,Never-married,Sales,Not-in-family,White,Male,0,2339,40,El-Salvador,<=50K
33964,62,Private,109190,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,15024,0,40,United-States,>50K


Our targets are stored in the `class` column, which has two unique values. 



In [17]:
label = 'class'
train_data[label].describe()

count        500
unique         2
top        <=50K
freq         365
Name: class, dtype: object

Now construct a `TabularPredictor` instance by specifying the label column name, and train with `fit`. It will perform automatic feature engineering, train multiple models, and then ensemble them to form the final predictions. You can find detailed information in the output log.



In [18]:
predictor = TabularPredictor(label=label).fit(train_data)

No path specified. Models will be saved in: "AutogluonModels/ag-20220705_183345/"
Beginning AutoGluon training ...
AutoGluon will save models to "AutogluonModels/ag-20220705_183345/"
AutoGluon Version:  0.5.0
Python Version:     3.7.13
Operating System:   Linux
Train Data Rows:    500
Train Data Columns: 14
Label Column: class
Preprocessing data ...
AutoGluon infers your prediction problem is: 'binary' (because only two unique label-values observed).
	2 unique label values:  [' >50K', ' <=50K']
	If 'binary' is not the correct problem_type, please manually specify the problem_type parameter during predictor init (You may specify problem_type as one of: ['binary', 'multiclass', 'regression'])
Selected class <--> label mapping:  class 1 =  >50K, class 0 =  <=50K
	Note: For your binary classification, AutoGluon arbitrarily selected which label-value represents positive ( >50K) vs negative ( <=50K) class.
	To explicitly set the positive_class, either rename classes to 1 and 0, or specify po

Once training is done, load separate test data to predict.



In [19]:
test_data = TabularDataset(url+'test.csv')
# Optional: delete the label column for safety check.
y_pred = predictor.predict(test_data.drop(columns=[label]))
y_pred.head()

Loaded data from: https://autogluon.s3.amazonaws.com/datasets/Inc/test.csv | Columns = 15 / 15 | Rows = 9769 -> 9769


0     <=50K
1     <=50K
2     <=50K
3     <=50K
4     <=50K
Name: class, dtype: object

If you just want to evaluate the model performance, you can call the `evaluate` method.

In [20]:
predictor.evaluate(test_data, silent=True)

{'accuracy': 0.8374449790152523,
 'balanced_accuracy': 0.7430558394221018,
 'f1': 0.621904761904762,
 'mcc': 0.5243657567117436,
 'precision': 0.69394261424017,
 'recall': 0.5634167385677308,
 'roc_auc': 0.880746792185795}

Now we did a quick through about loading the data, training and inference. Next you can read

- the cheetsheet for a quick overview of the APIs
- tutorials to customize the training and inference
- understand how AutoGluon performs feature engineering and model ensemble. 