# AutoGluon Tabular - Quick Start

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/autogluon/autogluon/blob/stable/docs/tutorials/tabular/tabular-quick-start.ipynb)
[![Open In SageMaker Studio Lab](https://studiolab.sagemaker.aws/studiolab.svg)](https://studiolab.sagemaker.aws/import/github/autogluon/autogluon/blob/stable/docs/tutorials/tabular/tabular-quick-start.ipynb)

In this tutorial, we will see how to use AutoGluon's `TabularPredictor` to predict the values of a target column based on the other columns in a tabular dataset.

Begin by making sure AutoGluon is installed, and then import AutoGluon's `TabularDataset` and `TabularPredictor`. We will use the former to load data and the latter to train models and make predictions.

In [1]:
!python -m pip install --upgrade pip
!python -m pip install autogluon

Collecting pip
  Downloading pip-25.0.1-py3-none-any.whl.metadata (3.7 kB)
Downloading pip-25.0.1-py3-none-any.whl (1.8 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 MB[0m [31m19.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pip
  Attempting uninstall: pip
    Found existing installation: pip 24.1.2
    Uninstalling pip-24.1.2:
      Successfully uninstalled pip-24.1.2
Successfully installed pip-25.0.1
Collecting autogluon
  Downloading autogluon-1.2-py3-none-any.whl.metadata (11 kB)
Collecting autogluon.core==1.2 (from autogluon.core[all]==1.2->autogluon)
  Downloading autogluon.core-1.2-py3-none-any.whl.metadata (12 kB)
Collecting autogluon.features==1.2 (from autogluon)
  Downloading autogluon.features-1.2-py3-none-any.whl.metadata (11 kB)
Collecting autogluon.tabular==1.2 (from autogluon.tabular[all]==1.2->autogluon)
  Downloading autogluon.tabular-1.2-py3-none-any.whl.metadata (14 kB)
Collecting autogluon.multimodal==1.2 (from a

In [1]:
from autogluon.tabular import TabularDataset, TabularPredictor

## Example Data

For this tutorial we will use a dataset from the cover story of [Nature issue 7887](https://www.nature.com/nature/volumes/600/issues/7887): [AI-guided intuition for math theorems](https://www.nature.com/articles/s41586-021-04086-x.pdf). The goal is to predict a knot's signature based on its properties. We sampled 10K training and 5K test examples from the [original data](https://github.com/deepmind/mathematics_conjectures/blob/main/knot_theory.ipynb). The sampled dataset make this tutorial run quickly, but AutoGluon can handle the full dataset if desired.

We load this dataset directly from a URL. AutoGluon's `TabularDataset` is a subclass of pandas [DataFrame](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html), so any `DataFrame` methods can be used on `TabularDataset` as well.

In [2]:
train_label = TabularDataset("./train_label.csv")
train_data = TabularDataset('./train_input.csv')
train_data.head()

Unnamed: 0,PM2.5 (µg/m³),PM10 (µg/m³),NO2 (µg/m³),SO2 (µg/m³),CO (mg/m³),O3 (µg/m³),Temperature (°C),Humidity (%),Wind Speed (m/s),Wind Direction (°),...,AQI,Season,Latitude,Longitude,Day of Week,Hour,Month,Year,Weather Condition,Station ID
0,204.626541,177.172912,56.181879,40.973914,0.85869,48.073378,23.491706,58.757424,8.168284,359.881532,...,38,Winter,44.134919,106.844439,Tuesday,16,1,2016,Clear,32
1,178.47723,114.525464,48.634363,8.791909,1.921633,163.915372,5.175494,33.352674,2.95737,359.792024,...,489,Autumn,48.591492,118.513972,Saturday,5,7,2020,Clear,80
2,116.317364,209.525485,61.934167,27.043954,1.870824,176.562196,27.285353,11.964247,7.152106,359.124842,...,399,Spring,29.407165,100.469703,Saturday,6,6,2023,Clear,11
3,214.559012,29.915876,71.323678,7.326995,3.01314,94.270021,-6.936606,40.184682,6.90348,359.114918,...,162,Summer,43.45901,125.615611,Tuesday,0,6,2018,Snow,83
4,34.392308,158.081571,54.892799,25.356895,4.640395,73.04732,6.969151,81.48433,3.098179,358.962514,...,423,Spring,40.273913,128.614599,Wednesday,22,6,2021,Clear,46


Our targets are stored in the "signature" column, which has 18 unique integers. Even though pandas didn't correctly recognize this data type as categorical, AutoGluon will fix this issue.


In [3]:
from sklearn.preprocessing import LabelEncoder

# 创建编码器
label_encoder = LabelEncoder()

# 对字符串列进行编码（假设你的列名是'City'）
train_label['City_encoded'] = label_encoder.fit_transform(train_label['City'])

# 查看编码映射关系
print(dict(zip(label_encoder.classes_, label_encoder.transform(label_encoder.classes_))))

{'Beijing': 0, 'Chengdu': 1, 'Guangzhou': 2, 'Shanghai': 3, 'Shenzhen': 4}


In [7]:
train_data['weather_encode'] = label_encoder.fit_transform(train_data['Weather Condition'])

In [17]:
train_data

Unnamed: 0,PM2.5 (µg/m³),PM10 (µg/m³),NO2 (µg/m³),SO2 (µg/m³),CO (mg/m³),O3 (µg/m³),Temperature (°C),Humidity (%),Wind Speed (m/s),Wind Direction (°),...,Season,Latitude,Longitude,Day of Week,Hour,Month,Year,Weather Condition,Station ID,weather_encode
0,204.626541,177.172912,56.181879,40.973914,0.858690,48.073378,23.491706,58.757424,8.168284,359.881532,...,Winter,44.134919,106.844439,Tuesday,16,1,2016,Clear,32,0
1,178.477230,114.525464,48.634363,8.791909,1.921633,163.915372,5.175494,33.352674,2.957370,359.792024,...,Autumn,48.591492,118.513972,Saturday,5,7,2020,Clear,80,0
2,116.317364,209.525485,61.934167,27.043954,1.870824,176.562196,27.285353,11.964247,7.152106,359.124842,...,Spring,29.407165,100.469703,Saturday,6,6,2023,Clear,11,0
3,214.559012,29.915876,71.323678,7.326995,3.013140,94.270021,-6.936606,40.184682,6.903480,359.114918,...,Summer,43.459010,125.615611,Tuesday,0,6,2018,Snow,83,5
4,34.392308,158.081571,54.892799,25.356895,4.640395,73.047320,6.969151,81.484330,3.098179,358.962514,...,Spring,40.273913,128.614599,Wednesday,22,6,2021,Clear,46,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2494,146.445941,232.691254,85.944790,44.068548,3.418110,13.914695,10.342230,27.130427,1.588975,0.264560,...,Summer,39.747101,99.083629,Wednesday,21,11,2015,Fog,49,2
2495,248.865589,60.469210,64.775036,24.676307,3.824908,138.289059,22.516228,55.225753,7.329591,0.214645,...,Summer,23.628014,92.791756,Monday,1,8,2020,Clear,9,0
2496,205.705537,24.804909,79.195931,10.788163,4.791531,153.579056,15.157692,12.125589,6.985294,0.149666,...,Autumn,35.794039,115.345610,Tuesday,12,3,2023,Rain,33,4
2497,19.504890,35.869950,55.534393,30.365828,4.919669,50.514686,18.626287,48.348076,6.138192,0.080350,...,Winter,44.825846,115.987397,Tuesday,17,3,2019,Snow,19,5


In [15]:
# 创建左、右数据集（重置索引保证对齐）
left = train_data.iloc[:, 0:17].reset_index(drop=True)
right = train_data.iloc[:, -2:].reset_index(drop=True)

# 按行号合并（假设行顺序一致）
train_data_filter = pd.merge(
    left,
    right,
    left_index=True,  # 使用左数据集索引作为键
    right_index=True,  # 使用右数据集索引作为键
    how='inner'  # 内连接避免NaN
)

In [20]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder

# 假设您的数据框名为df，且包含season和weather列
# 生成演示数据（实际使用时请注释掉）

df = train_data

# 自定义season映射（请根据实际类别修改）
SEASON_MAPPING = {
    'Spring': 10,
    'Summer': 100,
    'Autumn': 1000,
    'Winter': 10000
}

# 应用season列映射
df['season_encoded'] = df['Season'].map(SEASON_MAPPING)

# 对weather列进行序列编码
# 方法1：使用LabelEncoder自动生成序列
le = LabelEncoder()
df['weather_encoded'] = le.fit_transform(df['Weather Condition'])

# 方法2：也可以自定义weather映射（如果需要特定顺序）
# WEATHER_MAPPING = {'sunny':0, 'cloudy':1, 'rainy':2, 'fog':3}
# df['weather_encoded'] = df['weather'].map(WEATHER_MAPPING)

# 生成乘积列
df['season_weather_interaction'] = df['season_encoded'] * df['weather_encoded']

# 展示结果
print("处理后的数据框：")
print(df)

# 查看weather编码映射
print("\nWeather编码映射：")
print(dict(zip(le.classes_, le.transform(le.classes_))))

处理后的数据框：
      PM2.5 (µg/m³)  PM10 (µg/m³)  NO2 (µg/m³)  SO2 (µg/m³)  CO (mg/m³)  \
0        204.626541    177.172912    56.181879    40.973914    0.858690   
1        178.477230    114.525464    48.634363     8.791909    1.921633   
2        116.317364    209.525485    61.934167    27.043954    1.870824   
3        214.559012     29.915876    71.323678     7.326995    3.013140   
4         34.392308    158.081571    54.892799    25.356895    4.640395   
...             ...           ...          ...          ...         ...   
2494     146.445941    232.691254    85.944790    44.068548    3.418110   
2495     248.865589     60.469210    64.775036    24.676307    3.824908   
2496     205.705537     24.804909    79.195931    10.788163    4.791531   
2497      19.504890     35.869950    55.534393    30.365828    4.919669   
2498     129.334502    147.139323    85.320211    24.083668    0.390219   

      O3 (µg/m³)  Temperature (°C)  Humidity (%)  Wind Speed (m/s)  \
0      48.073378    

In [27]:
df

Unnamed: 0,PM2.5 (µg/m³),PM10 (µg/m³),NO2 (µg/m³),SO2 (µg/m³),CO (mg/m³),O3 (µg/m³),Temperature (°C),Humidity (%),Wind Speed (m/s),Wind Direction (°),...,Season,Latitude,Longitude,Day of Week,Hour,Month,Year,Weather Condition,Station ID,season_weather_interaction
0,204.626541,177.172912,56.181879,40.973914,0.858690,48.073378,23.491706,58.757424,8.168284,359.881532,...,Winter,44.134919,106.844439,Tuesday,16,1,2016,Clear,32,0
1,178.477230,114.525464,48.634363,8.791909,1.921633,163.915372,5.175494,33.352674,2.957370,359.792024,...,Autumn,48.591492,118.513972,Saturday,5,7,2020,Clear,80,0
2,116.317364,209.525485,61.934167,27.043954,1.870824,176.562196,27.285353,11.964247,7.152106,359.124842,...,Spring,29.407165,100.469703,Saturday,6,6,2023,Clear,11,0
3,214.559012,29.915876,71.323678,7.326995,3.013140,94.270021,-6.936606,40.184682,6.903480,359.114918,...,Summer,43.459010,125.615611,Tuesday,0,6,2018,Snow,83,500
4,34.392308,158.081571,54.892799,25.356895,4.640395,73.047320,6.969151,81.484330,3.098179,358.962514,...,Spring,40.273913,128.614599,Wednesday,22,6,2021,Clear,46,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2494,146.445941,232.691254,85.944790,44.068548,3.418110,13.914695,10.342230,27.130427,1.588975,0.264560,...,Summer,39.747101,99.083629,Wednesday,21,11,2015,Fog,49,200
2495,248.865589,60.469210,64.775036,24.676307,3.824908,138.289059,22.516228,55.225753,7.329591,0.214645,...,Summer,23.628014,92.791756,Monday,1,8,2020,Clear,9,0
2496,205.705537,24.804909,79.195931,10.788163,4.791531,153.579056,15.157692,12.125589,6.985294,0.149666,...,Autumn,35.794039,115.345610,Tuesday,12,3,2023,Rain,33,4000
2497,19.504890,35.869950,55.534393,30.365828,4.919669,50.514686,18.626287,48.348076,6.138192,0.080350,...,Winter,44.825846,115.987397,Tuesday,17,3,2019,Snow,19,50000


In [29]:
for i in ['Month','Year','Hour','Season']:
  del df[f'{i}']
df

Unnamed: 0,PM2.5 (µg/m³),PM10 (µg/m³),NO2 (µg/m³),SO2 (µg/m³),CO (mg/m³),O3 (µg/m³),Temperature (°C),Humidity (%),Wind Speed (m/s),Wind Direction (°),Pressure (hPa),Precipitation (mm),Visibility (km),AQI,Latitude,Longitude,Station ID,season_weather_interaction
0,204.626541,177.172912,56.181879,40.973914,0.858690,48.073378,23.491706,58.757424,8.168284,359.881532,1028.454590,28.751510,3.929002,38,44.134919,106.844439,32,0
1,178.477230,114.525464,48.634363,8.791909,1.921633,163.915372,5.175494,33.352674,2.957370,359.792024,1010.721254,23.710453,9.583764,489,48.591492,118.513972,80,0
2,116.317364,209.525485,61.934167,27.043954,1.870824,176.562196,27.285353,11.964247,7.152106,359.124842,1041.280438,37.595955,17.182630,399,29.407165,100.469703,11,0
3,214.559012,29.915876,71.323678,7.326995,3.013140,94.270021,-6.936606,40.184682,6.903480,359.114918,1042.317963,16.569783,14.601751,162,43.459010,125.615611,83,500
4,34.392308,158.081571,54.892799,25.356895,4.640395,73.047320,6.969151,81.484330,3.098179,358.962514,1045.434773,34.266248,15.174932,423,40.273913,128.614599,46,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2494,146.445941,232.691254,85.944790,44.068548,3.418110,13.914695,10.342230,27.130427,1.588975,0.264560,1009.290000,36.405804,0.174305,191,39.747101,99.083629,49,200
2495,248.865589,60.469210,64.775036,24.676307,3.824908,138.289059,22.516228,55.225753,7.329591,0.214645,981.171254,0.029129,11.415098,261,23.628014,92.791756,9,0
2496,205.705537,24.804909,79.195931,10.788163,4.791531,153.579056,15.157692,12.125589,6.985294,0.149666,1009.447831,35.052453,8.655214,371,35.794039,115.345610,33,4000
2497,19.504890,35.869950,55.534393,30.365828,4.919669,50.514686,18.626287,48.348076,6.138192,0.080350,999.635641,24.476348,5.581354,465,44.825846,115.987397,19,50000


In [32]:
train_label

Unnamed: 0,City,City_encoded
0,Shenzhen,4
1,Shanghai,3
2,Beijing,0
3,Shanghai,3
4,Beijing,0
...,...,...
2494,Beijing,0
2495,Shenzhen,4
2496,Beijing,0
2497,Guangzhou,2


In [34]:
train = pd.merge(
    df.reset_index(drop=True),          # 重置df索引
    train_label.iloc[:,1:].reset_index(drop=True),  # 重置train_label索引
    left_index=True,                     # 以左数据集索引为键
    right_index=True,                    # 以右数据集索引为键
    how='inner'                          # 内连接避免NaN
)
train

Unnamed: 0,PM2.5 (µg/m³),PM10 (µg/m³),NO2 (µg/m³),SO2 (µg/m³),CO (mg/m³),O3 (µg/m³),Temperature (°C),Humidity (%),Wind Speed (m/s),Wind Direction (°),Pressure (hPa),Precipitation (mm),Visibility (km),AQI,Latitude,Longitude,Station ID,season_weather_interaction,City_encoded
0,204.626541,177.172912,56.181879,40.973914,0.858690,48.073378,23.491706,58.757424,8.168284,359.881532,1028.454590,28.751510,3.929002,38,44.134919,106.844439,32,0,4
1,178.477230,114.525464,48.634363,8.791909,1.921633,163.915372,5.175494,33.352674,2.957370,359.792024,1010.721254,23.710453,9.583764,489,48.591492,118.513972,80,0,3
2,116.317364,209.525485,61.934167,27.043954,1.870824,176.562196,27.285353,11.964247,7.152106,359.124842,1041.280438,37.595955,17.182630,399,29.407165,100.469703,11,0,0
3,214.559012,29.915876,71.323678,7.326995,3.013140,94.270021,-6.936606,40.184682,6.903480,359.114918,1042.317963,16.569783,14.601751,162,43.459010,125.615611,83,500,3
4,34.392308,158.081571,54.892799,25.356895,4.640395,73.047320,6.969151,81.484330,3.098179,358.962514,1045.434773,34.266248,15.174932,423,40.273913,128.614599,46,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2494,146.445941,232.691254,85.944790,44.068548,3.418110,13.914695,10.342230,27.130427,1.588975,0.264560,1009.290000,36.405804,0.174305,191,39.747101,99.083629,49,200,0
2495,248.865589,60.469210,64.775036,24.676307,3.824908,138.289059,22.516228,55.225753,7.329591,0.214645,981.171254,0.029129,11.415098,261,23.628014,92.791756,9,0,4
2496,205.705537,24.804909,79.195931,10.788163,4.791531,153.579056,15.157692,12.125589,6.985294,0.149666,1009.447831,35.052453,8.655214,371,35.794039,115.345610,33,4000,0
2497,19.504890,35.869950,55.534393,30.365828,4.919669,50.514686,18.626287,48.348076,6.138192,0.080350,999.635641,24.476348,5.581354,465,44.825846,115.987397,19,50000,2


## Training

We now construct a `TabularPredictor` by specifying the label column name and then train on the dataset with `TabularPredictor.fit()`. We don't need to specify any other parameters. AutoGluon will recognize this is a multi-class classification task, perform automatic feature engineering, train multiple models, and then ensemble the models to create the final predictor.

In [35]:
predictor = TabularPredictor(label='City_encoded').fit(train)

No path specified. Models will be saved in: "AutogluonModels/ag-20250415_141440"
Verbosity: 2 (Standard Logging)
AutoGluon Version:  1.2
Python Version:     3.11.12
Operating System:   Linux
Platform Machine:   x86_64
Platform Version:   #1 SMP PREEMPT_DYNAMIC Thu Jun 27 21:05:47 UTC 2024
CPU Count:          2
Memory Avail:       11.10 GB / 12.67 GB (87.6%)
Disk Space Avail:   62.15 GB / 107.72 GB (57.7%)
No presets specified! To achieve strong results with AutoGluon, it is recommended to use the available presets. Defaulting to `'medium'`...
	Recommended Presets (For more details refer to https://auto.gluon.ai/stable/tutorials/tabular/tabular-essentials.html#presets):
	presets='experimental' : New in v1.2: Pre-trained foundation model + parallel fits. The absolute best accuracy without consideration for inference speed. Does not support GPU.
	presets='best'         : Maximize accuracy. Recommended for most users. Use in competitions and benchmarks.
	presets='high'         : Strong acc

Model fitting should take a few minutes or less depending on your CPU. You can make training faster by specifying the `time_limit` argument. For example, `fit(..., time_limit=60)` will stop training after 60 seconds. Higher time limits will generally result in better prediction performance, and excessively low time limits will prevent AutoGluon from training and ensembling a reasonable set of models.



## Prediction

Once we have a predictor that is fit on the training dataset, we can load a separate set of data to use for prediction and evaulation.

In [None]:
test_data = TabularDataset('/content/test_input.csv')

y_pred = predictor.predict(test_data.drop(columns=[label]))
y_pred.head()

## Evaluation

We can evaluate the predictor on the test dataset using the `evaluate()` function, which measures how well our predictor performs on data that was not used for fitting the models.

In [None]:
predictor.evaluate(test_data, silent=True)

AutoGluon's `TabularPredictor` also provides the `leaderboard()` function, which allows us to evaluate the performance of each individual trained model on the test data.

In [None]:
predictor.leaderboard(test_data)

## Conclusion

In this quickstart tutorial we saw AutoGluon's basic fit and predict functionality using `TabularDataset` and `TabularPredictor`. AutoGluon simplifies the model training process by not requiring feature engineering or model hyperparameter tuning. Check out the in-depth tutorials to learn more about AutoGluon's other features like customizing the training and prediction steps or extending AutoGluon with custom feature generators, models, or metrics.