<a href="https://colab.research.google.com/github/kat-le/autogluon-kaggle/blob/main/ieee_fraud_autogluon.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# How to use AutoGluon for Kaggle competitions

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/autogluon/autogluon/blob/stable/docs/tutorials/tabular/advanced/tabular-kaggle.ipynb)
[![Open In SageMaker Studio Lab](https://studiolab.sagemaker.aws/studiolab.svg)](https://studiolab.sagemaker.aws/import/github/autogluon/autogluon/blob/stable/docs/tutorials/tabular/advanced/tabular-kaggle.ipynb)



This tutorial will teach you how to use AutoGluon to become a serious Kaggle competitor without writing lots of code.
We first outline the general steps to use AutoGluon in Kaggle contests. Here, we assume the competition involves tabular data which are stored in one (or more) CSV files.

1) Run Bash command: pip install kaggle

2) Navigate to: https://www.kaggle.com/account and create an account (if necessary).
Then , click on "Create New API Token" and move downloaded file to this location on your machine: `~/.kaggle/kaggle.json`. For troubleshooting, see [Kaggle API instructions](https://www.kaggle.com/docs/api).

3) To download data programmatically: Execute this Bash command in your terminal:

`kaggle competitions download -c [COMPETITION]`

Here, [COMPETITION] should be replaced by the name of the competition you wish to enter.
Alternatively, you can download data manually: Just navigate to website of the Kaggle competition you wish to enter, click "Download All", and accept the competition's terms.

4) If the competition's training data is comprised of multiple CSV files, use [pandas](https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html) to properly merge/join them into a single data table where rows = training examples, columns = features.

5) Run autogluon `fit()` on the resulting data table.

6) Load the test dataset from competition (again making the necessary merges/joins to ensure it is in the exact same format as the training data table), and then call autogluon `predict()`.  Subsequently use [pandas.read_csv](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html) to load the competition's `sample_submission.csv` file into a DataFrame, put the AutoGluon predictions in the right column of this DataFrame, and finally save it as a CSV file via [pandas.to_csv](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_csv.html). If the competition does not offer a sample submission file, you will need to create the submission file yourself by appropriately reformatting AutoGluon's test predictions.

7) Submit your predictions via Bash command:

`kaggle competitions submit -c [COMPETITION] -f [FILE] -m ["MESSAGE"]`

Here, [COMPETITION] again is the competition's name, [FILE] is the name of the CSV file you created with your predictions, and ["MESSAGE"] is a string message you want to record with this submitted entry. Alternatively, you can  manually upload your file of predictions on the competition website.

8) Finally, navigate to competition leaderboard website to see how well your submission performed!
It may take time for your submission to appear.



Below, we demonstrate how to do steps (4)-(6) in Python for a specific Kaggle competition: [ieee-fraud-detection](https://www.kaggle.com/c/ieee-fraud-detection/).
This means you'll need to run the above steps with `[COMPETITION]` replaced by `ieee-fraud-detection` in each command.  Here, we assume you've already completed steps (1)-(3) and the data CSV files are available on your computer. To begin step (4), we first load the competition's training data into Python:

In [None]:
!pip -q install -U pip
!pip -q install -U autogluon kaggle

import autogluon, sys, platform, os, pandas as pd, numpy as np
print("Python:", sys.version.split()[0])
print("Platform:", platform.platform())
!nvidia-smi || true

DIR = "/content/IEEEfraud/"
os.makedirs(DIR, exist_ok=True)
print("Working dir:", DIR)

Python: 3.12.12
Platform: Linux-6.6.105+-x86_64-with-glibc2.35
Tue Oct 14 00:04:11 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  NVIDIA A100-SXM4-80GB          Off |   00000000:00:05.0 Off |                    0 |
| N/A   36C    P0             52W /  400W |       0MiB /  81920MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+---------

In [None]:
!pip uninstall -y torchaudio

Found existing installation: torchaudio 2.8.0+cu126
Uninstalling torchaudio-2.8.0+cu126:
  Successfully uninstalled torchaudio-2.8.0+cu126


In [None]:
from google.colab import files
print("Upload your kaggle.json (Kaggle > Account > Create New API Token)")
files.upload()

!mkdir -p ~/.kaggle
!cp kaggle.json ~/.kaggle/kaggle.json
!chmod 600 ~/.kaggle/kaggle.json
!ls -l ~/.kaggle

Upload your kaggle.json (Kaggle > Account > Create New API Token)


Saving kaggle.json to kaggle.json
total 4
-rw------- 1 root root 65 Oct 14 00:04 kaggle.json


In [None]:
%cd /content
!kaggle competitions download -c ieee-fraud-detection -p "{DIR}"
!unzip -o -q "{DIR}/ieee-fraud-detection.zip" -d "{DIR}"
!ls -lh "{DIR}" | head -n 20

/content
Downloading ieee-fraud-detection.zip to /content/IEEEfraud
  0% 0.00/118M [00:00<?, ?B/s]
100% 118M/118M [00:00<00:00, 1.69GB/s]
total 1.4G
-rw-r--r-- 1 root root 119M Dec 11  2019 ieee-fraud-detection.zip
-rw-r--r-- 1 root root 5.8M Dec 11  2019 sample_submission.csv
-rw-r--r-- 1 root root  25M Dec 11  2019 test_identity.csv
-rw-r--r-- 1 root root 585M Dec 11  2019 test_transaction.csv
-rw-r--r-- 1 root root  26M Dec 11  2019 train_identity.csv
-rw-r--r-- 1 root root 652M Dec 11  2019 train_transaction.csv


In [None]:
import pandas as pd
import numpy as np
from autogluon.tabular import TabularPredictor

directory = '/content/IEEEfraud/'
label = 'isFraud'
eval_metric = 'roc_auc'
save_path = directory + 'AutoGluonModels/'

train_identity = pd.read_csv(directory+'train_identity.csv')
train_transaction = pd.read_csv(directory+'train_transaction.csv')

Since the training data for this competition is comprised of multiple CSV files, we just first join them into a single large table (with rows = examples, columns = features) before applying AutoGluon:

In [None]:
train_data = pd.merge(train_transaction, train_identity, on='TransactionID', how='left')

Note that a left-join on the `TransactionID` key happened to be most appropriate for this Kaggle competition, but for others involving multiple training data files, you will likely need to use a different join strategy (always consider this very carefully). Now that all our training data resides within a single table, we can apply AutoGluon. Below, we specify the `presets` argument to maximize AutoGluon's predictive accuracy which usually requires that you run `fit()` with longer time limits (3600s below should likely be increased in your run):

```
predictor = TabularPredictor(label=label, eval_metric=eval_metric, path=save_path, verbosity=3).fit(
    train_data, presets='best_quality', time_limit=3600
)

results = predictor.fit_summary()
```


In [None]:
predictor = TabularPredictor(label=label, eval_metric=eval_metric, path=save_path, verbosity=3).fit(
    train_data, presets='medium_quality', time_limit=900
)

results = predictor.fit_summary()

Verbosity: 3 (Detailed Logging)
AutoGluon Version:  1.4.0
Python Version:     3.12.12
Operating System:   Linux
Platform Machine:   x86_64
Platform Version:   #1 SMP Thu Oct  2 10:42:05 UTC 2025
CPU Count:          12
GPU Count:          1
Memory Avail:       159.03 GB / 167.05 GB (95.2%)
Disk Space Avail:   188.69 GB / 235.68 GB (80.1%)
Presets specified: ['medium_quality']
User Specified kwargs:
{'auto_stack': False}
Full kwargs:
{'_experimental_dynamic_hyperparameters': False,
 '_feature_generator_kwargs': None,
 '_save_bag_folds': None,
 'ag_args': None,
 'ag_args_ensemble': None,
 'ag_args_fit': None,
 'auto_stack': False,
 'calibrate': 'auto',
 'delay_bag_sets': False,
 'ds_args': {'clean_up_fits': True,
             'detection_time_frac': 0.25,
             'enable_callbacks': False,
             'enable_ray_logging': True,
             'holdout_data': None,
             'holdout_frac': 0.1111111111111111,
             'memory_safe_fits': True,
             'n_folds': 2,
       

[50]	valid_set's binary_logloss: 0.100743
[100]	valid_set's binary_logloss: 0.0924349
[150]	valid_set's binary_logloss: 0.0884887
[200]	valid_set's binary_logloss: 0.0856469
[250]	valid_set's binary_logloss: 0.0834745
[300]	valid_set's binary_logloss: 0.0818649
[350]	valid_set's binary_logloss: 0.0805544
[400]	valid_set's binary_logloss: 0.079232
[450]	valid_set's binary_logloss: 0.0784206
[500]	valid_set's binary_logloss: 0.0768095
[550]	valid_set's binary_logloss: 0.0759847
[600]	valid_set's binary_logloss: 0.075066
[650]	valid_set's binary_logloss: 0.0743114
[700]	valid_set's binary_logloss: 0.0736169
[750]	valid_set's binary_logloss: 0.0727746
[800]	valid_set's binary_logloss: 0.072003
[850]	valid_set's binary_logloss: 0.0714918
[900]	valid_set's binary_logloss: 0.0708359
[950]	valid_set's binary_logloss: 0.0705871
[1000]	valid_set's binary_logloss: 0.0700511
[1050]	valid_set's binary_logloss: 0.0695724
[1100]	valid_set's binary_logloss: 0.0685792
[1150]	valid_set's binary_logloss:

Saving /content/IEEEfraud/AutoGluonModels/models/LightGBMXT/model.pkl
Saving /content/IEEEfraud/AutoGluonModels/utils/attr/LightGBMXT/y_pred_proba_val.pkl
	0.9707	 = Validation score   (roc_auc)
	440.06s	 = Training   runtime
	0.99s	 = Validation runtime
	5939.9	 = Inference  throughput (rows/s | 5906 batch size)
Saving /content/IEEEfraud/AutoGluonModels/models/trainer.pkl
Fitting model: LightGBM ... Training model for up to 429.18s of the 429.17s of remaining time.
	Fitting LightGBM with 'num_gpus': 0, 'num_cpus': 6
	Fitting with cpus=6, gpus=0, mem=9.7/155.0 GB
	Fitting 10000 rounds... Hyperparameters: {'learning_rate': 0.05}


[50]	valid_set's binary_logloss: 0.0958781
[100]	valid_set's binary_logloss: 0.0878627
[150]	valid_set's binary_logloss: 0.0833765
[200]	valid_set's binary_logloss: 0.0809382
[250]	valid_set's binary_logloss: 0.0788046
[300]	valid_set's binary_logloss: 0.0760906
[350]	valid_set's binary_logloss: 0.0747061
[400]	valid_set's binary_logloss: 0.0730843
[450]	valid_set's binary_logloss: 0.071836
[500]	valid_set's binary_logloss: 0.0708809
[550]	valid_set's binary_logloss: 0.0699904
[600]	valid_set's binary_logloss: 0.0689386
[650]	valid_set's binary_logloss: 0.067958
[700]	valid_set's binary_logloss: 0.0671754
[750]	valid_set's binary_logloss: 0.0665508
[800]	valid_set's binary_logloss: 0.0650938
[850]	valid_set's binary_logloss: 0.0641821
[900]	valid_set's binary_logloss: 0.0632762
[950]	valid_set's binary_logloss: 0.0627643
[1000]	valid_set's binary_logloss: 0.0623129
[1050]	valid_set's binary_logloss: 0.061703
[1100]	valid_set's binary_logloss: 0.0612304
[1150]	valid_set's binary_logloss

Saving /content/IEEEfraud/AutoGluonModels/models/LightGBM/model.pkl
Saving /content/IEEEfraud/AutoGluonModels/utils/attr/LightGBM/y_pred_proba_val.pkl
	0.9742	 = Validation score   (roc_auc)
	385.07s	 = Training   runtime
	0.82s	 = Validation runtime
	7217.5	 = Inference  throughput (rows/s | 5906 batch size)
Saving /content/IEEEfraud/AutoGluonModels/models/trainer.pkl
Fitting model: RandomForestGini ... Training model for up to 43.03s of the 43.02s of remaining time.
	Fitting RandomForestGini with 'num_gpus': 0, 'num_cpus': 12
	Fitting with cpus=12, gpus=0, mem=0.4/154.7 GB
Saving /content/IEEEfraud/AutoGluonModels/models/RandomForestGini/model.pkl
Saving /content/IEEEfraud/AutoGluonModels/utils/attr/RandomForestGini/y_pred_proba_val.pkl
	0.9319	 = Validation score   (roc_auc)
	238.54s	 = Training   runtime
	0.15s	 = Validation runtime
	39753.7	 = Inference  throughput (rows/s | 5906 batch size)
Saving /content/IEEEfraud/AutoGluonModels/models/trainer.pkl
Skipping RandomForestEntr due

*** Summary of fit() ***
Estimated performance of each model:
                 model  score_val eval_metric  pred_time_val    fit_time  pred_time_val_marginal  fit_time_marginal  stack_level  can_infer  fit_order
0  WeightedEnsemble_L2   0.974700     roc_auc       1.813882  825.210078                0.001297           0.080716            2       True          4
1             LightGBM   0.974212     roc_auc       0.818289  385.065197                0.818289         385.065197            1       True          2
2           LightGBMXT   0.970663     roc_auc       0.994296  440.064165                0.994296         440.064165            1       True          1
3     RandomForestGini   0.931882     roc_auc       0.148565  238.538071                0.148565         238.538071            1       True          3
Number of models trained: 4
Types of models trained:
{'LGBModel', 'WeightedEnsembleModel', 'RFModel'}
Bagging used: False 
Multi-layer stack-ensembling used: False 
Feature Metadata (

Now, we use the trained AutoGluon Predictor to make predictions on the competition's test data. It is imperative that multiple test data files are joined together in the exact same manner as the training data. Because this competition is evaluated based on the AUC (Area under the ROC curve) metric, we ask AutoGluon for predicted class-probabilities rather than class predictions. In general, when to use `predict` vs `predict_proba` will depend on the particular competition.

In [10]:
test_identity = pd.read_csv(directory+'test_identity.csv')
test_transaction = pd.read_csv(directory+'test_transaction.csv')

# Ensure test_identity has all columns present in train_identity, filling missing ones with NaN
train_identity_cols = train_identity.columns
test_identity_cols = test_identity.columns
missing_cols_in_test_identity = set(train_identity_cols) - set(test_identity_cols)
for col in missing_cols_in_test_identity:
    test_identity[col] = np.nan

test_data = pd.merge(test_transaction, test_identity, on='TransactionID', how='left')  # same join applied to training files

y_predproba = predictor.predict_proba(test_data)
y_predproba.head(5)  # some example predicted fraud-probabilities

Loading: /content/IEEEfraud/AutoGluonModels/models/LightGBM/model.pkl
Loading: /content/IEEEfraud/AutoGluonModels/models/LightGBMXT/model.pkl
Loading: /content/IEEEfraud/AutoGluonModels/models/WeightedEnsemble_L2/model.pkl


Unnamed: 0,0,1
0,0.999839,0.000161
1,0.999992,8e-06
2,0.99998,2e-05
3,0.999893,0.000107
4,0.999973,2.7e-05


When submitting predicted probabilities for classification competitions, it is imperative these correspond to the same class expected by Kaggle. For binary classification tasks, you can see which class AutoGluon's predicted probabilities correspond to via:

In [11]:
predictor.positive_class

1

For multiclass classification tasks, you can see which classes AutoGluon's predicted probabilities correspond to via:

In [12]:
predictor.class_labels  # classes in this list correspond to columns of predict_proba() output

[0, 1]

Now, let's get prediction probabilities for the entire test data, while only getting the positive class predictions by specifying:

In [13]:
y_predproba = predictor.predict_proba(test_data, as_multiclass=False)

Loading: /content/IEEEfraud/AutoGluonModels/models/LightGBM/model.pkl
Loading: /content/IEEEfraud/AutoGluonModels/models/LightGBMXT/model.pkl
Loading: /content/IEEEfraud/AutoGluonModels/models/WeightedEnsemble_L2/model.pkl


Now that we have made a prediction for each row in the test dataset, we can submit these predictions to Kaggle. Most Kaggle competitions provide a sample submission file, in which you can simply overwrite the sample predictions with your own as we do below:

In [14]:
submission = pd.read_csv(directory+'sample_submission.csv')
submission['isFraud'] = y_predproba
submission.head()
submission.to_csv(directory+'my_submission.csv', index=False)



We have now completed steps (4)-(6) from the top of this tutorial. To submit your predictions to Kaggle, you can run the following command in your terminal (from the appropriate directory):

`kaggle competitions submit -c ieee-fraud-detection -f sample_submission.csv -m "my first submission"`

You can now play with different `fit()` arguments and feature-engineering techniques to try and maximize the rank of your submissions in the Kaggle Leaderboard!


**Tips to maximize predictive performance:**

   - Be sure to specify the appropriate evaluation metric if one is specified on the competition website! If you are unsure which metric is best, then simply do not specify this argument when invoking `fit()`; AutoGluon should still produce high-quality models by automatically inferring which metric to use.

   - If the training examples are time-based and the competition test examples come from future data, we recommend you reserve the most recently-collected training examples as a separate validation dataset passed to `fit()`. Otherwise, you do not need to specify a validation set yourself and AutoGluon will automatically partition the competition training data into its own training/validation sets.

   - Beyond simply specifying `presets = 'best_quality'`, you may play with more advanced `fit()` arguments such as: `num_bag_folds`, `num_stack_levels`, `num_bag_sets`, `hyperparameter_tune_kwargs`, `hyperparameters`, `refit_full`. However we recommend spending most of your time on feature-engineering and just specify `presets = 'best_quality'` inside the call to `fit()`.


**Troubleshooting:**

- Check that you have the right user-permissions on your computer to access the data files downloaded from Kaggle.

- For issues downloading Kaggle data or submitting predictions, check your Kaggle account setup and the [Kaggle FAQ](https://www.kaggle.com/general/14438).