### 📦 Setup: Kaggle + AutoGluon Environment
This first cell authenticates Kaggle using my `kaggle.json`, installs the slim `autogluon.tabular` package, and downloads the **IEEE-CIS Fraud Detection** dataset.  
It extracts the data into `/content/data/ieee-fraud-detection` and prepares a save path for AutoGluon models.

In [1]:
# --- Colab setup: Kaggle competition + AutoGluon Tabular ---
import os

# ---- CONFIG ----
KAGGLE_COMPETITION = "ieee-fraud-detection"
DATA_DIR = "/content/data"
DATASET = os.path.join(DATA_DIR, KAGGLE_COMPETITION)
AUTOGLUON_SAVE_PATH = os.path.join(DATA_DIR, "AutoGluonModels")

print("Competition:", KAGGLE_COMPETITION)
print("DATA_DIR:", DATA_DIR)
print("AUTOGLUON_SAVE_PATH:", AUTOGLUON_SAVE_PATH)

# ---- Install slim AutoGluon + Kaggle CLI ----
!pip install -q kaggle autogluon.tabular

# ---- Kaggle auth (assumes kaggle.json exists in Google Drive) ----
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

!mkdir -p ~/.kaggle
!cp /content/drive/MyDrive/kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json

# ---- Download + extract competition files ----
!mkdir -p "{DATA_DIR}" "{AUTOGLUON_SAVE_PATH}"
!kaggle competitions download -c "{KAGGLE_COMPETITION}" -p "{DATA_DIR}" --force
!unzip -o -q "{DATA_DIR}/{KAGGLE_COMPETITION}.zip" -d "{DATA_DIR}/{KAGGLE_COMPETITION}"
!rm -f "{DATA_DIR}/{KAGGLE_COMPETITION}.zip"
!ls -lh "{DATA_DIR}/{KAGGLE_COMPETITION}"

Competition: ieee-fraud-detection
DATA_DIR: /content/data
AUTOGLUON_SAVE_PATH: /content/data/AutoGluonModels
Mounted at /content/drive
Downloading ieee-fraud-detection.zip to /content/data
  0% 0.00/118M [00:00<?, ?B/s]
100% 118M/118M [00:00<00:00, 1.34GB/s]
total 1.3G
-rw-r--r-- 1 root root 5.8M Dec 11  2019 sample_submission.csv
-rw-r--r-- 1 root root  25M Dec 11  2019 test_identity.csv
-rw-r--r-- 1 root root 585M Dec 11  2019 test_transaction.csv
-rw-r--r-- 1 root root  26M Dec 11  2019 train_identity.csv
-rw-r--r-- 1 root root 652M Dec 11  2019 train_transaction.csv


### ⚡ Fast Training on 20 K Sample with Feature Alignment
Here I merge the `train_transaction` and `train_identity` files, sample **20 000 rows** for quick runtime, and apply minimal preprocessing: drop `TransactionID`, fill missing objects with `"NA"` and numbers with `-999`, and cast `isFraud` to int.  
AutoGluon trains a **single LightGBM model** with a short time limit.  
Before prediction, I align the test dataset to the model’s exact feature set—adding any missing columns like `id_01`–`id_38`—so inference runs smoothly.  
The notebook then writes predictions to `submission.csv`.

In [2]:
# === FAST MODE (Academic demo): IEEE-CIS Fraud (with test feature alignment + robust proba) ===
import os, time, numpy as np, pandas as pd
from autogluon.tabular import TabularPredictor

COMP_DIR = DATASET
N_TRAIN = 20_000    # 20k rows trained for speed
TL = 300            # Limit training for demonstration

# Load & merge
train_id = pd.read_csv(os.path.join(COMP_DIR, "train_identity.csv"))
train_tx = pd.read_csv(os.path.join(COMP_DIR, "train_transaction.csv"))
train_full = pd.merge(train_tx, train_id, on="TransactionID", how="left")
del train_id, train_tx
print("Full train shape:", train_full.shape)

# Sample rows for speed
train_df = train_full.sample(n=min(N_TRAIN, len(train_full)), random_state=42).reset_index(drop=True)
del train_full

# Minimal preprocess
if "TransactionID" in train_df.columns:
    train_df.drop(columns=["TransactionID"], inplace=True)
obj_cols = train_df.select_dtypes(include=["object"]).columns
train_df[obj_cols] = train_df[obj_cols].fillna("NA")          # strings
num_cols = [c for c in train_df.select_dtypes(include=["number"]).columns if c != "isFraud"]
train_df[num_cols] = train_df[num_cols].fillna(-999)          # numerics
train_df["isFraud"] = train_df["isFraud"].astype("int8")

# Fastest hyperparameters: LightGBM only
hyperparameters = {
    "GBM": [{"num_boost_round": 300, "learning_rate": 0.1, "num_leaves": 31, "min_data_in_leaf": 200}],
    "CAT": [], "XGB": [], "RF": [], "XT": [], "NN_TORCH": []
}

# --- Use a fresh, unique run folder to avoid "path already exists" warning ---
BASE_PATH = AUTOGLUON_SAVE_PATH
os.makedirs(BASE_PATH, exist_ok=True)
RUN_NAME = f"ieee_fraud_{time.strftime('%Y%m%d_%H%M%S')}"
MODEL_PATH = os.path.join(BASE_PATH, RUN_NAME)

predictor = TabularPredictor(
    label="isFraud",
    eval_metric="roc_auc",
    path=MODEL_PATH,
    verbosity=2
).fit(
    train_data=train_df,
    hyperparameters=hyperparameters,
    presets=None,
    time_limit=TL,
    num_bag_folds=0,
    num_stack_levels=0,
    keep_only_best=True,
)

# Predict on FULL test & write submission
test_id = pd.read_csv(os.path.join(COMP_DIR, "test_identity.csv"))
test_tx = pd.read_csv(os.path.join(COMP_DIR, "test_transaction.csv"))
test_df = pd.merge(test_tx, test_id, on="TransactionID", how="left")
del test_id, test_tx

if "TransactionID" in test_df.columns:
    test_df.drop(columns=["TransactionID"], inplace=True)

# --- Align test columns to trained features ---
required_feats = predictor.features()
missing = [c for c in required_feats if c not in test_df.columns]
if missing:
    print(f"Adding {len(missing)} missing columns to test (showing up to 10): {missing[:10]}{' ...' if len(missing)>10 else ''}")
    for c in missing:
        test_df[c] = np.nan

# Keep ONLY required features (drops extras, fixes order)
test_df = test_df[required_feats]

# Fast fills (match train policy)
obj_cols_t = test_df.select_dtypes(include=["object"]).columns
num_cols_t = test_df.select_dtypes(include=["number"]).columns
if len(obj_cols_t):
    test_df[obj_cols_t] = test_df[obj_cols_t].fillna("NA")
if len(num_cols_t):
    test_df[num_cols_t] = test_df[num_cols_t].fillna(-999)

# --- Robust predict_proba handling (Series vs DataFrame) ---
proba = predictor.predict_proba(test_df)
if isinstance(proba, pd.Series):
    preds = proba.values
else:
    if 1 in proba.columns:
        preds = proba[1].values
    elif "1" in proba.columns:
        preds = proba["1"].values
    else:
        preds = proba.iloc[:, -1].values

# Save submission next to the model for traceability
sub = pd.read_csv(os.path.join(COMP_DIR, "sample_submission.csv"))
sub["isFraud"] = preds
out_path = os.path.join(MODEL_PATH, "submission.csv")
sub.to_csv(out_path, index=False)
print("✅ Saved submission:", out_path)

Full train shape: (590540, 434)


Verbosity: 2 (Standard Logging)
AutoGluon Version:  1.4.0
Python Version:     3.12.12
Operating System:   Linux
Platform Machine:   x86_64
Platform Version:   #1 SMP Thu Oct  2 10:42:05 UTC 2025
CPU Count:          8
Memory Avail:       47.36 GB / 50.99 GB (92.9%)
Disk Space Avail:   183.02 GB / 225.83 GB (81.0%)
No presets specified! To achieve strong results with AutoGluon, it is recommended to use the available presets. Defaulting to `'medium'`...
	Recommended Presets (For more details refer to https://auto.gluon.ai/stable/tutorials/tabular/tabular-essentials.html#presets):
	presets='extreme' : New in v1.4: Massively better than 'best' on datasets <30000 samples by using new models meta-learned on https://tabarena.ai: TabPFNv2, TabICL, Mitra, and TabM. Absolute best accuracy. Requires a GPU. Recommended 64 GB CPU memory and 32+ GB GPU memory.
	presets='best'    : Maximize accuracy. Recommended for most users. Use in competitions and benchmarks.
	presets='high'    : Strong accuracy w

Adding 38 missing columns to test (showing up to 10): ['id_01', 'id_02', 'id_03', 'id_04', 'id_05', 'id_06', 'id_07', 'id_08', 'id_09', 'id_10'] ...
✅ Saved submission: /content/data/AutoGluonModels/ieee_fraud_20251014_181753/submission.csv


### 🚀 Submit to Kaggle
This cell submits the saved `submission.csv` to the **ieee-fraud-detection** competition using the Kaggle CLI.  
It lists the file, sends the submission, and prints recent submissions for confirmation.

In [3]:
# === Submit to Kaggle competition ===
import os

COMPETITION = "ieee-fraud-detection"
SUBMISSION_FILE = os.path.join(AUTOGLUON_SAVE_PATH, "submission.csv")
MESSAGE = "AutoGluon fast academic demo submission"

# Verify the submission file exists
!ls -lh "$SUBMISSION_FILE"

# Submit to Kaggle
!kaggle competitions submit -c "$COMPETITION" -f "$SUBMISSION_FILE" -m "$MESSAGE"

# Show recent submissions for confirmation
!kaggle competitions submissions -c "$COMPETITION" | head -n 20

-rw-r--r-- 1 root root 15M Oct 14 18:05 /content/data/AutoGluonModels/submission.csv
100% 14.2M/14.2M [00:00<00:00, 21.4MB/s]
Successfully submitted to IEEE-CIS Fraud DetectionfileName        date                        description                              status                     publicScore  privateScore  
--------------  --------------------------  ---------------------------------------  -------------------------  -----------  ------------  
submission.csv  2025-10-14 18:18:47.153000  AutoGluon fast academic demo submission  SubmissionStatus.PENDING                              
submission.csv  2025-10-14 18:13:15.017000  AutoGluon fast academic demo submission  SubmissionStatus.COMPLETE  0.892597     0.877589      
submission.csv  2025-10-14 18:05:36.010000  AutoGluon fast academic demo submission  SubmissionStatus.COMPLETE  0.892597     0.877589      
submission.csv  2025-10-14 17:55:41.007000  AutoGluon fast academic demo submission  SubmissionStatus.COMPLETE  0.892597    