### 📦 Setup: Kaggle + AutoGluon Environment
This cell installs the minimal dependencies (`kaggle`, `autogluon.tabular`), authenticates using my `kaggle.json` from Google Drive, and downloads the **California Housing Prices** dataset from Kaggle.  
It also creates folders to store models and outputs so the workflow can run start-to-finish in Colab.

In [1]:
# --- Colab setup: Kaggle competition + AutoGluon Tabular ---
import os

# ---- CONFIG ----
KAGGLE_COMPETITION = "california-house-prices"
DATA_DIR = "/content/data"
DATASET = os.path.join(DATA_DIR, KAGGLE_COMPETITION)
AUTOGLUON_SAVE_PATH = os.path.join(DATA_DIR, "AutoGluonModels")

print("Competition:", KAGGLE_COMPETITION)
print("DATA_DIR:", DATA_DIR)
print("AUTOGLUON_SAVE_PATH:", AUTOGLUON_SAVE_PATH)

# ---- Install slim AutoGluon + Kaggle CLI ----
!pip install -q kaggle autogluon.tabular scikit-learn

# ---- Kaggle auth (assumes kaggle.json exists in Google Drive) ----
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

!mkdir -p ~/.kaggle
!cp /content/drive/MyDrive/kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json

# ---- Download + extract competition files ----
!mkdir -p "{DATA_DIR}" "{AUTOGLUON_SAVE_PATH}"
!kaggle competitions download -c "{KAGGLE_COMPETITION}" -p "{DATA_DIR}" --force
!unzip -o -q "{DATA_DIR}/{KAGGLE_COMPETITION}.zip" -d "{DATA_DIR}/{KAGGLE_COMPETITION}"
!rm -f "{DATA_DIR}/{KAGGLE_COMPETITION}.zip"
!ls -lh "{DATA_DIR}/{KAGGLE_COMPETITION}"

Competition: california-house-prices
DATA_DIR: /content/data
AUTOGLUON_SAVE_PATH: /content/data/AutoGluonModels
Mounted at /content/drive
Downloading california-house-prices.zip to /content/data
  0% 0.00/29.5M [00:00<?, ?B/s]
100% 29.5M/29.5M [00:00<00:00, 1.08GB/s]
total 86M
-rw-r--r-- 1 root root 248K Mar 19  2021 sample_submission.csv
-rw-r--r-- 1 root root  35M Mar 19  2021 test.csv
-rw-r--r-- 1 root root  51M Mar 19  2021 train.csv


### 🤖 Fast Training on 20 K Sample
Here I load the housing dataset, take a **20 000-row sample** for a fast academic demonstration, and apply lightweight preprocessing: drop `Id`, log-scale numeric features, and use **LightGBM only** for quick training.  
After training, AutoGluon predicts home prices on the full test set, reverses the log transform, and saves the predictions as `submission.csv`.

In [2]:
# === FAST MODE (Academic demo): California Housing ===
import os, time, numpy as np, pandas as pd, random
from autogluon.tabular import TabularPredictor
from sklearn.model_selection import train_test_split

# ---- Reproducible seed ----
SEED = 42
np.random.seed(SEED)
random.seed(SEED)

COMP_DIR = DATASET
TRAIN_PATH = os.path.join(COMP_DIR, "train.csv")
TEST_PATH  = os.path.join(COMP_DIR, "test.csv")
SUB_PATH   = os.path.join(COMP_DIR, "sample_submission.csv")

N_TRAIN = 20_000    # 20k rows trained for speed
TL = 300            # Limit training for demonstration

train_full = pd.read_csv(TRAIN_PATH)
test_df    = pd.read_csv(TEST_PATH)
sub_df     = pd.read_csv(SUB_PATH)

# ✅ use SEED here for deterministic sampling
train_df = train_full.sample(n=min(N_TRAIN, len(train_full)), random_state=SEED).reset_index(drop=True)
del train_full

# ---- quick preprocess ----
target = "Sold Price"
if "Id" in train_df.columns: train_df.drop(columns=["Id"], inplace=True)
if "Id" in test_df.columns:  test_df.drop(columns=["Id"], inplace=True)

# log1p numerics (stable scale)
for c in train_df.select_dtypes(include=["number"]).columns:
    if c != target:
        train_df[c] = np.log1p(train_df[c].clip(lower=0))
train_df[target] = np.log1p(train_df[target].clip(lower=0))
for c in test_df.select_dtypes(include=["number"]).columns:
    test_df[c] = np.log1p(test_df[c].clip(lower=0))

# ---- Explicit splits: dev (for tuning) and holdout (never seen in training) ----
# 10% holdout, then 10% dev from remaining 90%
train_full_split, holdout = train_test_split(train_df, test_size=0.10, random_state=SEED)
train_split, dev_split    = train_test_split(train_full_split, test_size=0.10, random_state=SEED)
print(f"Train: {train_split.shape}, Dev: {dev_split.shape}, Holdout: {holdout.shape}")

# ---- single fast model (LightGBM) ----
hyperparameters = {
    "GBM": [{
        "num_boost_round": 200,
        "learning_rate": 0.1,
        "num_leaves": 31,
        "random_state": SEED,          # ensures deterministic splits
        "bagging_seed": SEED,          # ensures bagging reproducibility
        "feature_fraction_seed": SEED, # ensures consistent feature sampling
        "data_random_seed": SEED,
        "num_threads": 1,              # stricter reproducibility
    }],
    "CAT": [], "XGB": [], "RF": [], "XT": [], "NN_TORCH": []
}

# Propagate a seed into AutoGluon’s fitting stage
ag_args_fit = {"random_seed": SEED}

# --- Use a fresh, unique run folder to avoid "path already exists" warning ---
BASE_PATH = AUTOGLUON_SAVE_PATH
os.makedirs(BASE_PATH, exist_ok=True)
RUN_NAME = f"california_house_{time.strftime('%Y%m%d_%H%M%S')}"
MODEL_PATH = os.path.join(BASE_PATH, RUN_NAME)

predictor = TabularPredictor(
    label=target,
    eval_metric="rmse",
    path=MODEL_PATH,
    verbosity=2
)

trained = False
try:
    predictor.fit(
        train_data=train_split,
        tuning_data=dev_split,
        hyperparameters=hyperparameters,
        presets='medium_quality',
        time_limit=TL,
        num_bag_folds=0,
        num_stack_levels=0,
        keep_only_best=True,
        ag_args_fit=ag_args_fit,
    )
    trained = True
except AssertionError as e:
    print("✅ Fit exited early (low time_limit):", e)

# ---- evaluate on truly unseen holdout ----
if trained and getattr(predictor, "_trainer", None):
    holdout_metrics = predictor.evaluate(holdout)
    print("Holdout metrics (log-scale RMSE):", holdout_metrics)

    try:
        from sklearn.metrics import mean_squared_error
        y_true = np.expm1(holdout[target].values)
        y_pred = np.expm1(predictor.predict(holdout))
        rmse_dollars = mean_squared_error(y_true, y_pred, squared=False)
        print(f"Holdout RMSE (original $): {rmse_dollars:,.2f}")
    except Exception as e:
        print("Skipping dollar-scale RMSE:", e)

    # ---- predict + submission ----
    pred_log = predictor.predict(test_df)
    pred = np.expm1(pred_log)
    sub = sub_df.copy()
    sub["Sold Price"] = pred
    out_path = os.path.join(MODEL_PATH, "submission.csv")
    sub.to_csv(out_path, index=False)
    print("✅ Saved submission:", out_path)
else:
    print("⏸️ Training didn’t complete — increase TL (≥300 s).")

Verbosity: 2 (Standard Logging)
AutoGluon Version:  1.4.0
Python Version:     3.12.12
Operating System:   Linux
Platform Machine:   x86_64
Platform Version:   #1 SMP Thu Oct  2 10:42:05 UTC 2025
CPU Count:          2
Memory Avail:       11.28 GB / 12.67 GB (89.1%)
Disk Space Avail:   186.24 GB / 225.83 GB (82.5%)
Presets specified: ['medium_quality']


Train: (16200, 40), Dev: (1800, 40), Holdout: (2000, 40)


Beginning AutoGluon training ... Time limit = 300s
AutoGluon will save models to "/content/data/AutoGluonModels/california_house_20251014_235349"
Train Data Rows:    16200
Train Data Columns: 39
Tuning Data Rows:    1800
Tuning Data Columns: 39
Label Column:       Sold Price
AutoGluon infers your prediction problem is: 'regression' (because dtype of label-column == float and many unique label-values observed).
	Label info (max, min, mean, stddev): (17.909855136853043, 11.51792295668052, 13.73669, 0.79778)
	If 'regression' is not the correct problem_type, please manually specify the problem_type parameter during Predictor init (You may specify problem_type as one of: ['binary', 'multiclass', 'regression', 'quantile'])
Problem Type:       regression
Preprocessing data ...
Using Feature Generators to preprocess the data ...
Fitting AutoMLPipelineFeatureGenerator...
	Available Memory:                    11623.01 MB
	Train Data (Original)  Memory Usage: 37.07 MB (0.3% of available memory)
	

Holdout metrics (log-scale RMSE): {'root_mean_squared_error': np.float64(-0.1884125617346258), 'mean_squared_error': -0.03549929341940418, 'mean_absolute_error': -0.09152310644919366, 'r2': 0.9436067064340019, 'pearsonr': 0.9715009371785128, 'median_absolute_error': np.float64(-0.04761857970012873)}
Skipping dollar-scale RMSE: got an unexpected keyword argument 'squared'
✅ Saved submission: /content/data/AutoGluonModels/california_house_20251014_235349/submission.csv


### 🚀 Submit to Kaggle
This cell uses the Kaggle CLI to submit the generated `submission.csv` to the **california-house-prices** competition.  
It confirms the file exists, uploads it, and shows my recent submissions to verify a successful run.

In [3]:
# === Submit to Kaggle competition ===
import os

COMPETITION = "california-house-prices"
SUBMISSION_FILE = os.path.join(AUTOGLUON_SAVE_PATH, "submission.csv")
MESSAGE = "AutoGluon fast academic demo submission"

# Verify the submission file exists
!ls -lh "$SUBMISSION_FILE"

# Submit to Kaggle
!kaggle competitions submit -c "$COMPETITION" -f "$SUBMISSION_FILE" -m "$MESSAGE"

# Show recent submissions for confirmation
!kaggle competitions submissions -c "$COMPETITION" | head -n 20

ls: cannot access '/content/data/AutoGluonModels/submission.csv': No such file or directory
[Errno 2] No such file or directory: '/content/data/AutoGluonModels/submission.csv'
fileName        date                        description                              status                     publicScore  privateScore  
--------------  --------------------------  ---------------------------------------  -------------------------  -----------  ------------  
submission.csv  2025-10-14 18:58:22.137000  AutoGluon fast academic demo submission  SubmissionStatus.COMPLETE  0.14620      0.13130       
submission.csv  2025-10-14 18:52:46.223000  AutoGluon fast academic demo submission  SubmissionStatus.COMPLETE  0.14620      0.13130       
submission.csv  2025-10-14 18:45:58.970000  AutoGluon fast academic demo submission  SubmissionStatus.COMPLETE  0.14620      0.13130       
submission.csv  2025-10-14 18:38:27.617000  AutoGluon fast academic demo submission  SubmissionStatus.COMPLETE  0.14620     