# Tier 2 — Tabular ML Playground
## Starter Notebook: Baselines on California Housing (Regression)


This notebook is the **starter baseline** for the Tier 2 Tabular ML Playground.

**Goal:** Load a clean tabular dataset, do quick EDA, train a baseline model, and compare a few classical ML models using consistent metrics.

**Dataset:** California Housing (built into scikit-learn)

**Models (baseline → stronger):**
- Linear Regression (baseline)
- Random Forest Regressor
- Gradient Boosting Regressor
- Support Vector Regressor (with scaling)

**Metrics:** RMSE, MAE, R²

In [None]:
# If you get import errors, install requirements from the repo root:
# pip install -r requirements.txt

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVR

RANDOM_STATE = 42

## 1) Load dataset

In [None]:
housing = fetch_california_housing(as_frame=True)
df = housing.frame.copy()

df.head()

### Quick checks

In [None]:
df.shape, df.isna().sum().sum()

In [None]:
df.describe().T

### Target and features

In [None]:
target_col = "MedHouseVal"
X = df.drop(columns=[target_col])
y = df[target_col]

X.shape, y.shape

## 2) Train/test split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=RANDOM_STATE
)

X_train.shape, X_test.shape

## 3) Metrics helper

In [None]:
def regression_metrics(y_true, y_pred):
    rmse = mean_squared_error(y_true, y_pred, squared=False)
    mae = mean_absolute_error(y_true, y_pred)
    r2 = r2_score(y_true, y_pred)
    return {"rmse": rmse, "mae": mae, "r2": r2}

## 4) Baseline model — Linear Regression

In [None]:
lin = LinearRegression()
lin.fit(X_train, y_train)

y_pred = lin.predict(X_test)
baseline = regression_metrics(y_test, y_pred)

baseline

## 5) Compare a few classical models

In [None]:
models = {
    "LinearRegression": lin,
    # Reasonable defaults for a fast baseline comparison (not tuned)
    "RandomForestRegressor": RandomForestRegressor(
        n_estimators=300, random_state=RANDOM_STATE, n_jobs=-1
    ),
    "GradientBoostingRegressor": GradientBoostingRegressor(random_state=RANDOM_STATE),
    "SVR (scaled)": Pipeline([
        ("scaler", StandardScaler()),
        ("svr", SVR(C=10.0, gamma="scale"))
    ]),
}

results = []
for name, model in models.items():
    if name != "LinearRegression":
        model.fit(X_train, y_train)
    y_hat = model.predict(X_test)
    m = regression_metrics(y_test, y_hat)
    results.append({"model": name, **m})

results_df = pd.DataFrame(results).sort_values(by="rmse")
results_df

## 6) Visualize performance (RMSE lower is better)

In [None]:
plt.figure(figsize=(8, 4))
plt.bar(results_df["model"], results_df["rmse"])
plt.xticks(rotation=25, ha="right")
plt.ylabel("RMSE")
plt.title("Model Comparison (Test RMSE)")
plt.tight_layout()
plt.show()

## 7) Quick reflection

Fill this in after you run the notebook:

- Which model had the best RMSE?
- Did any model overfit (e.g., suspiciously strong performance vs expectation)?
- If you had to pick one model for a first production pass, which would it be and why?
- What one change would you try next (feature engineering, cross-validation, tuning, etc.)?