# Week  - Linear Regression 2

In [73]:
!pip install statsmodels


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.1.1[0m[39;49m -> [0m[32;49m25.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython3 -m pip install --upgrade pip[0m


In [74]:
import os, re, numpy as np, pandas as pd
from sklearn.preprocessing import PolynomialFeatures, OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import train_test_split, KFold
from sklearn.linear_model import Ridge, Lasso, ElasticNet, RidgeCV, LassoCV, ElasticNetCV
from sklearn.metrics import r2_score, mean_squared_error, accuracy_score, f1_score, roc_auc_score

First Dataset: Acute Kidney

In [75]:
DATA_PATH = "Acute Kidney.csv" 
df = pd.read_csv(DATA_PATH, low_memory=False)

# Normalize columns (spaces/symbols -> underscores; lowercase)
df.columns = (df.columns.astype(str)
                .str.strip()
                .str.replace(r"\s+", "_", regex=True)
                .str.replace(r"[^0-9a-zA-Z_]", "", regex=True)
                .str.lower())

In [76]:
# Pick a CONTINUOUS target
# Prefer continuous LOS; otherwise fall back to any numeric with enough unique values (not just 0/1)
preferred = ["cox_los", "los", "length_of_stay"]
target_col = next((c for c in preferred if c in df.columns), None)
if target_col is None:
    nums = df.select_dtypes(include=["int64","float64"]).columns.tolist()
    if not nums:
        raise ValueError("No numeric columns found to use as a regression target. Set target_col manually.")
    candidates = [c for c in nums if df[c].nunique(dropna=True) >= 10 and set(df[c].dropna().unique()) != {0,1}]
    target_col = candidates[0] if candidates else nums[0]

print(f"Target: {target_col}")
y = pd.to_numeric(df[target_col], errors="coerce")

Target: cox_los


In [77]:
# Feature typing & basic NA handling
num_cols = df.select_dtypes(include=["int64","float64"]).columns.tolist()
if target_col in num_cols:
    num_cols.remove(target_col)
cat_cols = df.select_dtypes(include=["object","category","bool"]).columns.tolist()

X = df[num_cols + cat_cols].copy()

# Simple imputations
for c in num_cols:
    X[c] = pd.to_numeric(X[c], errors="coerce").fillna(0.0)
for c in cat_cols:
    X[c] = X[c].astype("category").cat.add_categories(["__missing__"]).fillna("__missing__")

In [78]:
# Train/Test split
mask = ~y.isna()
X_tr, X_te, y_tr, y_te = train_test_split(X[mask], y[mask], test_size=0.30, random_state=42)

In [79]:
# Preprocess: scale numeric, OHE categoricals
pre = ColumnTransformer(
    transformers=[
        ("num", StandardScaler(), num_cols),
        ("cat", OneHotEncoder(handle_unknown="ignore"), cat_cols),
    ],
    remainder="drop"
)

Xtr = pre.fit_transform(X_tr)
Xte = pre.transform(X_te)

# Densify if sparse (Lasso/ElasticNet are happier with dense on older versions)
if hasattr(Xtr, "toarray"):
    Xtr = Xtr.toarray()
    Xte = Xte.toarray()

# Try to build feature names (works on newer sklearn; safe-fallback otherwise)
def get_feature_names(preprocessor, num_cols, cat_cols):
    names = []
    # numeric
    names.extend(list(num_cols))
    # categoricals
    try:
        ohe = preprocessor.named_transformers_["cat"]
        try:
            cat_names = list(ohe.get_feature_names_out(cat_cols))
        except Exception:
            # older versions
            cat_names = list(ohe.get_feature_names(cat_cols))
        names.extend(cat_names)
    except Exception:
        pass
    return np.array(names, dtype=object)

try:
    feat_names = get_feature_names(pre, num_cols, cat_cols)
except Exception:
    feat_names = None

def rmse(a, b): 
    return float(np.sqrt(mean_squared_error(a, b)))

In [80]:
# Cross-validate alphas (and l1_ratio), then refit
MAX_CV_ROWS = 60_000
if Xtr.shape[0] > MAX_CV_ROWS:
    rng = np.random.default_rng(42)
    idx = rng.choice(Xtr.shape[0], size=MAX_CV_ROWS, replace=False)
    Xcv, ycv = Xtr[idx], y_tr.iloc[idx]
else:
    Xcv, ycv = Xtr, y_tr

alphas = np.logspace(-4, 3, 20)     # 1e-4 → 1e3
l1_ratios = [0.15, 0.3, 0.5, 0.7, 0.85]

ridge_cv = RidgeCV(alphas=alphas, cv=5).fit(Xcv, ycv)
lasso_cv = LassoCV(alphas=alphas, cv=5, max_iter=20000, random_state=42).fit(Xcv, ycv)
enet_cv  = ElasticNetCV(alphas=alphas, l1_ratio=l1_ratios, cv=5, max_iter=30000, random_state=42).fit(Xcv, ycv)

# Refit on full training data with best params
ridge = Ridge(alpha=ridge_cv.alpha_).fit(Xtr, y_tr)
lasso = Lasso(alpha=lasso_cv.alpha_, max_iter=20000).fit(Xtr, y_tr)
enet  = ElasticNet(alpha=enet_cv.alpha_, l1_ratio=enet_cv.l1_ratio_, max_iter=30000).fit(Xtr, y_tr)

In [81]:
# Evaluate
def report_model(name, model):
    yhat_tr, yhat_te = model.predict(Xtr), model.predict(Xte)
    print(f"\n[{name}]")
    if name == "Ridge":
        print(f"alpha={ridge_cv.alpha_:.6g}")
    elif name == "Lasso":
        print(f"alpha={lasso_cv.alpha_:.6g}")
    elif name == "ElasticNet":
        print(f"alpha={enet_cv.alpha_:.6g}  |  l1_ratio={enet_cv.l1_ratio_:.2f}")
    print(f"Train R^2: {r2_score(y_tr, yhat_tr):.4f}")
    print(f"Test  R^2: {r2_score(y_te, yhat_te):.4f}")
    print(f"Test  RMSE: {rmse(y_te, yhat_te):.4f}")
    
    # Quick sparsity view (Lasso/EN usually shrink many to zero)
    try:
        nnz = int(np.count_nonzero(model.coef_))
        print(f"Non-zero coefficients: {nnz} / {model.coef_.size}")
    except Exception:
        pass

print("\n=== Week 2 — Acute Kidney: Regularized Linear Models ===")
report_model("Ridge", ridge)
report_model("Lasso", lasso)
report_model("ElasticNet", enet)


=== Week 2 — Acute Kidney: Regularized Linear Models ===

[Ridge]
alpha=6.15848
Train R^2: 0.9701
Test  R^2: 0.9708
Test  RMSE: 5.9156
Non-zero coefficients: 63 / 63

[Lasso]
alpha=0.0885867
Train R^2: 0.9698
Test  R^2: 0.9713
Test  RMSE: 5.8577
Non-zero coefficients: 24 / 63

[ElasticNet]
alpha=0.0885867  |  l1_ratio=0.85
Train R^2: 0.9697
Test  R^2: 0.9711
Test  RMSE: 5.8769
Non-zero coefficients: 29 / 63


In [82]:
# Show top coefficients by |weight|
def top_coefs(model, names, k=15):
    if names is None:
        return None
    coefs = np.asarray(model.coef_).ravel()
    order = np.argsort(np.abs(coefs))[::-1][:k]
    return list(zip(names[order], coefs[order]))

try:
    print("\nTop Lasso coefficients (by |weight|):")
    for nm, w in (top_coefs(lasso, feat_names, k=15) or []):
        print(f"{nm:>30s}: {w:+.5f}")
except Exception:
    pass


Top Lasso coefficients (by |weight|):
                   mort_28_day: -17.18284
                   mort_90_day: -16.99999
                       lactate: -0.28916
                             p: -0.26861
                     aki_stage: -0.21199
                        sapsii: -0.20637
                           ckd: +0.15719
                           rdw: +0.15252
                           chf: +0.14250
                          pco2: +0.13721
                            ne: +0.09777
                            bp: +0.08982
                        weight: -0.08665
                            hb: -0.06955
                             k: +0.05430


In [83]:
def rmse(a, b): 
    return float(np.sqrt(mean_squared_error(a, b)))

def densify_if_sparse(A):
    return A.toarray() if hasattr(A, "toarray") else A

In [84]:
#  Week 1 comparators (OLS with polynomials/interactions) 
# (A) Polynomials: degree=2 (squares + interactions) on numeric only, then concat with OHE(cats)
poly_all = PolynomialFeatures(degree=2, include_bias=False)
pre_poly_all = ColumnTransformer(
    transformers=[
        ("num_poly", poly_all, num_cols),
        ("cat", OneHotEncoder(handle_unknown="ignore"), cat_cols),
    ],
    remainder="drop"
)

# (B) Interactions-only
poly_int = PolynomialFeatures(degree=2, interaction_only=True, include_bias=False)
pre_poly_int = ColumnTransformer(
    transformers=[
        ("num_poly", poly_int, num_cols),
        ("cat", OneHotEncoder(handle_unknown="ignore"), cat_cols),
    ],
    remainder="drop"
)

# Function to compute 5-fold CV (outer) metrics for a (preprocess -> LinearRegression) pipeline
def cv_linear(preprocessor, X, y, n_splits=5, seed=42, tag="OLS"):
    kf = KFold(n_splits=n_splits, shuffle=True, random_state=seed)
    r2s, rmses = [], []
    for tr_idx, va_idx in kf.split(X):
        Xtr, Xva = X.iloc[tr_idx], X.iloc[va_idx]
        ytr, yva = y.iloc[tr_idx], y.iloc[va_idx]
        Xtr_p = densify_if_sparse(preprocessor.fit_transform(Xtr))
        Xva_p = densify_if_sparse(preprocessor.transform(Xva))
        ols = LinearRegression()
        ols.fit(Xtr_p, ytr)
        yhat = ols.predict(Xva_p)
        r2s.append(r2_score(yva, yhat))
        rmses.append(rmse(yva, yhat))
    return {
        "model": tag,
        "cv_r2_mean": float(np.mean(r2s)), "cv_r2_std": float(np.std(r2s)),
        "cv_rmse_mean": float(np.mean(rmses)), "cv_rmse_std": float(np.std(rmses))
    }

cv_poly_all = cv_linear(pre_poly_all, X, y, tag="OLS Poly(d=2, squares+interactions)")
cv_poly_int = cv_linear(pre_poly_int, X, y, tag="OLS Interactions-only (d=2)")

In [85]:
# Week 2 models (Ridge, Lasso, Elastic Net)
# Preprocessor for regularized models: scale numeric, OHE categoricals
pre_reg = ColumnTransformer(
    transformers=[
        ("num", StandardScaler(), num_cols),
        ("cat", OneHotEncoder(handle_unknown="ignore"), cat_cols),
    ],
    remainder="drop"
)

alphas = np.logspace(-4, 3, 20)
l1_ratios = [0.15, 0.3, 0.5, 0.7, 0.85]

def nested_cv_regularized(X, y, model_name, n_splits=5, seed=42):
    """
    Outer 5-fold CV for unbiased performance estimate.
    Inside each outer train fold:
      - Fit RidgeCV / LassoCV / ElasticNetCV on preprocessed data to pick hyperparams
      - Refit on the whole outer-train and score on outer-val
    """
    kf = KFold(n_splits=n_splits, shuffle=True, random_state=seed)
    r2s, rmses = [], []
    chosen = []  # store chosen params per fold

    for tr_idx, va_idx in kf.split(X):
        Xtr_raw, Xva_raw = X.iloc[tr_idx], X.iloc[va_idx]
        ytr, yva = y.iloc[tr_idx], y.iloc[va_idx]

        Xtr_p = pre_reg.fit_transform(Xtr_raw)
        Xva_p = pre_reg.transform(Xva_raw)

        # densify for Lasso/EN if needed
        Xtr_p = densify_if_sparse(Xtr_p)
        Xva_p = densify_if_sparse(Xva_p)

        if model_name == "Ridge":
            inner = RidgeCV(alphas=alphas, cv=5)
            inner.fit(Xtr_p, ytr)
            model = Ridge(alpha=inner.alpha_)
            chosen.append({"alpha": float(inner.alpha_)})
        elif model_name == "Lasso":
            inner = LassoCV(alphas=alphas, cv=5, max_iter=20000, random_state=seed)
            inner.fit(Xtr_p, ytr)
            model = Lasso(alpha=inner.alpha_, max_iter=20000)
            chosen.append({"alpha": float(inner.alpha_)})
        elif model_name == "ElasticNet":
            inner = ElasticNetCV(alphas=alphas, l1_ratio=l1_ratios, cv=5, max_iter=30000, random_state=seed)
            inner.fit(Xtr_p, ytr)
            model = ElasticNet(alpha=inner.alpha_, l1_ratio=float(inner.l1_ratio_), max_iter=30000)
            chosen.append({"alpha": float(inner.alpha_), "l1_ratio": float(inner.l1_ratio_)})
        else:
            raise ValueError("Unknown model_name")

        model.fit(Xtr_p, ytr)
        yhat = model.predict(Xva_p)
        r2s.append(r2_score(yva, yhat))
        rmses.append(rmse(yva, yhat))

    return {
        "model": model_name,
        "cv_r2_mean": float(np.mean(r2s)), "cv_r2_std": float(np.std(r2s)),
        "cv_rmse_mean": float(np.mean(rmses)), "cv_rmse_std": float(np.std(rmses)),
        "chosen_params_per_fold": chosen
    }

cv_ridge = nested_cv_regularized(X, y, "Ridge")
cv_lasso = nested_cv_regularized(X, y, "Lasso")
cv_enet  = nested_cv_regularized(X, y, "ElasticNet")

In [86]:
#Holdout test set: head-to-head comparison 
X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.30, random_state=42)

# Week 1 holdout: OLS poly all + interactions-only
Xtr_poly_all = densify_if_sparse(pre_poly_all.fit_transform(X_tr))
Xte_poly_all = densify_if_sparse(pre_poly_all.transform(X_te))
ols_poly_all = LinearRegression().fit(Xtr_poly_all, y_tr)
yhat_tr, yhat_te = ols_poly_all.predict(Xtr_poly_all), ols_poly_all.predict(Xte_poly_all)
hold_poly_all = {"model":"OLS Poly(d=2, squares+interactions)",
                 "test_r2": r2_score(y_te, yhat_te), "test_rmse": rmse(y_te, yhat_te)}

Xtr_poly_int = densify_if_sparse(pre_poly_int.fit_transform(X_tr))
Xte_poly_int = densify_if_sparse(pre_poly_int.transform(X_te))
ols_poly_int = LinearRegression().fit(Xtr_poly_int, y_tr)
yhat_te_int = ols_poly_int.predict(Xte_poly_int)
hold_poly_int = {"model":"OLS Interactions-only (d=2)",
                 "test_r2": r2_score(y_te, yhat_te_int), "test_rmse": rmse(y_te, yhat_te_int)}

# Week 2 holdout: choose best params on train, then refit on full train and score on test
# Ridge
Xtr_reg = densify_if_sparse(pre_reg.fit_transform(X_tr))
Xte_reg = densify_if_sparse(pre_reg.transform(X_te))

ridge_cv = RidgeCV(alphas=alphas, cv=5).fit(Xtr_reg, y_tr)
ridge = Ridge(alpha=float(ridge_cv.alpha_)).fit(Xtr_reg, y_tr)
ridge_te = {"model":"Ridge", "params": {"alpha": float(ridge_cv.alpha_)},
            "test_r2": r2_score(y_te, ridge.predict(Xte_reg)), "test_rmse": rmse(y_te, ridge.predict(Xte_reg))}

lasso_cv = LassoCV(alphas=alphas, cv=5, max_iter=20000, random_state=42).fit(Xtr_reg, y_tr)
lasso = Lasso(alpha=float(lasso_cv.alpha_), max_iter=20000).fit(Xtr_reg, y_tr)
lasso_te = {"model":"Lasso", "params": {"alpha": float(lasso_cv.alpha_)},
            "test_r2": r2_score(y_te, lasso.predict(Xte_reg)), "test_rmse": rmse(y_te, lasso.predict(Xte_reg))}

enet_cv = ElasticNetCV(alphas=alphas, l1_ratio=l1_ratios, cv=5, max_iter=30000, random_state=42).fit(Xtr_reg, y_tr)
enet = ElasticNet(alpha=float(enet_cv.alpha_), l1_ratio=float(enet_cv.l1_ratio_), max_iter=30000).fit(Xtr_reg, y_tr)
enet_te = {"model":"ElasticNet", "params": {"alpha": float(enet_cv.alpha_), "l1_ratio": float(enet_cv.l1_ratio_)},
           "test_r2": r2_score(y_te, enet.predict(Xte_reg)), "test_rmse": rmse(y_te, enet.predict(Xte_reg))}

In [87]:
# Print summary tables
def fmt_pm(mean, std, nd=4): 
    return f"{mean:.{nd}f} ± {std:.{nd}f}"

cv_rows = [
    cv_poly_all,
    cv_poly_int,
    cv_ridge,
    cv_lasso,
    cv_enet
]
cv_table = pd.DataFrame([{
    "Model": r["model"],
    "CV R^2 (mean ± std)": fmt_pm(r["cv_r2_mean"], r["cv_r2_std"]),
    "CV RMSE (mean ± std)": fmt_pm(r["cv_rmse_mean"], r["cv_rmse_std"])
} for r in cv_rows])

hold_rows = [
    hold_poly_all,
    hold_poly_int,
    ridge_te,
    lasso_te,
    enet_te
]
hold_table = pd.DataFrame([{
    "Model": r["model"],
    "Test R^2": r["test_r2"],
    "Test RMSE": r["test_rmse"],
    "Params": r.get("params", {})
} for r in hold_rows])

print("\n=== 5-fold CV (outer) — mean ± std ===")
print(cv_table.to_string(index=False))

print("\n=== Common Holdout (30%) — head-to-head ===")
print(hold_table.to_string(index=False))


=== 5-fold CV (outer) — mean ± std ===
                              Model CV R^2 (mean ± std) CV RMSE (mean ± std)
OLS Poly(d=2, squares+interactions)     0.8852 ± 0.0422     11.1748 ± 1.9909
        OLS Interactions-only (d=2)     0.8984 ± 0.0298     10.5893 ± 1.4726
                              Ridge     0.9695 ± 0.0036      5.8632 ± 0.3596
                              Lasso     0.9699 ± 0.0037      5.8268 ± 0.3638
                         ElasticNet     0.9697 ± 0.0038      5.8451 ± 0.3732

=== Common Holdout (30%) — head-to-head ===
                              Model  Test R^2  Test RMSE                                           Params
OLS Poly(d=2, squares+interactions)  0.903190  10.764237                                               {}
        OLS Interactions-only (d=2)  0.908835  10.445721                                               {}
                              Ridge  0.970762   5.915594                    {'alpha': 6.1584821106602545}
                             

Week 2 — Regularized Linear Regression (Acute Kidney)

with 5-fold Cross-Validation and Week 1 OLS Comparison

Objective

Evaluate Ridge, Lasso, and Elastic Net on the Acute Kidney dataset with proper preprocessing, and report 5-fold CV mean ± std for R² and RMSE. Compare these models against Week 1 OLS baselines with degree-2 polynomials (squares+interactions) and interactions-only.

Data & Target

Dataset: Acute Kidney.csv

Rows × Cols: <fill from notebook>

Target (continuous): cox_los (preferred). If unavailable, we used: <target_col>

Predictors: mixture of continuous (vitals, labs, severity scores) and categorical (demographics, comorbidities, clinical flags).

Preprocessing

Column cleanup: lower-cased, spaces/symbols → underscores.

Missing values: numeric → 0.0 (quick pass); categoricals → "__missing__".

Encoding & Scaling:

Week 1 OLS: Polynomial expansion (numeric only), OHE for categoricals.

Week 2 models: StandardScaler on numeric, OHE on categorical features (fit on train, transform test).

Models
Week 1 Baselines

OLS Poly(d=2, squares + interactions) (numeric only) + OHE categoricals

OLS Interactions-only (d=2) (numeric only) + OHE categoricals

Week 2 Regularized

Ridge (L2) — stabilizes coefficients under multicollinearity (shrinks, doesn’t zero).

Lasso (L1) — induces sparsity / feature selection.

Elastic Net (L1 + L2) — balances grouping effect and sparsity when predictors are correlated.

Hyperparameter selection (nested CV):
Inside each outer fold, we select:

Ridge: alpha ∈ {1e-4 … 1e3} (logspace)

Lasso: alpha ∈ {1e-4 … 1e3}, max_iter=20k

Elastic Net: alpha ∈ {1e-4 … 1e3}, l1_ratio ∈ {0.15, 0.3, 0.5, 0.7, 0.85}, max_iter=30k

Discussion & Comparison to Week 1

Stability: Ridge/EN typically stabilize coefficients under multicollinearity introduced by polynomial terms.

Sparsity: Lasso/EN zeroed <count> coefficients (from console), offering a simpler model at potentially minor cost to Test R².

When to prefer which:

Ridge when predictors are strongly correlated and interpretability via sparsity is less critical.

Lasso/EN when you value feature selection or your signal is sparse.

OLS Poly only if you can validate that the added complexity translates to robust out-of-sample gains.

Second Dataset: Colorectal cancer

In [88]:
# Load & tidy
DATA_PATH = "colorectal_cancer_dataset.csv" 
df = pd.read_csv(DATA_PATH, low_memory=False)
df.columns = (df.columns.astype(str)
                .str.strip()
                .str.replace(r"\s+", "_", regex=True)
                .str.replace(r"[^0-9a-zA-Z_]", "", regex=True)
                .str.lower())

In [89]:
# Choose a CONTINUOUS target
# Prefer typical continuous CRC outcomes; else pick any numeric with adequate variability (not just {0,1})
preferred = ["survival_months", "time_to_event", "tumor_size", "tumor_volume", "age", "bmi", "los"]
target_col = next((c for c in preferred if c in df.columns), None)
if target_col is None:
    nums = df.select_dtypes(include=["int64","float64"]).columns.tolist()
    if not nums:
        raise ValueError("No numeric columns found for regression target. Set target_col explicitly.")
    candidates = [c for c in nums if df[c].nunique(dropna=True) >= 10 and set(pd.unique(df[c].dropna())) != {0,1}]
    target_col = candidates[0] if candidates else nums[0]
print("Target:", target_col)

y = pd.to_numeric(df[target_col], errors="coerce")
mask = ~y.isna()
df = df.loc[mask].reset_index(drop=True)
y = y.loc[mask].reset_index(drop=True)

# Split features
num_cols = df.select_dtypes(include=["int64","float64"]).columns.tolist()
if target_col in num_cols:
    num_cols.remove(target_col)
cat_cols = df.select_dtypes(include=["object","category","bool"]).columns.tolist()

# Basic NA handling for a quick pass
X = df[num_cols + cat_cols].copy()
for c in num_cols:
    X[c] = pd.to_numeric(X[c], errors="coerce").fillna(0.0)
for c in cat_cols:
    X[c] = X[c].astype("category").cat.add_categories(["__missing__"]).fillna("__missing__")

Target: age


In [90]:
#  Week 1 comparators (OLS with polynomials/interactions) ----------------
# (A) Polynomials: degree=2 (squares + interactions) on numeric only, then concat with OHE(cats)
poly_all = PolynomialFeatures(degree=2, include_bias=False)
pre_poly_all = ColumnTransformer(
    transformers=[
        ("num_poly", poly_all, num_cols),
        ("cat", OneHotEncoder(handle_unknown="ignore"), cat_cols),
    ],
    remainder="drop"
)

# (B) Interactions-only
poly_int = PolynomialFeatures(degree=2, interaction_only=True, include_bias=False)
pre_poly_int = ColumnTransformer(
    transformers=[
        ("num_poly", poly_int, num_cols),
        ("cat", OneHotEncoder(handle_unknown="ignore"), cat_cols),
    ],
    remainder="drop"
)

# 5-fold CV (outer) for OLS + polynomial preprocessors
from sklearn.model_selection import KFold
def cv_linear(preprocessor, X, y, n_splits=5, seed=42, tag="OLS"):
    kf = KFold(n_splits=n_splits, shuffle=True, random_state=seed)
    r2s, rmses = [], []
    for tr_idx, va_idx in kf.split(X):
        Xtr, Xva = X.iloc[tr_idx], X.iloc[va_idx]
        ytr, yva = y.iloc[tr_idx], y.iloc[va_idx]
        Xtr_p = densify_if_sparse(preprocessor.fit_transform(Xtr))
        Xva_p = densify_if_sparse(preprocessor.transform(Xva))
        ols = LinearRegression()
        ols.fit(Xtr_p, ytr)
        yhat = ols.predict(Xva_p)
        r2s.append(r2_score(yva, yhat))
        rmses.append(rmse(yva, yhat))
    return {
        "model": tag,
        "cv_r2_mean": float(np.mean(r2s)), "cv_r2_std": float(np.std(r2s)),
        "cv_rmse_mean": float(np.mean(rmses)), "cv_rmse_std": float(np.std(rmses))
    }

cv_poly_all = cv_linear(pre_poly_all, X, y, tag="OLS Poly(d=2, squares+interactions)")
cv_poly_int = cv_linear(pre_poly_int, X, y, tag="OLS Interactions-only (d=2)")

In [91]:
#  Week 2 models (Ridge, Lasso, Elastic Net) with nested CV
# Preprocessor for regularized models: scale numeric, OHE categoricals
pre_reg = ColumnTransformer(
    transformers=[
        ("num", StandardScaler(), num_cols),
        ("cat", OneHotEncoder(handle_unknown="ignore"), cat_cols),
    ],
    remainder="drop"
)

alphas = np.logspace(-4, 3, 20)
l1_ratios = [0.15, 0.3, 0.5, 0.7, 0.85]

def nested_cv_regularized(X, y, model_name, n_splits=5, seed=42):
    """
    Outer 5-fold CV for unbiased performance estimate.
    Inside each outer train fold:
      - Fit RidgeCV / LassoCV / ElasticNetCV on preprocessed data to pick hyperparams
      - Refit on the whole outer-train and score on outer-val
    """
    kf = KFold(n_splits=n_splits, shuffle=True, random_state=seed)
    r2s, rmses, chosen = [], [], []

    for tr_idx, va_idx in kf.split(X):
        Xtr_raw, Xva_raw = X.iloc[tr_idx], X.iloc[va_idx]
        ytr, yva = y.iloc[tr_idx], y.iloc[va_idx]

        Xtr_p = pre_reg.fit_transform(Xtr_raw)
        Xva_p = pre_reg.transform(Xva_raw)
        Xtr_p = densify_if_sparse(Xtr_p)
        Xva_p = densify_if_sparse(Xva_p)

        if model_name == "Ridge":
            inner = RidgeCV(alphas=alphas, cv=5)
            inner.fit(Xtr_p, ytr)
            model = Ridge(alpha=float(inner.alpha_))
            chosen.append({"alpha": float(inner.alpha_)})
        elif model_name == "Lasso":
            inner = LassoCV(alphas=alphas, cv=5, max_iter=20000, random_state=seed)
            inner.fit(Xtr_p, ytr)
            model = Lasso(alpha=float(inner.alpha_), max_iter=20000)
            chosen.append({"alpha": float(inner.alpha_)})
        elif model_name == "ElasticNet":
            inner = ElasticNetCV(alphas=alphas, l1_ratio=l1_ratios, cv=5, max_iter=30000, random_state=seed)
            inner.fit(Xtr_p, ytr)
            model = ElasticNet(alpha=float(inner.alpha_), l1_ratio=float(inner.l1_ratio_), max_iter=30000)
            chosen.append({"alpha": float(inner.alpha_), "l1_ratio": float(inner.l1_ratio_)})
        else:
            raise ValueError("Unknown model_name")

        model.fit(Xtr_p, ytr)
        yhat = model.predict(Xva_p)
        r2s.append(r2_score(yva, yhat))
        rmses.append(rmse(yva, yhat))

    return {
        "model": model_name,
        "cv_r2_mean": float(np.mean(r2s)), "cv_r2_std": float(np.std(r2s)),
        "cv_rmse_mean": float(np.mean(rmses)), "cv_rmse_std": float(np.std(rmses)),
        "chosen_params_per_fold": chosen
    }

cv_ridge = nested_cv_regularized(X, y, "Ridge")
cv_lasso = nested_cv_regularized(X, y, "Lasso")
cv_enet  = nested_cv_regularized(X, y, "ElasticNet")


In [92]:
# Holdout test set: head-to-head comparison 
X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.30, random_state=42)

# Week 1 holdout: OLS poly all + interactions-only
Xtr_poly_all = densify_if_sparse(pre_poly_all.fit_transform(X_tr))
Xte_poly_all = densify_if_sparse(pre_poly_all.transform(X_te))
ols_poly_all = LinearRegression().fit(Xtr_poly_all, y_tr)
hold_poly_all = {
    "model": "OLS Poly(d=2, squares+interactions)",
    "test_r2": r2_score(y_te, ols_poly_all.predict(Xte_poly_all)),
    "test_rmse": rmse(y_te, ols_poly_all.predict(Xte_poly_all))
}

Xtr_poly_int = densify_if_sparse(pre_poly_int.fit_transform(X_tr))
Xte_poly_int = densify_if_sparse(pre_poly_int.transform(X_te))
ols_poly_int = LinearRegression().fit(Xtr_poly_int, y_tr)
hold_poly_int = {
    "model": "OLS Interactions-only (d=2)",
    "test_r2": r2_score(y_te, ols_poly_int.predict(Xte_poly_int)),
    "test_rmse": rmse(y_te, ols_poly_int.predict(Xte_poly_int))
}

# Week 2 holdout: choose best params on train, then refit on full train and score on test
Xtr_reg = densify_if_sparse(pre_reg.fit_transform(X_tr))
Xte_reg = densify_if_sparse(pre_reg.transform(X_te))

ridge_cv = RidgeCV(alphas=alphas, cv=5).fit(Xtr_reg, y_tr)
ridge = Ridge(alpha=float(ridge_cv.alpha_)).fit(Xtr_reg, y_tr)
ridge_te = {"model": "Ridge", "params": {"alpha": float(ridge_cv.alpha_)},
            "test_r2": r2_score(y_te, ridge.predict(Xte_reg)), "test_rmse": rmse(y_te, ridge.predict(Xte_reg))}

lasso_cv = LassoCV(alphas=alphas, cv=5, max_iter=20000, random_state=42).fit(Xtr_reg, y_tr)
lasso = Lasso(alpha=float(lasso_cv.alpha_), max_iter=20000).fit(Xtr_reg, y_tr)
lasso_te = {"model": "Lasso", "params": {"alpha": float(lasso_cv.alpha_)},
            "test_r2": r2_score(y_te, lasso.predict(Xte_reg)), "test_rmse": rmse(y_te, lasso.predict(Xte_reg))}

enet_cv = ElasticNetCV(alphas=alphas, l1_ratio=l1_ratios, cv=5, max_iter=30000, random_state=42).fit(Xtr_reg, y_tr)
enet = ElasticNet(alpha=float(enet_cv.alpha_), l1_ratio=float(enet_cv.l1_ratio_), max_iter=30000).fit(Xtr_reg, y_tr)
enet_te = {"model": "ElasticNet", "params": {"alpha": float(enet_cv.alpha_), "l1_ratio": float(enet_cv.l1_ratio_)},
           "test_r2": r2_score(y_te, enet.predict(Xte_reg)), "test_rmse": rmse(y_te, enet.predict(Xte_reg))}


In [93]:
# Pretty print summary tables
def fmt_pm(mean, std, nd=4):
    return f"{mean:.{nd}f} ± {std:.{nd}f}"

cv_rows = [cv_poly_all, cv_poly_int, cv_ridge, cv_lasso, cv_enet]
cv_table = pd.DataFrame([{
    "Model": r["model"],
    "CV R^2 (mean ± std)": fmt_pm(r["cv_r2_mean"], r["cv_r2_std"]),
    "CV RMSE (mean ± std)": fmt_pm(r["cv_rmse_mean"], r["cv_rmse_std"])
} for r in cv_rows])

hold_rows = [hold_poly_all, hold_poly_int, ridge_te, lasso_te, enet_te]
hold_table = pd.DataFrame([{
    "Model": r["model"],
    "Test R^2": r["test_r2"],
    "Test RMSE": r["test_rmse"],
    "Params": r.get("params", {})
} for r in hold_rows])

print("\n=== 5-fold CV (outer) — mean ± std ===")
print(cv_table.to_string(index=False))

print("\n=== Common Holdout (30%) — head-to-head ===")
print(hold_table.to_string(index=False))


=== 5-fold CV (outer) — mean ± std ===
                              Model CV R^2 (mean ± std) CV RMSE (mean ± std)
OLS Poly(d=2, squares+interactions)    -0.0004 ± 0.0002     11.8748 ± 0.0380
        OLS Interactions-only (d=2)    -0.0005 ± 0.0002     11.8749 ± 0.0380
                              Ridge    -0.0003 ± 0.0002     11.8742 ± 0.0377
                              Lasso    -0.0001 ± 0.0001     11.8725 ± 0.0384
                         ElasticNet    -0.0001 ± 0.0001     11.8725 ± 0.0385

=== Common Holdout (30%) — head-to-head ===
                              Model  Test R^2  Test RMSE                              Params
OLS Poly(d=2, squares+interactions) -0.000480  11.892239                                  {}
        OLS Interactions-only (d=2) -0.000383  11.891660                                  {}
                              Ridge -0.000300  11.891165                   {'alpha': 1000.0}
                              Lasso -0.000137  11.890197                   {'alph

Week 2 — Regularized Linear Regression (Colorectal Cancer)

with 5-fold Cross-Validation and Week 1 OLS Comparison

Objective

Evaluate Ridge, Lasso, and Elastic Net on the colorectal cancer dataset, report 5-fold CV mean ± std for R² and RMSE, and compare against Week 1 OLS baselines using degree-2 polynomials (squares+interactions) and interactions-only.

Data & Target

Dataset: colorectal_cancer_dataset.csv

Rows × Cols: <fill from notebook>

Target (continuous): survival_months (preferred if present). If not, used: <target_col>

Predictors: mix of continuous (e.g., age, BMI, tumor size/volume, biomarkers) and categorical (e.g., sex, stage, site, therapy).

Preprocessing

Column cleanup: lower-cased, spaces/symbols → underscores.

Missing values: numeric → 0.0 (quick pass); categoricals → "__missing__".

Encoding & Scaling

Week 1 OLS: Polynomial expansion on numeric only (degree=2); OHE for categoricals.

Week 2: StandardScaler on numeric; OHE for categoricals (fit on train, transform test).

Models
Week 1 Baselines

OLS Poly(d=2, squares + interactions) (numeric only) + OHE categoricals

OLS Interactions-only (d=2) (numeric only) + OHE categoricals

Week 2 Regularized

Ridge (L2): shrinks coefficients to handle multicollinearity.

Lasso (L1): induces sparsity / feature selection.

Elastic Net (L1+L2): balances grouping (correlated predictors) and sparsity.

Hyperparameter selection (nested CV in each outer fold):

Ridge: alpha ∈ {1e-4 … 1e3} (logspace)

Lasso: alpha ∈ {1e-4 … 1e3}, max_iter=20k

Elastic Net: alpha ∈ {1e-4 … 1e3}, l1_ratio ∈ {0.15, 0.3, 0.5, 0.7, 0.85}, max_iter=30k

Interpretation & Comparison to Week 1


Regularization vs. Feature Expansion: Ridge/EN typically stabilize coefficients and resist overfitting from quadratic expansion; Lasso/EN may offer simpler models by zeroing weak predictors.

Sparsity: Lasso/EN set many coefficients to zero (see console) → easier interpretation with minor potential trade-off in Test R².

Third Dataset: Diabetes

In [94]:
DATA_PATH = "diabetes_012_health_indicators_BRFSS2015.csv"
df = pd.read_csv(DATA_PATH, low_memory=True)

# normalize columns
df.columns = (df.columns.astype(str)
                .str.strip().str.replace(r"\s+", "_", regex=True)
                .str.replace(r"[^0-9a-zA-Z_]", "", regex=True)
                .str.lower())

In [95]:
# Choose a CONTINUOUS target
# Prefer BMI; fallback to other numeric with adequate variability (not just {0,1})
preferred = ["bmi", "menthlth", "physhlth", "genhlth", "age"]
target_col = next((c for c in preferred if c in df.columns), None)
if target_col is None:
    nums_all = df.select_dtypes(include=["int64","float64"]).columns.tolist()
    if not nums_all:
        raise ValueError("No numeric columns found for regression target. Set target_col explicitly.")
    candidates = [c for c in nums_all if df[c].nunique(dropna=True) >= 10 and set(pd.unique(df[c].dropna())) != {0,1}]
    target_col = candidates[0] if candidates else nums_all[0]
print("Target:", target_col)

y = pd.to_numeric(df[target_col], errors="coerce")
mask = ~y.isna()
df = df.loc[mask].reset_index(drop=True)
y = y.loc[mask].reset_index(drop=True)

Target: bmi


In [96]:
# Feature sets 
num_cols_all = df.select_dtypes(include=["int64","float64"]).columns.tolist()
if target_col in num_cols_all:
    num_cols_all.remove(target_col)
cat_cols = df.select_dtypes(include=["object","category","bool"]).columns.tolist() 

# Quick NA handling (fast pass)
X = df[num_cols_all + cat_cols].copy()
for c in num_cols_all:
    X[c] = pd.to_numeric(X[c], errors="coerce").fillna(0.0).astype("float32")
for c in cat_cols:
    X[c] = X[c].astype("category").cat.add_categories(["__missing__"]).fillna("__missing__")

In [97]:
# Pick top-K numeric for Week 1 polynomial comparators (controls explosion)
K_NUMERIC = 12  # raise to include more numeric in poly OLS if you have RAM
corrs = X[num_cols_all].corrwith(y.astype("float32")).abs().sort_values(ascending=False)
num_cols_k = list(corrs.index[:min(K_NUMERIC, len(corrs))])

In [98]:
# Week 1 comparators (OLS with polynomials/interactions)
# (A) Polynomials: degree=2 (squares + interactions) on top-K numeric only, then concat with OHE(cats)
poly_all = PolynomialFeatures(degree=2, include_bias=False)
pre_poly_all = ColumnTransformer(
    transformers=[
        ("num_poly", poly_all, num_cols_k),
        ("cat", OneHotEncoder(handle_unknown="ignore"), cat_cols),
    ],
    remainder="drop"
)

# (B) Interactions-only
poly_int = PolynomialFeatures(degree=2, interaction_only=True, include_bias=False)
pre_poly_int = ColumnTransformer(
    transformers=[
        ("num_poly", poly_int, num_cols_k),
        ("cat", OneHotEncoder(handle_unknown="ignore"), cat_cols),
    ],
    remainder="drop"
)

# 5-fold CV (outer) for OLS + polynomial preprocessors
def cv_linear(preprocessor, X, y, n_splits=5, seed=42, tag="OLS"):
    kf = KFold(n_splits=n_splits, shuffle=True, random_state=seed)
    r2s, rmses = [], []
    for tr_idx, va_idx in kf.split(X):
        Xtr, Xva = X.iloc[tr_idx], X.iloc[va_idx]
        ytr, yva = y.iloc[tr_idx], y.iloc[va_idx]
        Xtr_p = densify_if_sparse(preprocessor.fit_transform(Xtr))
        Xva_p = densify_if_sparse(preprocessor.transform(Xva))
        # cast to float32 to cut memory for large N
        if hasattr(Xtr_p, "astype"):
            Xtr_p = Xtr_p.astype("float32", copy=False)
            Xva_p = Xva_p.astype("float32", copy=False)
        ols = LinearRegression()
        ols.fit(Xtr_p, ytr)
        yhat = ols.predict(Xva_p)
        r2s.append(r2_score(yva, yhat))
        rmses.append(rmse(yva, yhat))
    return {
        "model": tag,
        "cv_r2_mean": float(np.mean(r2s)), "cv_r2_std": float(np.std(r2s)),
        "cv_rmse_mean": float(np.mean(rmses)), "cv_rmse_std": float(np.std(rmses))
    }

cv_poly_all = cv_linear(pre_poly_all, X, y, tag=f"OLS Poly(d=2, squares+interactions) [top-{len(num_cols_k)} num]")
cv_poly_int = cv_linear(pre_poly_int, X, y, tag=f"OLS Interactions-only (d=2) [top-{len(num_cols_k)} num]")

In [99]:
# Week 2 models (Ridge, Lasso, Elastic Net)
# Preprocessor for regularized models: scale numeric, OHE categoricals
pre_reg = ColumnTransformer(
    transformers=[
        ("num", StandardScaler(), num_cols_all),
        ("cat", OneHotEncoder(handle_unknown="ignore"), cat_cols),
    ],
    remainder="drop"
)

alphas = np.logspace(-4, 3, 20)
l1_ratios = [0.15, 0.3, 0.5, 0.7, 0.85]
MAX_INNER_ROWS = 120_000  # cap rows during inner CV to keep speed/memory reasonable

def nested_cv_regularized(X, y, model_name, n_splits=5, seed=42):
    """
    Outer 5-fold CV for unbiased performance estimate.
    Inside each outer train fold:
      - Fit RidgeCV / LassoCV / ElasticNetCV on preprocessed data to pick hyperparams
      - Optionally subsample rows for inner CV to keep runtime reasonable
      - Refit on the whole outer-train and score on outer-val
    """
    kf = KFold(n_splits=n_splits, shuffle=True, random_state=seed)
    r2s, rmses, chosen = [], [], []

    for tr_idx, va_idx in kf.split(X):
        Xtr_raw, Xva_raw = X.iloc[tr_idx], X.iloc[va_idx]
        ytr, yva = y.iloc[tr_idx], y.iloc[va_idx]

        Xtr_p = pre_reg.fit_transform(Xtr_raw)
        Xva_p = pre_reg.transform(Xva_raw)

        Xtr_p = densify_if_sparse(Xtr_p)
        Xva_p = densify_if_sparse(Xva_p)

        # Inner CV row cap
        if Xtr_p.shape[0] > MAX_INNER_ROWS:
            rng = np.random.default_rng(seed)
            idx = rng.choice(Xtr_p.shape[0], size=MAX_INNER_ROWS, replace=False)
            Xcv, ycv = Xtr_p[idx], ytr.iloc[idx]
        else:
            Xcv, ycv = Xtr_p, ytr

        if model_name == "Ridge":
            inner = RidgeCV(alphas=alphas, cv=5).fit(Xcv, ycv)
            model = Ridge(alpha=float(inner.alpha_))
            chosen.append({"alpha": float(inner.alpha_)})
        elif model_name == "Lasso":
            inner = LassoCV(alphas=alphas, cv=5, max_iter=20000, random_state=seed).fit(Xcv, ycv)
            model = Lasso(alpha=float(inner.alpha_), max_iter=20000)
            chosen.append({"alpha": float(inner.alpha_)})
        elif model_name == "ElasticNet":
            inner = ElasticNetCV(alphas=alphas, l1_ratio=l1_ratios, cv=5, max_iter=30000, random_state=seed).fit(Xcv, ycv)
            model = ElasticNet(alpha=float(inner.alpha_), l1_ratio=float(inner.l1_ratio_), max_iter=30000)
            chosen.append({"alpha": float(inner.alpha_), "l1_ratio": float(inner.l1_ratio_)})
        else:
            raise ValueError("Unknown model_name")

        model.fit(Xtr_p, ytr)
        yhat = model.predict(Xva_p)
        r2s.append(r2_score(yva, yhat))
        rmses.append(rmse(yva, yhat))

    return {
        "model": model_name,
        "cv_r2_mean": float(np.mean(r2s)), "cv_r2_std": float(np.std(r2s)),
        "cv_rmse_mean": float(np.mean(rmses)), "cv_rmse_std": float(np.std(rmses)),
        "chosen_params_per_fold": chosen
    }

cv_ridge = nested_cv_regularized(X, y, "Ridge")
cv_lasso = nested_cv_regularized(X, y, "Lasso")
cv_enet  = nested_cv_regularized(X, y, "ElasticNet")

In [100]:
# Holdout test set: head-to-head comparison
X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.30, random_state=42)

# Week 1 holdout: OLS poly all + interactions-only (top-K numeric)
Xtr_poly_all = densify_if_sparse(pre_poly_all.fit_transform(X_tr)).astype("float32", copy=False)
Xte_poly_all = densify_if_sparse(pre_poly_all.transform(X_te)).astype("float32", copy=False)
ols_poly_all = LinearRegression().fit(Xtr_poly_all, y_tr)
hold_poly_all = {
    "model": f"OLS Poly(d=2, squares+interactions) [top-{len(num_cols_k)} num]",
    "test_r2": r2_score(y_te, ols_poly_all.predict(Xte_poly_all)),
    "test_rmse": rmse(y_te, ols_poly_all.predict(Xte_poly_all))
}

Xtr_poly_int = densify_if_sparse(pre_poly_int.fit_transform(X_tr)).astype("float32", copy=False)
Xte_poly_int = densify_if_sparse(pre_poly_int.transform(X_te)).astype("float32", copy=False)
ols_poly_int = LinearRegression().fit(Xtr_poly_int, y_tr)
hold_poly_int = {
    "model": f"OLS Interactions-only (d=2) [top-{len(num_cols_k)} num]",
    "test_r2": r2_score(y_te, ols_poly_int.predict(Xte_poly_int)),
    "test_rmse": rmse(y_te, ols_poly_int.predict(Xte_poly_int))
}

# Week 2 holdout: choose best params on train, then refit on full train and score on test
Xtr_reg = densify_if_sparse(pre_reg.fit_transform(X_tr))
Xte_reg = densify_if_sparse(pre_reg.transform(X_te))

ridge_cv = RidgeCV(alphas=alphas, cv=5).fit(Xtr_reg, y_tr)
ridge = Ridge(alpha=float(ridge_cv.alpha_)).fit(Xtr_reg, y_tr)
ridge_te = {"model": "Ridge", "params": {"alpha": float(ridge_cv.alpha_)},
            "test_r2": r2_score(y_te, ridge.predict(Xte_reg)), "test_rmse": rmse(y_te, ridge.predict(Xte_reg))}

lasso_cv = LassoCV(alphas=alphas, cv=5, max_iter=20000, random_state=42).fit(Xtr_reg, y_tr)
lasso = Lasso(alpha=float(lasso_cv.alpha_), max_iter=20000).fit(Xtr_reg, y_tr)
lasso_te = {"model": "Lasso", "params": {"alpha": float(lasso_cv.alpha_)},
            "test_r2": r2_score(y_te, lasso.predict(Xte_reg)), "test_rmse": rmse(y_te, lasso.predict(Xte_reg))}

enet_cv = ElasticNetCV(alphas=alphas, l1_ratio=l1_ratios, cv=5, max_iter=30000, random_state=42).fit(Xtr_reg, y_tr)
enet = ElasticNet(alpha=float(enet_cv.alpha_), l1_ratio=float(enet_cv.l1_ratio_), max_iter=30000).fit(Xtr_reg, y_tr)
enet_te = {"model": "ElasticNet", "params": {"alpha": float(enet_cv.alpha_), "l1_ratio": float(enet_cv.l1_ratio_)},
           "test_r2": r2_score(y_te, enet.predict(Xte_reg)), "test_rmse": rmse(y_te, enet.predict(Xte_reg))}

In [101]:
# Print summary tables
def fmt_pm(mean, std, nd=4):
    return f"{mean:.{nd}f} ± {std:.{nd}f}"

cv_rows = [cv_poly_all, cv_poly_int, cv_ridge, cv_lasso, cv_enet]
cv_table = pd.DataFrame([{
    "Model": r["model"],
    "CV R^2 (mean ± std)": fmt_pm(r["cv_r2_mean"], r["cv_r2_std"]),
    "CV RMSE (mean ± std)": fmt_pm(r["cv_rmse_mean"], r["cv_rmse_std"])
} for r in cv_rows])

hold_rows = [hold_poly_all, hold_poly_int, ridge_te, lasso_te, enet_te]
hold_table = pd.DataFrame([{
    "Model": r["model"],
    "Test R^2": r["test_r2"],
    "Test RMSE": r["test_rmse"],
    "Params": r.get("params", {})
} for r in hold_rows])

print("\n=== 5-fold CV (outer) — mean ± std ===")
print(cv_table.to_string(index=False))

print("\n=== Common Holdout (30%) — head-to-head ===")
print(hold_table.to_string(index=False))


=== 5-fold CV (outer) — mean ± std ===
                                           Model CV R^2 (mean ± std) CV RMSE (mean ± std)
OLS Poly(d=2, squares+interactions) [top-12 num]     0.0609 ± 0.0026      6.4040 ± 0.0516
        OLS Interactions-only (d=2) [top-12 num]     0.1080 ± 0.0036      6.2415 ± 0.0508
                                           Ridge     0.1393 ± 0.0028      6.1308 ± 0.0486
                                           Lasso     0.1393 ± 0.0028      6.1308 ± 0.0487
                                      ElasticNet     0.1393 ± 0.0028      6.1308 ± 0.0487

=== Common Holdout (30%) — head-to-head ===
                                           Model  Test R^2  Test RMSE                                            Params
OLS Poly(d=2, squares+interactions) [top-12 num]  0.095954   6.256866                                                {}
        OLS Interactions-only (d=2) [top-12 num]  0.109788   6.208809                                                {}
               

Week 2 — Regularized Linear Regression (Diabetes BRFSS2015)

with 5-fold Cross-Validation and Week 1 OLS Comparison

Objective

Evaluate Ridge, Lasso, and Elastic Net on the BRFSS 2015 diabetes indicators dataset. Report 5-fold CV (mean ± std) for R² and RMSE, and compare against Week 1 OLS baselines using degree-2 polynomials (squares+interactions) and interactions-only.

Target (continuous): bmi

Predictors: mostly continuous health indicators; sometimes categorical flags (OHE if present).

Preprocessing

Column cleanup: lower-cased, spaces/symbols → underscores.

Missing values: numeric → 0.0; categoricals → "__missing__".

Encoding & Scaling

Week 1 OLS: Polynomial expansion on numeric only (degree=2); OHE for categoricals.

Week 2: StandardScaler on numeric; OHE on categoricals (fit on train, transform test).

Large dataset hygiene: float32, optional cap on polynomial features (top-K by correlation) to avoid memory blow-up.

Models
Week 1 Baselines

OLS Poly(d=2, squares + interactions) (numeric only) + OHE categoricals

OLS Interactions-only (d=2) (numeric only) + OHE categoricals

Week 2 Regularized

Ridge (L2): shrinks coefficients to handle multicollinearity.

Lasso (L1): induces sparsity / feature selection.

Elastic Net (L1+L2): balances grouping (correlated predictors) and sparsity.

Hyperparameter selection (nested CV in each outer fold):

Ridge: alpha ∈ {1e-4 … 1e3} (logspace)

Lasso: alpha ∈ {1e-4 … 1e3}, max_iter=20k

Elastic Net: alpha ∈ {1e-4 … 1e3}, l1_ratio ∈ {0.15, 0.3, 0.5, 0.7, 0.85}, max_iter=30k

For speed on large N, inner-CV may subsample rows (e.g., ≤120k).



Interpretation & Comparison to Week 1


Regularization vs. Feature Expansion: Ridge/EN typically stabilize coefficients and resist overfitting from quadratic terms; Lasso/EN may offer simpler models by zeroing weak predictors.

Sparsity: Lasso/EN set many coefficients to zero (see console) → easier interpretation with minor potential trade-off in Test R².