<a href="https://colab.research.google.com/github/juliabui/csc408-411/blob/main/CSC411Module3OverfittingCorrection.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Group Activity: How to correct for overfitting

Make a copy of this notebook.

In this activity you will get into groups of 4-5 people (try to mix it up with the CS and PINC students for different perspectives).

Follow the instructions to correct the code in the second code cell to help with overfitting.

Write down 5 modifications you made or things you learned and share them with the class.

# How to use these scripts

* Run the first script to see what a model without much overfitting looks like in the output.

* Run the second as-is to show baseline vs. regularized/simplified models and early-stopped boosting/NN.

* Inspect CV RMSE vs Test RMSE/R² and decide which model generalizes best.

* Take the drop-in code below the two scripts to tune that various parts to reduce overfitting. Make as many scripts with these modifications as you want, but save each one to compare it to previous modifications. Compare results.

In [None]:
# =========================
# Overfitting fixes demo (version-agnostic RMSE + scorer)
# =========================
import numpy as np
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split, KFold, cross_val_score, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.metrics import mean_squared_error, r2_score, make_scorer, get_scorer
from sklearn.dummy import DummyRegressor

# Models
from sklearn.linear_model import Ridge
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor

# Try to use HistGradientBoosting (with early stopping); otherwise fall back to GradientBoosting
try:
    from sklearn.ensemble import HistGradientBoostingRegressor
    HAVE_HGB = True
except Exception:
    from sklearn.ensemble import GradientBoostingRegressor
    HAVE_HGB = False

RNG = 42

# ----- Version-agnostic RMSE function -----
def rmse(y_true, y_pred):
    """Return RMSE, compatible with old sklearn (no squared=False)."""
    try:
        return mean_squared_error(y_true, y_pred, squared=False)
    except TypeError:
        return np.sqrt(mean_squared_error(y_true, y_pred))

# ----- Version-agnostic CV scorer for RMSE -----
try:
    # If the named scorer exists, use it (sklearn >= 0.22)
    get_scorer("neg_root_mean_squared_error")
    RMSE_SCORER = "neg_root_mean_squared_error"
except Exception:
    # Otherwise make our own; greater_is_better=False => values will be NEGATED by sklearn
    RMSE_SCORER = make_scorer(
        lambda yt, yp: np.sqrt(mean_squared_error(yt, yp)),
        greater_is_better=False
    )

# =========================
# 0) Data
# =========================
n, d = 800, 6
X, y = make_regression(n_samples=n, n_features=d, noise=12.0, random_state=RNG)
# Inject a touch of nonlinearity so non-linear models have something to learn
y = y + 0.004 * (X[:, 0] ** 3)

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=RNG
)

# =========================
# Helpers
# =========================
cv = KFold(n_splits=5, shuffle=True, random_state=RNG)

def eval_and_report(name, model):
    """
    Do CV on train (safer generalization estimate), then refit on full train and
    evaluate once on the held-out test set.
    """
    cv_scores = cross_val_score(model, X_train, y_train, cv=cv, scoring=RMSE_SCORER)
    cv_rmse = -cv_scores.mean()   # scores are negative RMSE
    cv_std  =  cv_scores.std()
    model.fit(X_train, y_train)
    y_hat = model.predict(X_test)
    print(f"{name:>30s} | CV RMSE: {cv_rmse:.3f} ± {cv_std:.3f} | "
          f"Test RMSE: {rmse(y_test, y_hat):.3f} | Test R²: {r2_score(y_test, y_hat):.3f}")

print("\n=== Baselines & Regularized/Simpler Models ===")

# =========================
# 1) Mean baseline (yardstick)
# =========================
baseline = DummyRegressor(strategy="mean")
eval_and_report("Mean baseline", baseline)

# =========================
# 2) Linear + L2 (Ridge) in a leakage-safe Pipeline
# =========================
ridge = Pipeline([
    ("scaler", StandardScaler()),
    ("model", Ridge())
])
ridge_grid = GridSearchCV(
    ridge,
    param_grid={"model__alpha": np.logspace(-3, 3, 9)},
    scoring=RMSE_SCORER, cv=cv, n_jobs=-1
)
eval_and_report("Ridge (alpha tuned)", ridge_grid)

# =========================
# 3) Polynomial capacity control + L2
# =========================
poly_ridge = Pipeline([
    ("poly", PolynomialFeatures(include_bias=False)),
    ("scaler", StandardScaler()),
    ("model", Ridge())
])
poly_grid = GridSearchCV(
    poly_ridge,
    param_grid={
        "poly__degree": [1, 2, 3, 4],
        "model__alpha": np.logspace(-3, 3, 7),
    },
    scoring=RMSE_SCORER, cv=cv, n_jobs=-1
)
eval_and_report("Poly + Ridge (deg, alpha tuned)", poly_grid)

# =========================
# 4) k-NN — smoothness via k
# =========================
knn = Pipeline([
    ("scaler", StandardScaler()),
    ("model", KNeighborsRegressor())
])
knn_grid = GridSearchCV(
    knn,
    param_grid={"model__n_neighbors": [3, 5, 8, 15, 25]},
    scoring=RMSE_SCORER, cv=cv, n_jobs=-1
)
eval_and_report("k-NN (k tuned)", knn_grid)




print("\nTip: pick the **simplest** model that consistently beats the mean baseline and linear Ridge on CV and test.")



=== Baselines & Regularized/Simpler Models ===
                 Mean baseline | CV RMSE: 123.876 ± 8.633 | Test RMSE: 123.963 | Test R²: -0.025
           Ridge (alpha tuned) | CV RMSE: 12.048 ± 0.545 | Test RMSE: 11.485 | Test R²: 0.991
Poly + Ridge (deg, alpha tuned) | CV RMSE: 12.048 ± 0.545 | Test RMSE: 11.483 | Test R²: 0.991
                k-NN (k tuned) | CV RMSE: 50.742 ± 3.329 | Test RMSE: 46.432 | Test R²: 0.856

Tip: pick the **simplest** model that consistently beats the mean baseline and linear Ridge on CV and test.


#How to interpret results:

**Baseline is terrible (yardstick).**

* Mean baseline Test RMSE ≈ 124, Test R² -0.025 → predicting the average for everyone is useless. Your target's spread is roughly this size, so anything meaningful must beat ~124 RMSE.

**Linear Ridge wins (and generalizes).**
* Ridge (alpha tuned) CV RMSE 12.05 ± 0.55, Test RMSE 11.49, Test R² 0.991.

**Small CV±std → stable across folds.**

* Test ≈ CV (slightly better) → no overfitting; the test split is just a bit easier.

 **“Poly + Ridge” added no value.**

* Same numbers as Ridge → the grid likely chose degree = 1 (i.e., it collapsed to linear). So curvature wasn't needed.

**k-NN underperforms here.**
* k-NN Test RMSE 46.4 (R² 0.856). It learns something, but far worse than Ridge—likely higher variance on this problem.

# An Overfitted Model (Intentionally)

In [None]:
# =========================
# OVERFITTING VERSION
# =========================
import numpy as np
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split, KFold, cross_val_score
from sklearn.metrics import mean_squared_error, r2_score, get_scorer, make_scorer
from sklearn.dummy import DummyRegressor
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor, HistGradientBoostingRegressor, GradientBoostingRegressor

RNG = 42
HAVE_HGB = True  # set False if your sklearn lacks HistGradientBoostingRegressor

# ----- Version-agnostic RMSE function -----
def rmse(y_true, y_pred):
    """Return RMSE, compatible with old sklearn (no squared=False)."""
    try:
        return mean_squared_error(y_true, y_pred, squared=False)
    except TypeError:
        return np.sqrt(mean_squared_error(y_true, y_pred))

# ----- Version-agnostic CV scorer for RMSE -----
try:
    get_scorer("neg_root_mean_squared_error")
    RMSE_SCORER = "neg_root_mean_squared_error"
except Exception:
    RMSE_SCORER = make_scorer(
        lambda yt, yp: np.sqrt(mean_squared_error(yt, yp)),
        greater_is_better=False
    )

# =========================
# 0) Data (keep the same)
# =========================
n, d = 800, 6
X, y = make_regression(n_samples=n, n_features=d, noise=12.0, random_state=RNG)
y = y + 0.004 * (X[:, 0] ** 3)

# (Leave normal split; overfitting will still show up clearly)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=RNG
)

# =========================
# Helpers
# =========================
# Use a light CV (few folds) to make it easier for high-capacity models to slip through.
cv = KFold(n_splits=3, shuffle=False)  # intentionally weaker CV

def eval_and_report(name, model):
    """
    Do CV on train (for reference), then fit the full train and report Train vs Test.
    Overfitting will show as Train RMSE << Test RMSE.
    """
    cv_scores = cross_val_score(model, X_train, y_train, cv=cv, scoring=RMSE_SCORER)
    cv_rmse = -cv_scores.mean()
    cv_std  =  cv_scores.std()

    model.fit(X_train, y_train)
    y_hat_test  = model.predict(X_test)
    y_hat_train = model.predict(X_train)

    print(f"{name:>35s} | CV RMSE: {cv_rmse:.3f} ± {cv_std:.3f} | "
          f"Train RMSE: {rmse(y_train, y_hat_train):.3f} | "
          f"Test RMSE: {rmse(y_test,  y_hat_test):.3f} | "
          f"Test R²: {r2_score(y_test, y_hat_test):.3f}")

print("\n=== Baseline ===")
baseline = DummyRegressor(strategy="mean")
eval_and_report("Mean baseline", baseline)

print("\n=== INTENTIONALLY OVERFITTED MODELS ===")

# ---------------------------------------------------------
# 1) High-degree Polynomial Regression with ~no regularization
#    (explodes feature space; LinearRegression memorizes patterns)
# ---------------------------------------------------------
poly_ols = Pipeline([
    ("poly",   PolynomialFeatures(degree=7, include_bias=False)),  # very high capacity
    ("scaler", StandardScaler()),
    ("model",  LinearRegression())                                  # no L2 penalty
])
eval_and_report("Poly degree=7 + OLS (no reg)", poly_ols)

# (Optional: a barely-regularized Ridge is also very high-capacity)
weak_ridge = Pipeline([
    ("poly",   PolynomialFeatures(degree=7, include_bias=False)),
    ("scaler", StandardScaler()),
    ("model",  Ridge(alpha=1e-9))   # virtually no shrinkage
])
eval_and_report("Poly degree=7 + Ridge(alpha≈0)", weak_ridge)

# ---------------------------------------------------------
# 2) k-NN with k=1 (each point predicts itself on train ⇒ ~0 train error)
# ---------------------------------------------------------
knn_1 = Pipeline([
    ("scaler", StandardScaler()),
    ("model",  KNeighborsRegressor(n_neighbors=1))
])
eval_and_report("k-NN (k=1)", knn_1)




print("\nNote: You should see much lower Train RMSE than Test RMSE for these models — a hallmark of overfitting.")



=== Baseline ===
                      Mean baseline | CV RMSE: 124.189 ± 1.186 | Train RMSE: 124.117 | Test RMSE: 123.963 | Test R²: -0.025

=== INTENTIONALLY OVERFITTED MODELS ===
       Poly degree=7 + OLS (no reg) | CV RMSE: 327.297 ± 91.116 | Train RMSE: 0.000 | Test RMSE: 350.770 | Test R²: -7.208
     Poly degree=7 + Ridge(alpha≈0) | CV RMSE: 327.297 ± 91.116 | Train RMSE: 0.000 | Test RMSE: 350.770 | Test R²: -7.208
                         k-NN (k=1) | CV RMSE: 60.826 ± 3.938 | Train RMSE: 0.000 | Test RMSE: 57.603 | Test R²: 0.779

Note: You should see much lower Train RMSE than Test RMSE for these models — a hallmark of overfitting.


#What went wrong in the overfitting code

##Yardstick

* Mean baseline — CV ≈ 124, Test ≈ 124, R² ≈ –0.03 (by definition this is the “do nothing but predict the mean” level).

##Intentionally overfit models

* Poly degree=7 + OLS (no reg) and Poly degree=7 + Ridge(α≈0)

  * Train RMSE = 0.000 (memorized the training set).

  * CV RMSE ≈ 327 ± 91 and Test RMSE ≈ 351, R² = –7.21 → catastrophic overfit (far worse than baseline).

  * The huge CV std (±91) shows the model is unstable across folds. Discard these settings.

* k-NN (k=1)

  * Train RMSE = 0.000 (each point predicts itself).

  * CV ≈ 60.8 ± 3.9, Test ≈ 57.6, R² ≈ 0.78 → Strong test performance but high variance model (classic 1-NN).

  * Safer fix: raise k (e.g., 15–40) and re-CV; expect a small increase in Test RMSE with better stability.


##Takeaways

* The polynomial models are clear, textbook overfitting (zero train error, terrible CV/Test, negative R²).

* 1-NN overfits the train but still generalizes reasonably on this split; prefer larger k to reduce variance.


##What to change (quick fixes)

* Poly: lower degree (≤3) and add Ridge with larger α.

* k-NN: increase k (15–40).


* Use stronger CV (e.g., RepeatedKFold) and pick the simplest model whose CV mean ± std beats Ridge and the baseline.

#What to change to correct for overfitting

If CV/test error is higher than train error, you need to **reduce model capacity** or **increase regularization**. Down below are drop-in edits you can make in the above script. Experiment and pick the ones that apply to the model overfitting.

**Note**: make sure that you leave the original code in your notebook. Copy the cell instead of tuning it in place, so that you have multiple versions of your model to compare. Take as much room as you need below.

##0) Stronger, more stable CV (replace current cv)

This stabilizes CV and pushes the search toward simpler settings

In [None]:
# Stronger CV so tuning prefers simpler, more general models
from sklearn.model_selection import RepeatedKFold
cv = RepeatedKFold(n_splits=5, n_repeats=3, random_state=RNG)



##1) Polynomial regression --> cap degree + add L2(Ridge)

In [None]:
# Replace the poly degree=7 OLS and near-zero Ridge with this tuned, regularized version
poly_ridge_reg = Pipeline([
    ("poly",   PolynomialFeatures(include_bias=False)),
    ("scaler", StandardScaler()),
    ("model",  Ridge())
])

poly_ridge_grid = GridSearchCV(
    poly_ridge_reg,
    param_grid={
        "poly__degree": [1, 2, 3],               # cap model capacity
        "model__alpha": np.logspace(-2, 3, 8)    # meaningful L2 shrinkage
    },
    scoring=RMSE_SCORER, cv=cv, n_jobs=-1
)

eval_and_report("Poly + Ridge (regularized, tuned)", poly_ridge_grid)

)


##2)k-NN (smooth neigborhoods, bigger k)

**Why**: larger k = smoother, less overfit

In [None]:
# Replace k=1 with a tuned, smoother k
knn_reg = Pipeline([
    ("scaler", StandardScaler()),
    ("model",  KNeighborsRegressor())
])

knn_grid = GridSearchCV(
    knn_reg,
    param_grid={
        "model__n_neighbors": [8, 15, 25, 40, 60],   # larger k reduces variance
        "model__weights": ["uniform", "distance"]    # distance weighting often helps
    },
    scoring=RMSE_SCORER, cv=cv, n_jobs=-1
)

eval_and_report("k-NN (smoothed, tuned)", knn_grid)



##3) Sanity Checks (simple but effective)

* Prefer the smallest model whose CV mean ± std beats simpler baselines.

* If CV improves but test doesn’t, tighten the settings above (shallower/smoother/more regularized) or increase data.

* Plot your validation curve (error vs degree/depth/etc.) and stop at the U-shape minimum.



# What all of these do

* RepeatedKFold stabilizes CV and nudges selection toward simpler models.

* Poly+Ridge limits polynomial degree and adds meaningful L2 shrinkage.

* k-NN uses a larger k (and optionally distance weights) to reduce variance.


Run with these changes and you should see Train RMSE ≈ CV/Test RMSE and improved stability (smaller CV std).

# Start your experimentation here

Below, make as many code cells as you need to. Copy and paste the overfitting code and then make the drop-in fixes one by one, replacing the parts systematically so that you can compare the results. Make one final version with all the fixes you decide to use, and check your final results to see if there is still overfitting.

Then make a text cell to describe what you changed, why, and how it affected the fit of the model.