<a href="https://colab.research.google.com/github/juliabui/csc408-411/blob/main/CSC411Mod3SpotTheLeakageActivity.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Spot the Leakage

There are two lines in this script where we call `.fit(...)` on the entire dataset *before* the train/test split. Those are the "leaky" spots".

###Why they're leaky

* `StandardScaler().fit(X)` learnes means/SDs from **train + test** --> the test set influences preprocessing

* `SelectKBest(...).fit(X_scaled_leaky, y)` is worse: it uses **labels** `y` from the entire dataset to choose features, so the test labels directly shape the features.

###How to "fix"

* **Split first**, then put preprocessing and selection **inside a `Pipeline`** so `.fit(...)` happens only on the training fold.

In [None]:
# Activity 1: Spot the Leakage  ----------------------------------------------
# Goal: Identify data leakage (doing preprocessing/feature selection on the FULL dataset
# before the split/CV) and fix it with a Pipeline.

import numpy as np
import pandas as pd

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split, StratifiedKFold, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.metrics import confusion_matrix, classification_report, roc_auc_score

RNG = 42

# ----------------------------------------------------------------------
# 0) Data
# ----------------------------------------------------------------------
data = load_breast_cancer(as_frame=True)
X = data.data
y = data.target  # 0/1

# Helper for evaluation
def evaluate(model, X_train, X_test, y_train, y_test, name="model"):
    model.fit(X_train, y_train)
    if hasattr(model, "predict_proba"):
        y_proba = model.predict_proba(X_test)[:, 1]
    else:
        # for safety; most LR models have predict_proba
        y_proba = model.decision_function(X_test)
        # rescale to (0,1) if needed (optional)
        y_proba = (y_proba - y_proba.min()) / (y_proba.max() - y_proba.min() + 1e-12)
    y_pred = (y_proba >= 0.50).astype(int)

    print(f"\n=== {name} ===")
    print("Confusion matrix:\n", confusion_matrix(y_test, y_pred))
    print(classification_report(y_test, y_pred, digits=3))
    print("ROC AUC:", round(roc_auc_score(y_test, y_proba), 3))


# ----------------------------------------------------------------------
# 1) LEAKY VERSION (WRONG ON PURPOSE)
#    - Scaling and feature selection are fit on the FULL dataset
#    - THEN we split → information from the test set has leaked into preprocessing
# ----------------------------------------------------------------------
scaler_leaky   = StandardScaler().fit(X)                     # fit on ALL data
X_scaled_leaky = scaler_leaky.transform(X)

selector_leaky = SelectKBest(score_func=f_classif, k=10).fit(X_scaled_leaky, y)  # fit on ALL data (uses y! omg)
X_sel_leaky    = selector_leaky.transform(X_scaled_leaky)

X_train_L, X_test_L, y_train_L, y_test_L = train_test_split(
    X_sel_leaky, y, test_size=0.2, stratify=y, random_state=RNG
)

lr_leaky = LogisticRegression(max_iter=1000, solver="lbfgs", class_weight="balanced", random_state=RNG)
evaluate(lr_leaky, X_train_L, X_test_L, y_train_L, y_test_L, name="LEAKY workflow")


# ----------------------------------------------------------------------
# 2) FIXED VERSION (CORRECT)
#    - Split first
#    - Use a Pipeline so that scaling & feature selection are fit ONLY on train
#      and re-fit inside each CV fold (no peeking)
# ----------------------------------------------------------------------
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=RNG
)

pipe_clean = Pipeline(steps=[
    ("scaler",   StandardScaler()),
    ("select",   SelectKBest(score_func=f_classif, k=10)),
    ("model",    LogisticRegression(max_iter=1000, solver="lbfgs", class_weight="balanced", random_state=RNG))
])

evaluate(pipe_clean, X_train, X_test, y_train, y_test, name="CLEAN pipeline (k=10)")


# ----------------------------------------------------------------------
# 3) (Optional) Model selection the RIGHT way (inside CV)
#    - Tune k with GridSearchCV + StratifiedKFold
#    - All preprocessing is inside the pipeline (no leakage)
# ----------------------------------------------------------------------
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=RNG)

param_grid = {
    "select__k": [5, 10, 15, 20],
    "model__C":  [0.1, 1.0, 10.0],  # regularization strength
    "model__penalty": ["l2"]        # can add "l1" with solver="saga"
}

pipe_search = Pipeline(steps=[
    ("scaler",   StandardScaler()),
    ("select",   SelectKBest(score_func=f_classif)),
    ("model",    LogisticRegression(max_iter=1000, solver="lbfgs",
                                    class_weight="balanced", random_state=RNG))
])

grid = GridSearchCV(pipe_search, param_grid=param_grid, scoring="roc_auc",
                    cv=cv, n_jobs=-1, refit=True)
grid.fit(X_train, y_train)

print("\nBest CV params:", grid.best_params_)
print("Best CV ROC AUC:", round(grid.best_score_, 3))

# Evaluate best pipeline on held-out test
best_model = grid.best_estimator_
evaluate(best_model, X_train, X_test, y_train, y_test, name="BEST pipeline (CV-tuned)")

# ----------------------------------------------------------------------
# Student prompts:
# - Where exactly is the leakage in the LEAKY version?
# - Why is SelectKBest before the split a bigger problem than scaling?
# - Compare AUC and precision/recall between LEAKY vs CLEAN vs BEST.
# - If we changed the threshold to maximize F1, how would your confusion matrix change?
# ----------------------------------------------------------------------



=== LEAKY workflow ===
Confusion matrix:
 [[40  2]
 [ 8 64]]
              precision    recall  f1-score   support

           0      0.833     0.952     0.889        42
           1      0.970     0.889     0.928        72

    accuracy                          0.912       114
   macro avg      0.902     0.921     0.908       114
weighted avg      0.919     0.912     0.913       114

ROC AUC: 0.991

=== CLEAN pipeline (k=10) ===
Confusion matrix:
 [[40  2]
 [ 8 64]]
              precision    recall  f1-score   support

           0      0.833     0.952     0.889        42
           1      0.970     0.889     0.928        72

    accuracy                          0.912       114
   macro avg      0.902     0.921     0.908       114
weighted avg      0.919     0.912     0.913       114

ROC AUC: 0.991

Best CV params: {'model__C': 10.0, 'model__penalty': 'l2', 'select__k': 20}
Best CV ROC AUC: 0.997

=== BEST pipeline (CV-tuned) ===
Confusion matrix:
 [[40  2]
 [ 4 68]]
             

#Compare & Learn

1. Where's the leakage?


2. But my LEAKY and CLEAN results are identical - does that mean leakage is harmless?


3. What changed from CLEAN to BEST (CV-tuned)? Be precise about FNs (read the confusion matrix), FPs, TNs, and TPs. Look at Recall, F1, Accuracy, and AUC for Class-1. Did they improve or not?


4. Which error type improved most, and what does that mean in a clinical context?


5. Why did AUC barely move (0.991→0.994) even though recall/F1 improved a lot?



6. If we (incorrectly) do feature selection BEFORE CV, what happens to CV scores vs test scores?


7. How does the Pipeline stop leakage during CV?


8. Why is leaky feature selection more dangerous than leaky scaling?


9. Given the three reports, which model would you deploy and why?
* A: BEST (CV-tuned)—it improves recall/F1 without increasing FP (still 2), and it was selected with leakage-safe CV.

10. If your goal is screening (prioritize recall), what's your next step after BEST?


11. What single line proves your evaluation is leakage-safe?


12. Name two other common leakage traps to watch for.
