In [None]:
# # Notebook 4: Preventing Data Leakage with Pipelines
#
# ## Goals
# * Understand how preprocessing steps (e.g., scaling, imputation) can cause **Data Leakage** if applied incorrectly during cross-validation.
# * Demonstrate the wrong way (applying preprocessing *before* CV) and the right way (using `sklearn.pipeline.Pipeline`).
# * Emphasize why Pipelines are essential for valid model evaluation.

# ## 1. The Problem: Data Leakage from Preprocessing
#
# Many machine learning algorithms perform better when input features are preprocessed. Common steps include:
# *   **Scaling/Normalization:** Bringing features to a similar range (e.g., using `StandardScaler`, `MinMaxScaler`).
# *   **Imputation:** Filling in missing values (e.g., using `SimpleImputer`).
# *   **Feature Selection:** Choosing a subset of relevant features (e.g., based on statistical tests).
# *   **Dimensionality Reduction:** Reducing the number of features (e.g., using `PCA`).
#
# **The Danger:** These preprocessing steps often learn parameters *from the data itself*.
# *   `StandardScaler` learns the mean and standard deviation.
# *   `SimpleImputer` learns the mean, median, or most frequent value.
# *   Feature selection methods might calculate correlations or p-values across the data.
# *   `PCA` finds principal components based on the data's variance structure.
#
# **Leakage Scenario:** If you apply a preprocessor (like `StandardScaler`) to your *entire* development dataset *before* performing cross-validation:
# 1.  The scaler calculates the mean/std using **all** the data (including samples that will later be in the validation folds).
# 2.  You then split the *already scaled* data into train/validation folds during CV.
# 3.  The model is trained on a fold's training set and evaluated on its validation set.
#
# **Why is this wrong?** The scaling applied to the validation set was influenced by the validation set samples themselves (via the initial `fit` on the whole dataset). The model effectively gets a "peek" at information from the validation set through the shared scaling parameters. This violates the principle of evaluating the model on data it has truly never seen (or been influenced by) before.
#
# **Result:** Artificially inflated performance scores during CV.

# ## 2. The Solution: Scikit-learn Pipelines
#
# A `Pipeline` object chains multiple processing steps (transformers) and a final estimator (model) together. When a `Pipeline` is used within a cross-validation function (`cross_val_score`, `cross_validate`, `GridSearchCV`):
#
# 1.  For each CV fold:
#     *   The preprocessing steps (`fit` and `transform` methods) are applied **only** to the **training portion** of that fold.
#     *   The *fitted* preprocessors are then used to `transform` the **validation portion** of that fold.
#     *   The final estimator is trained on the preprocessed training data and evaluated on the preprocessed validation data.
#
# This correctly simulates the real-world scenario where preprocessing is learned *only* from the training data and then applied to new, unseen data.

# ## 3. Setup and Data Preparation
#
# Let's load our development data and introduce some variations in feature scales to make scaling more relevant. We might also optionally introduce missing values.

# +
import numpy as np
import pandas as pd
from sklearn.model_selection import cross_val_score, GroupKFold, StratifiedKFold
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer # Optional: for missing data demo
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
import time

# Assume X_dev, y_dev, groups_dev are loaded from Notebook 0
RANDOM_STATE = 42
try:
    X_dev.shape
    print("Using data loaded from previous notebook.")
    groups_dev = groups_dev.astype(int)
except NameError:
    print("Generating synthetic data for standalone execution...")
    from sklearn.datasets import make_classification
    N_SAMPLES_DEV = 400
    N_FEATURES = 20
    N_PATIENTS_DEV = 80
    IMBALANCE = 0.8
    X_dev, y_dev = make_classification(
        n_samples=N_SAMPLES_DEV, n_features=N_FEATURES, n_informative=10, n_redundant=5, n_repeated=0,
        n_classes=2, n_clusters_per_class=2, weights=[IMBALANCE, 1.0 - IMBALANCE],
        flip_y=0.05, class_sep=0.8, random_state=RANDOM_STATE
    )
    samples_per_patient = N_SAMPLES_DEV // N_PATIENTS_DEV
    groups_dev = np.repeat(np.arange(N_PATIENTS_DEV), samples_per_patient)
    remaining_samples = N_SAMPLES_DEV % N_PATIENTS_DEV
    if remaining_samples > 0:
        groups_dev = np.concatenate([groups_dev, np.random.choice(N_PATIENTS_DEV, remaining_samples)])
    np.random.seed(RANDOM_STATE)
    np.random.shuffle(groups_dev)
    groups_dev = groups_dev.astype(int)
    print(f"Generated X_dev shape: {X_dev.shape}, y_dev shape: {y_dev.shape}, groups_dev shape: {groups_dev.shape}")


# Modify features to have different scales for illustration
np.random.seed(RANDOM_STATE)
X_dev[:, 0] = X_dev[:, 0] * 1000 # Make first feature have large scale
X_dev[:, 1] = X_dev[:, 1] * 0.001 # Make second feature have small scale

# Optional: Introduce some missing values
# X_dev_nan = X_dev.copy().astype(float)
# mask = np.random.choice([True, False], size=X_dev_nan.shape, p=[0.05, 0.95]) # 5% missing
# X_dev_nan[mask] = np.nan
# print(f"Introduced {np.isnan(X_dev_nan).sum()} missing values.")
# X_dev = X_dev_nan # Use data with NaNs if imputing

# Define model and CV strategy
model = LogisticRegression(solver='liblinear', random_state=RANDOM_STATE, max_iter=1000)

# Choose appropriate CV strategy - GroupKFold if groups are relevant
n_unique_groups = len(np.unique(groups_dev))
N_SPLITS = 5
if N_SPLITS > n_unique_groups: N_SPLITS = n_unique_groups
cv_strategy = GroupKFold(n_splits=N_SPLITS)
# If not using groups: cv_strategy = StratifiedKFold(n_splits=N_SPLITS, shuffle=True, random_state=RANDOM_STATE)
print(f"Using CV Strategy: {type(cv_strategy).__name__} with {N_SPLITS} splits.")

# -

# ## 4. The WRONG Way: Preprocessing Before CV
#
# Here, we apply `StandardScaler` to the entire development set first, and then run cross-validation on the *already scaled* data. **This leaks information.**

# +
print("--- Method 1: Incorrect Scaling (Before CV - Data Leakage!) ---")

# 1. Instantiate Scaler
scaler_wrong = StandardScaler()

# 2. Fit scaler on ALL development data
scaler_wrong.fit(X_dev) # Learns mean/std from all dev data (including future validation folds)

# 3. Transform ALL development data
X_dev_scaled_wrong = scaler_wrong.transform(X_dev)

# 4. Perform Cross-Validation on the pre-scaled data
start_time = time.time()
wrong_scores = cross_val_score(
    model,
    X_dev_scaled_wrong, # Use the incorrectly scaled data
    y_dev,
    groups=groups_dev,  # Pass groups if GroupKFold is used
    cv=cv_strategy,
    scoring='roc_auc',
    n_jobs=-1
)
wrong_time = time.time() - start_time

print("\nResults (Scaling Before CV):")
print(f"  Fold AUCs: {wrong_scores}")
print(f"  Mean AUC:  {wrong_scores.mean():.4f} (+/- {wrong_scores.std():.4f})")
print(f"  Time taken: {wrong_time:.2f} seconds")
print("  (This score is likely inflated due to data leakage!)")
# -

# ## 5. The RIGHT Way: Using a Pipeline
#
# Here, we create a `Pipeline` that includes the `StandardScaler` and the `LogisticRegression` model. We then pass this `pipeline` object directly to `cross_val_score`. The CV function ensures the scaler is fitted correctly *within* each fold.

# +
print("\n--- Method 2: Correct Scaling (Using Pipeline inside CV) ---")

# 1. Define the steps for the pipeline
steps = [
    ('scaler', StandardScaler()),      # Step 1: Scale data
    # Optional: ('imputer', SimpleImputer(strategy='mean')), # Add imputation if needed
    ('model', model)               # Step 2: Logistic Regression model
]

# 2. Create the Pipeline
pipeline = Pipeline(steps)
print(f"Pipeline steps: {pipeline.steps}")

# 3. Perform Cross-Validation using the pipeline as the estimator
start_time = time.time()
correct_scores = cross_val_score(
    pipeline,           # Pass the entire pipeline object
    X_dev,              # Use the ORIGINAL, unscaled data
    y_dev,
    groups=groups_dev,  # Pass groups if GroupKFold is used
    cv=cv_strategy,
    scoring='roc_auc',
    n_jobs=-1
)
correct_time = time.time() - start_time

print("\nResults (Scaling within Pipeline):")
print(f"  Fold AUCs: {correct_scores}")
print(f"  Mean AUC:  {correct_scores.mean():.4f} (+/- {correct_scores.std():.4f})")
print(f"  Time taken: {correct_time:.2f} seconds")
print("  (This score is the more realistic estimate.)")

# Compare the results
print("\n--- Comparison ---")
print(f"Mean AUC (Scaling BEFORE CV - WRONG): {wrong_scores.mean():.4f}")
print(f"Mean AUC (Scaling inside CV - RIGHT): {correct_scores.mean():.4f}")

diff = wrong_scores.mean() - correct_scores.mean()
if diff > 0.001: # Allow for tiny floating point differences
    print(f"\nThe incorrect method produced an inflated score by ~{diff:.4f}.")
elif diff < -0.001:
     print("\nInterestingly, the correct method produced a higher score here. This can sometimes happen,")
     print(" but the pipeline approach is still the methodologically sound way to avoid bias.")
else:
    print("\nThe scores are very similar in this run, but using the pipeline is still crucial for methodological correctness.")
# -

# ## 6. Conclusion
#
# Data leakage from preprocessing steps applied before cross-validation is a common and serious error that leads to overly optimistic performance estimates. Using `sklearn.pipeline.Pipeline` is the standard and correct way to integrate preprocessing steps into your cross-validation workflow. It ensures that information from the validation set does not improperly influence the model training process via shared preprocessing parameters, giving you a much more reliable estimate of your model's true generalization ability. **Always use pipelines for preprocessing within CV.**