In [None]:
# # Notebook 5: CV for Hyperparameter Tuning vs. Model Evaluation
#
# ## Goals
# * Distinguish between the two primary uses of cross-validation:
#     1.  **Hyperparameter Tuning:** Selecting the best model settings.
#     2.  **Model Evaluation:** Estimating the performance of the final chosen model.
# * Implement hyperparameter tuning using `GridSearchCV`, which uses CV internally.
# * Understand why the score reported by `GridSearchCV` (`best_score_`) is **optimistically biased** and should **not** be reported as the final model performance estimate.

# ## 1. Two Distinct Goals for Cross-Validation
#
# Cross-validation is a powerful tool, but it's vital to recognize its different applications:
#
# **Use Case 1: Hyperparameter Tuning (Model Selection)**
# *   **Purpose:** To find the optimal settings (hyperparameters) for a model or to choose between different model types. Examples of hyperparameters include the `C` parameter in Logistic Regression/SVM, the number of trees in a Random Forest, or the learning rate in neural networks.
# *   **Method:** Tools like `GridSearchCV` or `RandomizedSearchCV` automate this. They systematically try different combinations of hyperparameters. For *each combination*, they perform an internal cross-validation on the training data to estimate how well that specific combination performs. The combination yielding the best average CV score is selected as the "best".
# *   **Output:** The optimal set of hyperparameters (`best_params_`).
#
# **Use Case 2: Model Evaluation (Performance Estimation)**
# *   **Purpose:** To estimate the generalization performance (e.g., accuracy, AUC) of a *single, finalized* model (often one with hyperparameters already chosen via tuning) on unseen data.
# *   **Method:** Apply a CV strategy (like `StratifiedKFold` or `GroupKFold`) to the model using functions like `cross_val_score` or `cross_validate`.
# *   **Output:** A performance estimate, usually reported as mean +/- standard deviation across the folds (e.g., `0.85 +/- 0.03` AUC).
#
# ## 2. The Pitfall: Reporting Tuning Scores as Final Performance
#
# A common mistake is to use `GridSearchCV` to find the best parameters and then report its `.best_score_` attribute as the final estimate of how well the model will perform on new data.
#
# **Why is this wrong?**
# The `.best_score_` from `GridSearchCV` is the score achieved on the *internal* validation folds *used during the search*. The hyperparameters were *chosen specifically because* they maximized performance on these folds. The process has selected the parameters that best fit the idiosyncrasies of those particular data splits.
#
# Therefore, `.best_score_` reflects the performance of the *tuning process itself* on that dataset, not necessarily how a model trained with those chosen parameters will perform on *completely new, unseen data* that wasn't involved in the tuning process at all. It's an **optimistically biased** estimate.

# ## 3. Setup and Data
#
# We'll use our development data and a model that has hyperparameters to tune (Logistic Regression with `C`). We also need a pipeline for proper preprocessing.

# +
import numpy as np
import pandas as pd
from sklearn.model_selection import GridSearchCV, StratifiedKFold, GroupKFold, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
import time

# Assume X_dev, y_dev, groups_dev are loaded
RANDOM_STATE = 42
try:
    X_dev.shape
    print("Using data loaded from previous notebook.")
    groups_dev = groups_dev.astype(int)
    X_dev[:, 0] = X_dev[:, 0] * 1000 # Ensure scaling is relevant
    X_dev[:, 1] = X_dev[:, 1] * 0.001
except NameError:
    print("Generating synthetic data for standalone execution...")
    from sklearn.datasets import make_classification
    N_SAMPLES_DEV = 400
    N_FEATURES = 20
    N_PATIENTS_DEV = 80
    IMBALANCE = 0.8
    X_dev, y_dev = make_classification(
        n_samples=N_SAMPLES_DEV, n_features=N_FEATURES, n_informative=10, n_redundant=5, n_repeated=0,
        n_classes=2, n_clusters_per_class=2, weights=[IMBALANCE, 1.0 - IMBALANCE],
        flip_y=0.05, class_sep=0.8, random_state=RANDOM_STATE
    )
    samples_per_patient = N_SAMPLES_DEV // N_PATIENTS_DEV
    groups_dev = np.repeat(np.arange(N_PATIENTS_DEV), samples_per_patient)
    remaining_samples = N_SAMPLES_DEV % N_PATIENTS_DEV
    if remaining_samples > 0:
        groups_dev = np.concatenate([groups_dev, np.random.choice(N_PATIENTS_DEV, remaining_samples)])
    np.random.seed(RANDOM_STATE); np.random.shuffle(groups_dev)
    groups_dev = groups_dev.astype(int)
    X_dev[:, 0] = X_dev[:, 0] * 1000
    X_dev[:, 1] = X_dev[:, 1] * 0.001
    print(f"Generated X_dev shape: {X_dev.shape}, y_dev shape: {y_dev.shape}, groups_dev shape: {groups_dev.shape}")

# Create a pipeline with preprocessing and the model
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('model', LogisticRegression(solver='liblinear', random_state=RANDOM_STATE, max_iter=1000))
])

# Define the hyperparameter grid to search
# Note the 'model__' prefix to specify parameters for the 'model' step in the pipeline
param_grid = {
    'model__C': [0.001, 0.01, 0.1, 1, 10, 100] # Regularization strength
}

# Define the INNER CV strategy used by GridSearchCV for tuning
# Use StratifiedKFold for classification tuning if groups aren't the primary concern *during tuning*
# If group structure is very strong and impacts parameter choice, could use GroupKFold here too.
N_SPLITS_INNER = 3
inner_cv = StratifiedKFold(n_splits=N_SPLITS_INNER, shuffle=True, random_state=RANDOM_STATE)
print(f"GridSearchCV will use {type(inner_cv).__name__} with {N_SPLITS_INNER} splits internally for tuning.")
# -

# ## 4. Hyperparameter Tuning with GridSearchCV
#
# We run `GridSearchCV` on the development data (`X_dev`, `y_dev`). It uses the `inner_cv` strategy internally.

# +
print("\n--- Running GridSearchCV for Hyperparameter Tuning ---")

# Instantiate GridSearchCV
# We pass the pipeline, parameter grid, inner CV strategy, and scoring metric
grid_search = GridSearchCV(
    estimator=pipeline,
    param_grid=param_grid,
    cv=inner_cv,            # The CV strategy FOR TUNING
    scoring='roc_auc',      # Metric to optimize during tuning
    n_jobs=-1,
    refit=True             # Refit the best model on the whole dev data? True by default.
)

start_time = time.time()
# Fit GridSearchCV on the development data
# Note: If inner_cv was GroupKFold, we would pass groups=groups_dev here too.
# Since inner_cv is StratifiedKFold, we don't need groups for the *tuning* part in this case.
grid_search.fit(X_dev, y_dev)
tuning_time = time.time() - start_time

print(f"\nGridSearchCV completed in {tuning_time:.2f} seconds.")
print(f"Best parameters found: {grid_search.best_params_}")

# THIS IS THE BIASED SCORE!
best_tuning_score = grid_search.best_score_
print(f"Best **INTERNAL CV** score (AUC) during tuning: {best_tuning_score:.4f}")
print(" ** WARNING: This score is optimistically biased! Do NOT report as final performance! **")

# The grid_search object now contains the best estimator found, refitted on all X_dev, y_dev
best_model_from_tuning = grid_search.best_estimator_
print(f"\nBest estimator refitted on full development set: {best_model_from_tuning}")
# -

# ## 5. Estimating Performance (Less Biased Approach - Simple CV on Best Model)
#
# The `.best_score_` is biased. A *slightly* better (but still not ideal) approach is to take the `best_estimator_` found by `GridSearchCV` and evaluate *it* using a *separate* cross-validation loop (e.g., `cross_val_score` with an appropriate outer CV strategy).
#
# This is better because the evaluation CV splits are different from the inner CV splits used for tuning. However, the model itself was still chosen based on performance on subsets of the same overall data pool, so some bias might remain. The truly robust method is Nested CV (Notebook 6).

# +
print("\n--- Evaluating the Best Model Found by GridSearchCV using a Separate CV ---")

# Define the OUTER CV strategy for EVALUATION
# Use GroupKFold here if patient-wise evaluation is needed!
n_unique_groups = len(np.unique(groups_dev))
N_SPLITS_OUTER = 5
if N_SPLITS_OUTER > n_unique_groups: N_SPLITS_OUTER = n_unique_groups
outer_cv = GroupKFold(n_splits=N_SPLITS_OUTER)
# If not using groups: outer_cv = StratifiedKFold(n_splits=N_SPLITS_OUTER, shuffle=True, random_state=RANDOM_STATE+1) # Use diff random state
print(f"Evaluation will use {type(outer_cv).__name__} with {N_SPLITS_OUTER} splits.")


start_time = time.time()
evaluation_scores = cross_val_score(
    best_model_from_tuning, # Use the single best model found by GridSearchCV
    X_dev,
    y_dev,
    groups=groups_dev,     # Pass groups if outer_cv is GroupKFold
    cv=outer_cv,           # Use the separate OUTER CV for evaluation
    scoring='roc_auc',
    n_jobs=-1
)
evaluation_time = time.time() - start_time

print("\nResults (Evaluating Best Tuned Model with Separate CV):")
print(f"  Fold AUCs: {evaluation_scores}")
print(f"  Mean AUC:  {evaluation_scores.mean():.4f} (+/- {evaluation_scores.std():.4f})")
print(f"  Time taken: {evaluation_time:.2f} seconds")
print("  (This score is less biased than GridSearchCV's .best_score_ but Nested CV is preferred.)")

# Compare the scores
print("\n--- Comparison ---")
print(f"GridSearchCV Internal Best Score (Biased):   {best_tuning_score:.4f}")
print(f"Separate CV Evaluation Score (Less Biased): {evaluation_scores.mean():.4f}")

diff = best_tuning_score - evaluation_scores.mean()
if diff > 0.001:
     print(f"\nNote that the biased tuning score was higher than the evaluation score by ~{diff:.4f}.")
# -

# ## 6. Conclusion
#
# It is crucial to distinguish between using cross-validation for hyperparameter tuning and for final model evaluation.
# *   `GridSearchCV` (and similar tools) use **internal CV** to find the best parameters.
# *   The score reported directly from this tuning process (`.best_score_`) is **optimistically biased** because the parameters were chosen to maximize performance on those specific internal splits.
# *   Reporting this tuning score as the final model performance is misleading.
# *   A better approach is to evaluate the *single best model* found by tuning using a *separate* CV loop, but the gold standard for obtaining an unbiased estimate of the entire modeling *process* (including tuning) is **Nested Cross-Validation**, which we will explore next.