In [None]:
# # Notebook 6: Nested Cross-Validation
#
# ## Goals
# * Understand the concept and rationale behind Nested Cross-Validation.
# * Implement Nested CV to get a more unbiased estimate of model performance *after* hyperparameter tuning.
# * Compare the Nested CV score with the biased score from simple `GridSearchCV`.

# ## 1. The Need for Nested Cross-Validation
#
# As discussed in Notebook 5, using the same cross-validation process for both hyperparameter tuning and final model evaluation leads to an optimistically biased performance estimate. Why? Because the hyperparameters are chosen specifically to perform well on the validation folds used during tuning.
#
# **Nested Cross-Validation** provides a solution by separating the CV processes for tuning and evaluation into two loops:
#
# 1.  **Outer Loop (Evaluation Loop):**
#     *   Splits the data into K folds (e.g., using `StratifiedKFold` or `GroupKFold` appropriate for the final evaluation goal).
#     *   Its primary purpose is **evaluation**. The final performance estimate comes from averaging scores across the validation sets of this outer loop.
#
# 2.  **Inner Loop (Tuning Loop):**
#     *   Runs *independently* within **each training split** of the outer loop.
#     *   Takes the training data from the current outer fold and performs *another* CV (e.g., using `GridSearchCV` with its own K' splits) **solely on that subset** to find the best hyperparameters *for that specific outer fold*.
#     *   Its primary purpose is **hyperparameter tuning** for the model being trained in the current outer fold.
#
# **How it works (per Outer Fold):**
# 1.  Outer loop splits data into `Outer Train` and `Outer Validation`.
# 2.  An **Inner CV** (e.g., `GridSearchCV`) is run *only* on the `Outer Train` data to find the locally best hyperparameters (`best_params_inner`).
# 3.  A *new model* is trained on the *entire* `Outer Train` data using `best_params_inner`.
# 4.  This model is then evaluated on the held-out `Outer Validation` set. This score is recorded.
#
# The final Nested CV performance estimate is the average of the scores recorded in step 4 across all folds of the Outer Loop.
#
# **Benefit:** The evaluation in the outer loop uses data that was **never seen** during the hyperparameter tuning process (the inner loop for that fold), yielding a much less biased estimate of the true generalization performance of the *entire model selection pipeline* (including tuning).
#
# **Drawback:** Increased computational cost (K_outer * K_inner model trainings, roughly).

# ## 2. Visualization of Nested CV
#
# ```
# Development Data
# |---------------------------------------------------------|
#
# Outer Fold 1:
# |---------- Outer Train (Inner CV happens here) ----------|--- Outer Val ---|
#                 |                                         |
#                 V (GridSearchCV on Outer Train)           |
#              Best Inner Params Found                      |  Evaluate Model
#                 |                                         |  with Best Inner
#                 V                                         |  Params Here --> Score 1
#              Train Model on Outer Train w/ Best Params -->+
#
# Outer Fold 2:
# |--- Outer Val ---|---------- Outer Train (Inner CV happens here) ----------|
#                       |                                         |
#                       V (GridSearchCV on Outer Train)           |
#                    Best Inner Params Found                      |  Evaluate Model
#                       |                                         |  with Best Inner
#                       V                                         |  Params Here --> Score 2
#                    Train Model on Outer Train w/ Best Params -->+
# ... (repeat for K_outer folds)
#
# Final Score = Average(Score 1, Score 2, ...)
# ```

# ## 3. Setup and Data
#
# We need the development data, the pipeline definition, and the parameter grid from the previous notebook.

# +
import numpy as np
import pandas as pd
from sklearn.model_selection import GridSearchCV, StratifiedKFold, GroupKFold, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
import time

# Assume X_dev, y_dev, groups_dev are loaded
RANDOM_STATE = 42
try:
    X_dev.shape
    print("Using data loaded from previous notebook.")
    groups_dev = groups_dev.astype(int)
    # Assume features already have varied scales
except NameError:
    print("Generating synthetic data for standalone execution...")
    from sklearn.datasets import make_classification
    N_SAMPLES_DEV = 400; N_FEATURES = 20; N_PATIENTS_DEV = 80; IMBALANCE = 0.8
    X_dev, y_dev = make_classification(
        n_samples=N_SAMPLES_DEV, n_features=N_FEATURES, n_informative=10, n_redundant=5, n_repeated=0,
        n_classes=2, n_clusters_per_class=2, weights=[IMBALANCE, 1.0 - IMBALANCE],
        flip_y=0.05, class_sep=0.8, random_state=RANDOM_STATE
    )
    samples_per_patient = N_SAMPLES_DEV // N_PATIENTS_DEV
    groups_dev = np.repeat(np.arange(N_PATIENTS_DEV), samples_per_patient)
    remaining_samples = N_SAMPLES_DEV % N_PATIENTS_DEV
    if remaining_samples > 0: groups_dev = np.concatenate([groups_dev, np.random.choice(N_PATIENTS_DEV, remaining_samples)])
    np.random.seed(RANDOM_STATE); np.random.shuffle(groups_dev)
    groups_dev = groups_dev.astype(int)
    X_dev[:, 0] = X_dev[:, 0] * 1000; X_dev[:, 1] = X_dev[:, 1] * 0.001
    print(f"Generated X_dev shape: {X_dev.shape}, y_dev shape: {y_dev.shape}, groups_dev shape: {groups_dev.shape}")


# --- Define Pipeline and Parameter Grid ---
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('model', LogisticRegression(solver='liblinear', random_state=RANDOM_STATE, max_iter=1000))
])

param_grid = {
    'model__C': [0.001, 0.01, 0.1, 1, 10, 100]
}

# --- Define CV Strategies ---

# Inner CV: Used by GridSearchCV for tuning WITHIN each outer fold training set
N_SPLITS_INNER = 3
inner_cv = StratifiedKFold(n_splits=N_SPLITS_INNER, shuffle=True, random_state=RANDOM_STATE)
print(f"Inner CV for Tuning: {type(inner_cv).__name__} with {N_SPLITS_INNER} splits.")

# Outer CV: Used for the final evaluation loop. Should reflect the true generalization goal.
# Use GroupKFold if patient-wise evaluation is needed.
n_unique_groups = len(np.unique(groups_dev))
N_SPLITS_OUTER = 5
if N_SPLITS_OUTER > n_unique_groups: N_SPLITS_OUTER = n_unique_groups
outer_cv = GroupKFold(n_splits=N_SPLITS_OUTER)
# If not using groups: outer_cv = StratifiedKFold(n_splits=N_SPLITS_OUTER, shuffle=True, random_state=RANDOM_STATE+1)
print(f"Outer CV for Evaluation: {type(outer_cv).__name__} with {N_SPLITS_OUTER} splits.")
# -

# ## 4. Implementing Nested CV with Scikit-learn
#
# Scikit-learn makes implementing Nested CV straightforward. We pass the `GridSearchCV` object (which handles the inner loop) as the `estimator` to `cross_val_score`, which handles the outer loop.

# +
print("\n--- Running Nested Cross-Validation ---")

# 1. Set up GridSearchCV (This defines the inner loop + tuning)
# It will be instantiated multiple times implicitly by cross_val_score
# Note: We don't call .fit() on this grid_search object directly here.
grid_search_inner = GridSearchCV(
    estimator=pipeline,      # Base pipeline (scaler + model)
    param_grid=param_grid,   # Hyperparameters to search
    cv=inner_cv,             # CV strategy for the INNER loop (tuning)
    scoring='roc_auc',       # Metric for tuning
    n_jobs=1                 # Can parallelize inner loops if set > 1 or -1, but be wary of nested parallelism resource use
                               # Might be better to parallelize only the outer loop (in cross_val_score)
)

# 2. Perform the Nested CV using cross_val_score (This defines the outer loop)
start_time = time.time()
nested_scores = cross_val_score(
    estimator=grid_search_inner, # Pass the GridSearchCV object as the estimator!
    X=X_dev,
    y=y_dev,
    groups=groups_dev,          # Pass groups if outer_cv requires it
    cv=outer_cv,                # CV strategy for the OUTER loop (evaluation)
    scoring='roc_auc',          # Final evaluation metric
    n_jobs=-1                   # Parallelize the outer folds if possible
)
nested_time = time.time() - start_time


print(f"\nNested CV completed in {nested_time:.2f} seconds.")
print("\nResults (Nested CV):")
print(f"  Individual Outer Fold AUCs: {nested_scores}")
print(f"  Mean AUC (Nested CV):       {nested_scores.mean():.4f}")
print(f"  Std Deviation (Nested CV):  {nested_scores.std():.4f}")
print("  (This provides a less biased estimate of generalization performance including tuning)")

# For comparison, let's refit the simple GridSearchCV on all data again to get the biased score
print("\n--- For Comparison: Simple GridSearchCV Best Score (Biased) ---")
grid_search_simple = GridSearchCV(
    estimator=pipeline, param_grid=param_grid, cv=inner_cv, scoring='roc_auc', n_jobs=-1, refit=True
)
grid_search_simple.fit(X_dev, y_dev) # Pass groups=groups_dev if inner_cv=GroupKFold
biased_score = grid_search_simple.best_score_
print(f"  Simple GridSearchCV Best Internal Score: {biased_score:.4f} (Biased)")


# Compare
print("\n--- Comparison Summary ---")
print(f"Nested CV Mean Score (Less Biased): {nested_scores.mean():.4f} (+/- {nested_scores.std():.4f})")
print(f"Simple GridSearchCV Score (Biased):   {biased_score:.4f}")

diff = biased_score - nested_scores.mean()
if diff > 0.001:
     print(f"\nNote that the biased score was higher than the Nested CV score by ~{diff:.4f}.")
# -

# **Observation:** Typically, the Nested CV score will be slightly lower (and possess a non-zero standard deviation reflecting evaluation stability) than the overly optimistic `.best_score_` reported by a simple `GridSearchCV` fitted once on the development data. The Nested CV score is considered a more realistic estimate of how the entire modeling *process* (including the hyperparameter search strategy) is likely to perform on truly unseen data.

# ## 5. What about the Final Model?
#
# Nested CV provides an estimate of the *performance* of your modeling pipeline. It doesn't directly output a single "best" model for deployment.
#
# To get your final deployment model after running Nested CV:
# 1.  You still need to determine the optimal hyperparameters. You could potentially:
#     *   Run `GridSearchCV` one last time on the *entire development set* (`X_dev`, `y_dev`) to find the best parameters using all available development data.
#     *   Alternatively, collect the best parameters found in each *outer fold* of the Nested CV and choose the most frequent/robust ones.
# 2.  Train your final model (using the pipeline and the chosen best hyperparameters) on the **entire development set** (`X_dev`, `y_dev`).
# 3.  The performance estimate you report for this final model is the one obtained from the **Nested CV run** (e.g., `nested_scores.mean() +/- nested_scores.std()`).
# 4.  Perform one **final evaluation** on the **hold-out Test Set** (`X_test`, `y_test`) that was set aside in Notebook 0. This provides the ultimate check on performance before considering deployment.

# ## 6. Conclusion
#
# Nested Cross-Validation is the gold standard when you need to both tune hyperparameters and obtain an unbiased estimate of the final model pipeline's generalization performance. By nesting the tuning process within an outer evaluation loop, it prevents the optimistic bias inherent in using the same data splits for both selection and evaluation. While computationally more intensive, the reliability gained, especially in high-stakes domains like medicine, often justifies the cost.