In [1]:
# # Notebook 0: Introduction, Overfitting, and Data Setup
#
# ## Goals
# * Understand the fundamental problem of overfitting in machine learning.
# * Understand the concept of generalization error.
# * Learn the importance of splitting data into Training, Validation, and Test sets.
# * Generate synthetic data for demonstrating cross-validation concepts, including features, a target variable, and crucial group identifiers (simulating patients).
# * Perform the initial Train/Test split to isolate the final hold-out test set.

# ## 1. The Problem: Overfitting and Generalization
#
# In supervised machine learning, our goal is to build a model that learns patterns from labeled data (training data) to make predictions on new, unseen data.
#
# **Overfitting** occurs when a model learns the training data *too well*. It memorizes the specific examples, including noise and random fluctuations, rather than learning the underlying general patterns.
#
# **Analogy:** Imagine studying for an exam by memorizing the answers to only last year's questions. You might ace those exact questions if they reappear, but you'll likely fail on new questions because you didn't learn the *concepts*. An overfitted model is like this student.
#
# **Consequences:** An overfitted model performs excellently on the data it was trained on, but poorly on new data it hasn't seen before.
#
# **Generalization Error:** This measures how accurately a model performs on new, unseen data drawn from the same distribution as the training data. Our **true goal** in machine learning is to minimize generalization error, not just the error on the training set.

# ## 2. The Solution: Train/Validation/Test Split
#
# To estimate generalization error and develop models that perform well on new data, we split our dataset:
#
# 1.  **Training Set:** Used to train the model (i.e., learn the parameters).
# 2.  **Validation Set (or Development Set):** Used *during* model development to:
#     *   Tune hyperparameters (e.g., complexity settings of the model).
#     *   Select features.
#     *   Get an *intermediate*, unbiased estimate of model performance to guide development choices. Cross-validation techniques systematically create validation sets.
# 3.  **Test Set (Hold-out Set):** Used *only once* at the very end of the development process, after the model is finalized (including hyperparameter choices). It provides the *final, unbiased estimate* of the chosen model's generalization performance. **This data must not influence training or tuning decisions.**
#
# ```
# +------------------------------------------+
# |           Original Dataset               |
# +------------------------------------------+
#      |
#      | Split (e.g., 80/20)
#      V
# +------------------------+   +-------------+
# |  Development Data      |   |  Test Set   |  <- Hold-out! Use only ONCE at the end.
# |  (Train + Validation)  |   |  (e.g. 20%) |
# |  (e.g. 80%)            |   +-------------+
# +------------------------+
#      |
#      | (Cross-Validation happens HERE)
#      V
# +-----------------+   +-----------------+
# |  Training Fold  |   | Validation Fold |  <- Repeated K times in K-Fold CV
# +-----------------+   +-----------------+
# ```

# ## 3. Setup and Data Generation
#
# Let's import necessary libraries and generate some synthetic data that mimics characteristics often found in medical datasets, such as classification tasks and multiple samples per patient (group).

# +
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_classification

# Settings for data generation
N_SAMPLES = 500       # Total number of data points (e.g., images, tests)
N_FEATURES = 20       # Number of input features
N_CLASSES = 2         # Binary classification task (e.g., disease vs. healthy)
N_PATIENTS = 100      # Number of unique patients/groups
IMBALANCE = 0.8       # Fraction of the majority class (e.g., 80% healthy)

RANDOM_STATE = 42     # For reproducibility

# Generate features and target variable
# Using make_classification for a slightly more complex dataset
X, y = make_classification(
    n_samples=N_SAMPLES,
    n_features=N_FEATURES,
    n_informative=10,    # Number of features that actually contain signal
    n_redundant=5,
    n_repeated=0,
    n_classes=N_CLASSES,
    n_clusters_per_class=2,
    weights=[IMBALANCE, 1.0 - IMBALANCE], # Class imbalance
    flip_y=0.05,          # Add some noise to labels
    class_sep=0.8,        # How separable the classes are
    random_state=RANDOM_STATE
)

# Generate Patient/Group IDs
# Ensure each patient has roughly the same number of samples
samples_per_patient = N_SAMPLES // N_PATIENTS
groups = np.repeat(np.arange(N_PATIENTS), samples_per_patient)

# If N_SAMPLES is not perfectly divisible by N_PATIENTS, assign remaining samples
remaining_samples = N_SAMPLES % N_PATIENTS
if remaining_samples > 0:
    groups = np.concatenate([groups, np.random.choice(N_PATIENTS, remaining_samples)]) # Assign randomly

# Shuffle groups array to mix patients (X, y are already shuffled by make_classification)
np.random.seed(RANDOM_STATE)
np.random.shuffle(groups) # Shuffle the group assignments while keeping X, y aligned initially

# Create a DataFrame (optional, but good practice)
feature_names = [f'feature_{i+1}' for i in range(N_FEATURES)]
df = pd.DataFrame(X, columns=feature_names)
df['target'] = y
df['patient_id'] = groups

print(f"Generated DataFrame shape: {df.shape}")
print(f"Value counts for target variable:\n{df['target'].value_counts()}")
print(f"Number of unique patients: {df['patient_id'].nunique()}")
print(f"Samples per patient (distribution):\n{df['patient_id'].value_counts().describe()}")
df.head()
# -

# ## 4. Perform the Initial Train/Test Split
#
# We will now split off the **final hold-out test set**. This set (`X_test`, `y_test`, `groups_test`) will **not be touched** during any cross-validation or hyperparameter tuning steps demonstrated in the subsequent notebooks. We must also keep track of the patient IDs corresponding to the test set.
#
# We use `stratify=y` to ensure the class proportions are roughly maintained in both the development and test sets, which is important for imbalanced datasets.

# +
# Define features (X), target (y), and groups
X = df.drop(['target', 'patient_id'], axis=1).values
y = df['target'].values
groups = df['patient_id'].values

# Split data into Development (Train+Validation) and Test sets
# Using 80% for development, 20% for the final test set
test_set_size = 0.20

X_dev, X_test, y_dev, y_test, groups_dev, groups_test = train_test_split(
    X, y, groups,
    test_size=test_set_size,
    random_state=RANDOM_STATE,
    stratify=y # Stratify based on the target variable 'y'
)

print("--- Dataset Shapes ---")
print(f"Development Features (X_dev):   {X_dev.shape}")
print(f"Development Target (y_dev):     {y_dev.shape}")
print(f"Development Groups (groups_dev):{groups_dev.shape}")
print("-" * 20)
print(f"Test Features (X_test):         {X_test.shape}")
print(f"Test Target (y_test):           {y_test.shape}")
print(f"Test Groups (groups_test):      {groups_test.shape}")
print("-" * 20)

# Verify stratification (optional)
dev_prop = np.bincount(y_dev) / len(y_dev)
test_prop = np.bincount(y_test) / len(y_test)
print(f"Development set class proportions: {dev_prop}")
print(f"Test set class proportions:        {test_prop}")

# Verify patient separation (Important check!)
dev_patients = set(groups_dev)
test_patients = set(groups_test)
common_patients = dev_patients.intersection(test_patients)
print(f"\nNumber of patients in Dev set: {len(dev_patients)}")
print(f"Number of patients in Test set: {len(test_patients)}")
print(f"Number of patients common to both sets: {len(common_patients)}")
if len(common_patients) > 0:
    print("WARNING: Patients ended up in both Dev and Test sets! Review split logic if patient-level split is needed here.")
    # NOTE: Standard train_test_split does NOT guarantee patient-level separation.
    # We will handle proper patient-level splitting *within* the Development set using GroupKFold later.
    # The separation check here is mainly illustrative for now. For a strict patient-level *initial* split,
    # you'd need GroupShuffleSplit or similar logic applied once.
else:
    print("OK: No patients shared between Dev and Test sets (based on this split).")

# Save the split data for use in other notebooks (Optional, could also pass variables if running sequentially)
# We'll assume variables X_dev, y_dev, groups_dev, X_test, y_test are available conceptually
# %store X_dev y_dev groups_dev X_test y_test groups_test
# -

# ## 5. Next Steps
#
# We now have our development dataset (`X_dev`, `y_dev`, `groups_dev`) which we will use in the subsequent notebooks to demonstrate various cross-validation techniques. The `X_test`, `y_test` data is set aside and should only be used for a final evaluation after all model development and selection is complete.

Generated DataFrame shape: (500, 22)
Value counts for target variable:
target
0    387
1    113
Name: count, dtype: int64
Number of unique patients: 100
Samples per patient (distribution):
count    100.0
mean       5.0
std        0.0
min        5.0
25%        5.0
50%        5.0
75%        5.0
max        5.0
Name: count, dtype: float64
--- Dataset Shapes ---
Development Features (X_dev):   (400, 20)
Development Target (y_dev):     (400,)
Development Groups (groups_dev):(400,)
--------------------
Test Features (X_test):         (100, 20)
Test Target (y_test):           (100,)
Test Groups (groups_test):      (100,)
--------------------
Development set class proportions: [0.775 0.225]
Test set class proportions:        [0.77 0.23]

Number of patients in Dev set: 100
Number of patients in Test set: 71
Number of patients common to both sets: 71
