# üìö Supervised Learning Evaluation - Complete Hands-on Guide
## Based on Lecture 6: Comprehensive Evaluation Techniques

### üìã Table of Contents
1. **Setup and Introduction**
2. **Part 1: Data Splitting and Validation Fundamentals**
3. **Part 2: Regression Evaluation Metrics** 
4. **Part 3: Classification Evaluation Metrics**
5. **Part 4: Cross-Validation and Model Selection**
6. **Summary and Exercises**

### üéØ Learning Objectives
- Master train/validation/test splitting strategies
- Implement and interpret regression metrics (MSE, RMSE, MAE, R¬≤)
- Apply classification metrics (Precision, Recall, F1, ROC-AUC)
- Use cross-validation techniques effectively
- Perform hyperparameter tuning and model selection

---

## üöÄ Setup and Imports

In [None]:
# Import necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.graph_objects as go
import plotly.express as px
from plotly.subplots import make_subplots
import warnings
warnings.filterwarnings('ignore')

# Machine Learning Libraries
from sklearn.model_selection import (
    train_test_split, KFold, StratifiedKFold, 
    cross_val_score, GridSearchCV, RandomizedSearchCV,
    learning_curve, validation_curve, LeaveOneOut
)
from sklearn.metrics import (
    mean_squared_error, mean_absolute_error, r2_score,
    accuracy_score, precision_score, recall_score, f1_score,
    confusion_matrix, classification_report, roc_curve, auc,
    roc_auc_score, precision_recall_curve, mean_absolute_percentage_error
)
from sklearn.datasets import load_iris, load_boston, load_breast_cancer, make_classification, make_regression
from sklearn.linear_model import LogisticRegression, LinearRegression, Ridge, Lasso
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.svm import SVC, SVR
from sklearn.preprocessing import StandardScaler

# Set random seed for reproducibility
np.random.seed(42)

# Configure visualization settings
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")

print("‚úÖ All libraries imported successfully!")
print(f"NumPy version: {np.__version__}")
print(f"Pandas version: {pd.__version__}")
print(f"Scikit-learn version: {sklearn.__version__}")

---
## Part 1: Data Splitting and Validation Fundamentals

### Exercise 1: Understanding Train/Validation/Test Split
#### üí° Concept
The foundation of model evaluation is proper data splitting. We typically use:
- **Training set (60-70%)**: To train the model
- **Validation set (15-20%)**: To tune hyperparameters
- **Test set (15-20%)**: For final unbiased evaluation

In [None]:
# Exercise 1: Train/Validation/Test Split Implementation
# Generate synthetic regression data
from sklearn.datasets import make_regression

X, y = make_regression(n_samples=1000, n_features=10, n_informative=8, 
                       noise=10, random_state=42)

# Convert to DataFrame for better visualization
feature_names = [f'feature_{i+1}' for i in range(X.shape[1])]
df = pd.DataFrame(X, columns=feature_names)
df['target'] = y

print(f"Dataset shape: {df.shape}")
print(f"\nFirst 5 rows:")
df.head()

In [None]:
# Implement proper train/val/test split
from sklearn.model_selection import train_test_split

# First split: separate test set (20%)
X_temp, X_test, y_temp, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Second split: separate train and validation (80% train, 20% val of remaining)
X_train, X_val, y_train, y_val = train_test_split(
    X_temp, y_temp, test_size=0.25, random_state=42  # 0.25 * 0.8 = 0.2 of total
)

# Verify split proportions
print("Dataset Split Proportions:")
print(f"Training set: {len(X_train)} samples ({len(X_train)/len(X)*100:.1f}%)")
print(f"Validation set: {len(X_val)} samples ({len(X_val)/len(X)*100:.1f}%)")
print(f"Test set: {len(X_test)} samples ({len(X_test)/len(X)*100:.1f}%)")

# Visualize the split
fig = go.Figure(data=[
    go.Bar(name='Split Distribution', 
           x=['Train', 'Validation', 'Test'],
           y=[len(X_train), len(X_val), len(X_test)],
           text=[f'{len(X_train)} ({60}%)', 
                 f'{len(X_val)} ({20}%)', 
                 f'{len(X_test)} ({20}%)'],
           textposition='auto',
           marker_color=['#1E64C8', '#4A90E2', '#7BB3F0'])
])
fig.update_layout(title='Train/Validation/Test Split Distribution',
                  yaxis_title='Number of Samples',
                  height=400)
fig.show()

### Exercise 2: Preventing Data Leakage
#### üí° Concept
Data leakage occurs when information from test set influences training. Common causes:
- Normalizing before splitting
- Using test set statistics
- Feature selection on entire dataset

In [None]:
# Exercise 2: Demonstrating Data Leakage Prevention

# WRONG WAY - Data leakage (normalizing before split)
print("‚ùå WRONG: Normalizing before split (causes data leakage)")
scaler_wrong = StandardScaler()
X_normalized_wrong = scaler_wrong.fit_transform(X)  # Uses ALL data statistics
X_train_wrong = X_normalized_wrong[:600]
X_test_wrong = X_normalized_wrong[800:]
print(f"Mean of test set (wrong): {X_test_wrong.mean():.4f}")
print(f"Std of test set (wrong): {X_test_wrong.std():.4f}")

print("\n" + "="*50 + "\n")

# CORRECT WAY - No data leakage
print("‚úÖ CORRECT: Normalizing after split")
# Split first
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Fit scaler only on training data
scaler_correct = StandardScaler()
X_train_correct = scaler_correct.fit_transform(X_train)  # Fit on train
X_test_correct = scaler_correct.transform(X_test)  # Only transform test

print(f"Mean of test set (correct): {X_test_correct.mean():.4f}")
print(f"Std of test set (correct): {X_test_correct.std():.4f}")

# Visualize the difference
fig, axes = plt.subplots(1, 2, figsize=(12, 4))

# Wrong way
axes[0].hist(X_test_wrong[:, 0], bins=30, alpha=0.7, color='red', edgecolor='black')
axes[0].set_title('‚ùå Test Data (Normalized Before Split)', fontsize=12)
axes[0].set_xlabel('Feature Value')
axes[0].set_ylabel('Frequency')
axes[0].axvline(0, color='black', linestyle='--', alpha=0.5)

# Correct way  
axes[1].hist(X_test_correct[:, 0], bins=30, alpha=0.7, color='green', edgecolor='black')
axes[1].set_title('‚úÖ Test Data (Normalized After Split)', fontsize=12)
axes[1].set_xlabel('Feature Value')
axes[1].set_ylabel('Frequency')
axes[1].axvline(0, color='black', linestyle='--', alpha=0.5)

plt.tight_layout()
plt.show()

print("\nüí° Key Insight: Notice how the correct method doesn't center test data at 0!")

---
## Part 2: Regression Evaluation Metrics

### Exercise 3: MSE, RMSE, MAE Implementation
#### üí° Concept
- **MSE (Mean Squared Error)**: Squares differences, heavily penalizes large errors
- **RMSE (Root Mean Squared Error)**: Square root of MSE, same units as target
- **MAE (Mean Absolute Error)**: Average absolute differences, robust to outliers

In [None]:
# Exercise 3: Implementing Regression Metrics

# Train multiple models for comparison
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.tree import DecisionTreeRegressor

# Generate regression data with outliers
np.random.seed(42)
X_reg = np.random.randn(200, 5)
y_reg = 2 * X_reg[:, 0] + 3 * X_reg[:, 1] - X_reg[:, 2] + np.random.randn(200) * 0.5

# Add some outliers
outlier_indices = np.random.choice(200, 10, replace=False)
y_reg[outlier_indices] += np.random.randn(10) * 10

# Split data
X_train, X_test, y_train, y_test = train_test_split(X_reg, y_reg, test_size=0.3, random_state=42)

# Train models
models = {
    'Linear Regression': LinearRegression(),
    'Ridge Regression': Ridge(alpha=1.0),
    'Lasso Regression': Lasso(alpha=0.1),
    'Decision Tree': DecisionTreeRegressor(max_depth=5, random_state=42)
}

results = []
for name, model in models.items():
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    
    # Calculate metrics
    mse = mean_squared_error(y_test, y_pred)
    rmse = np.sqrt(mse)
    mae = mean_absolute_error(y_test, y_pred)
    r2 = r2_score(y_test, y_pred)
    
    results.append({
        'Model': name,
        'MSE': mse,
        'RMSE': rmse,
        'MAE': mae,
        'R¬≤': r2
    })

# Display results
results_df = pd.DataFrame(results)
results_df = results_df.round(4)

# Style the dataframe
styled_df = results_df.style.background_gradient(subset=['MSE', 'RMSE', 'MAE'], cmap='Reds_r')\
                            .background_gradient(subset=['R¬≤'], cmap='Greens')
print("üìä Regression Metrics Comparison:")
styled_df

In [None]:
# Visualize metrics comparison
fig = make_subplots(rows=2, cols=2,
                    subplot_titles=('MSE Comparison', 'RMSE Comparison', 
                                  'MAE Comparison', 'R¬≤ Score Comparison'))

# MSE
fig.add_trace(go.Bar(x=results_df['Model'], y=results_df['MSE'], 
                     marker_color='#e74c3c', name='MSE'), row=1, col=1)

# RMSE  
fig.add_trace(go.Bar(x=results_df['Model'], y=results_df['RMSE'],
                     marker_color='#f39c12', name='RMSE'), row=1, col=2)

# MAE
fig.add_trace(go.Bar(x=results_df['Model'], y=results_df['MAE'],
                     marker_color='#3498db', name='MAE'), row=2, col=1)

# R¬≤
fig.add_trace(go.Bar(x=results_df['Model'], y=results_df['R¬≤'],
                     marker_color='#27ae60', name='R¬≤'), row=2, col=2)

fig.update_layout(height=600, showlegend=False, 
                  title_text="Regression Metrics Across Different Models")
fig.update_xaxes(tickangle=45)
fig.show()

print("\nüí° Key Insights:")
print("- Lower MSE, RMSE, MAE = Better performance")
print("- Higher R¬≤ = Better performance (max = 1.0)")
print("- RMSE penalizes large errors more than MAE")

### Exercise 4: Residual Analysis
#### üí° Concept
Residual analysis helps identify patterns in prediction errors:
- Random scatter = Good model
- Patterns = Model missing relationships
- Heteroscedasticity = Non-constant variance

In [None]:
# Exercise 4: Residual Analysis and Diagnostics

# Use the best performing model
best_model = LinearRegression()
best_model.fit(X_train, y_train)
y_pred = best_model.predict(X_test)

# Calculate residuals
residuals = y_test - y_pred

# Create comprehensive residual plots
fig, axes = plt.subplots(2, 2, figsize=(12, 10))

# 1. Residual Plot
axes[0, 0].scatter(y_pred, residuals, alpha=0.6, edgecolor='black')
axes[0, 0].axhline(y=0, color='red', linestyle='--')
axes[0, 0].set_xlabel('Predicted Values')
axes[0, 0].set_ylabel('Residuals')
axes[0, 0].set_title('Residual Plot')
axes[0, 0].grid(True, alpha=0.3)

# 2. Q-Q Plot
from scipy import stats
stats.probplot(residuals, dist="norm", plot=axes[0, 1])
axes[0, 1].set_title('Q-Q Plot (Normal Distribution Check)')
axes[0, 1].grid(True, alpha=0.3)

# 3. Histogram of Residuals
axes[1, 0].hist(residuals, bins=20, edgecolor='black', alpha=0.7, color='skyblue')
axes[1, 0].axvline(x=0, color='red', linestyle='--')
axes[1, 0].set_xlabel('Residuals')
axes[1, 0].set_ylabel('Frequency')
axes[1, 0].set_title('Distribution of Residuals')
axes[1, 0].grid(True, alpha=0.3)

# 4. Scale-Location Plot
standardized_residuals = residuals / np.std(residuals)
axes[1, 1].scatter(y_pred, np.sqrt(np.abs(standardized_residuals)), alpha=0.6, edgecolor='black')
axes[1, 1].set_xlabel('Predicted Values')
axes[1, 1].set_ylabel('‚àö|Standardized Residuals|')
axes[1, 1].set_title('Scale-Location Plot')
axes[1, 1].grid(True, alpha=0.3)

plt.suptitle('Residual Analysis Dashboard', fontsize=16, y=1.02)
plt.tight_layout()
plt.show()

# Statistical tests
from scipy.stats import shapiro, normaltest

shapiro_stat, shapiro_p = shapiro(residuals)
normal_stat, normal_p = normaltest(residuals)

print("üìä Residual Analysis Results:")
print(f"Mean of residuals: {np.mean(residuals):.4f} (should be close to 0)")
print(f"Std of residuals: {np.std(residuals):.4f}")
print(f"\nNormality Tests:")
print(f"Shapiro-Wilk test: p-value = {shapiro_p:.4f} {'(Normal)' if shapiro_p > 0.05 else '(Not Normal)'}")
print(f"D'Agostino test: p-value = {normal_p:.4f} {'(Normal)' if normal_p > 0.05 else '(Not Normal)'}")

---
## Part 3: Classification Evaluation Metrics

### Exercise 5: Confusion Matrix and Basic Metrics
#### üí° Concept
The confusion matrix is the foundation of classification metrics:
- **TP (True Positive)**: Correctly predicted positive
- **FP (False Positive)**: Incorrectly predicted as positive (Type I Error)
- **FN (False Negative)**: Incorrectly predicted as negative (Type II Error)
- **TN (True Negative)**: Correctly predicted negative

In [None]:
# Exercise 5: Confusion Matrix Implementation

# Load and prepare classification data
from sklearn.datasets import load_breast_cancer

data = load_breast_cancer()
X_clf = data.data
y_clf = data.target

# Create imbalanced dataset by removing some positive samples
mask = np.ones(len(y_clf), dtype=bool)
positive_indices = np.where(y_clf == 1)[0]
remove_indices = np.random.choice(positive_indices, size=150, replace=False)
mask[remove_indices] = False
X_clf = X_clf[mask]
y_clf = y_clf[mask]

print(f"Class distribution:")
print(f"Class 0 (Malignant): {sum(y_clf == 0)} samples")
print(f"Class 1 (Benign): {sum(y_clf == 1)} samples")
print(f"Imbalance ratio: {sum(y_clf == 1) / sum(y_clf == 0):.2f}:1")

# Split data
X_train, X_test, y_train, y_test = train_test_split(X_clf, y_clf, test_size=0.3, 
                                                    random_state=42, stratify=y_clf)

# Train classifier
clf = LogisticRegression(max_iter=1000, random_state=42)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

# Calculate confusion matrix
cm = confusion_matrix(y_test, y_pred)

# Create interactive confusion matrix
fig = go.Figure(data=go.Heatmap(
    z=cm,
    x=['Predicted Negative', 'Predicted Positive'],
    y=['Actual Negative', 'Actual Positive'],
    text=cm,
    texttemplate="%{text}",
    textfont={"size": 20},
    colorscale='Blues',
    showscale=True
))

fig.update_layout(
    title='Confusion Matrix',
    xaxis_title='Predicted Label',
    yaxis_title='Actual Label',
    width=500,
    height=400
)

fig.show()

# Extract metrics from confusion matrix
tn, fp, fn, tp = cm.ravel()

print(f"\nüìä Confusion Matrix Components:")
print(f"True Negatives (TN): {tn}")
print(f"False Positives (FP): {fp} (Type I Error)")
print(f"False Negatives (FN): {fn} (Type II Error)")
print(f"True Positives (TP): {tp}")

In [None]:
# Calculate all classification metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
specificity = tn / (tn + fp)

# Create metrics summary
metrics_data = {
    'Metric': ['Accuracy', 'Precision', 'Recall (Sensitivity)', 'Specificity', 'F1-Score'],
    'Formula': [
        '(TP + TN) / Total',
        'TP / (TP + FP)',
        'TP / (TP + FN)', 
        'TN / (TN + FP)',
        '2 √ó (Precision √ó Recall) / (Precision + Recall)'
    ],
    'Value': [accuracy, precision, recall, specificity, f1],
    'Interpretation': [
        'Overall correctness',
        'When we predict positive, how often are we right?',
        'Of all actual positives, how many did we find?',
        'Of all actual negatives, how many did we correctly identify?',
        'Harmonic mean of Precision and Recall'
    ]
}

metrics_df = pd.DataFrame(metrics_data)
metrics_df['Value'] = metrics_df['Value'].round(4)

# Display with styling
styled = metrics_df.style.bar(subset=['Value'], color='lightgreen', vmin=0, vmax=1)
print("\nüìä Classification Metrics Summary:")
styled

### Exercise 6: ROC Curve and AUC
#### üí° Concept
ROC (Receiver Operating Characteristic) curve plots:
- **True Positive Rate (Sensitivity)** vs **False Positive Rate (1-Specificity)**
- AUC (Area Under Curve) summarizes performance: 0.5 = random, 1.0 = perfect

In [None]:
# Exercise 6: ROC Curve and AUC Implementation

# Train multiple classifiers for comparison
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC

classifiers = {
    'Logistic Regression': LogisticRegression(max_iter=1000, random_state=42),
    'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42),
    'SVM': SVC(probability=True, random_state=42),
    'Decision Tree': DecisionTreeClassifier(max_depth=5, random_state=42)
}

# Calculate ROC curves
plt.figure(figsize=(10, 8))

for name, clf in classifiers.items():
    clf.fit(X_train, y_train)
    
    # Get probability predictions
    if hasattr(clf, 'predict_proba'):
        y_proba = clf.predict_proba(X_test)[:, 1]
    else:
        y_proba = clf.decision_function(X_test)
    
    # Calculate ROC curve
    fpr, tpr, thresholds = roc_curve(y_test, y_proba)
    auc_score = auc(fpr, tpr)
    
    # Plot ROC curve
    plt.plot(fpr, tpr, linewidth=2, label=f'{name} (AUC = {auc_score:.3f})')

# Plot random classifier
plt.plot([0, 1], [0, 1], 'k--', linewidth=1, label='Random Classifier (AUC = 0.500)')

# Formatting
plt.xlabel('False Positive Rate (1 - Specificity)', fontsize=12)
plt.ylabel('True Positive Rate (Sensitivity)', fontsize=12)
plt.title('ROC Curves Comparison', fontsize=14, fontweight='bold')
plt.legend(loc='lower right', fontsize=10)
plt.grid(True, alpha=0.3)
plt.xlim([0, 1])
plt.ylim([0, 1.05])

# Add shaded area for best model
best_clf = LogisticRegression(max_iter=1000, random_state=42)
best_clf.fit(X_train, y_train)
y_proba_best = best_clf.predict_proba(X_test)[:, 1]
fpr_best, tpr_best, _ = roc_curve(y_test, y_proba_best)
plt.fill_between(fpr_best, 0, tpr_best, alpha=0.1, color='blue')

plt.tight_layout()
plt.show()

print("\nüìä AUC Interpretation Guide:")
print("‚Ä¢ 0.90 - 1.00 = Excellent")
print("‚Ä¢ 0.80 - 0.90 = Good")
print("‚Ä¢ 0.70 - 0.80 = Fair")
print("‚Ä¢ 0.60 - 0.70 = Poor")
print("‚Ä¢ 0.50 - 0.60 = Fail")

### Exercise 7: Precision-Recall Curve
#### üí° Concept
Precision-Recall curve is especially useful for imbalanced datasets:
- Shows trade-off between Precision and Recall
- More informative than ROC for imbalanced classes

In [None]:
# Exercise 7: Precision-Recall Curve Analysis

fig, axes = plt.subplots(1, 2, figsize=(14, 6))

# Plot 1: Precision-Recall Curves
for name, clf in classifiers.items():
    clf.fit(X_train, y_train)
    y_proba = clf.predict_proba(X_test)[:, 1]
    
    precision, recall, thresholds = precision_recall_curve(y_test, y_proba)
    pr_auc = auc(recall, precision)
    
    axes[0].plot(recall, precision, linewidth=2, label=f'{name} (AUC = {pr_auc:.3f})')

# Baseline (random classifier)
baseline = sum(y_test) / len(y_test)
axes[0].axhline(y=baseline, color='k', linestyle='--', linewidth=1, 
                label=f'Baseline (y = {baseline:.3f})')

axes[0].set_xlabel('Recall', fontsize=12)
axes[0].set_ylabel('Precision', fontsize=12)
axes[0].set_title('Precision-Recall Curves', fontsize=14, fontweight='bold')
axes[0].legend(loc='lower left', fontsize=10)
axes[0].grid(True, alpha=0.3)
axes[0].set_xlim([0, 1])
axes[0].set_ylim([0, 1.05])

# Plot 2: Threshold Analysis
clf = LogisticRegression(max_iter=1000, random_state=42)
clf.fit(X_train, y_train)
y_proba = clf.predict_proba(X_test)[:, 1]

thresholds = np.linspace(0, 1, 100)
metrics_by_threshold = []

for thresh in thresholds:
    y_pred_thresh = (y_proba >= thresh).astype(int)
    if len(np.unique(y_pred_thresh)) > 1 and len(np.unique(y_test)) > 1:
        prec = precision_score(y_test, y_pred_thresh, zero_division=0)
        rec = recall_score(y_test, y_pred_thresh, zero_division=0)
        f1_thresh = f1_score(y_test, y_pred_thresh, zero_division=0)
        metrics_by_threshold.append({'threshold': thresh, 'precision': prec, 
                                    'recall': rec, 'f1': f1_thresh})

metrics_thresh_df = pd.DataFrame(metrics_by_threshold)

axes[1].plot(metrics_thresh_df['threshold'], metrics_thresh_df['precision'], 
            label='Precision', linewidth=2)
axes[1].plot(metrics_thresh_df['threshold'], metrics_thresh_df['recall'], 
            label='Recall', linewidth=2)
axes[1].plot(metrics_thresh_df['threshold'], metrics_thresh_df['f1'], 
            label='F1-Score', linewidth=2, linestyle='--')

axes[1].set_xlabel('Decision Threshold', fontsize=12)
axes[1].set_ylabel('Metric Value', fontsize=12)
axes[1].set_title('Metrics vs Decision Threshold', fontsize=14, fontweight='bold')
axes[1].legend(loc='best', fontsize=10)
axes[1].grid(True, alpha=0.3)
axes[1].set_xlim([0, 1])
axes[1].set_ylim([0, 1.05])

plt.tight_layout()
plt.show()

# Find optimal threshold based on F1-score
optimal_idx = metrics_thresh_df['f1'].idxmax()
optimal_threshold = metrics_thresh_df.loc[optimal_idx, 'threshold']

print(f"\nüéØ Optimal Decision Threshold (based on F1-Score): {optimal_threshold:.3f}")
print(f"   ‚Ä¢ Precision at optimal: {metrics_thresh_df.loc[optimal_idx, 'precision']:.3f}")
print(f"   ‚Ä¢ Recall at optimal: {metrics_thresh_df.loc[optimal_idx, 'recall']:.3f}")
print(f"   ‚Ä¢ F1-Score at optimal: {metrics_thresh_df.loc[optimal_idx, 'f1']:.3f}")

---
## Part 4: Cross-Validation and Model Selection

### Exercise 8: K-Fold Cross-Validation
#### üí° Concept
K-Fold CV divides data into K equal folds:
- Each fold serves as validation once
- Provides K performance estimates
- More reliable than single train-test split

In [None]:
# Exercise 8: K-Fold Cross-Validation Implementation

# Prepare data
X, y = load_breast_cancer(return_X_y=True)

# Implement K-Fold CV manually to show the process
k_folds = 5
kf = KFold(n_splits=k_folds, shuffle=True, random_state=42)

# Initialize models
models = {
    'Logistic Regression': LogisticRegression(max_iter=1000, random_state=42),
    'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42),
    'SVM': SVC(random_state=42),
    'Decision Tree': DecisionTreeClassifier(max_depth=5, random_state=42)
}

# Store results
cv_results = {name: {'scores': [], 'mean': 0, 'std': 0} for name in models.keys()}

# Perform K-Fold CV for each model
print("üîÑ Performing K-Fold Cross-Validation...")
print("=" * 60)

for name, model in models.items():
    print(f"\nEvaluating: {name}")
    fold_scores = []
    
    for fold, (train_idx, val_idx) in enumerate(kf.split(X), 1):
        # Split data
        X_train_fold, X_val_fold = X[train_idx], X[val_idx]
        y_train_fold, y_val_fold = y[train_idx], y[val_idx]
        
        # Train and evaluate
        model.fit(X_train_fold, y_train_fold)
        score = model.score(X_val_fold, y_val_fold)
        fold_scores.append(score)
        print(f"  Fold {fold}: {score:.4f}")
    
    # Calculate statistics
    cv_results[name]['scores'] = fold_scores
    cv_results[name]['mean'] = np.mean(fold_scores)
    cv_results[name]['std'] = np.std(fold_scores)
    
    print(f"  Mean: {cv_results[name]['mean']:.4f} (¬±{cv_results[name]['std']:.4f})")

# Create visualization of results
fig = go.Figure()

for name, results in cv_results.items():
    # Add box plot for each model
    fig.add_trace(go.Box(
        y=results['scores'],
        name=name,
        boxmean='sd',  # show mean and standard deviation
        marker_color=np.random.rand(3,)
    ))

fig.update_layout(
    title=f'{k_folds}-Fold Cross-Validation Results',
    yaxis_title='Accuracy Score',
    xaxis_title='Model',
    height=500,
    showlegend=False,
    yaxis=dict(range=[0.85, 1.0])
)

fig.show()

# Summary table
summary_df = pd.DataFrame({
    'Model': list(cv_results.keys()),
    'Mean Accuracy': [cv_results[name]['mean'] for name in cv_results.keys()],
    'Std Dev': [cv_results[name]['std'] for name in cv_results.keys()],
    'CV Score': [f"{cv_results[name]['mean']:.4f} (¬±{cv_results[name]['std']:.4f})" 
                 for name in cv_results.keys()]
})
summary_df = summary_df.sort_values('Mean Accuracy', ascending=False)
summary_df.reset_index(drop=True, inplace=True)

print("\nüìä Cross-Validation Summary:")
display(summary_df.style.background_gradient(subset=['Mean Accuracy'], cmap='Greens'))

### Exercise 9: Stratified K-Fold for Imbalanced Data
#### üí° Concept
Stratified K-Fold maintains class distribution in each fold:
- Essential for imbalanced datasets
- Ensures representative validation sets
- Reduces variance in performance estimates

In [None]:
# Exercise 9: Comparing Regular vs Stratified K-Fold

# Create highly imbalanced dataset
from sklearn.datasets import make_classification

X_imb, y_imb = make_classification(n_samples=1000, n_features=20, n_informative=15,
                                   n_redundant=5, n_classes=2, weights=[0.9, 0.1],
                                   flip_y=0.01, random_state=42)

print(f"Dataset class distribution:")
print(f"Class 0: {sum(y_imb == 0)} samples ({sum(y_imb == 0)/len(y_imb)*100:.1f}%)")
print(f"Class 1: {sum(y_imb == 1)} samples ({sum(y_imb == 1)/len(y_imb)*100:.1f}%)")
print(f"Imbalance ratio: {sum(y_imb == 0)/sum(y_imb == 1):.1f}:1")

# Compare regular K-Fold vs Stratified K-Fold
n_splits = 5
kf_regular = KFold(n_splits=n_splits, shuffle=True, random_state=42)
kf_stratified = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=42)

# Function to analyze fold distributions
def analyze_folds(cv_method, X, y, cv_name):
    fold_distributions = []
    
    for fold, (train_idx, val_idx) in enumerate(cv_method.split(X, y), 1):
        y_val_fold = y[val_idx]
        class_1_ratio = sum(y_val_fold == 1) / len(y_val_fold)
        fold_distributions.append({
            'Fold': fold,
            'Class 0': sum(y_val_fold == 0),
            'Class 1': sum(y_val_fold == 1),
            'Class 1 Ratio': class_1_ratio * 100
        })
    
    df = pd.DataFrame(fold_distributions)
    df['Variance'] = df['Class 1 Ratio'].std()
    return df

# Analyze both methods
regular_folds = analyze_folds(kf_regular, X_imb, y_imb, "Regular K-Fold")
stratified_folds = analyze_folds(kf_stratified, X_imb, y_imb, "Stratified K-Fold")

# Visualization
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Regular K-Fold
axes[0].bar(regular_folds['Fold'], regular_folds['Class 1 Ratio'], 
           color='coral', edgecolor='black', alpha=0.7)
axes[0].axhline(y=10, color='green', linestyle='--', label='True Ratio (10%)')
axes[0].set_xlabel('Fold Number')
axes[0].set_ylabel('Class 1 Percentage (%)')
axes[0].set_title(f'Regular K-Fold\n(Std Dev: {regular_folds["Class 1 Ratio"].std():.2f}%)')
axes[0].legend()
axes[0].set_ylim([0, 20])
axes[0].grid(True, alpha=0.3)

# Stratified K-Fold
axes[1].bar(stratified_folds['Fold'], stratified_folds['Class 1 Ratio'],
           color='lightgreen', edgecolor='black', alpha=0.7)
axes[1].axhline(y=10, color='green', linestyle='--', label='True Ratio (10%)')
axes[1].set_xlabel('Fold Number')
axes[1].set_ylabel('Class 1 Percentage (%)')
axes[1].set_title(f'Stratified K-Fold\n(Std Dev: {stratified_folds["Class 1 Ratio"].std():.2f}%)')
axes[1].legend()
axes[1].set_ylim([0, 20])
axes[1].grid(True, alpha=0.3)

plt.suptitle('Class Distribution Across Folds: Regular vs Stratified K-Fold', 
             fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

print("\nüìä Fold Distribution Analysis:")
print("\nRegular K-Fold:")
display(regular_folds)
print("\nStratified K-Fold:")
display(stratified_folds)

print("\nüí° Key Insight: Stratified K-Fold maintains consistent class distribution across all folds!")

### Exercise 10: Hyperparameter Tuning with Grid Search
#### üí° Concept
Grid Search systematically explores hyperparameter combinations:
- Exhaustive search through parameter grid
- Uses cross-validation for each combination
- Returns best parameters based on validation score

In [None]:
# Exercise 10: Hyperparameter Tuning Implementation

from sklearn.model_selection import GridSearchCV, RandomizedSearchCV

# Prepare data
X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Define parameter grids for different models
param_grids = {
    'Logistic Regression': {
        'model': LogisticRegression(max_iter=1000, random_state=42),
        'params': {
            'C': [0.001, 0.01, 0.1, 1, 10, 100],
            'penalty': ['l1', 'l2'],
            'solver': ['liblinear']
        }
    },
    'Random Forest': {
        'model': RandomForestClassifier(random_state=42),
        'params': {
            'n_estimators': [50, 100, 200],
            'max_depth': [None, 10, 20, 30],
            'min_samples_split': [2, 5, 10],
            'min_samples_leaf': [1, 2, 4]
        }
    },
    'SVM': {
        'model': SVC(random_state=42),
        'params': {
            'C': [0.1, 1, 10],
            'kernel': ['linear', 'rbf', 'poly'],
            'gamma': ['scale', 'auto', 0.001, 0.01]
        }
    }
}

# Perform Grid Search for each model
print("üîç Performing Grid Search for Hyperparameter Tuning...")
print("=" * 60)

best_models = {}
results_summary = []

for name, config in param_grids.items():
    print(f"\nTuning {name}...")
    
    # Create GridSearchCV object
    grid_search = GridSearchCV(
        estimator=config['model'],
        param_grid=config['params'],
        cv=5,
        scoring='f1',
        n_jobs=-1,
        verbose=0
    )
    
    # Fit grid search
    grid_search.fit(X_train_scaled, y_train)
    
    # Store best model
    best_models[name] = grid_search.best_estimator_
    
    # Evaluate on test set
    y_pred = grid_search.predict(X_test_scaled)
    test_f1 = f1_score(y_test, y_pred)
    
    # Store results
    results_summary.append({
        'Model': name,
        'Best Params': str(grid_search.best_params_),
        'CV F1 Score': grid_search.best_score_,
        'Test F1 Score': test_f1,
        'Total Combinations': len(grid_search.cv_results_['params'])
    })
    
    print(f"  Best parameters: {grid_search.best_params_}")
    print(f"  Best CV F1 score: {grid_search.best_score_:.4f}")
    print(f"  Test F1 score: {test_f1:.4f}")
    print(f"  Total combinations tested: {len(grid_search.cv_results_['params'])}")

# Display results summary
results_df = pd.DataFrame(results_summary)
results_df = results_df.sort_values('Test F1 Score', ascending=False)

print("\nüìä Grid Search Results Summary:")
display(results_df.style.background_gradient(subset=['CV F1 Score', 'Test F1 Score'], cmap='Greens'))

# Visualize hyperparameter importance (example with Random Forest)
rf_model = param_grids['Random Forest']['model']
rf_grid = GridSearchCV(
    estimator=rf_model,
    param_grid={'n_estimators': [50, 100, 150, 200],
                'max_depth': [5, 10, 20, 30, None]},
    cv=5,
    scoring='f1',
    n_jobs=-1
)
rf_grid.fit(X_train_scaled, y_train)

# Create heatmap of results
scores = rf_grid.cv_results_['mean_test_score']
scores_array = scores.reshape(5, 4)

fig = go.Figure(data=go.Heatmap(
    z=scores_array,
    x=[50, 100, 150, 200],
    y=[5, 10, 20, 30, 'None'],
    text=scores_array.round(3),
    texttemplate='%{text}',
    colorscale='Viridis',
    colorbar_title='F1 Score'
))

fig.update_layout(
    title='Random Forest: Hyperparameter Grid Search Results',
    xaxis_title='Number of Estimators',
    yaxis_title='Max Depth',
    width=600,
    height=500
)

fig.show()

print("\nüí° Key Insights:")
print("‚Ä¢ Grid Search tests all combinations exhaustively")
print("‚Ä¢ Can be computationally expensive for large parameter spaces")
print("‚Ä¢ Consider RandomizedSearchCV for faster exploration")
print("‚Ä¢ Always validate final model on held-out test set")

---
## Part 5: Summary and Final Exercise

### üéØ Complete Model Evaluation Pipeline
Let's put everything together in a comprehensive evaluation pipeline.

In [None]:
# Final Exercise: Complete Model Evaluation Pipeline

class ModelEvaluationPipeline:
    """Complete pipeline for model evaluation"""
    
    def __init__(self, models, scoring='accuracy'):
        self.models = models
        self.scoring = scoring
        self.results = {}
        
    def evaluate_models(self, X, y, test_size=0.3, cv_folds=5):
        """Evaluate multiple models with comprehensive metrics"""
        
        # Split data
        X_train, X_test, y_train, y_test = train_test_split(
            X, y, test_size=test_size, random_state=42, stratify=y
        )
        
        # Scale features
        scaler = StandardScaler()
        X_train_scaled = scaler.fit_transform(X_train)
        X_test_scaled = scaler.transform(X_test)
        
        print("üî¨ Model Evaluation Pipeline Started")
        print("=" * 60)
        
        for name, model in self.models.items():
            print(f"\nEvaluating: {name}")
            
            # Cross-validation
            cv_scores = cross_val_score(model, X_train_scaled, y_train, 
                                       cv=cv_folds, scoring=self.scoring)
            
            # Train on full training set
            model.fit(X_train_scaled, y_train)
            
            # Predictions
            y_pred = model.predict(X_test_scaled)
            y_proba = model.predict_proba(X_test_scaled)[:, 1] if hasattr(model, 'predict_proba') else None
            
            # Calculate metrics
            metrics = {
                'CV Score Mean': cv_scores.mean(),
                'CV Score Std': cv_scores.std(),
                'Test Accuracy': accuracy_score(y_test, y_pred),
                'Test Precision': precision_score(y_test, y_pred),
                'Test Recall': recall_score(y_test, y_pred),
                'Test F1': f1_score(y_test, y_pred),
                'Test AUC': roc_auc_score(y_test, y_proba) if y_proba is not None else None
            }
            
            self.results[name] = {
                'model': model,
                'metrics': metrics,
                'predictions': y_pred,
                'probabilities': y_proba,
                'confusion_matrix': confusion_matrix(y_test, y_pred)
            }
            
            print(f"  CV Score: {metrics['CV Score Mean']:.4f} (¬±{metrics['CV Score Std']:.4f})")
            print(f"  Test F1: {metrics['Test F1']:.4f}")
            print(f"  Test AUC: {metrics['Test AUC']:.4f}" if metrics['Test AUC'] else "  Test AUC: N/A")
        
        return self.results
    
    def plot_comparison(self):
        """Create comprehensive comparison visualizations"""
        
        # Prepare data for plotting
        models_list = []
        metrics_dict = {
            'Accuracy': [], 'Precision': [], 'Recall': [], 'F1': [], 'AUC': []
        }
        
        for name, result in self.results.items():
            models_list.append(name)
            metrics_dict['Accuracy'].append(result['metrics']['Test Accuracy'])
            metrics_dict['Precision'].append(result['metrics']['Test Precision'])
            metrics_dict['Recall'].append(result['metrics']['Test Recall'])
            metrics_dict['F1'].append(result['metrics']['Test F1'])
            metrics_dict['AUC'].append(result['metrics']['Test AUC'] or 0)
        
        # Create subplots
        fig = make_subplots(
            rows=2, cols=3,
            subplot_titles=['Accuracy', 'Precision', 'Recall', 'F1 Score', 'AUC Score', 'Overall Comparison'],
            specs=[[{'type': 'bar'}, {'type': 'bar'}, {'type': 'bar'}],
                   [{'type': 'bar'}, {'type': 'bar'}, {'type': 'scatter'}]]
        )
        
        # Individual metrics
        colors = ['#1E64C8', '#4A90E2', '#7BB3F0', '#9FC5E8']
        
        for idx, (metric, values) in enumerate(metrics_dict.items()):
            if idx < 5:
                row = idx // 3 + 1
                col = idx % 3 + 1
                fig.add_trace(
                    go.Bar(x=models_list, y=values, marker_color=colors, showlegend=False),
                    row=row, col=col
                )
        
        # Overall comparison (radar chart simulation)
        fig.add_trace(
            go.Scatter(x=['Accuracy', 'Precision', 'Recall', 'F1', 'AUC'] * len(models_list),
                      y=[val for metric_vals in metrics_dict.values() for val in metric_vals],
                      mode='markers+lines',
                      marker=dict(size=10),
                      showlegend=False),
            row=2, col=3
        )
        
        fig.update_layout(height=600, title_text="Model Performance Comparison Dashboard")
        fig.show()
        
        return fig
    
    def generate_report(self):
        """Generate comprehensive evaluation report"""
        
        report = []
        report.append("\n" + "=" * 60)
        report.append("üìä MODEL EVALUATION REPORT")
        report.append("=" * 60)
        
        # Find best model
        best_model = max(self.results.items(), 
                        key=lambda x: x[1]['metrics']['Test F1'])
        
        report.append(f"\nüèÜ Best Model: {best_model[0]}")
        report.append(f"   Test F1 Score: {best_model[1]['metrics']['Test F1']:.4f}")
        
        # Detailed metrics table
        metrics_data = []
        for name, result in self.results.items():
            metrics_data.append({
                'Model': name,
                'CV Score': f"{result['metrics']['CV Score Mean']:.4f} (¬±{result['metrics']['CV Score Std']:.4f})",
                'Accuracy': f"{result['metrics']['Test Accuracy']:.4f}",
                'Precision': f"{result['metrics']['Test Precision']:.4f}",
                'Recall': f"{result['metrics']['Test Recall']:.4f}",
                'F1': f"{result['metrics']['Test F1']:.4f}",
                'AUC': f"{result['metrics']['Test AUC']:.4f}" if result['metrics']['Test AUC'] else 'N/A'
            })
        
        report_df = pd.DataFrame(metrics_data)
        
        print("\n".join(report))
        print("\nüìà Detailed Metrics:")
        display(report_df.style.highlight_max(subset=['Accuracy', 'Precision', 'Recall', 'F1', 'AUC']))
        
        return report_df

# Initialize and run pipeline
models_to_evaluate = {
    'Logistic Regression': LogisticRegression(max_iter=1000, random_state=42),
    'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42),
    'SVM (RBF)': SVC(kernel='rbf', probability=True, random_state=42),
    'Decision Tree': DecisionTreeClassifier(max_depth=5, random_state=42)
}

# Load data
X, y = load_breast_cancer(return_X_y=True)

# Run pipeline
pipeline = ModelEvaluationPipeline(models_to_evaluate, scoring='f1')
results = pipeline.evaluate_models(X, y)

# Generate visualizations
pipeline.plot_comparison()

# Generate report
report = pipeline.generate_report()

print("\n‚úÖ Pipeline Complete!")

---
## üìù Practice Exercises

### Your Turn: Challenge Problems

1. **Imbalanced Classification**: Create a highly imbalanced dataset (95:5 ratio) and compare different evaluation metrics
2. **Time Series Splitting**: Implement time-based cross-validation for temporal data
3. **Multi-class Classification**: Extend the pipeline to handle 3+ classes with macro/micro/weighted averaging
4. **Custom Metrics**: Create a custom scoring function that penalizes false negatives 3x more than false positives
5. **Ensemble Evaluation**: Combine predictions from multiple models and evaluate the ensemble

### üéØ Key Takeaways
- Always use proper train/validation/test splits
- Choose metrics appropriate for your problem and data
- Prevent data leakage by splitting before preprocessing
- Use cross-validation for robust performance estimates
- Consider class imbalance when selecting metrics
- Hyperparameter tuning should use validation set, never test set
- Multiple metrics provide different perspectives on performance