# 2.4 - Model Evaluation & Metrics: Measuring What Matters

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/madeforai/madeforai/blob/main/docs/understanding-ai/module-2/2.4-model-evaluation.ipynb)

---

**Master the art of measuring model performance‚Äîbecause a model is only as good as your ability to evaluate it.**

## üìö What You'll Learn

- **Confusion matrices**: Understanding true/false positives and negatives
- **Core metrics**: Accuracy, precision, recall, F1-score‚Äîwhen to use what
- **ROC curves & AUC**: Visualizing classifier performance across thresholds
- **Cross-validation**: Properly assessing model generalization
- **Bias-variance tradeoff**: Overfitting vs underfitting explained
- **Business metrics**: Choosing the right metric for your problem

## ‚è±Ô∏è Estimated Time
40-45 minutes

## üìã Prerequisites
- Completed Chapter 2.3 (Unsupervised Learning)
- Understanding of classification and regression
- Basic probability concepts

## üéØ The Evaluation Paradox

**Scenario**: You've built two models for detecting spam emails.

**Model A**: 95% accuracy  
**Model B**: 88% accuracy

**Question**: Which is better?

**Your answer**: "Obviously Model A!"

**Reality**: **Maybe not.** Here's why:

Imagine your dataset:
- 95% legitimate emails
- 5% spam emails

**Model A (the "lazy" model)**:
- Predicts EVERYTHING as "legitimate"
- Accuracy: 95% ‚úÖ
- Spam caught: 0% ‚ùå
- **Completely useless!**

**Model B (the "smart" model)**:
- Actually tries to detect spam
- Accuracy: 88% ‚úÖ
- Spam caught: 85% ‚úÖ
- **Actually useful!**

**The Lesson**: **Accuracy alone is often meaningless!**

This chapter teaches you to:
1. ‚úÖ Choose the RIGHT metric for YOUR problem
2. ‚úÖ Understand the tradeoffs between different metrics
3. ‚úÖ Properly validate your models
4. ‚úÖ Communicate model performance to stakeholders

<!-- [PLACEHOLDER IMAGE]
Prompt for image generation:
"Create an infographic showing the model evaluation paradox.
Style: Professional, slightly humorous educational diagram.
Left side: 'Model A' showing 95% accuracy badge (shiny gold) but a broken spam filter icon (spam going through).
Right side: 'Model B' showing 88% accuracy badge (silver) but working spam filter (spam being blocked).
Center: Large question mark with 'Which is better?'
Bottom: Reveal showing Model A labels everything as 'Not Spam', Model B actually detects spam.
Include text: 'Accuracy isn't everything!'
Color scheme: Red for spam, green for legitimate, gold/silver for badges.
Format: Comparison layout, 16:9 ratio." -->

Let's dive into the metrics that matter! üöÄ

In [None]:
# Setup: Install and import libraries
# Uncomment if running in Google Colab
# !pip install numpy pandas matplotlib seaborn scikit-learn plotly -q

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
from itertools import cycle

from sklearn.datasets import make_classification
from sklearn.model_selection import (
    train_test_split, cross_val_score, cross_validate,
    learning_curve, validation_curve, KFold, StratifiedKFold
)
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import (
    confusion_matrix, classification_report,
    accuracy_score, precision_score, recall_score, f1_score,
    roc_curve, roc_auc_score, precision_recall_curve,
    average_precision_score, matthews_corrcoef,
    mean_squared_error, mean_absolute_error, r2_score
)

# Visualization settings
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")
warnings.filterwarnings('ignore')
np.random.seed(42)

# Better defaults
plt.rcParams['figure.figsize'] = (12, 6)
plt.rcParams['font.size'] = 11

print("‚úÖ Libraries loaded successfully!")
print("üìò Module 2.4: Model Evaluation & Metrics")
print("üìä Ready to master performance measurement!")

## üìã Part 1: The Confusion Matrix - Foundation of Classification Metrics

### Understanding the 2√ó2 Grid

Every classification metric starts from the **confusion matrix**:

```
                    Predicted
                 Negative  Positive
Actual Negative     TN        FP     (False Positive = Type I Error)
Actual Positive     FN        TP     (False Negative = Type II Error)
```

**The Four Outcomes**:

1. **True Positive (TP)**: Predicted positive, actually positive ‚úÖ
   - Example: Detected spam that IS spam

2. **True Negative (TN)**: Predicted negative, actually negative ‚úÖ
   - Example: Legitimate email marked as legitimate

3. **False Positive (FP)**: Predicted positive, actually negative ‚ùå (Type I Error)
   - Example: Legitimate email marked as spam
   - **Business cost**: User misses important email!

4. **False Negative (FN)**: Predicted negative, actually positive ‚ùå (Type II Error)
   - Example: Spam email reaches inbox
   - **Business cost**: User sees unwanted spam!

**Critical Insight**: Different problems care about different errors!

<!-- [PLACEHOLDER IMAGE]
Prompt for image generation:
"Create a detailed confusion matrix diagram with medical diagnosis example.
Style: Educational medical infographic.
2x2 grid labeled clearly:
- Top-left (TN): Healthy person correctly diagnosed as healthy. Icon: Happy person with green checkmark.
- Top-right (FP): Healthy person incorrectly diagnosed as sick. Icon: Worried person (false alarm). Red X.
- Bottom-left (FN): Sick person incorrectly diagnosed as healthy. Icon: Sick person sent home. Red X with warning symbol.
- Bottom-right (TP): Sick person correctly diagnosed as sick. Icon: Sick person getting treatment. Green checkmark.
Axis labels: 'Predicted' (top), 'Actual' (left).
Include percentages and counts in each cell.
Highlight FN as 'Most Dangerous' in medical context.
Color scheme: Green for correct, red for errors, medical blue background.
Format: Square, clear labels, professional medical theme." -->

Let's build a confusion matrix from scratch!

In [None]:
# Generate imbalanced classification data (like spam detection)
# 95% class 0 (legitimate), 5% class 1 (spam)
X, y = make_classification(
    n_samples=1000,
    n_features=20,
    n_informative=15,
    n_redundant=5,
    weights=[0.95, 0.05],  # Imbalanced!
    flip_y=0.01,
    random_state=42
)

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print("üìä Dataset Statistics:")
print("="*60)
print(f"Total samples: {len(y)}")
print(f"Training samples: {len(y_train)}")
print(f"Test samples: {len(y_test)}")
print(f"\nClass distribution (test set):")
print(f"  Class 0 (Legitimate): {np.sum(y_test == 0)} ({np.sum(y_test == 0)/len(y_test)*100:.1f}%)")
print(f"  Class 1 (Spam): {np.sum(y_test == 1)} ({np.sum(y_test == 1)/len(y_test)*100:.1f}%)")
print(f"\n‚ö†Ô∏è This is HIGHLY imbalanced - accuracy will be misleading!")

In [None]:
# Train two models for comparison

# Model 1: Dummy "always predict majority class"
from sklearn.dummy import DummyClassifier
dummy_model = DummyClassifier(strategy='most_frequent')
dummy_model.fit(X_train_scaled, y_train)
y_pred_dummy = dummy_model.predict(X_test_scaled)

# Model 2: Actual logistic regression
lr_model = LogisticRegression(random_state=42, max_iter=1000)
lr_model.fit(X_train_scaled, y_train)
y_pred_lr = lr_model.predict(X_test_scaled)
y_proba_lr = lr_model.predict_proba(X_test_scaled)[:, 1]

# Visualize confusion matrices side by side
fig, axes = plt.subplots(1, 2, figsize=(14, 6))

# Dummy model confusion matrix
cm_dummy = confusion_matrix(y_test, y_pred_dummy)
sns.heatmap(cm_dummy, annot=True, fmt='d', cmap='Reds', ax=axes[0],
           cbar_kws={'label': 'Count'})
axes[0].set_title('Dummy Model (Always Predicts Legitimate)\n' + 
                 f'Accuracy: {accuracy_score(y_test, y_pred_dummy):.1%}',
                 fontsize=14, fontweight='bold', color='red')
axes[0].set_ylabel('True Label', fontsize=12)
axes[0].set_xlabel('Predicted Label', fontsize=12)
axes[0].set_xticklabels(['Legitimate', 'Spam'])
axes[0].set_yticklabels(['Legitimate', 'Spam'])

# Logistic regression confusion matrix
cm_lr = confusion_matrix(y_test, y_pred_lr)
sns.heatmap(cm_lr, annot=True, fmt='d', cmap='Greens', ax=axes[1],
           cbar_kws={'label': 'Count'})
axes[1].set_title('Logistic Regression\n' + 
                 f'Accuracy: {accuracy_score(y_test, y_pred_lr):.1%}',
                 fontsize=14, fontweight='bold', color='green')
axes[1].set_ylabel('True Label', fontsize=12)
axes[1].set_xlabel('Predicted Label', fontsize=12)
axes[1].set_xticklabels(['Legitimate', 'Spam'])
axes[1].set_yticklabels(['Legitimate', 'Spam'])

plt.tight_layout()
plt.show()

# Detailed breakdown
print("\nüìä Confusion Matrix Breakdown (Logistic Regression):")
print("="*60)
tn, fp, fn, tp = cm_lr.ravel()
print(f"True Negatives (TN):  {tn:4d} - Correctly identified legitimate emails")
print(f"False Positives (FP): {fp:4d} - Legitimate emails marked as spam (BAD!)")
print(f"False Negatives (FN): {fn:4d} - Spam emails that got through (BAD!)")
print(f"True Positives (TP):  {tp:4d} - Correctly caught spam emails")
print(f"\nüí° Notice: High accuracy doesn't mean good spam detection!")

## üìã Part 2: Core Classification Metrics

### The Metric Trinity: Precision, Recall, F1

From the confusion matrix, we derive three essential metrics:

#### 1. Accuracy
**Formula**: `(TP + TN) / (TP + TN + FP + FN)`

**Meaning**: Proportion of correct predictions

**When to use**: Balanced datasets only!

**When NOT to use**: Imbalanced data (like our spam example)

---

#### 2. Precision (Positive Predictive Value)
**Formula**: `TP / (TP + FP)`

**Question answered**: "Of all emails I marked as spam, how many were actually spam?"

**Focus**: Minimizing false positives

**High precision means**: When I say it's spam, I'm probably right

**Use when**: False positives are costly (e.g., medical diagnosis)

---

#### 3. Recall (Sensitivity, True Positive Rate)
**Formula**: `TP / (TP + FN)`

**Question answered**: "Of all actual spam emails, how many did I catch?"

**Focus**: Minimizing false negatives

**High recall means**: I catch most of the spam

**Use when**: False negatives are costly (e.g., cancer screening)

---

#### 4. F1-Score (Harmonic Mean of Precision and Recall)
**Formula**: `2 √ó (Precision √ó Recall) / (Precision + Recall)`

**Meaning**: Balanced measure of precision and recall

**Use when**: You care about both false positives AND false negatives

**Range**: 0 to 1 (higher is better)

---

### The Precision-Recall Tradeoff

**Critical insight**: You can't maximize both!

**Increase recall** (catch more spam):
- ‚Üí More false positives (legitimate emails marked as spam)
- ‚Üí Lower precision

**Increase precision** (be more sure about spam):
- ‚Üí Miss more spam (higher false negatives)
- ‚Üí Lower recall

**F1 score** balances this tradeoff!

<!-- [PLACEHOLDER IMAGE]
Prompt for image generation:
"Create a seesaw/balance diagram showing precision-recall tradeoff.
Style: Conceptual illustration with clear metaphor.
Center: Fulcrum labeled 'Classification Threshold'
Left side: Weight labeled 'PRECISION' with icon of magnifying glass (being precise/selective). Arrow pointing up.
Right side: Weight labeled 'RECALL' with icon of wide net (catching everything). Arrow pointing down.
Show three positions:
1. Balanced (F1 optimal) - seesaw level
2. High Precision - left side down, catches fewer but more accurate
3. High Recall - right side down, catches more but less accurate
Include annotations: 'Moving threshold changes the balance'
Color scheme: Blue for precision, orange for recall, green for balanced.
Format: Horizontal layout showing the tradeoff concept." -->

In [None]:
# Calculate all metrics for our models

def print_detailed_metrics(y_true, y_pred, model_name):
    """Print comprehensive classification metrics"""
    print(f"\n{'='*60}")
    print(f"üìä {model_name} - Detailed Metrics")
    print(f"{'='*60}")
    
    # Basic metrics
    acc = accuracy_score(y_true, y_pred)
    prec = precision_score(y_true, y_pred, zero_division=0)
    rec = recall_score(y_true, y_pred, zero_division=0)
    f1 = f1_score(y_true, y_pred, zero_division=0)
    
    print(f"\nüéØ Overall Performance:")
    print(f"  Accuracy:  {acc:.1%} ‚Üê Proportion of correct predictions")
    print(f"  Precision: {prec:.1%} ‚Üê Of predicted spam, how many were actually spam?")
    print(f"  Recall:    {rec:.1%} ‚Üê Of actual spam, how many did we catch?")
    print(f"  F1-Score:  {f1:.3f} ‚Üê Harmonic mean of precision & recall")
    
    # Confusion matrix breakdown
    cm = confusion_matrix(y_true, y_pred)
    tn, fp, fn, tp = cm.ravel()
    
    print(f"\nüìã Confusion Matrix Breakdown:")
    print(f"  True Negatives:  {tn:4d}")
    print(f"  False Positives: {fp:4d} ‚Üê Legitimate emails marked as spam")
    print(f"  False Negatives: {fn:4d} ‚Üê Spam emails that got through")
    print(f"  True Positives:  {tp:4d}")
    
    # Derived metrics
    specificity = tn / (tn + fp) if (tn + fp) > 0 else 0
    print(f"\nüîç Additional Metrics:")
    print(f"  Specificity (True Negative Rate): {specificity:.1%}")
    print(f"    ‚Üí Of legitimate emails, how many were correctly identified?")
    print(f"  False Positive Rate: {fp/(fp+tn):.1%} ‚Üê Should be LOW")
    print(f"  False Negative Rate: {fn/(fn+tp):.1%} ‚Üê Should be LOW")

# Compare both models
print_detailed_metrics(y_test, y_pred_dummy, "Dummy Model (Baseline)")
print_detailed_metrics(y_test, y_pred_lr, "Logistic Regression")

print(f"\n\nüí° Key Insight:")
print(f"{'='*60}")
print(f"The dummy model has HIGH accuracy ({accuracy_score(y_test, y_pred_dummy):.1%})")
print(f"but ZERO recall (catches no spam)!")
print(f"\nLogistic Regression has slightly lower accuracy ({accuracy_score(y_test, y_pred_lr):.1%})")
print(f"but much better recall ({recall_score(y_test, y_pred_lr):.1%}) - actually useful!")
print(f"\n‚≠ê This is why accuracy alone is misleading on imbalanced data!")

### Visualizing the Tradeoff: Precision-Recall Curve

In [None]:
# Plot precision-recall curve
precision, recall, thresholds = precision_recall_curve(y_test, y_proba_lr)
avg_precision = average_precision_score(y_test, y_proba_lr)

fig, axes = plt.subplots(1, 2, figsize=(15, 6))

# Precision-Recall curve
axes[0].plot(recall, precision, linewidth=2.5, color='blue', label=f'AP = {avg_precision:.3f}')
axes[0].fill_between(recall, precision, alpha=0.2, color='blue')
axes[0].set_xlabel('Recall (Sensitivity)', fontsize=13, fontweight='bold')
axes[0].set_ylabel('Precision (Positive Predictive Value)', fontsize=13, fontweight='bold')
axes[0].set_title('Precision-Recall Curve', fontsize=15, fontweight='bold', pad=15)
axes[0].legend(loc='best', fontsize=12)
axes[0].grid(alpha=0.3)
axes[0].set_xlim([0, 1])
axes[0].set_ylim([0, 1])

# Add annotations
axes[0].annotate('High Precision\nLow Recall', xy=(0.2, 0.9), fontsize=11,
                bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.5))
axes[0].annotate('Low Precision\nHigh Recall', xy=(0.8, 0.3), fontsize=11,
                bbox=dict(boxstyle='round', facecolor='lightblue', alpha=0.5))

# Threshold vs Metrics
f1_scores = 2 * (precision[:-1] * recall[:-1]) / (precision[:-1] + recall[:-1] + 1e-8)
axes[1].plot(thresholds, precision[:-1], linewidth=2.5, label='Precision', color='blue')
axes[1].plot(thresholds, recall[:-1], linewidth=2.5, label='Recall', color='orange')
axes[1].plot(thresholds, f1_scores, linewidth=2.5, label='F1-Score', color='green', linestyle='--')

# Find optimal threshold (max F1)
optimal_idx = np.argmax(f1_scores)
optimal_threshold = thresholds[optimal_idx]
axes[1].axvline(x=optimal_threshold, color='red', linestyle=':', linewidth=2,
               label=f'Optimal Threshold = {optimal_threshold:.3f}')

axes[1].set_xlabel('Classification Threshold', fontsize=13, fontweight='bold')
axes[1].set_ylabel('Score', fontsize=13, fontweight='bold')
axes[1].set_title('Metrics vs Decision Threshold', fontsize=15, fontweight='bold', pad=15)
axes[1].legend(loc='best', fontsize=11)
axes[1].grid(alpha=0.3)

plt.tight_layout()
plt.show()

print(f"\nüéØ Optimal Operating Point:")
print(f"={'='*60}")
print(f"Best threshold: {optimal_threshold:.3f}")
print(f"At this threshold:")
print(f"  ‚Ä¢ Precision: {precision[optimal_idx]:.1%}")
print(f"  ‚Ä¢ Recall: {recall[optimal_idx]:.1%}")
print(f"  ‚Ä¢ F1-Score: {f1_scores[optimal_idx]:.3f}")
print(f"\nüí° Adjust threshold based on business needs!")
print(f"   Lower threshold ‚Üí Higher recall (catch more spam)")
print(f"   Higher threshold ‚Üí Higher precision (fewer false alarms)")

## üìã Part 3: ROC Curves & AUC - The Gold Standard

### Understanding ROC (Receiver Operating Characteristic)

ROC curves plot:
- **X-axis**: False Positive Rate (FPR) = FP / (FP + TN)
- **Y-axis**: True Positive Rate (TPR) = TP / (TP + FN) = Recall

**ROC curve shows**: Model performance across ALL possible thresholds

**AUC (Area Under Curve)**:
- **Range**: 0 to 1
- **1.0**: Perfect classifier
- **0.5**: Random guessing (diagonal line)
- **< 0.5**: Worse than random (you're doing something backwards!)

**Interpretation**:
- AUC = probability that model ranks random positive higher than random negative
- **0.9-1.0**: Excellent
- **0.8-0.9**: Good
- **0.7-0.8**: Fair
- **0.6-0.7**: Poor
- **0.5-0.6**: Fail

**Why ROC/AUC?**
‚úÖ Threshold-independent
‚úÖ Works well with imbalanced data
‚úÖ Single number for model comparison
‚úÖ Widely understood in industry

<!-- [PLACEHOLDER IMAGE]
Prompt for image generation:
"Create an educational diagram explaining ROC curves with multiple classifier examples.
Style: Professional statistical visualization.
Main plot: ROC space with FPR (0-1) on X-axis, TPR (0-1) on Y-axis.
Show 4 curves:
1. Perfect classifier (L-shaped, hugging top-left corner) - labeled 'AUC = 1.0 (Perfect)'
2. Good classifier (smooth curve above diagonal) - labeled 'AUC = 0.85 (Good)'
3. Random classifier (diagonal line from (0,0) to (1,1)) - labeled 'AUC = 0.5 (Random)'
4. Poor classifier (below diagonal) - labeled 'AUC = 0.3 (Inverted)'
Shade the area under the good classifier curve.
Add annotations: 'Better models hug the top-left corner'
Include interpretative box explaining AUC score ranges.
Color scheme: Green for perfect, blue for good, gray for random, red for poor.
Format: Professional academic style, 16:9 ratio." -->

In [None]:
# Train multiple models for comparison
models = {
    'Logistic Regression': LogisticRegression(random_state=42, max_iter=1000),
    'Decision Tree': DecisionTreeClassifier(random_state=42, max_depth=5),
    'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42, max_depth=5)
}

plt.figure(figsize=(14, 7))

# Plot ROC curves for all models
colors = ['blue', 'green', 'orange']
for (name, model), color in zip(models.items(), colors):
    # Train model
    model.fit(X_train_scaled, y_train)
    
    # Get probabilities
    if hasattr(model, 'predict_proba'):
        y_proba = model.predict_proba(X_test_scaled)[:, 1]
    else:
        y_proba = model.decision_function(X_test_scaled)
    
    # Calculate ROC curve
    fpr, tpr, _ = roc_curve(y_test, y_proba)
    auc = roc_auc_score(y_test, y_proba)
    
    # Plot
    plt.plot(fpr, tpr, linewidth=2.5, color=color, 
            label=f'{name} (AUC = {auc:.3f})')

# Plot diagonal (random classifier)
plt.plot([0, 1], [0, 1], 'k--', linewidth=2, label='Random Classifier (AUC = 0.5)')

# Formatting
plt.xlabel('False Positive Rate (1 - Specificity)', fontsize=13, fontweight='bold')
plt.ylabel('True Positive Rate (Recall/Sensitivity)', fontsize=13, fontweight='bold')
plt.title('ROC Curves - Model Comparison', fontsize=15, fontweight='bold', pad=15)
plt.legend(loc='lower right', fontsize=12)
plt.grid(alpha=0.3)
plt.xlim([0, 1])
plt.ylim([0, 1])

# Add annotations
plt.annotate('Perfect Classifier\n(AUC = 1.0)', 
            xy=(0, 1), xytext=(0.3, 0.9),
            arrowprops=dict(arrowstyle='->', color='red', lw=2),
            fontsize=11, fontweight='bold',
            bbox=dict(boxstyle='round', facecolor='yellow', alpha=0.7))

plt.tight_layout()
plt.show()

print("\nüìä Model Comparison via AUC:")
print("="*60)
for name, model in models.items():
    y_proba = model.predict_proba(X_test_scaled)[:, 1]
    auc = roc_auc_score(y_test, y_proba)
    print(f"{name:20s}: AUC = {auc:.4f}")

print("\nüí° Interpretation:")
print("  ‚Ä¢ Closer to top-left corner = better model")
print("  ‚Ä¢ AUC is threshold-independent metric")
print("  ‚Ä¢ Use AUC to compare models, then tune threshold for deployment")

## üìã Part 4: Cross-Validation - Proper Model Assessment

### The Problem with Single Train/Test Split

**Scenario**: You split your data once, train model, get 90% accuracy.

**Questions**:
- Was this a lucky split?
- Would performance hold on different data?
- Did you overfit to this particular test set?

**Answer**: You don't know! One split is not enough.

### K-Fold Cross-Validation

**The solution**: Test on multiple different splits!

**Process** (5-fold example):
1. Split data into 5 equal folds
2. Train on folds 1-4, test on fold 5
3. Train on folds 1-3 & 5, test on fold 4
4. Train on folds 1-2 & 4-5, test on fold 3
5. Train on folds 2-5, test on fold 1
6. Train on folds 1 & 3-5, test on fold 2

**Result**: 5 different performance scores ‚Üí average & standard deviation

**Benefits**:
‚úÖ More robust performance estimate
‚úÖ Every sample is used for both training and testing
‚úÖ Reduces variance in performance estimate
‚úÖ Detects overfitting

**Stratified K-Fold**: Maintains class distribution in each fold (important for imbalanced data!)

<!-- [PLACEHOLDER IMAGE]
Prompt for image generation:
"Create a diagram showing 5-fold cross-validation process.
Style: Educational flowchart with clear visual representation.
Show dataset as horizontal bar divided into 5 equal segments (Fold 1-5).
Display 5 iterations vertically:
Iteration 1: Folds 1-4 in blue (training), Fold 5 in orange (testing). Arrow to 'Score 1'
Iteration 2: Folds 1-3,5 in blue, Fold 4 in orange. Arrow to 'Score 2'
... and so on for all 5 iterations
Bottom: Show aggregation of 5 scores into 'Mean Score ¬± Std Dev'
Include annotations: 'Each fold used exactly once for testing'
Color scheme: Blue for training, orange for testing, green for final result.
Add icons showing model training and evaluation at each step.
Format: Vertical flow diagram, clear progression." -->

In [None]:
# Demonstrate cross-validation
from sklearn.model_selection import cross_val_score, StratifiedKFold

# Use Random Forest for this demo
rf_model = RandomForestClassifier(n_estimators=100, random_state=42, max_depth=5)

# Regular K-Fold
kfold = KFold(n_splits=5, shuffle=True, random_state=42)
cv_scores_regular = cross_val_score(rf_model, X_train_scaled, y_train, 
                                   cv=kfold, scoring='accuracy')

# Stratified K-Fold (better for imbalanced data)
skfold = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
cv_scores_stratified = cross_val_score(rf_model, X_train_scaled, y_train, 
                                      cv=skfold, scoring='accuracy')

# Multiple metrics
scoring = ['accuracy', 'precision', 'recall', 'f1', 'roc_auc']
cv_results = cross_validate(rf_model, X_train_scaled, y_train, 
                           cv=skfold, scoring=scoring, return_train_score=True)

print("\nüìä Cross-Validation Results (5-Fold):")
print("="*70)
print(f"\nRegular K-Fold Accuracy:")
print(f"  Scores: {[f'{s:.3f}' for s in cv_scores_regular]}")
print(f"  Mean:   {cv_scores_regular.mean():.3f} ¬± {cv_scores_regular.std():.3f}")

print(f"\nStratified K-Fold Accuracy:")
print(f"  Scores: {[f'{s:.3f}' for s in cv_scores_stratified]}")
print(f"  Mean:   {cv_scores_stratified.mean():.3f} ¬± {cv_scores_stratified.std():.3f}")

print(f"\nüìà Multiple Metrics (Stratified K-Fold):")
print(f"="*70)
for metric in scoring:
    test_scores = cv_results[f'test_{metric}']
    train_scores = cv_results[f'train_{metric}']
    print(f"\n{metric.upper():12s}:")
    print(f"  Train: {train_scores.mean():.3f} ¬± {train_scores.std():.3f}")
    print(f"  Test:  {test_scores.mean():.3f} ¬± {test_scores.std():.3f}")
    
    # Check for overfitting
    gap = train_scores.mean() - test_scores.mean()
    if gap > 0.1:
        print(f"  ‚ö†Ô∏è  Large train-test gap ({gap:.3f}) ‚Üí Possible overfitting!")

# Visualize CV scores
fig, axes = plt.subplots(1, 2, figsize=(15, 6))

# Box plot of scores
metrics_data = [cv_results[f'test_{m}'] for m in scoring]
axes[0].boxplot(metrics_data, labels=[m.upper() for m in scoring])
axes[0].set_ylabel('Score', fontsize=12, fontweight='bold')
axes[0].set_title('Cross-Validation Score Distribution', fontsize=14, fontweight='bold', pad=15)
axes[0].grid(alpha=0.3, axis='y')
axes[0].set_xticklabels([m.upper().replace('_', ' ') for m in scoring], rotation=15, ha='right')

# Train vs Test comparison
x_pos = np.arange(len(scoring))
train_means = [cv_results[f'train_{m}'].mean() for m in scoring]
test_means = [cv_results[f'test_{m}'].mean() for m in scoring]

axes[1].bar(x_pos - 0.2, train_means, 0.4, label='Train', alpha=0.8, color='skyblue')
axes[1].bar(x_pos + 0.2, test_means, 0.4, label='Test', alpha=0.8, color='lightcoral')
axes[1].set_xticks(x_pos)
axes[1].set_xticklabels([m.upper().replace('_', ' ') for m in scoring], rotation=15, ha='right')
axes[1].set_ylabel('Mean Score', fontsize=12, fontweight='bold')
axes[1].set_title('Train vs Test Performance', fontsize=14, fontweight='bold', pad=15)
axes[1].legend(fontsize=11)
axes[1].grid(alpha=0.3, axis='y')

plt.tight_layout()
plt.show()

print("\nüí° Key Insights:")
print("  ‚Ä¢ Standard deviation tells us score stability")
print("  ‚Ä¢ Large train-test gap indicates overfitting")
print("  ‚Ä¢ Always use stratified CV for classification!")

## üìã Part 5: Bias-Variance Tradeoff - Understanding Model Complexity

### The Fundamental Tradeoff

**Every model** faces this dilemma:

#### High Bias (Underfitting)
- Model too simple
- Doesn't capture patterns
- Poor on training AND test data
- Example: Linear model for non-linear data

#### High Variance (Overfitting)
- Model too complex
- Memorizes training data (including noise)
- Great on training, poor on test data
- Example: Deep decision tree on small dataset

#### The Sweet Spot
- Model complexity just right
- Captures true patterns, ignores noise
- Good on both training and test data

**Mathematically**:
```
Total Error = Bias¬≤ + Variance + Irreducible Error
```

**The Tradeoff**:
- Decrease bias ‚Üí Increase variance
- Decrease variance ‚Üí Increase bias
- Goal: Minimize total error

<!-- [PLACEHOLDER IMAGE]
Prompt for image generation:
"Create a comprehensive bias-variance tradeoff diagram.
Style: Professional statistical illustration with clear zones.
Main plot: U-shaped curve showing Total Error vs Model Complexity.
Show three curves:
1. Bias (decreasing from left to right) - red dashed line
2. Variance (increasing from left to right) - blue dashed line  
3. Total Error (U-shaped, sum of bias and variance) - solid black line
Mark optimal complexity point at the minimum of U-curve.
Three annotated zones:
- Left: 'Underfitting' (high bias, low variance) - simple model struggling with complex data
- Center: 'Just Right' (balanced) - model fitting data appropriately
- Right: 'Overfitting' (low bias, high variance) - complex model memorizing noise
Include small scatter plots in each zone showing example fits.
Color scheme: Red for bias, blue for variance, green for optimal zone.
Format: Wide horizontal layout with clear annotations." -->

In [None]:
# Demonstrate bias-variance tradeoff with learning curves
from sklearn.model_selection import learning_curve

# Function to plot learning curves
def plot_learning_curve(estimator, X, y, title, ylim=None):
    """
    Generate learning curve plot showing training and validation scores
    vs training set size.
    """
    train_sizes, train_scores, val_scores = learning_curve(
        estimator, X, y, cv=5, n_jobs=-1,
        train_sizes=np.linspace(0.1, 1.0, 10),
        scoring='accuracy',
        random_state=42
    )
    
    train_mean = np.mean(train_scores, axis=1)
    train_std = np.std(train_scores, axis=1)
    val_mean = np.mean(val_scores, axis=1)
    val_std = np.std(val_scores, axis=1)
    
    plt.figure(figsize=(10, 6))
    plt.fill_between(train_sizes, train_mean - train_std, train_mean + train_std,
                    alpha=0.2, color='blue')
    plt.fill_between(train_sizes, val_mean - val_std, val_mean + val_std,
                    alpha=0.2, color='orange')
    plt.plot(train_sizes, train_mean, 'o-', color='blue', linewidth=2.5,
            label='Training score')
    plt.plot(train_sizes, val_mean, 'o-', color='orange', linewidth=2.5,
            label='Cross-validation score')
    
    plt.xlabel('Training Set Size', fontsize=13, fontweight='bold')
    plt.ylabel('Accuracy Score', fontsize=13, fontweight='bold')
    plt.title(title, fontsize=15, fontweight='bold', pad=15)
    plt.legend(loc='best', fontsize=12)
    plt.grid(alpha=0.3)
    
    if ylim:
        plt.ylim(ylim)
    
    # Analyze the gap
    final_gap = train_mean[-1] - val_mean[-1]
    if final_gap > 0.1:
        plt.text(train_sizes[-1]*0.5, train_mean[-1]*0.95,
                f'‚ö†Ô∏è High Variance\n(Gap: {final_gap:.2%})',
                bbox=dict(boxstyle='round', facecolor='yellow', alpha=0.7),
                fontsize=11, fontweight='bold')
    elif val_mean[-1] < 0.7:
        plt.text(train_sizes[-1]*0.5, val_mean[-1]*1.1,
                f'‚ö†Ô∏è High Bias\n(Low score: {val_mean[-1]:.2%})',
                bbox=dict(boxstyle='round', facecolor='lightcoral', alpha=0.7),
                fontsize=11, fontweight='bold')
    else:
        plt.text(train_sizes[-1]*0.5, val_mean[-1]*1.05,
                f'‚úÖ Good Balance\n(Gap: {final_gap:.2%})',
                bbox=dict(boxstyle='round', facecolor='lightgreen', alpha=0.7),
                fontsize=11, fontweight='bold')
    
    plt.tight_layout()
    return train_mean, val_mean, final_gap

# Compare three models with different complexity
print("\nüî¨ Analyzing Bias-Variance Tradeoff...\n")

# High bias model (too simple)
simple_model = DecisionTreeClassifier(max_depth=2, random_state=42)
train_m, val_m, gap = plot_learning_curve(
    simple_model, X_train_scaled, y_train,
    'High Bias: Shallow Decision Tree (max_depth=2)'
)
plt.show()
print(f"Simple Model - Train: {train_m[-1]:.3f}, Val: {val_m[-1]:.3f}, Gap: {gap:.3f}")
print("‚Üí Low training score indicates UNDERFITTING (high bias)\n")

# Balanced model
balanced_model = DecisionTreeClassifier(max_depth=5, random_state=42)
train_m, val_m, gap = plot_learning_curve(
    balanced_model, X_train_scaled, y_train,
    'Balanced: Medium Decision Tree (max_depth=5)'
)
plt.show()
print(f"Balanced Model - Train: {train_m[-1]:.3f}, Val: {val_m[-1]:.3f}, Gap: {gap:.3f}")
print("‚Üí Small gap between train and validation indicates GOOD FIT\n")

# High variance model (too complex)
complex_model = DecisionTreeClassifier(max_depth=20, random_state=42)
train_m, val_m, gap = plot_learning_curve(
    complex_model, X_train_scaled, y_train,
    'High Variance: Deep Decision Tree (max_depth=20)'
)
plt.show()
print(f"Complex Model - Train: {train_m[-1]:.3f}, Val: {val_m[-1]:.3f}, Gap: {gap:.3f}")
print("‚Üí Large gap indicates OVERFITTING (high variance)\n")

print("\nüí° Learning Curve Interpretation:")
print("="*60)
print("üìà Both curves converge at high level ‚Üí Good model")
print("üìâ Both curves converge at low level ‚Üí Underfitting (add complexity)")
print("üìä Large gap between curves ‚Üí Overfitting (reduce complexity/add data)")
print("üîÑ Curves haven't converged ‚Üí Need more data")

## üìã Part 6: Business Metric Selection - Choosing What Matters

### The Business Context Matters!

**Same problem, different metrics based on business needs:**

#### Case 1: Medical Diagnosis (Cancer Screening)
**Priority**: Don't miss actual cancers (minimize False Negatives)

**Best metric**: **Recall (Sensitivity)**
- Better to have false alarms than miss cancer
- False positives ‚Üí extra tests (acceptable)
- False negatives ‚Üí missed treatment (catastrophic)

---

#### Case 2: Spam Detection
**Priority**: Don't mark legitimate emails as spam (minimize False Positives)

**Best metric**: **Precision**
- Missing some spam is annoying
- Blocking important email is unacceptable
- Better to let spam through than block legitimate email

---

#### Case 3: Fraud Detection
**Priority**: Balance both (catch fraud without annoying customers)

**Best metric**: **F1-Score** or **Precision-Recall AUC**
- Need to catch fraud (recall)
- Can't flag legitimate transactions (precision)
- Both matter equally

---

#### Case 4: Recommendation Systems
**Priority**: User engagement and satisfaction

**Best metric**: **Business KPIs**
- Click-through rate (CTR)
- Time on platform
- Revenue per user
- Traditional ML metrics are secondary!

### The Decision Framework

| Scenario | False Positive Cost | False Negative Cost | Best Metric |
|----------|-------------------|-------------------|-------------|
| Cancer screening | Low (extra tests) | **Very High** (death) | **Recall** |
| Spam filter | **High** (missed email) | Low (see spam) | **Precision** |
| Fraud detection | Medium (customer friction) | Medium (lost money) | **F1-Score** |
| Loan approval | Low (missed profit) | **High** (default loss) | **Precision** |
| Marketing campaign | Low (wasted ad spend) | Low (missed sale) | **Balanced** |

**Key Questions to Ask**:
1. What's the cost of a false positive?
2. What's the cost of a false negative?
3. Which error is more acceptable?
4. What does the business actually care about?

In [None]:
# Demonstrate metric selection for different scenarios

def evaluate_for_scenario(y_true, y_pred, y_proba, scenario_name, primary_metric):
    """
    Evaluate model focusing on the metric that matters for the scenario
    """
    print(f"\n{'='*70}")
    print(f"üìã Scenario: {scenario_name}")
    print(f"üéØ Primary Metric: {primary_metric}")
    print(f"{'='*70}")
    
    # Calculate all metrics
    acc = accuracy_score(y_true, y_pred)
    prec = precision_score(y_true, y_pred, zero_division=0)
    rec = recall_score(y_true, y_pred, zero_division=0)
    f1 = f1_score(y_true, y_pred, zero_division=0)
    auc = roc_auc_score(y_true, y_proba)
    
    # Confusion matrix
    cm = confusion_matrix(y_true, y_pred)
    tn, fp, fn, tp = cm.ravel()
    
    # Print with emphasis on primary metric
    metrics = {
        'Accuracy': acc,
        'Precision': prec,
        'Recall': rec,
        'F1-Score': f1,
        'ROC-AUC': auc
    }
    
    for metric_name, value in metrics.items():
        marker = "‚≠ê" if metric_name == primary_metric else "  "
        print(f"{marker} {metric_name:12s}: {value:.3f}")
    
    print(f"\nüìä Error Analysis:")
    print(f"   False Positives: {fp:4d}")
    print(f"   False Negatives: {fn:4d}")
    
    return metrics

# Train model once
model = RandomForestClassifier(n_estimators=100, random_state=42, max_depth=5)
model.fit(X_train_scaled, y_train)
y_pred = model.predict(X_test_scaled)
y_proba = model.predict_proba(X_test_scaled)[:, 1]

# Evaluate for different scenarios
scenarios = [
    ("Cancer Screening (Minimize Missed Cases)", "Recall"),
    ("Spam Filter (Minimize False Alarms)", "Precision"),
    ("Fraud Detection (Balance Both Errors)", "F1-Score"),
    ("General Classification", "ROC-AUC")
]

results = []
for scenario, metric in scenarios:
    result = evaluate_for_scenario(y_test, y_pred, y_proba, scenario, metric)
    results.append(result)

print(f"\n\nüí° Key Takeaway:")
print(f"={'='*70}")
print(f"The SAME model can be 'good' or 'bad' depending on:")
print(f"  1. What metric you prioritize")
print(f"  2. The business context and costs")
print(f"  3. Your tolerance for different error types")
print(f"\n‚≠ê Always align metrics with business objectives!")

## üéØ Exercise 1: Real-World Metric Selection

**Objective**: Practice choosing the right metric for different business problems

**Scenarios**: For each scenario below, identify:
1. The primary metric to optimize
2. Why that metric matters most
3. What threshold adjustment might be needed

**Scenario A: Email Phishing Detection**
- Detect phishing emails in corporate inbox
- False positive: Legitimate email blocked (might miss important business communication)
- False negative: Phishing email delivered (employee might get compromised)

**Scenario B: Credit Card Fraud Real-Time Detection**
- Block suspicious transactions instantly
- False positive: Legitimate purchase declined (customer frustrated)
- False negative: Fraudulent charge goes through (bank loses money)

**Scenario C: Automated Resume Screening**
- Filter candidates for interviews
- False positive: Unqualified candidate interviewed (wasted time)
- False negative: Qualified candidate rejected (lose talent)

<details>
<summary>üí° Hint: Think about business impact</summary>

Consider:
- Which error is more costly?
- Can you recover from the error?
- What's the user experience impact?
</details>

**Your Analysis**:

In [None]:
# Fill in your analysis
my_analysis = {
    'Scenario A - Phishing Detection': {
        'Primary Metric': '',  # Precision, Recall, F1, etc.
        'Reasoning': '',
        'Threshold Adjustment': ''  # Higher, lower, or balanced?
    },
    'Scenario B - Fraud Detection': {
        'Primary Metric': '',
        'Reasoning': '',
        'Threshold Adjustment': ''
    },
    'Scenario C - Resume Screening': {
        'Primary Metric': '',
        'Reasoning': '',
        'Threshold Adjustment': ''
    }
}

# Print your analysis
for scenario, analysis in my_analysis.items():
    print(f"\n{scenario}")
    print("="*60)
    for key, value in analysis.items():
        print(f"{key}: {value}")

## üéØ Exercise 2: Cross-Validation Deep Dive

**Objective**: Master cross-validation and detect overfitting

**Task**:
1. Create a dataset with 500 samples
2. Train three models with different complexities:
   - Simple: DecisionTreeClassifier(max_depth=3)
   - Medium: DecisionTreeClassifier(max_depth=8)
   - Complex: DecisionTreeClassifier(max_depth=None)
3. Use 5-fold cross-validation to evaluate each model
4. Plot training vs validation scores for each model
5. Identify which model is:
   - Underfitting
   - Well-fitted
   - Overfitting

<details>
<summary>üí° Hint: Detecting overfitting</summary>

Overfitting indicators:
- High training score, low validation score
- Large gap between training and validation
- Validation score decreases with model complexity
</details>

**Bonus Challenge**: 
- Create learning curves for each model
- Determine if more data would help each model

In [None]:
# Your code here!
# Implement cross-validation comparison






## üéì Key Takeaways

You've mastered model evaluation and metrics!

- ‚úÖ **Confusion Matrix Fundamentals**:
  - TP, TN, FP, FN form the foundation of all classification metrics
  - Understanding error types is crucial for metric selection
  - Visualize confusion matrices to understand model behavior

- ‚úÖ **Classification Metrics**:
  - **Accuracy**: Only useful for balanced datasets
  - **Precision**: Minimize false positives ("When I say yes, I'm right")
  - **Recall**: Minimize false negatives ("I catch most of the positives")
  - **F1-Score**: Harmonic mean balancing precision and recall
  - Each metric serves a different business need!

- ‚úÖ **ROC Curves & AUC**:
  - Threshold-independent evaluation
  - AUC summarizes performance across all thresholds
  - Perfect for comparing models objectively
  - Use precision-recall curves for imbalanced data

- ‚úÖ **Cross-Validation**:
  - Single train/test split is unreliable
  - K-fold CV provides robust performance estimates
  - Stratified CV maintains class distribution
  - Always report mean ¬± standard deviation

- ‚úÖ **Bias-Variance Tradeoff**:
  - Underfitting (high bias): Model too simple
  - Overfitting (high variance): Model too complex
  - Learning curves reveal which problem you have
  - Balance comes from proper model complexity

- ‚úÖ **Business Metric Selection**:
  - **Context matters more than the metric itself**
  - Align metrics with business costs and objectives
  - Different scenarios need different metrics
  - Always ask: "What does the business actually care about?"

### ü§î The Big Picture:

**Model Evaluation Workflow**:
1. ‚úÖ Understand the business problem and costs
2. ‚úÖ Choose appropriate metrics (not just accuracy!)
3. ‚úÖ Use cross-validation for robust estimates
4. ‚úÖ Check for overfitting/underfitting with learning curves
5. ‚úÖ Tune decision threshold based on business needs
6. ‚úÖ Monitor performance on production data

**Remember**: 
> "A model optimizing the wrong metric is worse than no model at all!"

**Always**:
- Question if accuracy is meaningful for your data
- Understand the cost of different errors
- Use multiple metrics to get the complete picture
- Validate properly before deployment

## üìñ Further Learning

**Recommended Reading**:
- [Scikit-learn Metrics Guide](https://scikit-learn.org/stable/modules/model_evaluation.html) - Comprehensive documentation
- [ROC Curves Explained](https://developers.google.com/machine-learning/crash-course/classification/roc-and-auc) - Google's ML Crash Course
- [Cross-Validation Guide](https://scikit-learn.org/stable/modules/cross_validation.html) - Official scikit-learn tutorial

**Video Tutorials**:
- [StatQuest: ROC and AUC](https://www.youtube.com/watch?v=4jRBRDbJemM) - Crystal clear explanation
- [Precision and Recall](https://www.youtube.com/watch?v=jJ7ff7Gcq34) - Visual walkthrough
- [Cross-Validation](https://www.youtube.com/watch?v=fSytzGwwBVw) - Practical demonstration

**Deep Dives**:
- [Imbalanced Learning](https://imbalanced-learn.org/stable/) - Handling imbalanced datasets
- [Matthews Correlation Coefficient](https://en.wikipedia.org/wiki/Matthews_correlation_coefficient) - Alternative to F1
- [Calibration](https://scikit-learn.org/stable/modules/calibration.html) - Probability calibration

**Interactive Tools**:
- [ROC Curve Visualizer](https://www.naftaliharris.com/blog/visualizing-dbscan-clustering/) - Interactive demo
- [Confusion Matrix Calculator](https://www.machinelearningplus.com/statistics/confusion-matrix-explained/) - Hands-on tool

**Research Papers**:
- [The Relationship Between Precision-Recall and ROC Curves](https://www.biostat.wisc.edu/~page/rocpr.pdf) - When to use which
- [A survey of cross-validation procedures](https://arxiv.org/abs/0907.4728) - Advanced techniques

**Case Studies**:
- [Netflix Prize](https://www.netflixprize.com/) - Real-world evaluation challenges
- [Kaggle Evaluation Metrics](https://www.kaggle.com/learn/intro-to-machine-learning) - Competition metrics

**Tools & Libraries**:
- [Yellowbrick](https://www.scikit-yb.org/) - Visual diagnostic tools
- [PyCaret](https://pycaret.org/) - Automated model evaluation
- [MLflow](https://mlflow.org/) - Experiment tracking

## ‚û°Ô∏è What's Next?

üéâ **Congratulations!** You've completed **Module 2: Machine Learning Fundamentals**!

You've mastered:
- ‚úÖ Supervised learning (classification & regression)
- ‚úÖ Unsupervised learning (clustering)
- ‚úÖ Model evaluation & selection
- ‚úÖ Core ML concepts and best practices

**In Module 3: Neural Networks Demystified**, you'll discover:

**Chapter 3.1 - Biological Neurons**:
- How the brain inspired artificial neural networks
- From biological neurons to mathematical models
- The perceptron: First artificial neuron

**Chapter 3.2 - Building Your First Neural Network**:
- Implementing neural networks from scratch
- Understanding forward propagation
- PyTorch basics and tensor operations

**Chapter 3.3 - Activation Functions & Backpropagation**:
- Why networks need non-linearity
- The mathematics of learning (chain rule!)
- Gradient descent and optimization

**Chapter 3.4 - Deep Learning Architectures**:
- CNNs for computer vision
- RNNs for sequences
- Modern architectures and design patterns

From classical ML to deep learning‚Äîthe journey continues! üß†

Ready to dive into neural networks? Open **[Chapter 3.1 - Biological Neurons](../module-3/3.1-biological-neurons.ipynb)**!

---

### üí¨ Feedback & Community

**Questions?** Join our [Discord community](https://discord.gg/madeforai)

**Found a bug?** [Open an issue on GitHub](https://github.com/madeforai/madeforai/issues)

**Share your evaluation insights!** Tweet with #MadeForAI

**Keep learning!** üöÄ