# Probability Calibration

## Overview

**Probability calibration** ensures that predicted probabilities match the true likelihood of the outcome. A well-calibrated classifier's predicted probability of 0.8 means the event occurs 80% of the time.

### Core Concept

*"When a model predicts 70% probability, it should be correct 70% of the time"*

### Key Ideas

1. **Well-calibrated**: Predicted probabilities match true frequencies
2. **Poorly-calibrated**: Probabilities are too confident or not confident enough
3. **Calibration methods**: Transform probabilities without changing rankings
4. **When it matters**: Decision-making, risk assessment, probability-based systems

## Mathematical Foundation

### Perfect Calibration

For a perfectly calibrated classifier:

\[
P(y=1 | \hat{p} = p) = p
\]

**Example**: Among all predictions where model predicts 0.6, exactly 60% should be positive class.

### Calibration Curve (Reliability Diagram)

Plot predicted probabilities vs empirical frequencies:
1. Bin predictions by probability (e.g., [0-0.1, 0.1-0.2, ...])
2. Calculate true frequency in each bin
3. Plot: x-axis = mean predicted probability, y-axis = fraction of positives

**Perfect calibration**: Points lie on diagonal (y = x)

### Brier Score

**Measures accuracy and calibration** of probabilistic predictions:

\[
\text{Brier Score} = \frac{1}{n} \sum_{i=1}^{n} (\hat{p}_i - y_i)^2
\]

- Lower is better (0 = perfect)
- Range: [0, 1]
- Combines calibration and resolution

### Calibration Methods

#### 1. Platt Scaling (Sigmoid)

Fit logistic regression on classifier outputs:

\[
P_{\text{calibrated}} = \frac{1}{1 + \exp(A \cdot f(x) + B)}
\]

where \(f(x)\) is the classifier's raw output.

**Properties**:
- Parametric (assumes sigmoid relationship)
- Works well for small datasets
- Good for SVM, boosted trees

#### 2. Isotonic Regression

Fits non-parametric, piecewise-constant, monotonic function:

\[
P_{\text{calibrated}} = \arg\min_z \sum_i (z_i - y_i)^2 \quad \text{subject to} \quad z_1 \leq z_2 \leq ... \leq z_n
\]

**Properties**:
- Non-parametric (more flexible)
- Needs more data (prone to overfitting)
- Better for tree-based models

## Topics Covered

1. Understanding calibration and why it matters
2. Calibration curves (reliability diagrams)
3. Brier score and calibration metrics
4. Which models need calibration
5. Platt scaling (sigmoid method)
6. Isotonic regression
7. CalibratedClassifierCV in sklearn
8. Before/after calibration comparison
9. Best practices and guidelines

## Setup and Import

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from time import time
import warnings
warnings.filterwarnings('ignore')

# Calibration tools
from sklearn.calibration import (
    CalibratedClassifierCV, calibration_curve
)

# Models
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier

# Utilities
from sklearn.model_selection import train_test_split, cross_val_predict
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import (
    accuracy_score, brier_score_loss, log_loss,
    roc_auc_score, classification_report
)
from sklearn.datasets import make_classification, load_breast_cancer

np.random.seed(42)
sns.set_style('whitegrid')
print("✓ Libraries imported successfully")

## 1. Understanding Calibration

### 1.1 What is a Well-Calibrated Model?

In [None]:
print("What is Calibration?")
print("="*70)
print("\nScenario: Medical diagnosis model\n")

# Simulate predictions
np.random.seed(42)
n_samples = 1000

# Well-calibrated model
print("WELL-CALIBRATED MODEL:")
print("-" * 70)
predicted_probs_good = np.array([0.1, 0.3, 0.5, 0.7, 0.9])
for prob in predicted_probs_good:
    # Generate outcomes based on true probability
    n_pred = 200
    outcomes = np.random.binomial(1, prob, n_pred)
    empirical_prob = outcomes.mean()
    
    print(f"  Predicted: {prob:.1f} → Actual: {empirical_prob:.2f} ✓")

print("\n" + "="*70)

# Poorly-calibrated model (overconfident)
print("\nPOORLY-CALIBRATED MODEL (Overconfident):")
print("-" * 70)
predicted_probs_bad = np.array([0.0, 0.2, 0.5, 0.8, 1.0])
true_probs_bad = np.array([0.1, 0.3, 0.5, 0.7, 0.9])  # Actual probabilities
for pred_prob, true_prob in zip(predicted_probs_bad, true_probs_bad):
    n_pred = 200
    outcomes = np.random.binomial(1, true_prob, n_pred)
    empirical_prob = outcomes.mean()
    
    print(f"  Predicted: {pred_prob:.1f} → Actual: {empirical_prob:.2f} ✗")

print("\n" + "="*70)
print("\n💡 Key Insight:")
print("   Well-calibrated: Predicted probabilities match empirical frequencies")
print("   Overconfident: Predicts extreme probabilities (0 or 1) too often")
print("   Underconfident: Predictions too close to 0.5")

# Visualize
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Well-calibrated
axes[0].plot([0, 1], [0, 1], 'k--', label='Perfect calibration', linewidth=2)
axes[0].scatter(predicted_probs_good, predicted_probs_good, s=100, alpha=0.7, label='Model predictions')
axes[0].set_xlabel('Predicted Probability')
axes[0].set_ylabel('Empirical Probability')
axes[0].set_title('Well-Calibrated Model')
axes[0].legend()
axes[0].grid(alpha=0.3)
axes[0].set_xlim([0, 1])
axes[0].set_ylim([0, 1])

# Overconfident
axes[1].plot([0, 1], [0, 1], 'k--', label='Perfect calibration', linewidth=2)
axes[1].scatter(predicted_probs_bad, true_probs_bad, s=100, alpha=0.7, 
               color='red', label='Model predictions')
axes[1].set_xlabel('Predicted Probability')
axes[1].set_ylabel('Empirical Probability')
axes[1].set_title('Poorly-Calibrated Model (Overconfident)')
axes[1].legend()
axes[1].grid(alpha=0.3)
axes[1].set_xlim([0, 1])
axes[1].set_ylim([0, 1])

plt.tight_layout()
plt.show()

## 2. Calibration Curves (Reliability Diagrams)

### 2.1 Comparing Different Models

In [None]:
# Load dataset
cancer = load_breast_cancer()
X = cancer.data
y = cancer.target

print("Calibration Curves - Breast Cancer Dataset")
print("="*70)
print(f"Samples: {X.shape[0]}")
print(f"Features: {X.shape[1]}\n")

# Split and scale
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train multiple models
models = {
    'Logistic Regression': LogisticRegression(max_iter=1000, random_state=42),
    'Naive Bayes': GaussianNB(),
    'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42),
    'SVM': SVC(probability=True, random_state=42),
    'Gradient Boosting': GradientBoostingClassifier(n_estimators=100, random_state=42)
}

print("Training models...\n")
predictions = {}

for name, model in models.items():
    print(f"Training {name}...")
    model.fit(X_train_scaled, y_train)
    
    # Get probability predictions
    y_pred_proba = model.predict_proba(X_test_scaled)[:, 1]
    y_pred = (y_pred_proba >= 0.5).astype(int)
    
    # Calculate metrics
    accuracy = accuracy_score(y_test, y_pred)
    brier = brier_score_loss(y_test, y_pred_proba)
    
    predictions[name] = {
        'proba': y_pred_proba,
        'accuracy': accuracy,
        'brier': brier
    }
    
    print(f"  Accuracy: {accuracy:.4f}, Brier Score: {brier:.4f}")

print("\n" + "="*70)

In [None]:
# Plot calibration curves
print("\nGenerating Calibration Curves...")

fig, ax = plt.subplots(figsize=(10, 8))

# Perfect calibration line
ax.plot([0, 1], [0, 1], 'k--', label='Perfect calibration', linewidth=2)

# Plot each model
for name, pred_dict in predictions.items():
    y_pred_proba = pred_dict['proba']
    
    # Calculate calibration curve
    fraction_of_positives, mean_predicted_value = calibration_curve(
        y_test, y_pred_proba, n_bins=10, strategy='uniform'
    )
    
    # Plot
    ax.plot(mean_predicted_value, fraction_of_positives, 'o-', 
           label=f"{name} (Brier: {pred_dict['brier']:.3f})", linewidth=2, markersize=8)

ax.set_xlabel('Mean Predicted Probability', fontsize=12)
ax.set_ylabel('Fraction of Positives', fontsize=12)
ax.set_title('Calibration Curves (Reliability Diagrams)', fontsize=14)
ax.legend(loc='upper left')
ax.grid(alpha=0.3)
ax.set_xlim([0, 1])
ax.set_ylim([0, 1])

plt.tight_layout()
plt.show()

print("\n💡 Interpreting Calibration Curves:")
print("   On diagonal (y=x): Perfectly calibrated")
print("   Above diagonal: Underconfident (predicts lower than actual)")
print("   Below diagonal: Overconfident (predicts higher than actual)")
print("   \n   Observations:")
print("   - Logistic Regression: Usually well-calibrated")
print("   - Naive Bayes: Often pushes probabilities to extremes")
print("   - Random Forest: Tends to be underconfident")
print("   - SVM: Often overconfident")
print("   - Gradient Boosting: Can be overconfident")

## 3. Brier Score

### 3.1 Understanding and Decomposition

In [None]:
print("Brier Score Breakdown")
print("="*70)
print("Brier Score = mean((predicted_prob - actual_outcome)^2)")
print("  - Range: [0, 1]")
print("  - 0 = Perfect prediction")
print("  - Lower is better\n")

# Create comparison table
metrics_data = []
for name, pred_dict in predictions.items():
    metrics_data.append({
        'Model': name,
        'Accuracy': pred_dict['accuracy'],
        'Brier Score': pred_dict['brier'],
        'Log Loss': log_loss(y_test, pred_dict['proba'])
    })

metrics_df = pd.DataFrame(metrics_data)
metrics_df = metrics_df.sort_values('Brier Score')

print("Model Performance Comparison:")
print("="*70)
print(metrics_df.to_string(index=False))

# Visualize
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Brier Score
axes[0].barh(metrics_df['Model'], metrics_df['Brier Score'], alpha=0.7)
axes[0].set_xlabel('Brier Score (lower is better)')
axes[0].set_title('Brier Score Comparison')
axes[0].invert_yaxis()
axes[0].grid(alpha=0.3, axis='x')

# Accuracy
axes[1].barh(metrics_df['Model'], metrics_df['Accuracy'], alpha=0.7, color='green')
axes[1].set_xlabel('Accuracy (higher is better)')
axes[1].set_title('Accuracy Comparison')
axes[1].invert_yaxis()
axes[1].set_xlim([0.9, 1.0])
axes[1].grid(alpha=0.3, axis='x')

plt.tight_layout()
plt.show()

print("\n💡 Note: High accuracy doesn't guarantee good calibration!")
print("   A model can be very accurate but have poorly calibrated probabilities")

## 4. Calibration Methods

### 4.1 Platt Scaling (Sigmoid)

In [None]:
print("Platt Scaling (Sigmoid Calibration)")
print("="*70)
print("Method: Fit sigmoid function to map uncalibrated scores to probabilities")
print("  P_calibrated = 1 / (1 + exp(A*f(x) + B))")
print("  \nBest for: SVM, Naive Bayes, Boosted trees")
print("  Requirements: Works with smaller datasets\n")

# Calibrate Random Forest (tends to be underconfident)
print("Calibrating Random Forest with Platt Scaling...\n")

rf_original = RandomForestClassifier(n_estimators=100, random_state=42)
rf_original.fit(X_train_scaled, y_train)

# Calibrate using Platt scaling (sigmoid)
rf_calibrated_platt = CalibratedClassifierCV(
    rf_original, 
    method='sigmoid',  # Platt scaling
    cv='prefit'  # Use already fitted model
)

# Need separate calibration set
# Split training data further
X_train_sub, X_calib, y_train_sub, y_calib = train_test_split(
    X_train_scaled, y_train, test_size=0.3, random_state=42
)

# Retrain on subset
rf_original = RandomForestClassifier(n_estimators=100, random_state=42)
rf_original.fit(X_train_sub, y_train_sub)

# Calibrate
rf_calibrated_platt = CalibratedClassifierCV(
    rf_original,
    method='sigmoid',
    cv='prefit'
)
rf_calibrated_platt.fit(X_calib, y_calib)

# Get predictions
y_pred_proba_original = rf_original.predict_proba(X_test_scaled)[:, 1]
y_pred_proba_calibrated = rf_calibrated_platt.predict_proba(X_test_scaled)[:, 1]

# Calculate Brier scores
brier_original = brier_score_loss(y_test, y_pred_proba_original)
brier_calibrated = brier_score_loss(y_test, y_pred_proba_calibrated)

print(f"Random Forest (Original):")
print(f"  Brier Score: {brier_original:.4f}")
print(f"\nRandom Forest (Platt Scaling):")
print(f"  Brier Score: {brier_calibrated:.4f}")
print(f"\nImprovement: {(brier_original - brier_calibrated):.4f} ({((brier_original - brier_calibrated)/brier_original*100):+.1f}%)")

# Plot calibration curves
fig, ax = plt.subplots(figsize=(10, 8))

ax.plot([0, 1], [0, 1], 'k--', label='Perfect calibration', linewidth=2)

# Original
fraction_pos_orig, mean_pred_orig = calibration_curve(y_test, y_pred_proba_original, n_bins=10)
ax.plot(mean_pred_orig, fraction_pos_orig, 'o-', label=f'Original RF (Brier: {brier_original:.3f})',
       linewidth=2, markersize=8)

# Calibrated
fraction_pos_cal, mean_pred_cal = calibration_curve(y_test, y_pred_proba_calibrated, n_bins=10)
ax.plot(mean_pred_cal, fraction_pos_cal, 's-', label=f'Platt Scaling (Brier: {brier_calibrated:.3f})',
       linewidth=2, markersize=8)

ax.set_xlabel('Mean Predicted Probability')
ax.set_ylabel('Fraction of Positives')
ax.set_title('Calibration: Before vs After Platt Scaling')
ax.legend()
ax.grid(alpha=0.3)

plt.tight_layout()
plt.show()

print("\n💡 Platt Scaling moved predictions closer to diagonal!")

### 4.2 Isotonic Regression

In [None]:
print("Isotonic Regression Calibration")
print("="*70)
print("Method: Fit non-parametric, monotonic, piecewise-constant function")
print("  More flexible than sigmoid")
print("  \nBest for: Tree-based models, non-linear calibration errors")
print("  Requirements: Needs larger datasets (prone to overfitting)\n")

# Calibrate with isotonic regression
rf_calibrated_isotonic = CalibratedClassifierCV(
    rf_original,
    method='isotonic',  # Isotonic regression
    cv='prefit'
)
rf_calibrated_isotonic.fit(X_calib, y_calib)

# Get predictions
y_pred_proba_isotonic = rf_calibrated_isotonic.predict_proba(X_test_scaled)[:, 1]

# Calculate Brier score
brier_isotonic = brier_score_loss(y_test, y_pred_proba_isotonic)

print(f"Random Forest (Isotonic Regression):")
print(f"  Brier Score: {brier_isotonic:.4f}")
print(f"\nComparison:")
print(f"  Original:          {brier_original:.4f}")
print(f"  Platt Scaling:     {brier_calibrated:.4f}")
print(f"  Isotonic:          {brier_isotonic:.4f}")

# Plot all three
fig, ax = plt.subplots(figsize=(10, 8))

ax.plot([0, 1], [0, 1], 'k--', label='Perfect calibration', linewidth=2)

# Original
ax.plot(mean_pred_orig, fraction_pos_orig, 'o-', 
       label=f'Original RF (Brier: {brier_original:.3f})', linewidth=2, markersize=8)

# Platt
ax.plot(mean_pred_cal, fraction_pos_cal, 's-', 
       label=f'Platt Scaling (Brier: {brier_calibrated:.3f})', linewidth=2, markersize=8)

# Isotonic
fraction_pos_iso, mean_pred_iso = calibration_curve(y_test, y_pred_proba_isotonic, n_bins=10)
ax.plot(mean_pred_iso, fraction_pos_iso, '^-', 
       label=f'Isotonic Regression (Brier: {brier_isotonic:.3f})', linewidth=2, markersize=8)

ax.set_xlabel('Mean Predicted Probability')
ax.set_ylabel('Fraction of Positives')
ax.set_title('Calibration Methods Comparison')
ax.legend()
ax.grid(alpha=0.3)

plt.tight_layout()
plt.show()

print("\n💡 Isotonic Regression:")
print("   - More flexible (can capture non-linear miscalibration)")
print("   - Needs more data")
print("   - Usually better for tree-based models")

## 5. CalibratedClassifierCV - Cross-Validation Approach

### 5.1 Training with Built-in Cross-Validation

In [None]:
print("CalibratedClassifierCV with Cross-Validation")
print("="*70)
print("When cv='prefit': Uses already fitted model + separate calibration set")
print("When cv=k: Trains k models using k-fold CV (better use of data)\n")

# Method 1: Using cv=5 (recommended)
print("Method 1: Using cv=5 (k-fold cross-validation)")
print("-" * 70)

# Train base model
rf_base = RandomForestClassifier(n_estimators=100, random_state=42)

# Calibrate with CV
rf_calibrated_cv = CalibratedClassifierCV(
    rf_base,
    method='sigmoid',
    cv=5  # 5-fold cross-validation
)

print("Training with 5-fold CV calibration...")
start = time()
rf_calibrated_cv.fit(X_train_scaled, y_train)
train_time = time() - start

print(f"Training time: {train_time:.2f}s")
print(f"Number of base estimators: {len(rf_calibrated_cv.calibrated_classifiers_)}")

# Predictions
y_pred_proba_cv = rf_calibrated_cv.predict_proba(X_test_scaled)[:, 1]
y_pred_cv = rf_calibrated_cv.predict(X_test_scaled)

# Evaluate
accuracy_cv = accuracy_score(y_test, y_pred_cv)
brier_cv = brier_score_loss(y_test, y_pred_proba_cv)

print(f"\nResults:")
print(f"  Accuracy: {accuracy_cv:.4f}")
print(f"  Brier Score: {brier_cv:.4f}")

print("\n💡 cv=5 approach:")
print("   - Uses all training data efficiently")
print("   - Trains 5 base models + 5 calibrators")
print("   - Averages predictions from all 5")
print("   - More robust than single calibration")

## 6. Which Models Need Calibration?

### 6.1 Testing Different Algorithms

In [None]:
print("Which Models Need Calibration?")
print("="*70)

# Test various models
test_models = {
    'Logistic Regression': LogisticRegression(max_iter=1000, random_state=42),
    'Naive Bayes': GaussianNB(),
    'Decision Tree': DecisionTreeClassifier(max_depth=10, random_state=42),
    'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42),
    'Gradient Boosting': GradientBoostingClassifier(n_estimators=100, random_state=42),
    'SVM (RBF)': SVC(probability=True, random_state=42)
}

calibration_results = []

for name, model in test_models.items():
    print(f"\nTesting {name}...")
    
    # Train original
    model.fit(X_train_scaled, y_train)
    y_pred_orig = model.predict_proba(X_test_scaled)[:, 1]
    brier_orig = brier_score_loss(y_test, y_pred_orig)
    
    # Calibrate with sigmoid
    model_calibrated = CalibratedClassifierCV(model, method='sigmoid', cv=5)
    model_calibrated.fit(X_train_scaled, y_train)
    y_pred_cal = model_calibrated.predict_proba(X_test_scaled)[:, 1]
    brier_cal = brier_score_loss(y_test, y_pred_cal)
    
    improvement = ((brier_orig - brier_cal) / brier_orig) * 100
    
    calibration_results.append({
        'Model': name,
        'Brier (Original)': brier_orig,
        'Brier (Calibrated)': brier_cal,
        'Improvement (%)': improvement
    })
    
    print(f"  Original Brier:    {brier_orig:.4f}")
    print(f"  Calibrated Brier:  {brier_cal:.4f}")
    print(f"  Improvement:       {improvement:+.1f}%")

cal_results_df = pd.DataFrame(calibration_results)
cal_results_df = cal_results_df.sort_values('Improvement (%)', ascending=False)

print("\n" + "="*70)
print("\nCalibration Impact Summary:")
print(cal_results_df.to_string(index=False))

# Visualize
fig, ax = plt.subplots(figsize=(12, 6))

x = np.arange(len(cal_results_df))
width = 0.35

ax.barh(x - width/2, cal_results_df['Brier (Original)'], width, 
       label='Original', alpha=0.8)
ax.barh(x + width/2, cal_results_df['Brier (Calibrated)'], width, 
       label='Calibrated', alpha=0.8)

ax.set_yticks(x)
ax.set_yticklabels(cal_results_df['Model'])
ax.set_xlabel('Brier Score (lower is better)')
ax.set_title('Calibration Impact on Different Models')
ax.legend()
ax.grid(alpha=0.3, axis='x')

plt.tight_layout()
plt.show()

print("\n💡 Models that NEED calibration:")
print("   ✓ Naive Bayes (pushes to extremes)")
print("   ✓ SVM (decision function not probabilistic)")
print("   ✓ Boosted Trees (can be overconfident)")
print("   ✓ Random Forest (tends to be underconfident)")
print("   \n   Models already well-calibrated:")
print("   ✓ Logistic Regression (often well-calibrated)")
print("   ✓ Decision Trees (decent, but can improve)")

## 7. Best Practices and Guidelines

### 7.1 When to Use Calibration

In [None]:
print("Calibration Best Practices")
print("="*70)

print("\n✓ WHEN CALIBRATION IS IMPORTANT:")
use_cases = [
    "Medical diagnosis (probability represents risk)",
    "Credit scoring (probability determines action)",
    "Weather forecasting (probabilities must be accurate)",
    "Fraud detection (threshold decisions based on probability)",
    "Ranking/recommendation systems",
    "Multi-class classification with probability-based decisions",
    "Cost-sensitive learning",
    "Ensemble methods combining probability outputs"
]
for i, case in enumerate(use_cases, 1):
    print(f"  {i}. {case}")

print("\n✗ WHEN CALIBRATION LESS IMPORTANT:")
not_important = [
    "Only care about final class predictions (not probabilities)",
    "Using fixed threshold (e.g., 0.5) for all decisions",
    "Ranking tasks where relative order matters, not absolute values",
    "Already using well-calibrated model (Logistic Regression)"
]
for i, case in enumerate(not_important, 1):
    print(f"  {i}. {case}")

print("\n\n⚙️ METHOD SELECTION GUIDE:")
print("-" * 70)

method_guide = [
    {
        'Scenario': 'Small dataset (<1000 samples)',
        'Method': 'Platt Scaling (sigmoid)',
        'Reason': 'More stable with less data'
    },
    {
        'Scenario': 'Large dataset (>10k samples)',
        'Method': 'Isotonic Regression',
        'Reason': 'More flexible, needs more data'
    },
    {
        'Scenario': 'SVM or Naive Bayes',
        'Method': 'Platt Scaling',
        'Reason': 'Standard choice for these models'
    },
    {
        'Scenario': 'Tree-based models',
        'Method': 'Isotonic Regression',
        'Reason': 'Better for non-linear miscalibration'
    },
    {
        'Scenario': 'Uncertain which to use',
        'Method': 'Try both, compare on validation',
        'Reason': 'Empirical comparison'
    },
]

method_df = pd.DataFrame(method_guide)
print(method_df.to_string(index=False))

print("\n\n📋 CALIBRATION WORKFLOW:")
print("-" * 70)
workflow = [
    "1. Split data: Train / Calibration / Test (or use CV)",
    "2. Train model on training set",
    "3. Check calibration curve on validation set",
    "4. If poorly calibrated: Apply CalibratedClassifierCV",
    "5. Choose method: sigmoid (small data) or isotonic (large data)",
    "6. Use cv=5 for better data efficiency",
    "7. Evaluate on test set using Brier score",
    "8. Generate final calibration curve"
]
for step in workflow:
    print(f"  {step}")

print("\n\n⚠️ COMMON PITFALLS:")
print("-" * 70)
pitfalls = [
    ("Calibrating on training data", "Use separate calibration set or CV"),
    ("Using isotonic with small data", "Use sigmoid for <1000 samples"),
    ("Ignoring calibration for medical/financial", "Always check calibration"),
    ("Not checking calibration curve", "Visualize before and after"),
    ("Over-calibrating", "Don't calibrate multiple times"),
    ("Using wrong metric", "Use Brier score, not just accuracy"),
]
for pitfall, solution in pitfalls:
    print(f"  ❌ {pitfall}")
    print(f"     ✓ {solution}\n")

## Summary and Quick Reference

### Quick Reference Code

```python
from sklearn.calibration import CalibratedClassifierCV, calibration_curve
from sklearn.metrics import brier_score_loss

# ===== METHOD 1: Separate Calibration Set =====

# Split data
X_train, X_calib, y_train, y_calib = train_test_split(
    X_train_full, y_train_full, test_size=0.3, random_state=42
)

# Train base model
model = RandomForestClassifier()
model.fit(X_train, y_train)

# Calibrate
calibrated_model = CalibratedClassifierCV(
    model,
    method='sigmoid',  # or 'isotonic'
    cv='prefit'  # Use already fitted model
)
calibrated_model.fit(X_calib, y_calib)

# ===== METHOD 2: Cross-Validation (Recommended) =====

# Train and calibrate in one step
calibrated_model_cv = CalibratedClassifierCV(
    RandomForestClassifier(),
    method='sigmoid',
    cv=5  # 5-fold cross-validation
)
calibrated_model_cv.fit(X_train_full, y_train_full)

# ===== EVALUATION =====

# Predictions
y_pred_proba = calibrated_model.predict_proba(X_test)[:, 1]

# Brier score (lower is better)
brier = brier_score_loss(y_test, y_pred_proba)

# Calibration curve
fraction_of_positives, mean_predicted_value = calibration_curve(
    y_test, y_pred_proba, n_bins=10
)

# Plot
plt.plot(mean_predicted_value, fraction_of_positives, 'o-')
plt.plot([0, 1], [0, 1], 'k--')  # Perfect calibration
```

### Key Metrics

**Brier Score**:
```python
brier = (1/n) * sum((predicted_prob - actual)^2)
```
- Range: [0, 1]
- 0 = Perfect
- Lower is better

**Log Loss (Cross-Entropy)**:
```python
log_loss = -(1/n) * sum(y*log(p) + (1-y)*log(1-p))
```
- Penalizes confident wrong predictions heavily

### Method Comparison

| Method | Type | Best For | Dataset Size | Flexibility |
|--------|------|----------|--------------|-------------|
| Platt Scaling | Parametric | SVM, NB, Boosting | Small-Medium | Low (sigmoid) |
| Isotonic | Non-parametric | Trees, large data | Large | High (arbitrary monotonic) |

### Models Calibration Needs

| Model | Natural Calibration | Need Calibration? |
|-------|---------------------|-------------------|
| Logistic Regression | Excellent | Usually No |
| Naive Bayes | Poor (extreme probs) | Yes |
| SVM | Poor | Yes |
| Decision Tree | Moderate | Sometimes |
| Random Forest | Underconfident | Yes |
| Gradient Boosting | Overconfident | Yes |
| Neural Networks | Variable | Often Yes |

### Calibration Curves Interpretation

**On diagonal (y=x)**: Perfectly calibrated
- Predicted 0.7 → Actually 0.7

**Above diagonal**: Underconfident
- Predicted 0.6 → Actually 0.8
- Model is too conservative

**Below diagonal**: Overconfident
- Predicted 0.8 → Actually 0.6
- Model is too aggressive

### Best Practices

1. **Always use separate data**: Never calibrate on training data
2. **Prefer cv approach**: Use cv=5 instead of single calibration set
3. **Check visually**: Plot calibration curves before and after
4. **Use Brier score**: Don't rely only on accuracy
5. **Method selection**: Sigmoid for small data, isotonic for large
6. **Don't over-calibrate**: Calibrate once, not multiple times
7. **Consider cost**: Calibration adds computational overhead
8. **Validate properly**: Use test set never seen during calibration

### When Calibration Matters Most

✓ **Critical Applications**:
- Medical diagnosis (risk assessment)
- Financial modeling (default probability)
- Weather forecasting
- Fraud detection with thresholds
- Cost-sensitive decisions

✗ **Less Critical**:
- Binary classification with fixed 0.5 threshold
- Ranking tasks (only order matters)
- Already using Logistic Regression

### Common Mistakes

| Mistake | Consequence | Solution |
|---------|-------------|----------|
| Calibrate on training data | Overfitting | Use holdout or CV |
| Use isotonic with <1k samples | Overfitting | Use sigmoid instead |
| Ignore calibration curves | Miss miscalibration | Always visualize |
| Multiple calibrations | Can hurt performance | Calibrate once |
| Trust accuracy only | Miss probability errors | Use Brier score |

### Computational Considerations

**Training Time**:
- Sigmoid: Fast (fits simple logistic)
- Isotonic: Moderate (sorts and fits)
- cv=5: 5× slower than prefit

**Prediction Time**:
- Sigmoid: Minimal overhead
- Isotonic: Minimal overhead
- cv=5: Averages 5 predictions

**Memory**:
- Sigmoid: Stores 2 parameters (A, B)
- Isotonic: Stores calibration mapping
- cv=5: Stores 5 calibrators

### Further Reading

- **Paper**: "Predicting Good Probabilities with Supervised Learning" - Niculescu-Mizil & Caruana
- **Paper**: "Obtaining calibrated probability estimates from decision trees and naive Bayesian classifiers" - Zadrozny & Elkan
- **sklearn Docs**: https://scikit-learn.org/stable/modules/calibration.html
- **Platt Scaling**: "Probabilistic Outputs for Support Vector Machines" - Platt (1999)

### Next Steps

- Temperature scaling for neural networks
- Multi-class calibration
- Calibration in online learning
- Beta calibration
- Venn-ABERS predictors