# X2: Model Evaluation & Selection - Measuring What MattersGuide to evaluating and comparing machine learning models.


## IntroductionAccuracy alone is misleading. A model that predicts "no cancer" for everyone achieves 99% accuracy if only 1% have cancer - but it's useless!This guide covers all evaluation metrics, when to use each, and how to properly compare models.

## Table of Contents1. Classification Metrics2. Regression Metrics3. Cross-Validation4. ROC Curves & AUC5. Precision-Recall Curves6. Confusion Matrix Deep Dive7. Statistical Significance Testing8. Model Comparison Framework

In [None]:
import numpy as npimport pandas as pdimport matplotlib.pyplot as pltfrom sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, roc_curve, precision_recall_curve, confusion_matrix, classification_reportfrom sklearn.metrics import mean_squared_error, mean_absolute_error, r2_scorefrom sklearn.model_selection import cross_val_score, cross_validate, StratifiedKFoldfrom sklearn.datasets import load_breast_cancerfrom sklearn.ensemble import RandomForestClassifierfrom sklearn.linear_model import LogisticRegressionfrom sklearn.model_selection import train_test_splitimport seaborn as snsnp.random.seed(42)print('✅ Libraries loaded')

## 1. Classification Metrics**Accuracy:** Overall correctness- $\text{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN}$- ⚠️ Misleading for imbalanced data**Precision:** When you predict positive, how often are you right?- $\text{Precision} = \frac{TP}{TP + FP}$- Use when: False positives are costly (spam detection)**Recall (Sensitivity):** Of all actual positives, how many did you find?- $\text{Recall} = \frac{TP}{TP + FN}$- Use when: False negatives are costly (cancer detection)**F1-Score:** Harmonic mean of precision and recall- $F1 = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}$- Use when: Need balance**ROC-AUC:** Area under ROC curve (0.5 = random, 1.0 = perfect)

In [None]:
# Load data and train modeldata = load_breast_cancer()X, y = data.data, data.targetX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)model = RandomForestClassifier(n_estimators=100, random_state=42)model.fit(X_train, y_train)y_pred = model.predict(X_test)y_proba = model.predict_proba(X_test)[:, 1]print('Classification Metrics:')print(f'Accuracy:  {accuracy_score(y_test, y_pred):.3f}')print(f'Precision: {precision_score(y_test, y_pred):.3f}')print(f'Recall:    {recall_score(y_test, y_pred):.3f}')print(f'F1-Score:  {f1_score(y_test, y_pred):.3f}')print(f'ROC-AUC:   {roc_auc_score(y_test, y_proba):.3f}')print('\nFull Classification Report:')print(classification_report(y_test, y_pred, target_names=data.target_names))

## 2. Cross-Validation**Why?** Single train/test split is noisy - results vary based on split.**K-Fold Cross-Validation:**1. Split data into K folds2. Train on K-1 folds, test on 13. Repeat K times4. Average results**Stratified K-Fold:** Maintains class proportions (use for classification)**Time Series Split:** Respects temporal order (never test on past!)

In [None]:
# K-Fold Cross-Validationfrom sklearn.model_selection import cross_val_score, StratifiedKFold# Standard cross-validationcv_scores = cross_val_score(model, X, y, cv=5, scoring='accuracy')print(f'5-Fold CV Accuracy: {cv_scores.mean():.3f} (+/- {cv_scores.std() * 2:.3f})')print(f'Individual fold scores: {cv_scores}')# Cross-validate multiple metricsscoring = ['accuracy', 'precision', 'recall', 'f1', 'roc_auc']cv_results = cross_validate(model, X, y, cv=5, scoring=scoring)print('\nCross-Validation Results:')for metric in scoring:    scores = cv_results[f'test_{metric}']    print(f'{metric:12s}: {scores.mean():.3f} (+/- {scores.std() * 2:.3f})')

## 3. ROC Curves & AUC**ROC Curve:** Plot of True Positive Rate vs False Positive Rate at different thresholds**Interpretation:**- Diagonal line = random classifier (AUC = 0.5)- Perfect classifier touches top-left corner (AUC = 1.0)- Higher AUC = better overall performance**When to use:**- ✅ Balanced classes- ✅ Care about ranking- ✅ Will tune threshold later

In [None]:
# ROC Curvefpr, tpr, thresholds = roc_curve(y_test, y_proba)auc = roc_auc_score(y_test, y_proba)plt.figure(figsize=(10, 6))plt.plot(fpr, tpr, linewidth=2, label=f'ROC Curve (AUC = {auc:.3f})')plt.plot([0, 1], [0, 1], 'k--', label='Random Classifier (AUC = 0.5)')plt.xlim([0.0, 1.0])plt.ylim([0.0, 1.05])plt.xlabel('False Positive Rate', fontsize=12)plt.ylabel('True Positive Rate', fontsize=12)plt.title('ROC Curve', fontsize=14, fontweight='bold')plt.legend(loc='lower right', fontsize=11)plt.grid(alpha=0.3)plt.show()

## 4. Precision-Recall Curves**Better than ROC for imbalanced data!****Why?** ROC can be optimistic when negatives >> positives**Interpretation:**- High precision + high recall = excellent- High precision, low recall = conservative (few predictions)- Low precision, high recall = aggressive (many predictions)**Use when:** Imbalanced classes (fraud, disease detection)

In [None]:
# Precision-Recall Curveprecision, recall, _ = precision_recall_curve(y_test, y_proba)plt.figure(figsize=(10, 6))plt.plot(recall, precision, linewidth=2)plt.xlabel('Recall', fontsize=12)plt.ylabel('Precision', fontsize=12)plt.title('Precision-Recall Curve', fontsize=14, fontweight='bold')plt.grid(alpha=0.3)plt.show()print(f'Average Precision Score: {precision.mean():.3f}')

## 5. Confusion Matrix Deep Dive**2×2 Matrix for Binary Classification:**|               | Predicted Negative | Predicted Positive ||---------------|--------------------|-----------------|| Actual Negative | TN | FP (Type I Error) || Actual Positive | FN (Type II Error) | TP |**Cost-Sensitive Learning:**- Medical: FN (missed cancer) >> FP (false alarm)- Spam: FP (blocking real email) > FN (spam in inbox)- Set threshold based on costs!

In [None]:
# Confusion Matrixcm = confusion_matrix(y_test, y_pred)plt.figure(figsize=(8, 6))sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',             xticklabels=data.target_names, yticklabels=data.target_names)plt.ylabel('True Label', fontsize=12)plt.xlabel('Predicted Label', fontsize=12)plt.title('Confusion Matrix', fontsize=14, fontweight='bold')plt.show()tn, fp, fn, tp = cm.ravel()print(f'True Negatives:  {tn}')print(f'False Positives: {fp} (Type I Error)')print(f'False Negatives: {fn} (Type II Error)')print(f'True Positives:  {tp}')

## Conclusion**Metric Selection Guide:****Classification:**- Balanced data, general use: **Accuracy, F1-Score**- Imbalanced data: **Precision-Recall AUC, F1-Score**- Ranking important: **ROC-AUC**- False negatives costly (medical): **Recall**- False positives costly (spam): **Precision****Always:**- ✅ Use cross-validation- ✅ Look at confusion matrix- ✅ Plot ROC/PR curves- ✅ Report confidence intervals- ✅ Test on held-out data**Never:**- ❌ Tune on test set- ❌ Use accuracy alone for imbalanced data- ❌ Compare without statistical testing