# ML Metrics Guide – Choosing the Right Evaluation

This notebook is a **learning + reference** guide for ML evaluation metrics.

It focuses on:

- **Regression** metrics: MSE, RMSE, MAE, R², MAPE
- **Classification** metrics: accuracy, precision, recall, F1, ROC-AUC, PR-AUC
- **Confusion matrix** and how to interpret it
- How to choose **which metric** based on the problem (imbalanced data, ranking vs calibration, etc.)


## 1. Regression Metrics

When your target is **continuous** (price, points, temperature), you typically use:

- **MSE (Mean Squared Error)**:
  - Penalizes large errors more (squared).
  - Often used in training objectives.
- **RMSE (Root Mean Squared Error)**:
  - Square root of MSE.
  - Same units as the target (more interpretable).
- **MAE (Mean Absolute Error)**:
  - Measures average absolute error.
  - More robust to outliers than MSE.
- **R² (Coefficient of Determination)**:
  - Proportion of variance explained by the model.
  - 1.0 is perfect, 0.0 means no better than predicting the mean.
- **MAPE (Mean Absolute Percentage Error)** (optional):
  - Expresses error as percentage.
  - Sensitive when true values are near zero.


In [None]:
# ========== 1.1 Regression metrics on a toy example ==========

from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

y_true = np.array([10, 12, 15, 20, 18])
y_pred_good = np.array([11, 11, 14, 19, 17])
y_pred_bad = np.array([5, 25, 5, 30, 10])

def regression_metrics(y_true, y_pred):
    mse = mean_squared_error(y_true, y_pred)
    rmse = np.sqrt(mse)
    mae = mean_absolute_error(y_true, y_pred)
    r2 = r2_score(y_true, y_pred)
    mape = np.mean(np.abs((y_true - y_pred) / y_true)) * 100
    return mse, rmse, mae, r2, mape

for name, y_hat in [('Good', y_pred_good), ('Bad', y_pred_bad)]:
    mse, rmse, mae, r2, mape = regression_metrics(y_true, y_hat)
    print(f'=== {name} model ===')
    print(f'MSE:  {mse:.3f}')
    print(f'RMSE: {rmse:.3f}')
    print(f'MAE:  {mae:.3f}')
    print(f'R2:   {r2:.3f}')
    print(f'MAPE: {mape:.2f}%')
    print()


### When to use which regression metric?

- **MSE / RMSE**:
  - When large errors are especially bad.
  - Common for many competitions and default in many libraries.
- **MAE**:
  - When you care about **median-like** behavior and robustness to outliers.
- **R²**:
  - When you want a normalized sense of how much variance is explained.
  - Good for model comparison on the same dataset.
- **MAPE**:
  - When percentage error is more meaningful, and target values are not near zero.


## 2. Classification Metrics & Confusion Matrix

When your target is a **class label** (binary or multiclass), evaluation revolves around
the **confusion matrix**.

For binary classification (positive vs negative):

- **True Positive (TP)**: predicted positive and actually positive
- **True Negative (TN)**: predicted negative and actually negative
- **False Positive (FP)**: predicted positive but actually negative
- **False Negative (FN)**: predicted negative but actually positive

From these, we derive metrics:

- **Accuracy** = (TP + TN) / (TP + TN + FP + FN)
- **Precision** = TP / (TP + FP) – "When I predict positive, how often am I right?"
- **Recall (Sensitivity)** = TP / (TP + FN) – "Of all actual positives, how many did I catch?"
- **F1 Score** = harmonic mean of precision and recall.


In [None]:
# ========== 2.1 Confusion matrix example ==========

from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

y_true = np.array([0, 0, 1, 1, 1, 0, 1, 0, 1, 0])
y_pred = np.array([0, 0, 1, 0, 1, 0, 1, 1, 0, 0])

cm = confusion_matrix(y_true, y_pred)
print('Confusion matrix (rows=true, cols=pred):')
print(cm)

disp = ConfusionMatrixDisplay(confusion_matrix=cm)
disp.plot()
plt.title('Confusion matrix')
plt.show()

tn, fp, fn, tp = cm.ravel()
accuracy = (tp + tn) / (tp + tn + fp + fn)
precision = tp / (tp + fp) if (tp + fp) > 0 else 0.0
recall = tp / (tp + fn) if (tp + fn) > 0 else 0.0
f1 = 2 * precision * recall / (precision + recall) if (precision + recall) > 0 else 0.0

print(f'Accuracy:  {accuracy:.3f}')
print(f'Precision: {precision:.3f}')
print(f'Recall:    {recall:.3f}')
print(f'F1 score:  {f1:.3f}')


### When to use which classification metric?

- **Accuracy**:
  - Works when classes are balanced and all errors are equally bad.
  - Can be misleading for imbalanced datasets (e.g., 99% negatives).

- **Precision & Recall**:
  - Precision: "If I flag something as positive, how often am I correct?"
  - Recall: "Of all true positives, how many did I catch?"
  - Useful when FP vs FN costs are asymmetric (fraud, disease detection).

- **F1 Score**:
  - Balanced tradeoff between precision and recall.
  - Common for imbalanced classification when you want a single summary.


## 3. ROC-AUC and PR-AUC

Many classifiers output **scores or probabilities**, not just hard labels.
By sweeping a threshold across these scores, you get different (TPR, FPR) tradeoffs.

- **ROC curve**: plots True Positive Rate (TPR) vs False Positive Rate (FPR).
- **ROC-AUC**: area under ROC curve; 0.5 is random, 1.0 is perfect.
- **PR curve**: plots Precision vs Recall.
- **PR-AUC**: area under PR curve.

Guidance:
- **ROC-AUC** is good when classes are reasonably balanced.
- **PR-AUC** is often more informative for **heavily imbalanced** problems –
  it focuses on the positive class performance.


In [None]:
# ========== 3.1 ROC-AUC and PR-AUC on a toy example ==========

from sklearn.metrics import roc_auc_score, roc_curve, precision_recall_curve, auc

rng = np.random.default_rng(0)
n = 500
y_true = rng.integers(0, 2, size=n)
scores = rng.normal(loc=y_true, scale=0.8, size=n)

roc_auc = roc_auc_score(y_true, scores)
fpr, tpr, roc_thresh = roc_curve(y_true, scores)

prec, rec, pr_thresh = precision_recall_curve(y_true, scores)
pr_auc = auc(rec, prec)

print(f'ROC-AUC: {roc_auc:.3f}')
print(f'PR-AUC:  {pr_auc:.3f}')

plt.plot(fpr, tpr)
plt.plot([0, 1], [0, 1], linestyle='--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC curve')
plt.show()

plt.plot(rec, prec)
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.title('Precision–Recall curve')
plt.show()


## 4. Metric Selection Cheat Sheet

### 4.1 Regression

- Use **RMSE** when:
  - Larger errors are especially harmful.
  - The competition or stakeholder optimizes squared error.

- Use **MAE** when:
  - You want a robust metric against outliers.
  - You care about "typical" absolute error.

- Use **R²** when:
  - You want to explain how much variance is captured.
  - Comparing models on the same dataset.

### 4.2 Classification

- Use **accuracy** when:
  - Classes are roughly balanced.
  - All errors have similar cost.

- Use **precision / recall / F1** when:
  - Data is imbalanced.
  - You care about a specific class (e.g., positive class).
  - You can explain FP vs FN cost clearly.

- Use **ROC-AUC** when:
  - You care about ranking quality across thresholds.
  - Class balance is not extremely skewed.

- Use **PR-AUC** when:
  - Positive class is rare and critical (fraud, disease).
  - You care about performance at the top of the score distribution.


## 5. Multi-Class & Other Settings (Brief)

For **multi-class** problems:

- Accuracy is still straightforward.
- Precision/recall/F1 can be averaged:
  - **Macro**: average of per-class metrics (treats each class equally).
  - **Weighted**: weighted by class frequency.
- ROC and PR can be extended using one-vs-rest and averaged.

For **ranking / recommendation**:

- You may use metrics like:
  - Top-k accuracy
  - Mean Average Precision at k (MAP@k)
  - NDCG

Key idea: **match the metric to the real-world objective**.
