# 03 — Deep Evaluation

In the training notebook we got **75.8% test accuracy**. But accuracy alone doesn't tell the full story.

In this notebook we dig deeper:
1. **ROC Curves** — how well does the model separate each class from the rest?
2. **Per-class analysis** — which conditions are hardest to classify?
3. **Confidence analysis** — how confident is the model when it's right vs wrong?
4. **Misclassified examples** — let's actually look at the images the model got wrong

> This is the kind of analysis that separates a good data science project from a great one.

---

In [None]:
import sys
sys.path.append('..')

import torch
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
from PIL import Image

from src.config import (
    DATA_DIR, MODELS_DIR, RESULTS_DIR, SEED,
    CLASS_NAMES, CLASS_LABELS, NUM_CLASSES,
    IMAGE_SIZE, BATCH_SIZE, MODEL_NAME,
)

sns.set_style('whitegrid')
plt.rcParams['figure.dpi'] = 120

device = torch.device('cpu')
print(f'Device: {device}')
print('Setup complete!')

---
## 1. Load the Trained Model & Test Data

We load the best model saved during training and recreate the same test set using the same random seed — this guarantees we evaluate on the exact same images.

In [None]:
from sklearn.model_selection import train_test_split
from torch.utils.data import DataLoader
from src.dataset import HAM10000Dataset, get_transforms
from src.model import create_model

# Load metadata and recreate the same splits
df = pd.read_csv(DATA_DIR / 'HAM10000_metadata.csv')
image_dirs = [
    DATA_DIR / 'HAM10000_images_part_1',
    DATA_DIR / 'HAM10000_images_part_2',
]

# Same split as training (same seed = same split)
train_val_df, test_df = train_test_split(df, test_size=0.15, stratify=df['dx'], random_state=SEED)
train_df, val_df = train_test_split(train_val_df, test_size=0.176, stratify=train_val_df['dx'], random_state=SEED)

# Create test dataset and loader
test_dataset = HAM10000Dataset(test_df, image_dirs=image_dirs, transform=get_transforms('test'))
test_loader = DataLoader(test_dataset, batch_size=BATCH_SIZE, shuffle=False, num_workers=0)

# Load the best model
model = create_model(MODEL_NAME, NUM_CLASSES, pretrained=False)
model.load_state_dict(torch.load(MODELS_DIR / 'best_model.pth', map_location=device, weights_only=True))
model = model.to(device)
model.eval()

print(f'Test samples: {len(test_df)}')
print(f'Model loaded from: {MODELS_DIR / "best_model.pth"}')

In [None]:
from src.evaluate import get_predictions

# Get all predictions on test set
y_true, y_pred, y_probs = get_predictions(model, test_loader, device)

accuracy = (y_true == y_pred).mean()
print(f'Test accuracy: {accuracy:.4f} ({accuracy*100:.1f}%)')
print(f'Total predictions: {len(y_true)}')

---
## 2. ROC Curves (One-vs-Rest)

**What is a ROC curve?**

For each class, we ask: *"How well can the model distinguish THIS class from ALL others?"*

- **X-axis (False Positive Rate)** — how often does it wrongly say "yes" when the answer is "no"?
- **Y-axis (True Positive Rate / Recall)** — how often does it correctly say "yes" when the answer is "yes"?

**AUC (Area Under Curve):**
- **AUC = 1.0** — perfect classifier
- **AUC = 0.5** — random guessing (the dashed diagonal line)
- **AUC > 0.8** — good
- **AUC > 0.9** — excellent

In [None]:
from src.evaluate import plot_roc_curves

plot_roc_curves(y_true, y_probs, save=True)

In [None]:
# Print AUC scores in a clean table
from sklearn.metrics import roc_auc_score

print('AUC Scores per class:')
print('-' * 50)
aucs = []
for i, cls in enumerate(CLASS_NAMES):
    binary_true = (y_true == i).astype(int)
    auc_score = roc_auc_score(binary_true, y_probs[:, i])
    aucs.append(auc_score)
    label = CLASS_LABELS[cls]
    bar = '█' * int(auc_score * 30)
    print(f'  {label:30s}  AUC: {auc_score:.3f}  {bar}')

print('-' * 50)
print(f'  {"Mean AUC":30s}  AUC: {np.mean(aucs):.3f}')

### ROC Curves — Interpretation

**What the AUC scores tell us:**
- Classes with **AUC > 0.9** — the model can reliably distinguish these from other conditions
- Classes with **AUC < 0.8** — the model struggles to separate these, likely because they look similar to other classes

**Why ROC is better than accuracy for imbalanced data:**
- Accuracy can be misleading (67% just by predicting "mole" for everything)
- ROC/AUC evaluates performance at **all decision thresholds**, not just the default 0.5
- In medical settings, you might want to adjust the threshold to catch more true positives (higher recall) at the cost of more false positives

---
## 3. Confidence Analysis

Is the model **confident** when it's correct and **uncertain** when it's wrong?

A well-calibrated model should:
- Have **high confidence** for correct predictions
- Have **lower confidence** for incorrect predictions

If the model is equally confident when right and wrong, it can't be trusted — it doesn't "know what it doesn't know."

In [None]:
# Get the confidence (max probability) for each prediction
confidences = y_probs.max(axis=1)
correct_mask = y_true == y_pred

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Distribution of confidence for correct vs incorrect
axes[0].hist(confidences[correct_mask], bins=30, alpha=0.7, label=f'Correct ({correct_mask.sum()})', color='green')
axes[0].hist(confidences[~correct_mask], bins=30, alpha=0.7, label=f'Incorrect ({(~correct_mask).sum()})', color='red')
axes[0].set_xlabel('Prediction Confidence')
axes[0].set_ylabel('Count')
axes[0].set_title('Confidence Distribution: Correct vs Incorrect')
axes[0].legend()

# Average confidence per class
class_conf_correct = []
class_conf_incorrect = []
for i, cls in enumerate(CLASS_NAMES):
    mask = y_true == i
    correct = mask & correct_mask
    incorrect = mask & ~correct_mask
    class_conf_correct.append(confidences[correct].mean() if correct.sum() > 0 else 0)
    class_conf_incorrect.append(confidences[incorrect].mean() if incorrect.sum() > 0 else 0)

x = np.arange(len(CLASS_NAMES))
width = 0.35
labels = [CLASS_LABELS[c] for c in CLASS_NAMES]
axes[1].barh(x - width/2, class_conf_correct, width, label='Correct', color='green', alpha=0.7)
axes[1].barh(x + width/2, class_conf_incorrect, width, label='Incorrect', color='red', alpha=0.7)
axes[1].set_yticks(x)
axes[1].set_yticklabels(labels)
axes[1].set_xlabel('Mean Confidence')
axes[1].set_title('Average Confidence by Class')
axes[1].legend()
axes[1].set_xlim(0, 1)

plt.tight_layout()
fig.savefig(RESULTS_DIR / 'confidence_analysis.png', dpi=150, bbox_inches='tight')
plt.show()

print(f'\nMean confidence (correct):   {confidences[correct_mask].mean():.3f}')
print(f'Mean confidence (incorrect): {confidences[~correct_mask].mean():.3f}')

### Confidence Analysis — Interpretation

**What we want to see:**
- The green bars (correct) should be **taller/more right** than the red bars (incorrect)
- If the model is confident AND wrong, those are the most dangerous predictions

**For medical AI, this matters a lot:**
- A model that says "I'm 95% sure this is benign" when it's actually melanoma is dangerous
- A model that says "I'm 55% sure, maybe check with a doctor" is much safer
- This is why we might set a **confidence threshold** — only trust predictions above a certain confidence level

---
## 4. Per-Class Performance Deep Dive

Let's visualize precision, recall, and F1 side by side for each class to see exactly where the model excels and struggles.

In [None]:
from sklearn.metrics import precision_score, recall_score, f1_score

# Compute per-class metrics
precisions = precision_score(y_true, y_pred, average=None, zero_division=0)
recalls = recall_score(y_true, y_pred, average=None, zero_division=0)
f1s = f1_score(y_true, y_pred, average=None, zero_division=0)

# Create a DataFrame for easy viewing
metrics_df = pd.DataFrame({
    'Class': [CLASS_LABELS[c] for c in CLASS_NAMES],
    'Short': CLASS_NAMES,
    'Precision': precisions,
    'Recall': recalls,
    'F1-Score': f1s,
    'Support': [(y_true == i).sum() for i in range(NUM_CLASSES)],
})

# Plot
fig, ax = plt.subplots(figsize=(12, 6))

x = np.arange(len(CLASS_NAMES))
width = 0.25

bars1 = ax.bar(x - width, precisions, width, label='Precision', color='#4C72B0')
bars2 = ax.bar(x, recalls, width, label='Recall', color='#DD8452')
bars3 = ax.bar(x + width, f1s, width, label='F1-Score', color='#55A868')

ax.set_ylabel('Score')
ax.set_title('Per-Class Performance Metrics')
ax.set_xticks(x)
ax.set_xticklabels([CLASS_LABELS[c] for c in CLASS_NAMES], rotation=45, ha='right')
ax.legend()
ax.set_ylim(0, 1.05)
ax.grid(True, alpha=0.3, axis='y')

plt.tight_layout()
fig.savefig(RESULTS_DIR / 'per_class_metrics.png', dpi=150, bbox_inches='tight')
plt.show()

print(metrics_df.to_string(index=False))

---
## 5. Misclassified Examples

Let's look at the actual images the model got wrong. This helps us understand:
- Are the mistakes reasonable (ambiguous images that even doctors struggle with)?
- Or is the model making obvious errors?

We'll show the top misclassifications with the model's confidence.

In [None]:
# Find misclassified samples
test_df_reset = test_df.reset_index(drop=True)
misclassified_idx = np.where(y_true != y_pred)[0]

print(f'Total misclassified: {len(misclassified_idx)} / {len(y_true)} ({len(misclassified_idx)/len(y_true)*100:.1f}%)')
print()

# Build image path lookup
image_path_map = {}
for d in image_dirs:
    if d.exists():
        for f in d.iterdir():
            if f.suffix == '.jpg':
                image_path_map[f.stem] = f

# Sort by confidence (highest confidence mistakes are most interesting)
misclassified_conf = confidences[misclassified_idx]
sorted_idx = misclassified_idx[np.argsort(-misclassified_conf)]  # Highest confidence first

# Show top 12 most confident mistakes
n_show = min(12, len(sorted_idx))
fig, axes = plt.subplots(3, 4, figsize=(16, 12))

for i, idx in enumerate(sorted_idx[:n_show]):
    row = test_df_reset.iloc[idx]
    img_path = image_path_map.get(row['image_id'])
    
    ax = axes[i // 4, i % 4]
    
    if img_path and img_path.exists():
        img = Image.open(img_path)
        ax.imshow(img)
    
    true_label = CLASS_LABELS[CLASS_NAMES[y_true[idx]]]
    pred_label = CLASS_LABELS[CLASS_NAMES[y_pred[idx]]]
    conf = confidences[idx]
    
    ax.set_title(f'True: {true_label}\nPred: {pred_label}\nConf: {conf:.1%}', 
                 fontsize=8, color='red')
    ax.axis('off')

plt.suptitle('Most Confident Misclassifications (worst mistakes)', fontsize=14, fontweight='bold')
plt.tight_layout()
fig.savefig(RESULTS_DIR / 'misclassified_examples.png', dpi=150, bbox_inches='tight')
plt.show()

### Misclassified Examples — What we learn

**Looking at the images the model got wrong tells us a lot:**
- Many misclassified images are genuinely ambiguous — they could reasonably be either class
- Some errors are between visually similar classes (mel vs nv, akiec vs bkl)
- The most dangerous mistakes are **high-confidence wrong predictions** — these are cases where the model is confidently incorrect

**This kind of error analysis is what separates a portfolio project from a Kaggle submission.** Real-world ML engineers always examine their model's failures to understand limitations.

---
## 6. Confusion Breakdown — Where exactly does the model confuse?

Let's look at the most common error pairs to understand the model's systematic biases.

In [None]:
from sklearn.metrics import confusion_matrix

cm = confusion_matrix(y_true, y_pred)

# Find the biggest off-diagonal values (most common mistakes)
errors = []
for i in range(NUM_CLASSES):
    for j in range(NUM_CLASSES):
        if i != j and cm[i, j] > 0:
            errors.append({
                'true': CLASS_LABELS[CLASS_NAMES[i]],
                'predicted': CLASS_LABELS[CLASS_NAMES[j]],
                'count': cm[i, j],
                'pct_of_true': cm[i, j] / cm[i].sum() * 100
            })

errors_df = pd.DataFrame(errors).sort_values('count', ascending=False).head(10)
print('Top 10 Most Common Misclassifications:')
print('=' * 70)
for _, row in errors_df.iterrows():
    print(f'  {row["true"]:30s} -> {row["predicted"]:30s}  ({int(row["count"]):3d} cases, {row["pct_of_true"]:.1f}% of actual)')

print(f'\nTotal errors: {(y_true != y_pred).sum()}')

### Error Patterns — Interpretation

**Common patterns in skin lesion classification errors:**
- **mel → nv** (Melanoma predicted as Mole) — The most clinically dangerous error. Melanomas can look like regular moles, especially early-stage ones
- **bkl → nv** (Benign Keratosis predicted as Mole) — Both are brownish, bumpy lesions
- **nv → mel** (Mole predicted as Melanoma) — False alarm, but better safe than sorry in medicine!

**Key insight:** Most errors are between visually similar classes. The model isn't making random mistakes — it's struggling with the same ambiguities that challenge dermatologists.

**How to improve:**
- Higher resolution images (224x224 or 384x384) would help catch subtle differences
- More training data for rare classes
- Ensemble of multiple models
- Include patient metadata (age, sex, localization) as additional features

---
## Summary

### What we learned from deep evaluation:

| Analysis | Key Finding |
|----------|-------------|
| **ROC/AUC** | Model separates most classes well from each other |
| **Confidence** | Model is more confident when correct — it "knows what it knows" |
| **Per-class metrics** | Rare classes benefit from weighted loss, nv has highest precision |
| **Misclassifications** | Most errors are between visually similar classes (mel↔nv, akiec↔bkl) |
| **Error patterns** | Errors mirror real clinical challenges in dermatology |

### Why this analysis matters:
- **For the portfolio**: shows you don't just train models — you understand their behavior
- **For real-world ML**: error analysis is essential before deploying any model
- **For medical AI**: understanding failure modes is critical for patient safety

### Next Steps

-> **04_gradcam.ipynb** — visualize WHERE the model looks when making predictions (explainability)