# ROC Curve Analysis for CIFAR-10 CNN Experiments

This notebook computes ROC (Receiver Operating Characteristic) curves and AUC (Area Under Curve) scores for CIFAR-10 CNN classification experiments. It compares model performance across different training configurations.

## Workflow

1. Load prediction probability files from DerivaML catalog as assets
2. Retrieve ground truth labels from the Image_Classification feature table
3. Compute per-class and micro/macro-averaged ROC curves
4. Generate comparison visualizations

## Requirements

- Prediction probability CSV files with columns: `Image_RID`, `Predicted_Class`, `prob_<classname>` for each class
- Ground truth labels stored in the Image_Classification feature (from a labeling execution with no confidence scores)

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import roc_curve, auc
from sklearn.preprocessing import label_binarize

from deriva_ml.execution import run_notebook

## Initialize Notebook

Initialize the notebook with DerivaML execution context. This single call:
1. Loads all configuration modules
2. Resolves the hydra-zen configuration
3. Creates the DerivaML connection
4. Creates a workflow and execution context
5. Downloads any specified assets

Override configuration at runtime:
```python
ml, execution, config = run_notebook(
    "roc_analysis",
    overrides=["assets=roc_quick_probabilities"],  # Analyze single experiment
)
```

Available asset configurations (see `src/configs/assets.py`):
- `roc_quick_probabilities` - cifar10_quick experiment only
- `roc_extended_probabilities` - cifar10_extended experiment only  
- `roc_comparison_probabilities` - both experiments (default)

In [None]:
# Initialize notebook - this single call handles all setup
ml, execution, config = run_notebook("roc_analysis", workflow_type="ROC Analysis Notebook")

from IPython.display import display, Markdown

display(Markdown(f"""**Connection:** `{ml.host_name}`, catalog `{ml.catalog_id}`

**Execution:** [{execution.execution_rid}]({ml.cite(execution.execution_rid)})

**Configuration:**
- Assets: `{config.assets}`
- Show per-class curves: `{config.show_per_class}`
- Confidence threshold: `{config.confidence_threshold}`

**Downloaded:** {list(execution.asset_paths.keys())}
"""))

# Display asset configuration description
if hasattr(config.assets, 'description') and config.assets.description:
    display(Markdown(f"**Asset Configuration Description:**\n\n{config.assets.description}"))

## Load Experiments from Assets

Load the prediction probability CSV files from downloaded assets and create `Experiment` objects to access configuration metadata.

In [None]:
from IPython.display import display, Markdown, HTML

# Build Experiment objects for each prediction asset
experiments = []
loaded_info = []

for asset_path in execution.asset_paths.get('Execution_Asset', []):
    if asset_path.file_name.name == "prediction_probabilities.csv":
        # Find the source execution that produced this asset
        asset = ml.lookup_asset(asset_path.asset_rid)
        asset_executions = asset.list_executions(asset_role='Output')
        
        if asset_executions:
            exec_rid = asset_executions[0]['Execution']
            exp = ml.lookup_experiment(exec_rid)
            
            # Load prediction data
            df = pd.read_csv(asset_path.file_name)
            
            experiments.append({
                'experiment': exp,
                'asset_rid': asset_path.asset_rid,
                'data': df,
                'name': exp.name,
                'config_choices': exp.config_choices,
                'model_config': exp.model_config,
            })
            loaded_info.append(
                f"- **{exp.name}**: {len(df)} predictions "
                f"(execution [{exec_rid}]({ml.cite(exec_rid)}))"
            )

display(Markdown("**Loaded experiments:**\n\n" + "\n".join(loaded_info)))
display(Markdown(f"\n*Successfully loaded {len(experiments)} experiments*"))

## Experiment Summary

Display detailed configuration information for each experiment using the `Experiment` class. This shows:
- **Configuration Choices**: The Hydra config names used (e.g., `model_config=cifar10_quick`)
- **Model Configuration**: Hyperparameters like epochs, learning rate, batch size
- **Input Datasets**: Training and test datasets used
- **Input Assets**: Any pre-trained weights or other assets used as inputs

In [None]:
# Display experiment configurations using Experiment.display_markdown()
for exp_data in experiments:
    exp_data['experiment'].display_markdown()

display(Markdown("---"))

## Get Ground Truth Labels

Retrieve ground truth labels from the `Image_Classification` feature table. This feature stores classification labels for images, potentially from multiple sources (executions).

**Identifying ground truth:**
- Ground truth labels are manually assigned and have **no confidence score** (NULL)
- Model predictions have confidence scores from softmax probabilities
- We identify the ground truth execution by finding labels with zero confidence values

In [None]:
# Get ground truth labels from the feature table
all_feature_values = list(ml.list_feature_values("Image", "Image_Classification"))
feature_df = pd.DataFrame(all_feature_values)

# Ground truth labels have no confidence score (manually labeled)
# Group by execution to identify which has ground truth
exec_summary = feature_df.groupby('Execution').agg({
    'Image': 'count',
    'Confidence': lambda x: x.notna().sum()
}).rename(columns={'Image': 'num_images', 'Confidence': 'with_confidence'})

# Find execution with no confidence scores (ground truth)
gt_mask = exec_summary['with_confidence'] == 0
if gt_mask.any():
    gt_execution = exec_summary[gt_mask].index[0]
else:
    gt_execution = exec_summary['num_images'].idxmax()

# Extract ground truth as lookup dictionary
ground_truth = feature_df[feature_df['Execution'] == gt_execution][['Image', 'Image_Class']]
gt_lookup = dict(zip(ground_truth['Image'], ground_truth['Image_Class']))

# Get class names
class_names = sorted(ground_truth['Image_Class'].unique())
n_classes = len(class_names)

display(Markdown(f"""**Ground Truth:**
- Execution: [{gt_execution}]({ml.cite(gt_execution)})
- Total labels: {exec_summary.loc[gt_execution, 'num_images']}
- Classes ({n_classes}): {', '.join(f'`{c}`' for c in class_names)}
"""))

## Merge Predictions with Ground Truth

Join prediction data with ground truth labels using `Image_RID` as the key. Only images that have both predictions and ground truth labels will be included in the ROC analysis.

In [None]:
# Add ground truth to each experiment's predictions
merge_results = []

for exp in experiments:
    df = exp['data'].copy()
    
    # Map Image_RID to ground truth class
    df['True_Class'] = df['Image_RID'].map(gt_lookup)
    
    # Keep only images with ground truth
    matched = df['True_Class'].notna().sum()
    total = len(df)
    
    df = df.dropna(subset=['True_Class'])
    exp['data'] = df
    exp['n_samples'] = len(df)
    
    if len(df) > 0:
        exp['accuracy'] = (df['Predicted_Class'] == df['True_Class']).mean() * 100
        merge_results.append(f"- **{exp['name']}**: {matched}/{total} matched, accuracy {exp['accuracy']:.1f}%")
    else:
        exp['accuracy'] = float('nan')
        merge_results.append(f"- **{exp['name']}**: ⚠️ No matching samples found!")

display(Markdown("**Prediction/Ground Truth Merge:**\n\n" + "\n".join(merge_results)))

## Compute ROC Curves

For multi-class classification, we use the **one-vs-rest (OvR)** approach:
- Each class gets its own ROC curve treating it as positive vs. all others
- **Micro-average**: Aggregate all classes, treating each prediction as independent
- **Macro-average**: Simple mean of per-class AUC scores (equal weight to each class)

AUC (Area Under ROC Curve) ranges from 0.5 (random) to 1.0 (perfect discrimination).

In [None]:
def compute_roc_metrics(df: pd.DataFrame, class_names: list[str]) -> dict:
    """Compute ROC curves and AUC scores for multi-class predictions.
    
    Args:
        df: DataFrame with True_Class and prob_* columns
        class_names: Ordered list of class names
        
    Returns:
        Dict with fpr, tpr, roc_auc for each class and micro/macro averages
    """
    n_classes = len(class_names)
    class_to_idx = {name: i for i, name in enumerate(class_names)}
    
    # Convert labels to indices
    y_true_idx = df['True_Class'].map(class_to_idx).values
    y_true_bin = label_binarize(y_true_idx, classes=range(n_classes))
    
    # Get probability matrix
    prob_cols = [f"prob_{c}" for c in class_names]
    y_score = df[prob_cols].values
    
    # Compute per-class ROC
    fpr, tpr, roc_auc = {}, {}, {}
    for i, name in enumerate(class_names):
        fpr[name], tpr[name], _ = roc_curve(y_true_bin[:, i], y_score[:, i])
        roc_auc[name] = auc(fpr[name], tpr[name])
    
    # Micro-average
    fpr['micro'], tpr['micro'], _ = roc_curve(y_true_bin.ravel(), y_score.ravel())
    roc_auc['micro'] = auc(fpr['micro'], tpr['micro'])
    
    # Macro-average
    roc_auc['macro'] = np.mean([roc_auc[c] for c in class_names])
    
    return {'fpr': fpr, 'tpr': tpr, 'roc_auc': roc_auc}

In [None]:
# Compute ROC metrics for each experiment
roc_results = []

for exp in experiments:
    metrics = compute_roc_metrics(exp['data'], class_names)
    exp.update(metrics)
    
    roc_results.append(
        f"- **{exp['name']}** (asset [{exp['asset_rid']}]({ml.cite(exp['asset_rid'])})): "
        f"Accuracy {exp['accuracy']:.2f}%, Micro-AUC {exp['roc_auc']['micro']:.4f}, "
        f"Macro-AUC {exp['roc_auc']['macro']:.4f}"
    )

display(Markdown("**ROC Metrics:**\n\n" + "\n".join(roc_results)))

In [None]:
# Display AUC comparison table
if experiments:
    auc_data = []
    for exp in experiments:
        exp_name = exp.get('name', exp['asset_rid'])
        row = {'Experiment': exp_name}
        for c in class_names:
            row[c] = exp['roc_auc'][c]
        row['Micro'] = exp['roc_auc']['micro']
        row['Macro'] = exp['roc_auc']['macro']
        auc_data.append(row)

    auc_df = pd.DataFrame(auc_data).set_index('Experiment')
    display(Markdown("**Per-class AUC scores:**"))
    display(auc_df.round(4))
else:
    display(Markdown("*No experiments loaded*"))

## Plot ROC Curves

In [None]:
def plot_roc_curves(exp: dict, class_names: list[str], show_per_class: bool = True):
    """Plot ROC curves for an experiment.
    
    Args:
        exp: Experiment dict with fpr, tpr, roc_auc data
        class_names: List of class names
        show_per_class: If True, plot individual class curves. If False, only micro-average.
    """
    fig, ax = plt.subplots(figsize=(10, 8))
    
    fpr, tpr, roc_auc = exp['fpr'], exp['tpr'], exp['roc_auc']
    
    # Micro-average (always shown)
    ax.plot(fpr['micro'], tpr['micro'], 
            label=f"Micro-avg (AUC={roc_auc['micro']:.3f})",
            color='deeppink', linestyle=':', linewidth=3)
    
    # Per-class curves (optional based on config)
    if show_per_class:
        colors = plt.cm.tab10(np.linspace(0, 1, len(class_names)))
        for i, name in enumerate(class_names):
            ax.plot(fpr[name], tpr[name], color=colors[i],
                    label=f"{name} (AUC={roc_auc[name]:.3f})")
    
    ax.plot([0, 1], [0, 1], 'k--', alpha=0.5)
    ax.set_xlim([0, 1])
    ax.set_ylim([0, 1.05])
    ax.set_xlabel('False Positive Rate')
    ax.set_ylabel('True Positive Rate')
    
    # Use experiment name in title
    exp_name = exp.get('name', exp['asset_rid'])
    ax.set_title(f"ROC Curves: {exp_name} (Acc: {exp['accuracy']:.1f}%)")
    
    ax.legend(loc='lower right', fontsize=9)
    ax.grid(True, alpha=0.3)
    
    return fig

In [None]:
# Plot ROC curves for each experiment (controlled by config.show_per_class)
for exp in experiments:
    fig = plot_roc_curves(exp, class_names, show_per_class=config.show_per_class)
    plt.tight_layout()
    plt.show()

## Experiment Comparison

Compare micro-averaged ROC curves across all experiments. This visualization shows how different model configurations perform relative to each other:
- Curves closer to the top-left corner indicate better performance
- The diagonal dashed line represents random classification (AUC = 0.5)

In [None]:
# Compare micro-average ROC curves across experiments
if len(experiments) > 1:
    fig, ax = plt.subplots(figsize=(10, 8))
    colors = plt.cm.Set1(np.linspace(0, 1, len(experiments)))
    
    for i, exp in enumerate(experiments):
        exp_name = exp.get('name', exp['asset_rid'])
        label = f"{exp_name} (AUC={exp['roc_auc']['micro']:.3f}, Acc={exp['accuracy']:.1f}%)"
        ax.plot(exp['fpr']['micro'], exp['tpr']['micro'], 
                color=colors[i], linewidth=2, label=label)
    
    ax.plot([0, 1], [0, 1], 'k--', alpha=0.5, label='Random')
    ax.set_xlim([0, 1])
    ax.set_ylim([0, 1.05])
    ax.set_xlabel('False Positive Rate')
    ax.set_ylabel('True Positive Rate')
    ax.set_title('ROC Curve Comparison (Micro-Average)')
    ax.legend(loc='lower right')
    ax.grid(True, alpha=0.3)
    plt.tight_layout()
    plt.show()
else:
    display(Markdown("*Single experiment - no comparison plot*"))

## Summary

In [None]:
# Final summary
summary_lines = [
    "# CIFAR-10 ROC Analysis Summary",
    "",
    f"**Catalog:** `{ml.host_name}:{ml.catalog_id}`",
    f"**Execution:** [{execution.execution_rid}]({ml.cite(execution.execution_rid)})",
    f"**Ground truth:** Execution [{gt_execution}]({ml.cite(gt_execution)}) ({len(gt_lookup)} labels)",
    f"**Classes:** {n_classes}",
    f"**Experiments analyzed:** {len(experiments)}",
]

display(Markdown("\n".join(summary_lines)))

# Display asset configuration description if available
if hasattr(config.assets, 'description') and config.assets.description:
    display(Markdown(f"\n**Analysis Context:**\n\n{config.assets.description}"))

# Display per-experiment summary
results_lines = ["## Experiment Results", ""]
for exp in experiments:
    exp_name = exp.get('name', exp['asset_rid'])
    exp_obj = exp.get('experiment')
    
    # Get experiment description if available
    exp_desc = ""
    if exp_obj and exp_obj.description:
        exp_desc = f" - {exp_obj.description}"
    
    # Link to source execution
    exec_link = f"[{exp_obj.execution_rid}]({ml.cite(exp_obj.execution_rid)})" if exp_obj else exp['asset_rid']
    
    results_lines.append(f"### {exp_name}{exp_desc}")
    results_lines.append(f"- **Execution:** {exec_link}")
    results_lines.append(f"- **Samples:** {exp['n_samples']}")
    results_lines.append(f"- **Accuracy:** {exp['accuracy']:.2f}%")
    results_lines.append(f"- **Micro-AUC:** {exp['roc_auc']['micro']:.4f}")
    results_lines.append(f"- **Macro-AUC:** {exp['roc_auc']['macro']:.4f}")
    results_lines.append("")

display(Markdown("\n".join(results_lines)))

In [None]:
# Complete execution and upload outputs
execution.upload_execution_outputs()
display(Markdown(f"**Execution completed:** [{execution.execution_rid}]({ml.cite(execution.execution_rid)})"))