## OOD Evaluation Utilities — Documentation

- This notebook provides utilities for evaluating Out-of-Distribution (OOD) detection
performance using common metrics such as AUROC and AUPR.
- The evaluation is based on
confidence scores assigned by a model to in-distribution (ID) and out-of-distribution
(OOD) samples.
- Lower scores indicate higher likelihood of being OOD.


## Imports

In [34]:
from __future__ import annotations
from dataclasses import dataclass

import numpy as np
from sklearn.metrics import auc, average_precision_score, precision_recall_curve, roc_curve

### OodEvaluationResult

A dataclass that stores all metrics and diagnostic arrays generated during OOD evaluation.

**Fields**

- **auroc (float)**  
  Area Under the ROC Curve. Measures how well ID and OOD samples are separable.

- **fpr (np.ndarray)**  
  False Positive Rates for different classification thresholds.

- **tpr (np.ndarray)**  
  True Positive Rates for the same thresholds.

- **labels (np.ndarray)**  
  Ground-truth labels used for evaluation:  
  0 = ID sample, 1 = OOD sample.

- **preds (np.ndarray)**  
  Anomaly scores computed as 1 - confidence.  
  Higher values correspond to higher OOD likelihood.

- **aupr (float)**  
  Area Under the Precision–Recall Curve.

- **precision (np.ndarray)**  
  Precision values for different thresholds.

- **recall (np.ndarray)**  
  Recall values for the same thresholds.



### evaluate_ood_performance(id_scores, ood_scores)

Computes OOD detection metrics given confidence scores for ID and OOD samples.

#### Parameters
- **id_scores (np.ndarray)**  
  Confidence scores for in-distribution samples. Expected in [0, 1].

- **ood_scores (np.ndarray)**  
  Confidence scores for out-of-distribution samples. Expected in [0, 1].

#### Core Logic
The function assumes the convention:

**Low confidence → more likely OOD**

To transform scores into anomaly predictions, it computes:  
preds = 1.0 - score.

This ensures that **higher values indicate higher OOD likelihood**, which is required
by metrics such as ROC and PR curves.

#### Steps Performed

1. **Clip scores** to [0, 1] to ensure numerical stability:
  - id_scores = np.clip(id_scores, 0.0, 1.0)

2. **Create Labels** 
- 0 for ID samples
- 1 for OOD samples

In [35]:
import numpy as np

id_scores = np.array([0.9, 0.8, 0.85])
ood_scores = np.array([0.1, 0.2, 0.05])

labels = np.concatenate([np.zeros(len(id_scores)), np.ones(len(ood_scores))])
labels

array([0., 0., 0., 1., 1., 1.])

3. **Combine Scores**

In [36]:
all_scores = np.concatenate([id_scores, ood_scores])

4. **Convert scores → anomaly predictions:**

In [37]:
preds = 1.0 - all_scores

5. **Compute ROC curve:**

In [38]:
fpr, tpr, _ = roc_curve(labels, preds)
roc_auc = auc(fpr, tpr)

6. **Compute Precision–Recall curve & AUPR:**

In [39]:
precision, recall, _ = precision_recall_curve(labels, preds)
aupr = average_precision_score(labels, preds)

7. **Return all metrics inside an OodEvaluationResult.**


After computing AUROC, AUPR, the ROC curve, and the precision–recall curve, the function
packages all results into an `OodEvaluationResult` dataclass, which stores:

- `auroc`: Area under the ROC curve  
- `fpr`: False positive rates  
- `tpr`: True positive rates  
- `labels`: Ground-truth labels (0 = ID, 1 = OOD)  
- `preds`: Anomaly scores (`1 - confidence`)  
- `precision`: Precision values  
- `recall`: Recall values  
- `aupr`: Area under the precision–recall curve  

This structured return object makes it easy to access or visualize each metric.

## Example: Evaluating OOD Performance

The following example shows how to use **evaluate_ood_performance** with
a small set of in-distribution (ID) and out-of-distribution (OOD) confidence scores.

**1. Define example scores**

In [42]:
id_scores = np.array([0.95, 0.88, 0.91, 0.85, 0.93])   # Higher scores → ID
ood_scores = np.array([0.12, 0.08, 0.22, 0.15, 0.05])  # Lower scores → OOD

2. **Run OOD evaluation**

In [None]:
result = evaluate_ood_performance(id_scores, ood_scores)

print("AUROC:", result.auroc)
print("AUPR :", result.aupr)
print("Labels:", result.labels)
print("Predictions (anomaly scores):", result.preds)

3. **Plot Roc Curve**

In [None]:
plot_roc(result)

4. **Plot Precision-Recall Curve**

In [None]:
plot_pr(result)