# Evaluator Analysis

This notebook demonstrates how to analyze the performance of baseline evaluators (Llama Guard, Prompt Guard) against ground-truth labels.

## Overview

The ingestion pipeline can run evaluators on prompts to generate additional labels:
- **LlamaGuardEvaluator**: Safety classification using Llama Guard
- **PromptGuardEvaluator**: Meta's Prompt Guard 2 for prompt injection detection
- **InjecAgentEvaluator**: Detects successful prompt injection attacks

This notebook shows how to load these labels and compute metrics.

In [None]:
from prompt_mining.classifiers import ClassificationDataset
import pandas as pd
import numpy as np
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix

## 1. Load Labels from Ingestion Output

Use `load_labels()` to get evaluator outputs without loading activations.

In [None]:
# Load dataset
# Update this path to point to your ingestion output
dataset = ClassificationDataset.from_path("/path/to/activations")

# Load only labels (no activations needed)
labels_data = dataset.load_labels(filters={'status': 'completed'})
print(f"Loaded {len(labels_data)} completed samples")

In [None]:
# Extract into DataFrame
records = []
for item in labels_data:
    labels = item['prompt_labels']
    records.append({
        'run_id': item['run_id'],
        'dataset': item['dataset_id'],
        'actual_malicious': labels.get('malicious', False),
        'lg_label': labels.get('lg_label'),  # Llama Guard output
    })

df = pd.DataFrame(records)
print(f"Total samples: {len(df)}")
print(f"\nLlama Guard coverage: {df['lg_label'].notna().sum()} / {len(df)}")
print(f"\nLabel distribution:")
print(df['lg_label'].value_counts(dropna=False))

## 2. Compute Per-Dataset Metrics

In [None]:
# Filter to samples with Llama Guard labels
df_labeled = df[df['lg_label'].notna()].copy()

# Convert to binary: UNSAFE=1 (malicious), SAFE=0 (benign)
df_labeled['lg_pred'] = (df_labeled['lg_label'] == 'UNSAFE').astype(int)
df_labeled['actual_malicious'] = df_labeled['actual_malicious'].astype(int)

# Compute metrics per dataset
metrics_list = []
for ds in sorted(df_labeled['dataset'].unique()):
    subset = df_labeled[df_labeled['dataset'] == ds]
    y_true = subset['actual_malicious'].values
    y_pred = subset['lg_pred'].values
    
    n_samples = len(subset)
    n_positive = y_true.sum()
    
    acc = accuracy_score(y_true, y_pred)
    
    # Handle single-class datasets
    if n_positive == 0 or n_positive == n_samples:
        prec, rec, f1 = np.nan, np.nan, np.nan
    else:
        prec = precision_score(y_true, y_pred, zero_division=0)
        rec = recall_score(y_true, y_pred, zero_division=0)
        f1 = f1_score(y_true, y_pred, zero_division=0)
    
    # Confusion matrix
    tn, fp, fn, tp = confusion_matrix(y_true, y_pred, labels=[0, 1]).ravel()
    
    metrics_list.append({
        'Dataset': ds,
        'N': n_samples,
        'Pos%': n_positive / n_samples * 100,
        'Accuracy': acc * 100,
        'Precision': prec * 100 if not np.isnan(prec) else np.nan,
        'Recall': rec * 100 if not np.isnan(rec) else np.nan,
        'F1': f1 * 100 if not np.isnan(f1) else np.nan,
        'TP': tp, 'FP': fp, 'TN': tn, 'FN': fn,
    })

metrics_df = pd.DataFrame(metrics_list)
print("Per-Dataset Metrics")
print("=" * 80)
display(metrics_df)

## 3. Overall Metrics

In [None]:
y_true_all = df_labeled['actual_malicious'].values
y_pred_all = df_labeled['lg_pred'].values

print("Overall Metrics")
print("=" * 60)
print(f"Total samples: {len(df_labeled):,}")
print(f"Actual positives: {y_true_all.sum():,} ({y_true_all.mean()*100:.1f}%)")
print(f"Predicted positives: {y_pred_all.sum():,} ({y_pred_all.mean()*100:.1f}%)")
print()
print(f"Accuracy:  {accuracy_score(y_true_all, y_pred_all)*100:.1f}%")
print(f"Precision: {precision_score(y_true_all, y_pred_all)*100:.1f}%")
print(f"Recall:    {recall_score(y_true_all, y_pred_all)*100:.1f}%")
print(f"F1 Score:  {f1_score(y_true_all, y_pred_all)*100:.1f}%")

cm = confusion_matrix(y_true_all, y_pred_all, labels=[0, 1])
print(f"\nConfusion Matrix:")
print(f"                 Predicted")
print(f"              SAFE    UNSAFE")
print(f"Actual SAFE   {cm[0,0]:>6,}    {cm[0,1]:>6,}")
print(f"     UNSAFE   {cm[1,0]:>6,}    {cm[1,1]:>6,}")

## 4. Detection Rate by Dataset

In [None]:
import matplotlib.pyplot as plt

# Calculate detection rate for datasets with positives
plot_df = metrics_df[metrics_df['Pos%'] > 0].copy()
plot_df['Detection_Rate'] = plot_df['TP'] / (plot_df['TP'] + plot_df['FN']) * 100
plot_df = plot_df.sort_values('Detection_Rate', ascending=True)

fig, ax = plt.subplots(figsize=(10, 8))
colors = ['#e74c3c' if d < 50 else '#f39c12' if d < 80 else '#27ae60' 
          for d in plot_df['Detection_Rate']]
bars = ax.barh(plot_df['Dataset'], plot_df['Detection_Rate'], color=colors)

ax.set_xlabel('Detection Rate (Recall) %')
ax.set_title('Llama Guard Detection Rate by Dataset')
ax.axvline(x=50, color='gray', linestyle='--', alpha=0.5)
ax.set_xlim(0, 105)

for bar, val in zip(bars, plot_df['Detection_Rate']):
    ax.text(val + 1, bar.get_y() + bar.get_height()/2, f'{val:.1f}%', va='center', fontsize=9)

plt.tight_layout()
plt.show()

## 5. False Positive Rate on Benign Datasets

In [None]:
# FPR on all-benign datasets
benign_datasets = metrics_df[metrics_df['Pos%'] == 0].copy()
benign_datasets['FPR'] = benign_datasets['FP'] / benign_datasets['N'] * 100
benign_datasets = benign_datasets.sort_values('FPR', ascending=False)

print("False Positive Rate on Benign Datasets")
print("=" * 60)
for _, row in benign_datasets.iterrows():
    print(f"{row['Dataset']:25s}: {row['FP']:4.0f} / {row['N']:5.0f} = {row['FPR']:.2f}% FPR")

total_fp = benign_datasets['FP'].sum()
total_n = benign_datasets['N'].sum()
print(f"\nOverall FPR on benign data: {total_fp:.0f} / {total_n:.0f} = {total_fp/total_n*100:.2f}%")

## 6. Summary Table with Styling

In [None]:
# Add detection rate column
display_df = metrics_df.copy()
display_df['Det%'] = display_df.apply(
    lambda r: r['TP']/(r['TP']+r['FN'])*100 if (r['TP']+r['FN']) > 0 else np.nan, 
    axis=1
)
display_df = display_df.set_index('Dataset')

styled = display_df[['N', 'Pos%', 'Accuracy', 'Det%', 'Precision', 'Recall', 'F1']].style\
    .format({
        'N': '{:,}',
        'Pos%': '{:.1f}%',
        'Accuracy': '{:.1f}%',
        'Det%': '{:.1f}%',
        'Precision': '{:.1f}%',
        'Recall': '{:.1f}%',
        'F1': '{:.1f}%',
    }, na_rep='-')\
    .background_gradient(subset=['Accuracy', 'Det%', 'F1'], cmap='RdYlGn', vmin=0, vmax=100)

display(styled)

## Summary

This notebook demonstrated:

1. **Loading evaluator labels** from ingestion output
2. **Computing per-dataset metrics** (accuracy, precision, recall, F1)
3. **Analyzing detection rates** across different attack types
4. **Measuring false positive rates** on benign data

### Key Findings (typical)

- **Direct jailbreaks** (advbench, harmbench): High detection rate (~95%+)
- **Indirect injection** (bipia, injecagent): Lower detection rate
- **Password extraction** (mosscap): Often missed by safety classifiers

This analysis helps identify gaps in baseline evaluators that activation-based classifiers may fill.