# Training Classifiers on Stored Activations

This notebook demonstrates how to train malicious prompt classifiers using pre-extracted model activations.

## Overview

The `prompt_mining.classifiers` module provides:
- **ClassificationDataset**: Load pre-extracted activations from the ingestion pipeline
- **LinearClassifier**: Sklearn-based logistic regression with configurable normalization
- **DANNClassifier**: Domain-Adversarial Neural Network for cross-dataset generalization
- **lodo_evaluate**: Leave-One-Dataset-Out evaluation for testing generalization
- **Threshold Strategies**: Pluggable threshold selection (constant, target precision, max F1)

In [4]:
from prompt_mining.classifiers import (
    # Data loading
    ClassificationDataset,
    # Classifiers
    LinearClassifier,
    LinearConfig,
    DANNClassifier,
    DANNConfig,
    # Evaluation
    lodo_evaluate,
    # Threshold strategies
    ConstantThreshold,
    TargetPrecisionThreshold,
    MaxF1Threshold,
    CVConfig,
)

import numpy as np
import pandas as pd

## 1. Loading Pre-Extracted Activations

The `ClassificationDataset` class loads activations from directories created by the ingestion pipeline.

**Key parameters for `.load()`:**
- `layer`: Which model layer to load (e.g., 27, 31)
- `space`: `'raw'` for model activations, `'sae'` for SAE-encoded features
- `position`: Token position (`'last'`, `'-5'`, `0`, etc.)
- `return_sparse`: Whether to return sparse matrix (for SAE features)

In [None]:
# Load dataset from ingestion output directory
# Update this path to point to your ingestion output
dataset = ClassificationDataset.from_path("/path/to/activations")

# Load raw (dense) activations at layer 31
data_raw = dataset.load(layer=31, space='raw', position=[-5])
print(dataset.summary(data_raw))

Loaded from cache: raw_layer31_pos-5.npz
ClassificationDataset Summary
Samples: 105034
Features: 4096
Datasets: 18
Class balance: 46.7% positive

Per-dataset breakdown:
  bipia_email_code_table: 15000 samples (95.3% positive)
  dolly_15k: 10000 samples (0.0% positive)
  enron: 10000 samples (0.0% positive)
  jayavibhav: 10000 samples (49.7% positive)
  mosscap: 10000 samples (100.0% positive)
  llmail: 9998 samples (100.0% positive)
  openorca: 9997 samples (0.0% positive)
  10k_prompts_ranked: 9924 samples (0.0% positive)
  safeguard: 8236 samples (30.3% positive)
  qualifire: 5000 samples (40.0% positive)
  wildjailbreak: 2210 samples (90.5% positive)
  injecagent_dh_ds_base: 1054 samples (100.0% positive)
  yanismiraoui: 1034 samples (100.0% positive)
  softAge: 1001 samples (0.0% positive)
  deepset: 546 samples (37.2% positive)
  advbench: 520 samples (100.0% positive)
  harmbench: 400 samples (100.0% positive)
  gandalf_summarization: 114 samples (100.0% positive)


In [None]:
# Load SAE-encoded features (sparse matrix for memory efficiency)
data_sae = dataset.load(layer=27, space='sae', position=[-5], return_sparse=True)
print(f"SAE data loaded: {data_sae.X.shape}")
print(f"Sparsity: {1 - data_sae.X.nnz / np.prod(data_sae.X.shape):.2%}")

Loaded from cache: sae_layer27_pos-5_sparse.npz
SAE data loaded: (105034, 131072)
Sparsity: 99.17%


## 2. Training a Linear Classifier

The `LinearClassifier` wraps sklearn's logistic regression with configurable normalization.

**LinearConfig options:**
- `model`: `'logistic'` (L-BFGS) or `'sgd'` (SGD optimizer)
- `normalize`: `'standard'` (z-score), `'l2'` (unit norm), `'none'`
- `C`: Inverse regularization strength (higher = less regularization)

In [7]:
from sklearn.model_selection import train_test_split

# Train a linear classifier on raw activations
clf_raw = LinearClassifier(LinearConfig(
    model='logistic',
    normalize='standard',
    C=0.1
))

# Simple train/test split for demonstration
X_train, X_test, y_train, y_test = train_test_split(
    data_raw.X, data_raw.y, test_size=0.2, random_state=42, stratify=data_raw.y
)

clf_raw.fit(X_train, y_train)
print(f"Trained: {clf_raw}")

# Evaluate
test_scores = clf_raw.predict_scores(X_test)
test_acc = ((test_scores >= 0.5).astype(int) == y_test).mean()
print(f"Test accuracy: {test_acc:.1%}")

Trained: LinearClassifier(model=logistic, normalize=standard, fitted)
Test accuracy: 97.8%


## 3. LODO Evaluation (Leave-One-Dataset-Out)

Random train/test splits can overestimate performance when samples from the same dataset appear in both sets. **LODO evaluation** tests true cross-dataset generalization:

1. For each dataset D: train on ALL other datasets, test on D
2. Report per-dataset metrics + weighted average

This reveals which datasets are "out-of-distribution" relative to others.

In [8]:
# LODO evaluation with constant threshold
clf = LinearClassifier(LinearConfig(model='logistic', normalize='standard', C=0.1))

results_constant = lodo_evaluate(
    clf,
    data_raw.X,
    data_raw.y,
    data_raw.datasets,
    threshold_strategy=ConstantThreshold(0.5),
    # Merge similar datasets for fair estimates
    merge_datasets={'gandalf_summarization': 'mosscap'},
)

Threshold strategy: constant(0.5), CV: False
Evaluating on 10k_prompts_ranked (9924 samples)... 



acc=92.4%, F1=0.0%, thr=0.500
Evaluating on advbench (520 samples)... 



acc=90.8%, F1=95.2%, thr=0.500
Evaluating on bipia_email_code_table (15000 samples)... acc=63.1%, F1=76.7%, thr=0.500
Evaluating on deepset (546 samples)... acc=77.7%, F1=63.5%, thr=0.500
Evaluating on dolly_15k (10000 samples)... 



acc=99.6%, F1=0.0%, thr=0.500
Evaluating on enron (10000 samples)... 



acc=82.6%, F1=0.0%, thr=0.500
Evaluating on harmbench (400 samples)... 



acc=42.8%, F1=59.9%, thr=0.500
Evaluating on injecagent_dh_ds_base (1054 samples)... 



acc=98.9%, F1=99.4%, thr=0.500
Evaluating on jayavibhav (10000 samples)... acc=69.1%, F1=72.3%, thr=0.500
Evaluating on llmail (9998 samples)... 



acc=71.4%, F1=83.3%, thr=0.500
Evaluating on mosscap (10114 samples)... 



acc=79.4%, F1=88.5%, thr=0.500
Evaluating on openorca (9997 samples)... 



acc=98.0%, F1=0.0%, thr=0.500
Evaluating on qualifire (5000 samples)... acc=77.8%, F1=74.5%, thr=0.500
Evaluating on safeguard (8236 samples)... acc=96.7%, F1=94.5%, thr=0.500
Evaluating on softAge (1001 samples)... 



acc=95.1%, F1=0.0%, thr=0.500
Evaluating on wildjailbreak (2210 samples)... acc=78.6%, F1=87.0%, thr=0.500
Evaluating on yanismiraoui (1034 samples)... acc=55.8%, F1=71.6%, thr=0.500
--------------------------------------------------
Weighted average: acc=81.8%, F1=49.8%




## 4. Threshold Strategies

Different threshold selection strategies affect precision/recall tradeoffs:

- **ConstantThreshold(0.5)**: Fixed threshold at 0.5
- **TargetPrecisionThreshold(0.95)**: Select threshold to achieve 95% precision on benign class
- **MaxF1Threshold()**: Select threshold that maximizes F1 score

Use **CVConfig** to generate calibrated scores via cross-validation before threshold selection.

In [9]:
# Target precision strategy with cross-validation
results_precision = lodo_evaluate(
    clf,
    data_raw.X,
    data_raw.y,
    data_raw.datasets,
    threshold_strategy=TargetPrecisionThreshold(0.95),
    cv=CVConfig(enabled=True, folds=3, n_jobs=8),
    merge_datasets={'gandalf_summarization': 'mosscap'},
)

Threshold strategy: precision(0.95), CV: True
Evaluating on 10k_prompts_ranked (9924 samples)... 



acc=85.6%, F1=0.0%, thr=0.075
Evaluating on advbench (520 samples)... 



acc=93.7%, F1=96.7%, thr=0.081
Evaluating on bipia_email_code_table (15000 samples)... acc=76.6%, F1=86.6%, thr=0.077
Evaluating on deepset (546 samples)... acc=78.0%, F1=67.4%, thr=0.072
Evaluating on dolly_15k (10000 samples)... 



acc=99.2%, F1=0.0%, thr=0.112
Evaluating on enron (10000 samples)... 



acc=71.9%, F1=0.0%, thr=0.102
Evaluating on harmbench (400 samples)... 



acc=51.2%, F1=67.8%, thr=0.064
Evaluating on injecagent_dh_ds_base (1054 samples)... 

  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])


acc=100.0%, F1=100.0%, thr=0.079
Evaluating on jayavibhav (10000 samples)... acc=62.1%, F1=70.3%, thr=0.032
Evaluating on llmail (9998 samples)... 



acc=86.9%, F1=93.0%, thr=0.075
Evaluating on mosscap (10114 samples)... 



acc=87.3%, F1=93.2%, thr=0.088
Evaluating on openorca (9997 samples)... 



acc=95.0%, F1=0.0%, thr=0.129
Evaluating on qualifire (5000 samples)... acc=75.1%, F1=73.9%, thr=0.028
Evaluating on safeguard (8236 samples)... acc=96.7%, F1=94.7%, thr=0.082
Evaluating on softAge (1001 samples)... 



acc=89.2%, F1=0.0%, thr=0.080
Evaluating on wildjailbreak (2210 samples)... acc=86.0%, F1=92.1%, thr=0.054
Evaluating on yanismiraoui (1034 samples)... acc=74.4%, F1=85.3%, thr=0.069
--------------------------------------------------
Weighted average: acc=83.6%, F1=52.7%




## 5. DANN Classifier (Domain Adversarial)

The `DANNClassifier` learns domain-invariant representations by adversarially training against dataset prediction. This can improve cross-dataset generalization.

**DANNConfig options:**
- `hidden_layers`: MLP architecture (default: [512, 256, 64])
- `domain_weight`: Weight of adversarial loss (0 = no adversarial training)
- `lr`: Learning rate
- `max_epochs`: Maximum training epochs
- `early_stopping_patience`: Epochs to wait before early stopping

In [10]:
# DANN classifier (neural network with optional domain adversarial training)
dann = DANNClassifier(DANNConfig())

results_dann = lodo_evaluate(
    dann,
    data_raw.X,
    data_raw.y,
    data_raw.datasets,
    threshold_strategy=ConstantThreshold(0.5),
    merge_datasets={'gandalf_summarization': 'mosscap'},
)

Threshold strategy: constant(0.5), CV: False
Evaluating on 10k_prompts_ranked (9924 samples)... 

Training DANN:  63%|██████▎   | 63/100 [01:28<00:51,  1.40s/it]

Early stopping at epoch 63





acc=92.4%, F1=0.0%, thr=0.500
Evaluating on advbench (520 samples)... 

Training DANN:  42%|████▏     | 42/100 [01:06<01:31,  1.57s/it]

Early stopping at epoch 42





acc=97.1%, F1=98.5%, thr=0.500
Evaluating on bipia_email_code_table (15000 samples)... 

Training DANN:  63%|██████▎   | 63/100 [01:24<00:49,  1.34s/it]

Early stopping at epoch 63





acc=60.1%, F1=74.6%, thr=0.500
Evaluating on deepset (546 samples)... 

Training DANN:  45%|████▌     | 45/100 [01:10<01:25,  1.56s/it]

Early stopping at epoch 45





acc=78.4%, F1=62.4%, thr=0.500
Evaluating on dolly_15k (10000 samples)... 

Training DANN:  40%|████      | 40/100 [00:55<01:23,  1.39s/it]

Early stopping at epoch 40





acc=99.8%, F1=0.0%, thr=0.500
Evaluating on enron (10000 samples)... 

Training DANN:  60%|██████    | 60/100 [01:23<00:55,  1.40s/it]

Early stopping at epoch 60





acc=81.1%, F1=0.0%, thr=0.500
Evaluating on harmbench (400 samples)... 

Training DANN:  39%|███▉      | 39/100 [01:01<01:36,  1.57s/it]

Early stopping at epoch 39





acc=44.8%, F1=61.8%, thr=0.500
Evaluating on injecagent_dh_ds_base (1054 samples)... 

Training DANN:  38%|███▊      | 38/100 [00:59<01:36,  1.56s/it]

Early stopping at epoch 38



  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])


acc=100.0%, F1=100.0%, thr=0.500
Evaluating on jayavibhav (10000 samples)... 

Training DANN:  41%|████      | 41/100 [00:55<01:20,  1.36s/it]

Early stopping at epoch 41





acc=75.6%, F1=77.4%, thr=0.500
Evaluating on llmail (9998 samples)... 

Training DANN:  70%|███████   | 70/100 [01:37<00:41,  1.39s/it]

Early stopping at epoch 70





acc=45.8%, F1=62.8%, thr=0.500
Evaluating on mosscap (10114 samples)... 

Training DANN:  43%|████▎     | 43/100 [01:01<01:21,  1.43s/it]

Early stopping at epoch 43





acc=84.4%, F1=91.5%, thr=0.500
Evaluating on openorca (9997 samples)... 

Training DANN:  77%|███████▋  | 77/100 [01:48<00:32,  1.41s/it]

Early stopping at epoch 77





acc=98.9%, F1=0.0%, thr=0.500
Evaluating on qualifire (5000 samples)... 

Training DANN:  48%|████▊     | 48/100 [01:10<01:16,  1.47s/it]

Early stopping at epoch 48





acc=77.8%, F1=74.9%, thr=0.500
Evaluating on safeguard (8236 samples)... 

Training DANN:  29%|██▉       | 29/100 [00:41<01:42,  1.45s/it]

Early stopping at epoch 29





acc=97.4%, F1=95.6%, thr=0.500
Evaluating on softAge (1001 samples)... 

Training DANN:  48%|████▊     | 48/100 [01:14<01:21,  1.56s/it]

Early stopping at epoch 48





acc=96.4%, F1=0.0%, thr=0.500
Evaluating on wildjailbreak (2210 samples)... 

Training DANN:  58%|█████▊    | 58/100 [01:29<01:04,  1.55s/it]

Early stopping at epoch 58





acc=79.6%, F1=87.7%, thr=0.500
Evaluating on yanismiraoui (1034 samples)... 

Training DANN:  51%|█████     | 51/100 [01:20<01:16,  1.57s/it]

Early stopping at epoch 51





acc=45.6%, F1=62.6%, thr=0.500
--------------------------------------------------
Weighted average: acc=80.1%, F1=48.4%




## 6. SAE Features Classifier

SAE (Sparse Autoencoder) features provide interpretable dimensions. Use L2 normalization for sparse features.

In [11]:
# SAE classifier with L2 normalization
clf_sae = LinearClassifier(LinearConfig(model='logistic', normalize='l2', C=1.0))

results_sae = lodo_evaluate(
    clf_sae,
    data_sae.X,
    data_sae.y,
    data_sae.datasets,
    threshold_strategy=ConstantThreshold(0.5),
    merge_datasets={'gandalf_summarization': 'mosscap'},
)

Threshold strategy: constant(0.5), CV: False
Evaluating on 10k_prompts_ranked (9924 samples)... 



acc=89.1%, F1=0.0%, thr=0.500
Evaluating on advbench (520 samples)... 



acc=92.9%, F1=96.3%, thr=0.500
Evaluating on bipia_email_code_table (15000 samples)... acc=26.1%, F1=36.8%, thr=0.500
Evaluating on deepset (546 samples)... acc=80.6%, F1=66.9%, thr=0.500
Evaluating on dolly_15k (10000 samples)... 



acc=99.8%, F1=0.0%, thr=0.500
Evaluating on enron (10000 samples)... 



acc=85.7%, F1=0.0%, thr=0.500
Evaluating on harmbench (400 samples)... 



acc=36.2%, F1=53.2%, thr=0.500
Evaluating on injecagent_dh_ds_base (1054 samples)... 

  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])


acc=100.0%, F1=100.0%, thr=0.500
Evaluating on jayavibhav (10000 samples)... acc=76.6%, F1=73.1%, thr=0.500
Evaluating on llmail (9998 samples)... 



acc=58.4%, F1=73.7%, thr=0.500
Evaluating on mosscap (10114 samples)... 



acc=65.2%, F1=78.9%, thr=0.500
Evaluating on openorca (9997 samples)... 



acc=98.3%, F1=0.0%, thr=0.500
Evaluating on qualifire (5000 samples)... acc=76.5%, F1=74.5%, thr=0.500
Evaluating on safeguard (8236 samples)... acc=95.7%, F1=92.7%, thr=0.500
Evaluating on softAge (1001 samples)... 



acc=95.6%, F1=0.0%, thr=0.500
Evaluating on wildjailbreak (2210 samples)... acc=80.7%, F1=88.5%, thr=0.500
Evaluating on yanismiraoui (1034 samples)... acc=41.9%, F1=59.0%, thr=0.500
--------------------------------------------------
Weighted average: acc=74.5%, F1=42.1%




## 7. Results Comparison

Compare all methods across datasets. Each row is a dataset (held out during training), columns are different classifier configurations.

In [15]:
# Collect all results
all_results = {
    'Linear (t=0.5)': results_constant,
    'Linear (p=0.95)': results_precision,
    'DANN': results_dann,
    'SAE Linear': results_sae,
}

# Build per-dataset comparison tables
acc_data = {}
f1_data = {}

for method_name, results in all_results.items():
    for ds_name, metrics in results.per_dataset.items():
        if ds_name not in acc_data:
            acc_data[ds_name] = {}
            f1_data[ds_name] = {}
        acc_data[ds_name][method_name] = metrics['acc'] * 100
        f1_data[ds_name][method_name] = metrics['malicious_f1']

# Create DataFrames
df_acc = pd.DataFrame(acc_data).T.sort_index()
df_f1 = pd.DataFrame(f1_data).T.sort_index()

# Add weighted average row
avg_row_acc = {name: res.weighted_average['acc'] * 100 for name, res in all_results.items()}
avg_row_f1 = {name: res.weighted_average['malicious_f1'] for name, res in all_results.items()}
df_acc.loc['WEIGHTED AVG'] = avg_row_acc
df_f1.loc['WEIGHTED AVG'] = avg_row_f1

# Display Accuracy table with colors
print("Accuracy by Dataset (%)")
print("=" * 80)
display(df_acc.style
    .format('{:.1f}')
    .background_gradient(cmap='RdYlGn', vmin=0, vmax=100)
    .set_properties(**{'font-size': '15px'})
)

Accuracy by Dataset (%)


Unnamed: 0,Linear (t=0.5),Linear (p=0.95),DANN,SAE Linear
10k_prompts_ranked,92.4,85.6,92.4,89.1
advbench,90.8,93.7,97.1,92.9
bipia_email_code_table,63.1,76.6,60.1,26.1
deepset,77.7,78.0,78.4,80.6
dolly_15k,99.6,99.2,99.8,99.8
enron,82.6,71.9,81.1,85.7
harmbench,42.8,51.2,44.8,36.2
injecagent_dh_ds_base,98.9,100.0,100.0,100.0
jayavibhav,69.1,62.1,75.6,76.6
llmail,71.4,86.9,45.8,58.4


In [16]:
# Display F1 table with colors
print("F1 Score by Dataset (%)")
print("=" * 80)
display(df_f1.style
    .format('{:.1f}')
    .background_gradient(cmap='RdYlGn', vmin=0, vmax=100)
    .set_properties(**{'font-size': '15px'})
)

F1 Score by Dataset (%)


Unnamed: 0,Linear (t=0.5),Linear (p=0.95),DANN,SAE Linear
10k_prompts_ranked,0.0,0.0,0.0,0.0
advbench,95.2,96.7,98.5,96.3
bipia_email_code_table,76.7,86.6,74.6,36.8
deepset,63.5,67.4,62.4,66.9
dolly_15k,0.0,0.0,0.0,0.0
enron,0.0,0.0,0.0,0.0
harmbench,59.9,67.8,61.8,53.2
injecagent_dh_ds_base,99.4,100.0,100.0,100.0
jayavibhav,72.3,70.3,77.4,73.1
llmail,83.3,93.0,62.8,73.7


## Summary

This notebook demonstrated:

1. **Loading activations** from the ingestion pipeline (raw and SAE)
2. **Training classifiers** (LinearClassifier, DANNClassifier)
3. **LODO evaluation** for cross-dataset generalization
4. **Threshold strategies** for precision/recall tradeoffs

### Next Steps

- See `02_on_the_fly_classification.ipynb` for real-time classification with SAE interpretation
- See `03_evaluator_analysis.ipynb` for comparing baseline evaluators (Llama Guard, Prompt Guard)