diff --git a/docs/benchmarking/Jailbreak_roc_curves.png b/docs/benchmarking/Jailbreak_roc_curves.png new file mode 100644 index 0000000..98f15f3 Binary files /dev/null and b/docs/benchmarking/Jailbreak_roc_curves.png differ diff --git a/docs/benchmarking/jailbreak_roc_curve.png b/docs/benchmarking/jailbreak_roc_curve.png deleted file mode 100644 index 82bafd9..0000000 Binary files a/docs/benchmarking/jailbreak_roc_curve.png and /dev/null differ diff --git a/docs/ref/checks/jailbreak.md b/docs/ref/checks/jailbreak.md index 90f3804..2e70299 100644 --- a/docs/ref/checks/jailbreak.md +++ b/docs/ref/checks/jailbreak.md @@ -89,37 +89,40 @@ When conversation history is available, the guardrail automatically: ### Dataset Description -This benchmark evaluates model performance on a diverse set of prompts: +This benchmark combines multiple public datasets and synthetic benign conversations: -- **Subset of the open source jailbreak dataset [JailbreakV-28k](https://huggingface.co/datasets/JailbreakV-28K/JailBreakV-28k)** (n=2,000) -- **Synthetic prompts** covering a diverse range of benign topics (n=1,000) -- **Open source [Toxicity](https://github.com/surge-ai/toxicity/blob/main/toxicity_en.csv) dataset** containing harmful content that does not involve jailbreak attempts (n=1,000) +- **Red Queen jailbreak corpus ([GitHub](https://github.com/kriti-hippo/red_queen/blob/main/Data/Red_Queen_Attack.zip))**: 14,000 positive samples collected with gpt-4o attacks. +- **Tom Gibbs multi-turn jailbreak attacks ([Hugging Face](https://huggingface.co/datasets/tom-gibbs/multi-turn_jailbreak_attack_datasets/tree/main))**: 4,136 positive samples. +- **Scale MHJ dataset ([Hugging Face](https://huggingface.co/datasets/ScaleAI/mhj))**: 537 positive samples. +- **Synthetic benign conversations**: 12,433 negative samples generated by seeding prompts from [WildGuardMix](https://huggingface.co/datasets/allenai/wildguardmix?utm_source=chatgpt.com) where `adversarial=false` and `prompt_harm_label=false`, then expanding each single-turn input into five-turn dialogues using gpt-4.1. -**Total n = 4,000; positive class prevalence = 2,000 (50.0%)** +**Total n = 31,106; positives = 18,673; negatives = 12,433** + +For benchmarking, we randomly sampled 4,000 conversations from this pool using a 50/50 split between positive and negative samples. ### Results #### ROC Curve -![ROC Curve](../../benchmarking/jailbreak_roc_curve.png) +![ROC Curve](../../benchmarking/Jailbreak_roc_curves.png) #### Metrics Table | Model | ROC AUC | Prec@R=0.80 | Prec@R=0.90 | Prec@R=0.95 | Recall@FPR=0.01 | |--------------|---------|-------------|-------------|-------------|-----------------| -| gpt-5 | 0.979 | 0.973 | 0.970 | 0.970 | 0.733 | -| gpt-5-mini | 0.954 | 0.990 | 0.900 | 0.900 | 0.768 | -| gpt-4.1 | 0.990 | 1.000 | 1.000 | 0.984 | 0.946 | -| gpt-4.1-mini (default) | 0.982 | 0.992 | 0.992 | 0.954 | 0.444 | +| gpt-5 | 0.994 | 0.993 | 0.993 | 0.993 | 0.997 | +| gpt-5-mini | 0.813 | 0.832 | 0.832 | 0.832 | 0.000 | +| gpt-4.1 | 0.999 | 0.999 | 0.999 | 0.999 | 1.000 | +| gpt-4.1-mini (default) | 0.928 | 0.968 | 0.968 | 0.500 | 0.000 | #### Latency Performance | Model | TTC P50 (ms) | TTC P95 (ms) | |--------------|--------------|--------------| -| gpt-5 | 4,569 | 7,256 | -| gpt-5-mini | 5,019 | 9,212 | -| gpt-4.1 | 841 | 1,861 | -| gpt-4.1-mini | 749 | 1,291 | +| gpt-5 | 7,370 | 12,218 | +| gpt-5-mini | 7,055 | 11,579 | +| gpt-4.1 | 2,998 | 4,204 | +| gpt-4.1-mini | 1,538 | 2,089 | **Notes:**