Skip to content

FEAT: Add StrongREJECT seed dataset loader#1800

Merged
romanlutz merged 3 commits into
microsoft:mainfrom
romanlutz:romanlutz/plan-strongreject-benchmark
May 30, 2026
Merged

FEAT: Add StrongREJECT seed dataset loader#1800
romanlutz merged 3 commits into
microsoft:mainfrom
romanlutz:romanlutz/plan-strongreject-benchmark

Conversation

@romanlutz
Copy link
Copy Markdown
Contributor

Description

Adds the StrongREJECT (Souly et al., NeurIPS 2024) 313-prompt refusal-robustness dataset as a new remote seed loader. StrongREJECT is a widely-cited jailbreak-success benchmark, and adding it closes a gap surfaced in a broader AI red-team toolkit audit comparing PyRIT against competing frameworks.

The loader follows the _HarmBenchDataset template: single concrete _RemoteDatasetLoader subclass, pinned-commit raw GitHub URL, emits SeedObjective rows. Per-row category is preserved verbatim in harm_categories, and the upstream source column (AdvBench / DAN / HarmfulQ / MaliciousInstruct / MasterKey / "Jailbreaking via Prompt Engineering" / OpenAI System Card / custom) is preserved in metadata["strong_reject_source"] so users can filter by provenance later.

A few non-obvious design choices worth a reviewer's attention:

  • Tags are {"safety", "jailbreak"}, explicitly NOT "default". StrongREJECT is a jailbreak-success benchmark, not a harm-category coverage dataset, so opting it into the default set would change every scenario's default-dataset surface area without those scenarios opting in.
  • The companion 60-prompt strongreject_small_dataset.csv is intentionally not shipped as a sibling loader. It is a strict prompt-subset of the full set, but its metadata is hand-edited (three rows have their source rewritten to "custom" even though the same prompts are attributed to AdvBench/DAN in the full CSV). Shipping it would surface conflicting provenance for identical prompts. Users who want a smaller balanced sample can post-filter the full loader at runtime.
  • No scenario PR. Users compose Jailbreak --dataset-names strong_reject themselves; the StrongREJECT rubric scorer is owned by a parallel planning session.
  • groups=["UC Berkeley"]. The lead authors are at UC Berkeley''s Center for Human-Compatible AI (not the Center for AI Safety, which authors HarmBench).

Tests and Documentation

  • New tests/unit/datasets/test_strong_reject_dataset.py with 6 unit tests covering happy path, per-row metadata preservation, missing-key validation, empty-dataset validation, and class-level metadata. All pass.
  • Full tests/unit/datasets/ suite (500 tests) still green.
  • Live sanity check against the pinned CSV verified the loader produces 313 seeds across 6 categories (50/50/50/50/54/59) and 8 distinct source values (custom=221, DAN=35, AdvBench=25, MaliciousInstruct=12, HarmfulQ=11, MasterKey=3, OpenAI System Card=3, "Jailbreaking via Prompt Engineering"=3).
  • SeedDatasetProvider.get_all_dataset_names_async() discovers strong_reject end-to-end.
  • New BibTeX entry @souly2024strongreject added to doc/references.bib and doc/bibliography.md.
  • doc/code/datasets/1_loading_datasets.py prose updated to include StrongREJECT in the alphabetical paper list. Paired .ipynb regenerated with jupytext --to ipynb --update (markdown-only change, no execution needed).
  • ruff format, ruff check, and ty check all pass on changed files (verified via the pre-commit run during commit).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Comment thread pyrit/datasets/seed_datasets/remote/strong_reject_dataset.py
romanlutz and others added 2 commits May 30, 2026 06:44
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@romanlutz romanlutz enabled auto-merge May 30, 2026 20:45
@romanlutz romanlutz added this pull request to the merge queue May 30, 2026
Merged via the queue into microsoft:main with commit eabb501 May 30, 2026
48 checks passed
@romanlutz romanlutz deleted the romanlutz/plan-strongreject-benchmark branch May 30, 2026 21:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants