FEAT: Add StrongREJECT seed dataset loader#1800
Merged
romanlutz merged 3 commits intoMay 30, 2026
Merged
Conversation
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
hannahwestra25
approved these changes
May 27, 2026
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Adds the StrongREJECT (Souly et al., NeurIPS 2024) 313-prompt refusal-robustness dataset as a new remote seed loader. StrongREJECT is a widely-cited jailbreak-success benchmark, and adding it closes a gap surfaced in a broader AI red-team toolkit audit comparing PyRIT against competing frameworks.
The loader follows the
_HarmBenchDatasettemplate: single concrete_RemoteDatasetLoadersubclass, pinned-commit raw GitHub URL, emitsSeedObjectiverows. Per-rowcategoryis preserved verbatim inharm_categories, and the upstreamsourcecolumn (AdvBench / DAN / HarmfulQ / MaliciousInstruct / MasterKey / "Jailbreaking via Prompt Engineering" / OpenAI System Card / custom) is preserved inmetadata["strong_reject_source"]so users can filter by provenance later.A few non-obvious design choices worth a reviewer's attention:
{"safety", "jailbreak"}, explicitly NOT"default". StrongREJECT is a jailbreak-success benchmark, not a harm-category coverage dataset, so opting it into the default set would change every scenario's default-dataset surface area without those scenarios opting in.strongreject_small_dataset.csvis intentionally not shipped as a sibling loader. It is a strict prompt-subset of the full set, but its metadata is hand-edited (three rows have theirsourcerewritten to"custom"even though the same prompts are attributed to AdvBench/DAN in the full CSV). Shipping it would surface conflicting provenance for identical prompts. Users who want a smaller balanced sample can post-filter the full loader at runtime.Jailbreak --dataset-names strong_rejectthemselves; the StrongREJECT rubric scorer is owned by a parallel planning session.groups=["UC Berkeley"]. The lead authors are at UC Berkeley''s Center for Human-Compatible AI (not the Center for AI Safety, which authors HarmBench).Tests and Documentation
tests/unit/datasets/test_strong_reject_dataset.pywith 6 unit tests covering happy path, per-row metadata preservation, missing-key validation, empty-dataset validation, and class-level metadata. All pass.tests/unit/datasets/suite (500 tests) still green.SeedDatasetProvider.get_all_dataset_names_async()discoversstrong_rejectend-to-end.@souly2024strongrejectadded todoc/references.bibanddoc/bibliography.md.doc/code/datasets/1_loading_datasets.pyprose updated to include StrongREJECT in the alphabetical paper list. Paired.ipynbregenerated withjupytext --to ipynb --update(markdown-only change, no execution needed).ruff format,ruff check, andty checkall pass on changed files (verified via the pre-commit run during commit).