FEAT: Add DecodingTrust Toxicity dataset loader by v0ropaev · Pull Request #1821 · microsoft/PyRIT

v0ropaev · 2026-05-27T21:50:13Z

Description

Adds _DecodingTrustToxicityDataset — a remote dataset loader for the Toxicity perspective of the DecodingTrust benchmark, fetching prompts at runtime from a pinned commit of AI-secure/DecodingTrust.

Closes #1798 (subtask of #291). Thanks @romanlutz for the scoping feedback on #1798.

Design decisions

Subset selector — subset: Literal["toxic", "nontoxic", "all"], default "toxic" per maintainer feedback on FEAT add DecodingTrust Toxicity dataset loader (subtask of #291) #1798 (nontoxic prompts are less interesting from a red-teaming perspective). All three options remain selectable.
harm_categories are derived per-prompt from the 8 Perspective-API scores shipped in each record (toxicity, severe_toxicity, profanity, sexually_explicit, identity_attack, threat, insult, flirtation); any score >= toxicity_threshold (default 0.5) contributes its key. This avoids guessing on records the source leaves unlabelled and lets the threshold be tuned per use case.
challenging_only filter restricts to the adversarial subset emphasised by the DecodingTrust authors.
Pinned commit SHA — URLs reference 161ae8321ced62f45fcd9ceb412e05b47c603cd4 (the current main HEAD, 2024-09-16) so the prompt set cannot drift silently.
License & attribution — DecodingTrust is CC BY-SA 4.0. PyRIT only fetches the data at runtime (no redistribution); the class docstring records the licence, and every SeedPrompt carries the full author and institution lists for both DecodingTrust and RealToxicityPrompts (which DT subsamples from). Matches the approach agreed on FEAT add DecodingTrust Toxicity dataset loader (subtask of #291) #1798.
Class-level metadata (modalities, size, tags) follows the _DangerousQADataset pattern; size="large" since the default subset="toxic" is ~1196 prompts.

Files

New: pyrit/datasets/seed_datasets/remote/decoding_trust_toxicity_dataset.py
New: tests/unit/datasets/test_decoding_trust_toxicity_dataset.py
Modified: pyrit/datasets/seed_datasets/remote/__init__.py (auto-discovery import + __all__ entry)

Tests and Documentation

uv run pre-commit run --files <three changed files> — clean (ruff format/check, ty type check, copyright header CPY001).
uv run pytest tests/unit/datasets/test_decoding_trust_toxicity_dataset.py -v — 12 / 12 passed. Tests cover: default subset locks to "toxic", each subset selector, harm-category mapping under different thresholds, challenging_only filter, skipping records with empty prompt.text, hard error on non-dict records, per-SeedPrompt metadata (dataset name / source / authors / groups), pinned commit SHA, class-level metadata.
uv run pytest tests/unit/datasets/ — 506 / 506 passed, no regressions.
No notebook / JupyText changes — dataset loaders are auto-discovered by SeedDatasetProvider, matching every other entry in pyrit/datasets/seed_datasets/remote/.

v0ropaev · 2026-05-27T21:55:05Z

@microsoft-github-policy-service agree

adrian-gavrila · 2026-05-29T17:13:34Z

+    def __init__(
+        self,
+        *,
+        subset: Literal["toxic", "nontoxic", "all"] = "toxic",


datasets.instructions.md requires filter axes to be module-level Enum rather than Literal[...]. See VLGuardSubset or PromptIntelSeverity for the pattern. Could you define a DecodingTrustToxicitySubset(Enum), accept it here, validate it with self._validate_enum(...), and re-export from remote/__init__.py?

adrian-gavrila · 2026-05-29T17:13:34Z

+
+        seed_prompts = self._records_to_seed_prompts(records=records)
+        logger.info(f"Loaded {len(seed_prompts)} prompts from DecodingTrust Toxicity")
+        return SeedDataset(seeds=seed_prompts, dataset_name=self.dataset_name)


datasets.instructions.md requires loaders to raise when filters leave zero seeds, with the standard message ValueError("SeedDataset cannot be empty. Check your filter criteria."). Today challenging_only=True against a subset that has no challenging records returns an empty SeedDataset silently, which is hard to debug downstream. Could you add the check after _records_to_seed_prompts and a paired test covering the case?

adrian-gavrila · 2026-05-29T17:13:34Z

+                    source=source_url,
+                    authors=list(self._AUTHORS),
+                    groups=list(self._GROUPS),
+                )


The challenging flag and the 8 Perspective scores get read here but not stored on the SeedPrompt. datasets.instructions.md asks for per-row source fields to land in metadata={...} so they're persisted to memory, queryable via _get_seed_metadata_conditions, and flow into MessagePiece.prompt_metadata when the seed reaches a target. For DT specifically those annotations are what distinguishes this dataset from raw RealToxicityPrompts, so it's worth carrying them through.

The schema is dict[str, Union[str, int]], so floats need stringifying (see _ToxicChatDataset for the precedent). One shape that fits:

metadata={ "challenging": bool(item.get("challenging", False)), **{ key: str(prompt_obj[key]) for key in _PERSPECTIVE_SCORE_KEYS if isinstance(prompt_obj.get(key), (int, float)) }, }

A paired test asserting one score and the flag round-trip would round it out.

adrian-gavrila · 2026-05-29T17:13:34Z

+    ``>= toxicity_threshold`` adds the corresponding category. This avoids
+    guessing where the source provides no label.
+
+    References:


Two doc gaps from datasets.instructions.md:

Docstring cite-key. @wang2023decodingtrust already exists in doc/references.bib and doc/bibliography.md, so this block should use the project form Reference: [@wang2023decodingtrust] instead of raw arxiv URLs.

1_loading_datasets. New loaders need to be added to the prose paragraph at the top of doc/code/datasets/1_loading_datasets.py (alphabetically, between DarkBench and Do Anything Now) and mirrored in 1_loading_datasets.ipynb. Inline edits to both files are fine for this trivial change.

FEAT: Add DecodingTrust Toxicity dataset loader

04bf041

This was referenced May 28, 2026

FEAT add DecodingTrust Machine Ethics (Jiminy CSV) dataset loader (subtask of #291) #1828

Open

FEAT: Add DecodingTrust Machine Ethics dataset loader #1829

Open

adrian-gavrila reviewed May 29, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FEAT: Add DecodingTrust Toxicity dataset loader#1821

FEAT: Add DecodingTrust Toxicity dataset loader#1821
v0ropaev wants to merge 1 commit into
microsoft:mainfrom
v0ropaev:feat/decoding-trust-toxicity-dataset

v0ropaev commented May 27, 2026

Uh oh!

v0ropaev commented May 27, 2026

Uh oh!

adrian-gavrila May 29, 2026

Uh oh!

adrian-gavrila May 29, 2026

Uh oh!

adrian-gavrila May 29, 2026

Uh oh!

adrian-gavrila May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

v0ropaev commented May 27, 2026

Description

Design decisions

Files

Tests and Documentation

Uh oh!

v0ropaev commented May 27, 2026

Uh oh!

adrian-gavrila May 29, 2026

Choose a reason for hiding this comment

Uh oh!

adrian-gavrila May 29, 2026

Choose a reason for hiding this comment

Uh oh!

adrian-gavrila May 29, 2026

Choose a reason for hiding this comment

Uh oh!

adrian-gavrila May 29, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants