FEAT: Add DecodingTrust Toxicity dataset loader#1821
Conversation
|
@microsoft-github-policy-service agree |
| def __init__( | ||
| self, | ||
| *, | ||
| subset: Literal["toxic", "nontoxic", "all"] = "toxic", |
There was a problem hiding this comment.
datasets.instructions.md requires filter axes to be module-level Enum rather than Literal[...]. See VLGuardSubset or PromptIntelSeverity for the pattern. Could you define a DecodingTrustToxicitySubset(Enum), accept it here, validate it with self._validate_enum(...), and re-export from remote/__init__.py?
|
|
||
| seed_prompts = self._records_to_seed_prompts(records=records) | ||
| logger.info(f"Loaded {len(seed_prompts)} prompts from DecodingTrust Toxicity") | ||
| return SeedDataset(seeds=seed_prompts, dataset_name=self.dataset_name) |
There was a problem hiding this comment.
datasets.instructions.md requires loaders to raise when filters leave zero seeds, with the standard message ValueError("SeedDataset cannot be empty. Check your filter criteria."). Today challenging_only=True against a subset that has no challenging records returns an empty SeedDataset silently, which is hard to debug downstream. Could you add the check after _records_to_seed_prompts and a paired test covering the case?
| source=source_url, | ||
| authors=list(self._AUTHORS), | ||
| groups=list(self._GROUPS), | ||
| ) |
There was a problem hiding this comment.
The challenging flag and the 8 Perspective scores get read here but not stored on the SeedPrompt. datasets.instructions.md asks for per-row source fields to land in metadata={...} so they're persisted to memory, queryable via _get_seed_metadata_conditions, and flow into MessagePiece.prompt_metadata when the seed reaches a target. For DT specifically those annotations are what distinguishes this dataset from raw RealToxicityPrompts, so it's worth carrying them through.
The schema is dict[str, Union[str, int]], so floats need stringifying (see _ToxicChatDataset for the precedent). One shape that fits:
metadata={
"challenging": bool(item.get("challenging", False)),
**{
key: str(prompt_obj[key])
for key in _PERSPECTIVE_SCORE_KEYS
if isinstance(prompt_obj.get(key), (int, float))
},
}A paired test asserting one score and the flag round-trip would round it out.
| ``>= toxicity_threshold`` adds the corresponding category. This avoids | ||
| guessing where the source provides no label. | ||
|
|
||
| References: |
There was a problem hiding this comment.
Two doc gaps from datasets.instructions.md:
Docstring cite-key. @wang2023decodingtrust already exists in doc/references.bib and doc/bibliography.md, so this block should use the project form Reference: [@wang2023decodingtrust] instead of raw arxiv URLs.
1_loading_datasets. New loaders need to be added to the prose paragraph at the top of doc/code/datasets/1_loading_datasets.py (alphabetically, between DarkBench and Do Anything Now) and mirrored in 1_loading_datasets.ipynb. Inline edits to both files are fine for this trivial change.
Description
Adds
_DecodingTrustToxicityDataset— a remote dataset loader for the Toxicity perspective of the DecodingTrust benchmark, fetching prompts at runtime from a pinned commit ofAI-secure/DecodingTrust.Closes #1798 (subtask of #291). Thanks @romanlutz for the scoping feedback on #1798.
Design decisions
subset: Literal["toxic", "nontoxic", "all"], default"toxic"per maintainer feedback on FEAT add DecodingTrust Toxicity dataset loader (subtask of #291) #1798 (nontoxic prompts are less interesting from a red-teaming perspective). All three options remain selectable.harm_categoriesare derived per-prompt from the 8 Perspective-API scores shipped in each record (toxicity,severe_toxicity,profanity,sexually_explicit,identity_attack,threat,insult,flirtation); any score>= toxicity_threshold(default0.5) contributes its key. This avoids guessing on records the source leaves unlabelled and lets the threshold be tuned per use case.challenging_onlyfilter restricts to the adversarial subset emphasised by the DecodingTrust authors.161ae8321ced62f45fcd9ceb412e05b47c603cd4(the currentmainHEAD, 2024-09-16) so the prompt set cannot drift silently.SeedPromptcarries the full author and institution lists for both DecodingTrust and RealToxicityPrompts (which DT subsamples from). Matches the approach agreed on FEAT add DecodingTrust Toxicity dataset loader (subtask of #291) #1798.modalities,size,tags) follows the_DangerousQADatasetpattern;size="large"since the defaultsubset="toxic"is ~1196 prompts.Files
pyrit/datasets/seed_datasets/remote/decoding_trust_toxicity_dataset.pytests/unit/datasets/test_decoding_trust_toxicity_dataset.pypyrit/datasets/seed_datasets/remote/__init__.py(auto-discovery import +__all__entry)Tests and Documentation
uv run pre-commit run --files <three changed files>— clean (ruff format/check, ty type check, copyright headerCPY001).uv run pytest tests/unit/datasets/test_decoding_trust_toxicity_dataset.py -v— 12 / 12 passed. Tests cover: default subset locks to"toxic", eachsubsetselector, harm-category mapping under different thresholds,challenging_onlyfilter, skipping records with emptyprompt.text, hard error on non-dict records, per-SeedPromptmetadata (dataset name / source / authors / groups), pinned commit SHA, class-level metadata.uv run pytest tests/unit/datasets/— 506 / 506 passed, no regressions.SeedDatasetProvider, matching every other entry inpyrit/datasets/seed_datasets/remote/.