Skip to content

FEAT: Add DecodingTrust Toxicity dataset loader#1821

Open
v0ropaev wants to merge 1 commit into
microsoft:mainfrom
v0ropaev:feat/decoding-trust-toxicity-dataset
Open

FEAT: Add DecodingTrust Toxicity dataset loader#1821
v0ropaev wants to merge 1 commit into
microsoft:mainfrom
v0ropaev:feat/decoding-trust-toxicity-dataset

Conversation

@v0ropaev
Copy link
Copy Markdown

Description

Adds _DecodingTrustToxicityDataset — a remote dataset loader for the Toxicity perspective of the DecodingTrust benchmark, fetching prompts at runtime from a pinned commit of AI-secure/DecodingTrust.

Closes #1798 (subtask of #291). Thanks @romanlutz for the scoping feedback on #1798.

Design decisions

  • Subset selectorsubset: Literal["toxic", "nontoxic", "all"], default "toxic" per maintainer feedback on FEAT add DecodingTrust Toxicity dataset loader (subtask of #291) #1798 (nontoxic prompts are less interesting from a red-teaming perspective). All three options remain selectable.
  • harm_categories are derived per-prompt from the 8 Perspective-API scores shipped in each record (toxicity, severe_toxicity, profanity, sexually_explicit, identity_attack, threat, insult, flirtation); any score >= toxicity_threshold (default 0.5) contributes its key. This avoids guessing on records the source leaves unlabelled and lets the threshold be tuned per use case.
  • challenging_only filter restricts to the adversarial subset emphasised by the DecodingTrust authors.
  • Pinned commit SHA — URLs reference 161ae8321ced62f45fcd9ceb412e05b47c603cd4 (the current main HEAD, 2024-09-16) so the prompt set cannot drift silently.
  • License & attribution — DecodingTrust is CC BY-SA 4.0. PyRIT only fetches the data at runtime (no redistribution); the class docstring records the licence, and every SeedPrompt carries the full author and institution lists for both DecodingTrust and RealToxicityPrompts (which DT subsamples from). Matches the approach agreed on FEAT add DecodingTrust Toxicity dataset loader (subtask of #291) #1798.
  • Class-level metadata (modalities, size, tags) follows the _DangerousQADataset pattern; size="large" since the default subset="toxic" is ~1196 prompts.

Files

  • New: pyrit/datasets/seed_datasets/remote/decoding_trust_toxicity_dataset.py
  • New: tests/unit/datasets/test_decoding_trust_toxicity_dataset.py
  • Modified: pyrit/datasets/seed_datasets/remote/__init__.py (auto-discovery import + __all__ entry)

Tests and Documentation

  • uv run pre-commit run --files <three changed files> — clean (ruff format/check, ty type check, copyright header CPY001).
  • uv run pytest tests/unit/datasets/test_decoding_trust_toxicity_dataset.py -v12 / 12 passed. Tests cover: default subset locks to "toxic", each subset selector, harm-category mapping under different thresholds, challenging_only filter, skipping records with empty prompt.text, hard error on non-dict records, per-SeedPrompt metadata (dataset name / source / authors / groups), pinned commit SHA, class-level metadata.
  • uv run pytest tests/unit/datasets/506 / 506 passed, no regressions.
  • No notebook / JupyText changes — dataset loaders are auto-discovered by SeedDatasetProvider, matching every other entry in pyrit/datasets/seed_datasets/remote/.

@v0ropaev
Copy link
Copy Markdown
Author

@microsoft-github-policy-service agree

def __init__(
self,
*,
subset: Literal["toxic", "nontoxic", "all"] = "toxic",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

datasets.instructions.md requires filter axes to be module-level Enum rather than Literal[...]. See VLGuardSubset or PromptIntelSeverity for the pattern. Could you define a DecodingTrustToxicitySubset(Enum), accept it here, validate it with self._validate_enum(...), and re-export from remote/__init__.py?


seed_prompts = self._records_to_seed_prompts(records=records)
logger.info(f"Loaded {len(seed_prompts)} prompts from DecodingTrust Toxicity")
return SeedDataset(seeds=seed_prompts, dataset_name=self.dataset_name)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

datasets.instructions.md requires loaders to raise when filters leave zero seeds, with the standard message ValueError("SeedDataset cannot be empty. Check your filter criteria."). Today challenging_only=True against a subset that has no challenging records returns an empty SeedDataset silently, which is hard to debug downstream. Could you add the check after _records_to_seed_prompts and a paired test covering the case?

source=source_url,
authors=list(self._AUTHORS),
groups=list(self._GROUPS),
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The challenging flag and the 8 Perspective scores get read here but not stored on the SeedPrompt. datasets.instructions.md asks for per-row source fields to land in metadata={...} so they're persisted to memory, queryable via _get_seed_metadata_conditions, and flow into MessagePiece.prompt_metadata when the seed reaches a target. For DT specifically those annotations are what distinguishes this dataset from raw RealToxicityPrompts, so it's worth carrying them through.

The schema is dict[str, Union[str, int]], so floats need stringifying (see _ToxicChatDataset for the precedent). One shape that fits:

metadata={
    "challenging": bool(item.get("challenging", False)),
    **{
        key: str(prompt_obj[key])
        for key in _PERSPECTIVE_SCORE_KEYS
        if isinstance(prompt_obj.get(key), (int, float))
    },
}

A paired test asserting one score and the flag round-trip would round it out.

``>= toxicity_threshold`` adds the corresponding category. This avoids
guessing where the source provides no label.

References:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two doc gaps from datasets.instructions.md:

Docstring cite-key. @wang2023decodingtrust already exists in doc/references.bib and doc/bibliography.md, so this block should use the project form Reference: [@wang2023decodingtrust] instead of raw arxiv URLs.

1_loading_datasets. New loaders need to be added to the prose paragraph at the top of doc/code/datasets/1_loading_datasets.py (alphabetically, between DarkBench and Do Anything Now) and mirrored in 1_loading_datasets.ipynb. Inline edits to both files are fine for this trivial change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

FEAT add DecodingTrust Toxicity dataset loader (subtask of #291)

2 participants