Subtask of #291
DecodingTrust [1] publishes adversarial toxicity test prompts at
/data/toxicity/user_prompts/
(sampled from RealToxicityPrompts [2]). The Stereotypes perspective was added in PR #385;
the Toxicity perspective is still open.
Data
toxic.jsonl — 1196 records (high Perspective toxicity)
nontoxic.jsonl — 1200 records
- Each record has a plain-text prompt (
prompt.text) plus 8 Perspective-API scores
(toxicity, severe_toxicity, profanity, sexually_explicit, identity_attack,
threat, insult, flirtation) and a challenging: bool flag.
Proposed loader
_DecodingTrustToxicityDataset(_RemoteDatasetLoader) following the
_DangerousQADataset pattern (single URL, _fetch_from_url, plain-string extraction).
- Parameters:
subset: Literal["toxic", "nontoxic", "all"] = "all",
challenging_only: bool = False, toxicity_threshold: float = 0.5.
- Per-prompt
harm_categories derived from the Perspective scores (e.g. include
"profanity" when prompt.profanity >= toxicity_threshold).
- Source URL pinned to a specific commit SHA
(current main HEAD: 161ae8321ced62f45fcd9ceb412e05b47c603cd4, 2024-09-16).
- Unit tests mock
_fetch_from_url, mirroring
tests/unit/datasets/test_dangerous_qa_dataset.py.
One question before I start
DecodingTrust's root LICENSE is CC BY-SA 4.0, while PyRIT is MIT. The existing
Stereotypes assets (pyrit/datasets/jailbreak/templates/dt_stereotypes_*.yaml) ship the
system prompts in-tree with attribution. For Toxicity I'd plan to fetch at runtime from
raw.githubusercontent.com (no vendoring) and add full attribution in the class
docstring. Is that the approach you'd like, or should I handle CC BY-SA sources
differently?
References
- Wang et al., 2023. DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models. https://arxiv.org/abs/2306.11698
- Gehman et al., 2020. RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models. https://arxiv.org/abs/2009.11462
⚠️ Content warning: the prompts include profanity, sexual content, and identity attacks (standard for red-team toxicity datasets).
Subtask of #291
DecodingTrust [1] publishes adversarial toxicity test prompts at
/data/toxicity/user_prompts/(sampled from RealToxicityPrompts [2]). The Stereotypes perspective was added in PR #385;
the Toxicity perspective is still open.
Data
toxic.jsonl— 1196 records (high Perspectivetoxicity)nontoxic.jsonl— 1200 recordsprompt.text) plus 8 Perspective-API scores(
toxicity,severe_toxicity,profanity,sexually_explicit,identity_attack,threat,insult,flirtation) and achallenging: boolflag.Proposed loader
_DecodingTrustToxicityDataset(_RemoteDatasetLoader)following the_DangerousQADatasetpattern (single URL,_fetch_from_url, plain-string extraction).subset: Literal["toxic", "nontoxic", "all"] = "all",challenging_only: bool = False,toxicity_threshold: float = 0.5.harm_categoriesderived from the Perspective scores (e.g. include"profanity"whenprompt.profanity >= toxicity_threshold).(current main HEAD:
161ae8321ced62f45fcd9ceb412e05b47c603cd4, 2024-09-16)._fetch_from_url, mirroringtests/unit/datasets/test_dangerous_qa_dataset.py.One question before I start
DecodingTrust's root LICENSE is CC BY-SA 4.0, while PyRIT is MIT. The existing
Stereotypes assets (
pyrit/datasets/jailbreak/templates/dt_stereotypes_*.yaml) ship thesystem prompts in-tree with attribution. For Toxicity I'd plan to fetch at runtime from
raw.githubusercontent.com(no vendoring) and add full attribution in the classdocstring. Is that the approach you'd like, or should I handle CC BY-SA sources
differently?
References