Skip to content

DOC: Add contributor instructions for dataset loaders#1775

Merged
romanlutz merged 2 commits into
microsoft:mainfrom
romanlutz:romanlutz/add-datasets-instructions
May 21, 2026
Merged

DOC: Add contributor instructions for dataset loaders#1775
romanlutz merged 2 commits into
microsoft:mainfrom
romanlutz:romanlutz/add-datasets-instructions

Conversation

@romanlutz
Copy link
Copy Markdown
Contributor

Description

Adds .github/instructions/datasets.instructions.md with the conventions Copilot (and human contributors) should follow when adding or modifying seed dataset loaders under pyrit/datasets/seed_datasets/.

Covers the points that recur in dataset PR reviews:

  • SeedObjective vs SeedPrompt — picking the right seed type for behavior/goal rows vs literal messages
  • Subclassing _RemoteDatasetLoader and using its helpers (_fetch_from_huggingface, _fetch_from_url, _validate_enum / _validate_enums)
  • HuggingFace gating: token argument + HUGGINGFACE_TOKEN env fallback, docstring requirements
  • Enum-based filters (no raw strings / Literals), validated via the inherited helpers
  • Per-seed metadata preservation and class-level metadata constants picked up by _parse_metadata
  • Raising ValueError on empty filter results
  • Registration in remote/__init__.py (imports + __all__)
  • Bibliography wiring (doc/references.bib + the hidden-citations block in doc/bibliography.md)
  • Test conventions (mock _fetch_from_huggingface, asyncio-auto, token forwarding)
  • Optional live sanity-check against the real HF dataset before opening the PR

Modeled on the structure, depth, and frontmatter style of converters.instructions.md, scenarios.instructions.md, and output.instructions.md. Does not duplicate rules already in style-guide.instructions.md (async suffix, kw-only, type hints, enums-over-Literals).

Tests and Documentation

Markdown-only contributor-instructions change — no code, tests, or notebooks affected.

romanlutz and others added 2 commits May 21, 2026 12:07
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@romanlutz romanlutz added this pull request to the merge queue May 21, 2026
Merged via the queue into microsoft:main with commit f41fbda May 21, 2026
48 checks passed
@romanlutz romanlutz deleted the romanlutz/add-datasets-instructions branch May 21, 2026 21:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants