Skip to content

Proposal: OWASP LLM02 output-side scorer pack (XSS / SQLi / Shell / Path) + companion seed dataset #1737

@ppcvote

Description

@ppcvote

Proposal

Add an "OWASP LLM02 (Insecure Output Handling)" scorer pack to PyRIT — 4 new TrueFalseScorer subclasses for output-side threat detection, plus an optional companion seed dataset.

This builds directly on the architecture introduced in #1704 (the new RegexScorer base class + CredentialLeakScorer subclass pattern). The proposal extends that pattern across the remaining OWASP LLM02 sub-categories.

Motivation

PyRIT has strong coverage of LLM01 (Prompt Injection — input side): SelfAsk*Scorers, PromptShieldScorer, GandalfScorer, garak scenarios, ATR via #1715, etc.

LLM02 (Insecure Output Handling — output side) is comparatively under-instrumented:

These four categories appear directly in OWASP's LLM02 sub-class list. Each is fast and deterministic to detect via the same RegexScorer pattern.

Proposed scope

Four new RegexScorer subclasses in pyrit/score/true_false/:

Scorer Purpose Default patterns
XSSOutputScorer LLM emits an XSS payload (script-tag, event-handler, javascript: URI, data:text/html, iframe srcdoc, svg+script) 6
SQLInjectionOutputScorer LLM emits a destructive / union / comment-bypass SQL payload 3
ShellCommandOutputScorer LLM emits a pipe-to-shell, destructive, reverse-shell, or env-exfil shell payload 4
PathTraversalOutputScorer LLM emits ../../etc/passwd-style path traversal to known-sensitive targets 1 (dual-condition)

Each follows the exact pattern CredentialLeakScorer uses: subclass RegexScorer, set _DEFAULT_PATTERNS, pass categories=["security", "owasp-llm02-<category>"] to super().__init__(). ~50-60 lines per scorer including docstrings.

Companion seed dataset (optional): OWASPLLM02SeedDataset — 33 hand-curated developer-style adversarial requests (29 adversarial + 4 benign controls) calibrated to elicit each of the above payloads. Useful for red-team eval scenarios that want the inputs as well as the scorers.

Out of scope (explicitly to avoid duplicating ongoing work)

Disclosure

The proposed patterns are derived from a published MIT-licensed regex catalog I (@ppcvote) maintain:

A companion arXiv preprint is in draft (cs.CR target, ~17K words, 8-model × 1,584-generation evaluation, with the regex taxonomy as the core methodology). Happy to share full text if useful for review context.

Surfacing the "author also contributor" dynamic up front: if maintainers would rather write the patterns yourselves using our work only as reference, that's a totally acceptable outcome — please say the word.

Questions before any PR

  1. 4 scorers or 1? Would you prefer 4 separate RegexScorer subclasses (mirroring CredentialLeakScorer), or a single OWASPLLM02Scorer parameterized by category list?
  2. Dependency on FEAT Add RegexScorer and CredentialLeakScorer for regex-based secret detection #1704? Should we wait for the RegexScorer base class to merge before opening dependent PRs, or is a stacked PR acceptable?
  3. Seed dataset placement? GitHub-hosted (per @romanlutz's note in Proposal: Add Agent Threat Rules (ATR) dataset loader and taxonomy scorer #1702) or HuggingFace-hosted?
  4. Scenario or atomic? Do you want a OWASPLLM02RedTeam scenario bundling these + existing scorers, or keep scorers atomic for scenario-author composition?

What we're not asking

Not asking you to merge anything blindly. Not asking to skip review. Happy to iterate on scope, drop any of the four scorers, or pause entirely if this conflicts with anything I've missed.

cc @rlundeen2 @romanlutz (per recent activity on #1703 / #1702 / #513)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions