feat(red-team): zero-friction report dashboard — auto-save + CLI + docs#351
Merged
Aryansharma28 merged 1 commit intofeat/red-team-dynamic-techniquesfrom Apr 15, 2026
Conversation
dd7b81c to
6e831d7
Compare
5 tasks
…o redteam-report` CLI
Problem: the existing `save_redteam_report()` helper required users to
import it and call it explicitly after every `scenario.run()`, and
launching the Streamlit dashboard required memorizing the app path and
batch directory. The report module was untracked — no one discovered it.
This change wires auto-save into the pytest plugin and adds a CLI that
auto-discovers the latest batch.
## The three-command user flow
pip install 'langwatch-scenario[report]'
pytest path/to/redteam_tests.py # reports save automatically
scenario redteam-report # opens dashboard
No imports in test code, no explicit save call, no path arguments.
## What landed
**M1 — Auto-save via pytest plugin**
- `pytest_plugin.py::_auto_save_redteam_report` hooks
`auto_reporting_run` to detect `isinstance(agent, RedTeamAgent)` in the
agents list and call `save_redteam_report`. Swallows errors with a
warning so reporting failures never break tests.
- Env vars: `SCENARIO_REDTEAM_REPORT=0` disables, `..._REPORT_DIR=path`
overrides batch root.
**M2 — CLI `scenario redteam-report`**
- New `scenario/cli.py` with `_find_batch_dir()` (lexicographic sort of
timestamped dirs → latest first).
- `setup.py` registers `console_scripts: scenario = scenario.cli:main`.
- `pyproject.toml` adds `[project.optional-dependencies] report =
[streamlit, plotly, pandas]`.
- Flags: `--latest N`, `--batch <ts>`, `--dir <path>`, `--port`,
`--no-browser`.
**M3 — TypeScript auto-save parity**
- New `javascript/src/red-team-report.ts` with `isRedTeamAgent()` +
`saveRedTeamReport()`, mirroring Python's JSON shape.
- `runner/run.ts` calls it after `execution.execute()` when a
RedTeamAgent is found. Same env vars.
- TS version skips the LLM-based severity/suggestion analysis at save
time (placeholder fields, `analysis_pending: true`) to keep tests
fast; dashboard's on-demand aggregator computes them.
- Exports `saveRedTeamReport` from the package root for users running
scenarios outside the default runner.
**M4 — Docs page under Red Teaming**
- `docs/docs/pages/advanced/red-teaming/report.mdx` — quick start, how
it works (with end-to-end data-flow diagram), dashboard viewing,
JSON shape, config env vars, CI / CD snippet, advanced manual save,
headless / SSH running, troubleshooting (7 common failure modes).
- `docs/vocs.config.tsx` — nav entry "Reports Dashboard" under the
existing Red Teaming section.
**Report module (previously untracked)**
- `python/scenario/report/` — moved into tracking. Contains
`_save.save_redteam_report`, `_aggregate.aggregate_fixes` (the
on-demand cross-batch LLM aggregator), and `app.py` (the 976-line
Streamlit dashboard).
## Verified end-to-end
Ran a 3-turn GOAT scenario against a mock always-refuse agent:
- Test passed in 71s.
- Report auto-saved to `./redteam-reports/20260415_153324/<ts>_<slug>_goat.json`.
- Payload contains all analysis fields (severity=high, break_severity=none,
4 suggestions, failure_summary populated).
- CLI `_find_batch_dir` resolves --latest 1/2/3 correctly.
## Design notes (see docs for details)
- `isinstance` detection over pytest markers: zero-friction, nothing
for users to remember.
- Timestamp-named batch directories: lexicographic sort = chronological
sort; no metadata file needed.
- LLM analysis synchronous on save (Python only): dashboards open
instantly; no waiting for 20 LLM calls when you click.
- Same JSON shape across Python and TS: one Streamlit app serves both.
- Save errors swallowed with warning: reporting is observability, not
correctness.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
6e831d7 to
ddcff6c
Compare
Contributor
|
Automated low-risk assessment This PR was evaluated against the repository's Low-Risk Pull Requests procedure and does not qualify as low risk.
This PR requires a manual review before merging. |
223c6da
into
feat/red-team-dynamic-techniques
6 checks passed
Aryansharma28
added a commit
that referenced
this pull request
Apr 29, 2026
…gent (#346) * feat: add RedTeamAgent.goat() strategy with dynamic technique selection Add GOAT (Generative Offensive Agent Tester) as a separate strategy alongside Crescendo. Based on Meta's GOAT paper (ICML 2025, 97% ASR). - GoatStrategy with 7-technique catalogue (hypothetical framing, persona modification, refusal suppression, response priming, dual response, topic splitting, authority & social engineering) - Soft progress stages (early/mid/late) instead of fixed phases - Dedicated GOAT metaprompt template for adaptive attack planning - Python: RedTeamAgent.goat(target=..., model=...) - TypeScript: scenario.redTeamGoat({ target, model }) - Crescendo (.crescendo()) is completely untouched Closes #2143 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: default marathon_script turns to total_turns and decouple GOAT metaprompt from Crescendo phases marathon_script(turns=...) was required with no default, causing TypeError in all test calls that omit it. Now defaults to self.total_turns. Also makes _generate_attack_plan strategy-aware: only computes Crescendo phase boundaries when using CrescendoStrategy, removing unnecessary coupling for the GOAT strategy. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: address PR review — GoatStrategy export, JS factory defaults, and test coverage Python: - Export GoatStrategy from scenario.__init__ (was missing, CrescendoStrategy was exported but not GoatStrategy) - Add 25 unit tests for GoatStrategy: stage boundaries, prompt building, factory method defaults - Fix technique 6 example message (was placeholder "...") JavaScript: - Fix redTeamGoat() always sets GOAT_METAPROMPT_TEMPLATE (was conditionally falling back to Crescendo template when attackPlan supplied) - Add totalTurns: 30 default to redTeamGoat() to match Python - Add metapromptTemplate to CrescendoConfig so users can override via both factory APIs - Fix renderMetapromptTemplate: phase boundary vars only injected for Crescendo (via optional phaseEnds param) - generateAttackPlan passes phaseEnds only when strategy instanceof CrescendoStrategy - marathonScript turns param is now optional, defaults to this.totalTurns (matches Python fix) - Make GoatStrategy.getStage() private (matches Python _get_stage()) - Fix float notation 0.3/0.7 → 0.30/0.70 to match Python Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix: update renderMetapromptTemplate tests to use explicit phaseEnds param Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * refactor: remove instanceof coupling, unify config types, add JS GOAT tests Architecture: - Add template_variables() to RedTeamStrategy base (Python) and phaseEnds?() to RedTeamStrategy interface (JS) — strategies declare their own template vars - CrescendoStrategy overrides to return phase boundary turn numbers; GoatStrategy returns nothing. Removes isinstance(CrescendoStrategy) check from orchestrator - Remove _PHASES import from red_team_agent.py (no longer needed) JS: - CrescendoConfig = Omit<RedTeamAgentConfig, "strategy"> — eliminates 13-field duplication across three interfaces - GoatConfig gets doc comment explaining it is a named hook for future GOAT params - CrescendoStrategy.getPhase() made private; tests updated to use getPhaseName() - Add 24 JS unit tests for GoatStrategy (stage boundaries, buildSystemPrompt, phaseEnds, redTeamGoat factory defaults) - Remove vestigial vi.doMock("ai") that never intercepted calls (module pre-loaded) - phaseEnds test uses literal [2, 4, 7] instead of re-deriving the formula Python: - metaprompt_template falsy check fixed: `or` → `is not None` (matches JS `??`) - Add "Should not be reached" comment to GoatStrategy fallback (matches Crescendo) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix: remove phaseEnds test on GoatStrategy — absence is enforced by types Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix: docstrings, error messages, exports, and test coverage gaps - Fix GoatStrategy docstring: remove benchmark-specific "in 5 turns" claim - Fix get_phase_name base class docstring: strategy-agnostic return value wording - Wrap .format() in _generate_attack_plan with helpful ValueError on KeyError (e.g. user passes Crescendo template to GOAT agent — was silent crash) - Export GOAT_METAPROMPT_TEMPLATE from Python scenario.__init__ and JS index.ts so users can inspect/extend without importing from internal paths - Update GoatConfig JSDoc to document inherited options and totalTurns=30 default - Add Python test: goat() allows overriding metaprompt_template via kwargs - Add JS test: renderMetapromptTemplate leaves phase placeholders as literals when phaseEnds is omitted (documents silent passthrough behavior) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix(red-team): close GOAT default-template foot-gun, document agent reuse The `goat()`/`redTeamGoat` factories used `setdefault` / object spread patterns that left an explicit `metaprompt_template=None` (Python) or `metapromptTemplate: undefined` (TypeScript) in place. The constructor then fell back to the Crescendo `_DEFAULT_METAPROMPT_TEMPLATE`, which contains `{phase1_end}` placeholders that GoatStrategy.template_variables() does not provide — first attack-plan render dies with KeyError. Force the GOAT default whenever the caller's value is None/undefined. Also document the silent-stale-plan failure mode: `_attack_plan` is cached on the instance and survives across `scenario.run()` calls. Reusing the same agent across scenarios with different descriptions silently uses the first run's plan. Added `.. note::` blocks to both `goat()` and `crescendo()` Python docstrings and `@remarks` to the TS factories. Added a warning to `goat()` about combining `injection_probability` with GOAT — the GOAT metaprompt already steers the attacker toward encoding techniques, and post-hoc encoding desyncs H_attacker from what the target saw. Default 0.0 is the safe path. Verified: 156 Python + 108 JS tests pass. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * refactor(red-team): align GOAT with Meta's paper — drop pre-generated plan and stage hints (#340) * refactor(red-team): align GOAT with Meta's paper — drop pre-generated plan + stage hints Meta's GOAT paper (ICML 2025) does not pre-generate an attack plan via a metaprompt LLM call, and the attacker's system prompt has no early/mid/late stage hints. Adaptation is driven entirely by the per-turn score/hint feedback that lives in the attacker's private conversation history (H_attacker). Changes: - Add `needs_metaprompt_plan` (Python) / `needsMetapromptPlan` (TS) property on the strategy interface, default True. GoatStrategy overrides to False; CrescendoStrategy keeps True. - Orchestrator (`call()`) consults the flag and skips `_generateAttackPlan` when False, saving one LLM call on turn 1 and eliminating the description-keyed stale-plan bug entirely for GOAT. - Remove `_STAGES` / `STAGES` array and `_get_stage` / `getStage` methods from GoatStrategy. `build_system_prompt` no longer renders a "Stage:" line or an ATTACK PLAN section. - Keep `get_phase_name` returning a coarse progress bucket (`early`/`mid`/`late`) for telemetry dashboards only — this label is no longer surfaced to the attacker. - Drop `GOAT_METAPROMPT_TEMPLATE` constant and its public export (Python `scenario.__init__.py` + JS `index.ts`). GOAT never renders a template. - Simplify `goat()` / `redTeamGoat` factories: no more `setdefault` / object-spread template injection dance (and no accompanying foot-gun). - Update tests: rewrite GoatStrategy stage tests as progress-bucket tests; add explicit assertions that GOAT prompts contain no ATTACK PLAN section and no stage hints; drop obsolete template-rendering tests. Net effect: GOAT behaviour moves from ~60% to ~85% faithful to the paper. The remaining gap is structured output (observation/strategy/reply JSON), tracked in #2142 / #330. Tests: 152 Python + 103 JS passing. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat(red-team): structured attacker output — observation/strategy/reply JSON (#341) * feat(red-team): structured attacker output — observation / strategy / reply JSON Meta's GOAT paper (ICML 2025) has the attacker emit a structured chain of thought at every turn: observation of the target's last response, strategy (which technique it will use and why), then the actual reply. We were letting the attacker emit free text — losing the reasoning signal, making per-turn technique selection invisible to telemetry, and leaving no gate against the attacker skipping the "Thought" step entirely. This commit implements the paper's contract: - New `JSON_OUTPUT_CONTRACT` module constant (Python `_red_team/base.py`, TypeScript `red-team-strategy.ts`) appended to every attacker system prompt. Instructs the attacker to emit {"observation": "...", "strategy": "...", "reply": "..."} and nothing else. - Added to both `CrescendoStrategy` and `GoatStrategy` system prompts so both strategies benefit. - `RedTeamAgent._parse_attacker_output` (Python) / `parseAttackerOutput` (TS) parses the attacker's raw output into `(reply, observation, strategy)`. Strips markdown fences, handles malformed JSON, coerces non-string fields. Fallback on parse failure: the whole raw response becomes the `reply` and a WARN-level log fires — the scenario keeps running. - Only `reply` reaches the target. `observation` and `strategy` are emitted as OpenTelemetry span attributes (`red_team.reasoning.observation`, `red_team.reasoning.strategy`, `red_team.reasoning.parse_failed`) so dashboards can answer "which technique works against which target?" — the paper's core selling point. - The raw JSON output is kept in H_attacker so the attacker sees its own format on subsequent turns (keeps the output shape consistent with the system prompt's directive). Paper fidelity: moves GOAT from ~85% to ~95% faithful. Tests: 163 Python (+11) and 115 JS (+12) passing. - Parser: well-formed JSON, code-fence stripping (```json and ```), malformed JSON fallback, missing/empty reply fallback, non-object JSON fallback, non-string field coercion, whitespace trimming. - Contract: OUTPUT FORMAT section present in both GOAT and Crescendo system prompts. - Existing tests (which mock attacker with plain strings) continue to pass via the graceful fallback path. Closes langwatch/scenario#2142 (structured attacker output). Closes #330 (GOAT technique telemetry) once consumers wire the span attributes to dashboards. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * refactor(red-team): scope JSON output contract to GOAT only Crescendo does not emit structured output in Microsoft's Crescendo paper — applying the JSON contract to both strategies was scope creep for a GOAT-focused PR stack. Roll it back on Crescendo behind a new per-strategy flag so the parser only runs when the attacker is actually instructed to emit JSON. Changes: - Add `emits_structured_output` (Python) / `emitsStructuredOutput` (TS) property on RedTeamStrategy. Default False/falsy. GoatStrategy overrides to True. - Remove `JSON_OUTPUT_CONTRACT` import and interpolation from CrescendoStrategy in both Python and TS. - Gate the parser in `call()`: run it only when the strategy's `emits_structured_output` flag is set. Otherwise use raw attacker output as the reply with no parsing and no telemetry spam. - Update tests: GoatStrategy asserts OUTPUT FORMAT present + flag true; CrescendoStrategy asserts OUTPUT FORMAT absent + flag falsy. Tests: 164 Python (+1) / 116 JS (+1) passing. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(red-team): type-check Crescendo's optional emitsStructuredOutput via interface CI typecheck in vitest-examples failed on: expect(new CrescendoStrategy().emitsStructuredOutput).toBeUndefined(); because `emitsStructuredOutput` is declared on the RedTeamStrategy interface as optional but never added to the Crescendo class — so direct access on the concrete type is a TS2339 error under strict mode. Access via the interface type so the optional property is visible. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(red-team): scale backtrack budget with total_turns, add run telemetry (#347) - max_backtracks: new optional param, auto-scales max(1, total_turns // 3) so 5-turn runs don't over-provision and 100-turn runs aren't starved against hardened targets (closes #331) - red_team.progress span attr: current_turn/total_turns, enables timeline filters in dashboards without deriving from turn/total_turns pairs - red_team.parse_failure_count span attr: cumulative count of malformed structured-output responses per run, surfaces attacker output-format reliability per provider/model for GOAT runs - TS parity for max_backtracks; telemetry is Python-only for now (TS red-team has no OTel instrumentation yet) #336 (success_score default) deferred — issue explicitly requires a data sweep before changing; will be a separate follow-up PR. Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat(red-team): techniques as data — typed catalogue + chosen_technique_ids telemetry (#348) Replace the inline TECHNIQUE_CATALOGUE string literal with a typed list of Technique records. Attackers still see the byte-identical rendered prompt, but the catalogue is now first-class data that downstream code can query, extend, and serialize. - New Technique dataclass/interface (Py + TS) with id/name/description/example - DEFAULT_GOAT_TECHNIQUES exports the paper's 7 techniques - render_catalogue(techniques) / renderCatalogue(techniques) produces the attacker-facing prompt, locked to byte-parity with the previous string - extract_chosen_ids parses the attacker's `strategy` field case-insensitively against both ID (HYPOTHETICAL_FRAMING) and name (HYPOTHETICAL FRAMING) forms - GoatStrategy(techniques=...) accepts a user override — closes the research ask "what if I add a new technique?" — with duplicate-ID validation - New span attr: red_team.chosen_technique_ids: list[str] per GOAT turn, groupable in dashboards to answer "which techniques work on which targets?" (Python only — TS red-team has no OTel instrumentation yet) - RedTeamStrategy.chosen_technique_ids() default returns [] so non-catalogue strategies contribute nothing (Crescendo unaffected) Closes #330. Partial toward #335 (Py↔TS parity — the two lists still exist separately; shared JSON fixture is a follow-up). Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * refactor(red-team): strategies own their output parsing (#349) Move `parse_attacker_output` from a RedTeamAgent static method into the strategy interface. Remove the `emits_structured_output` feature flag — the orchestrator always calls `strategy.parse_attacker_output(raw)`, and strategies themselves decide how to interpret the attacker's output. - New `AttackerOutput` dataclass / interface (reply, observation, strategy, parse_failed). Exported from the package root. - Base `RedTeamStrategy` provides a default `parse_attacker_output` that returns `AttackerOutput(reply=raw)` — the right shape for strategies without a JSON contract. GoatStrategy overrides with JSON parsing. - Remove the `if self._strategy.emits_structured_output:` branch from `RedTeamAgent.call()`. Single code path, telemetry always emitted. - Delete `RedTeamAgent._parse_attacker_output` static (Py) / the module-level `parseAttackerOutput` export (TS) — parser is on the strategy now. - `parse_failed` is set by the strategy itself, not inferred downstream by inspecting obs/strategy emptiness — cleaner and lets custom strategies signal failure without tripping the heuristic. New strategies can now add custom output schemas without orchestrator changes — adding a technique interface extension on top of the flag-based design would have required the exact branch we just removed. Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat(red-team): zero-friction report dashboard — auto-save + CLI + docs (#351) feat(red-team): zero-friction report dashboard — auto-save + `scenario redteam-report` CLI Problem: the existing `save_redteam_report()` helper required users to import it and call it explicitly after every `scenario.run()`, and launching the Streamlit dashboard required memorizing the app path and batch directory. The report module was untracked — no one discovered it. This change wires auto-save into the pytest plugin and adds a CLI that auto-discovers the latest batch. ## The three-command user flow pip install 'langwatch-scenario[report]' pytest path/to/redteam_tests.py # reports save automatically scenario redteam-report # opens dashboard No imports in test code, no explicit save call, no path arguments. ## What landed **M1 — Auto-save via pytest plugin** - `pytest_plugin.py::_auto_save_redteam_report` hooks `auto_reporting_run` to detect `isinstance(agent, RedTeamAgent)` in the agents list and call `save_redteam_report`. Swallows errors with a warning so reporting failures never break tests. - Env vars: `SCENARIO_REDTEAM_REPORT=0` disables, `..._REPORT_DIR=path` overrides batch root. **M2 — CLI `scenario redteam-report`** - New `scenario/cli.py` with `_find_batch_dir()` (lexicographic sort of timestamped dirs → latest first). - `setup.py` registers `console_scripts: scenario = scenario.cli:main`. - `pyproject.toml` adds `[project.optional-dependencies] report = [streamlit, plotly, pandas]`. - Flags: `--latest N`, `--batch <ts>`, `--dir <path>`, `--port`, `--no-browser`. **M3 — TypeScript auto-save parity** - New `javascript/src/red-team-report.ts` with `isRedTeamAgent()` + `saveRedTeamReport()`, mirroring Python's JSON shape. - `runner/run.ts` calls it after `execution.execute()` when a RedTeamAgent is found. Same env vars. - TS version skips the LLM-based severity/suggestion analysis at save time (placeholder fields, `analysis_pending: true`) to keep tests fast; dashboard's on-demand aggregator computes them. - Exports `saveRedTeamReport` from the package root for users running scenarios outside the default runner. **M4 — Docs page under Red Teaming** - `docs/docs/pages/advanced/red-teaming/report.mdx` — quick start, how it works (with end-to-end data-flow diagram), dashboard viewing, JSON shape, config env vars, CI / CD snippet, advanced manual save, headless / SSH running, troubleshooting (7 common failure modes). - `docs/vocs.config.tsx` — nav entry "Reports Dashboard" under the existing Red Teaming section. **Report module (previously untracked)** - `python/scenario/report/` — moved into tracking. Contains `_save.save_redteam_report`, `_aggregate.aggregate_fixes` (the on-demand cross-batch LLM aggregator), and `app.py` (the 976-line Streamlit dashboard). ## Verified end-to-end Ran a 3-turn GOAT scenario against a mock always-refuse agent: - Test passed in 71s. - Report auto-saved to `./redteam-reports/20260415_153324/<ts>_<slug>_goat.json`. - Payload contains all analysis fields (severity=high, break_severity=none, 4 suggestions, failure_summary populated). - CLI `_find_batch_dir` resolves --latest 1/2/3 correctly. ## Design notes (see docs for details) - `isinstance` detection over pytest markers: zero-friction, nothing for users to remember. - Timestamp-named batch directories: lexicographic sort = chronological sort; no metadata file needed. - LLM analysis synchronous on save (Python only): dashboards open instantly; no waiting for 20 LLM calls when you click. - Same JSON shape across Python and TS: one Streamlit app serves both. - Save errors swallowed with warning: reporting is observability, not correctness. Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(red-team): three edge-case hardening fixes (#337, #333a, #333b) (#354) Three small correctness fixes on red-team edge cases, batched so each gets its own test but reviewers only open one PR. All three convert "fails silent" bugs into "fails loud" signals. ## #337 — Regression test for the KeyError helper `_generate_attack_plan` already wraps `template.format()` in a try/except that turns a raw KeyError into a friendly ValueError pointing at the common cause (Crescendo ↔ GOAT template mismatch). There was no test — a future refactor could silently lose the helpful message. Adds `TestMetapromptTemplateKeyErrorHelper`: construct an agent with a bogus placeholder, assert the ValueError fires and its message mentions the unknown key, the available keys, and the Crescendo/GOAT hint. ## #333a — Validate empty techniques list Before: `RedTeamAgent.crescendo(injection_probability=0.5, techniques=[])` silently skipped injection on every turn (falsy empty list short-circuited the `and self._techniques` guard in `call()`), so users thought injection was active but got no encoded attacks. After: raise ValueError at construction with a clear message pointing at both workarounds ("disable injection, or provide at least one technique"). `techniques=None` still falls back to DEFAULT_TECHNIQUES — unchanged. ## #333b — Empty/whitespace plan must fail loud Strategies that use a plan (Crescendo, `needs_metaprompt_plan=True`) REQUIRE a non-empty plan by design. Previously `_generate_attack_plan` only rejected None content — an empty string or all-whitespace response from the metaprompt LLM would be silently stored as the plan, then rendered as a labeled-but-blank "ATTACK PLAN:" section. The attacker LLM reads that as "your plan is nothing" and degrades to generic attacks. The user gets no signal the metaprompt call silently failed. Fix `_generate_attack_plan` to raise RuntimeError when the returned content is None, empty, or whitespace-only. The check moves from `build_system_prompt` (wrong layer — prettifies malformed state) to the generation step (right layer — prevents malformed state). An earlier version of this PR instead omitted the ATTACK PLAN section in the system prompt when empty; that hid the upstream bug. Adds `TestEmptyMetapromptPlanRaises` with three cases (empty, whitespace, real plan) using monkeypatched `litellm.acompletion`. ## Tests 187 Python red-team tests pass (+ 7 new). Closes #337 Addresses parts a + b of #333 (part c — scorer-failure cascade WARN threshold — remains open for a follow-up) Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat(red-team): guard against cross-run RedTeamAgent reuse (#329) (#353) RedTeamAgent keeps mutable per-run state on the instance (attacker history, turn scores, backtrack counter, backtrack history, parse failure count). Reusing the same agent across scenario.run() calls — serial or parallel — silently interleaves that state, corrupting both runs. This lands Option A from #329: document the contract, enforce it at runtime. - Track self._run_thread_id / this.runThreadId (first call's thread_id). - On turn 1, if a different thread_id was previously recorded, raise a clear RuntimeError directing the user to instantiate a fresh agent. - On turn > 1 with a changed thread_id, raise too (defensive — shouldn't happen under the normal orchestrator but catches manual-call misuse). - Use getattr fallback so existing MagicMock(spec=AgentInput) test fixtures that don't set thread_id don't trip the guard. - Update crescendo()/goat() / redTeamCrescendo/redTeamGoat docstrings to state the single-use contract explicitly. Option B (ContextVar / AsyncLocalStorage isolation) would let users reuse agents transparently but hides the bug — the correct shape is to enforce the single-use contract loudly. Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(red-team): annotate H_attacker on post-hoc injection (closes #326, #334) (#365) fix(red-team): annotate H_attacker when post-hoc injection fires (#326, #334) When injection_probability fires, the target received the encoded form but the attacker's private history (H_attacker) only recorded the plaintext. The attacker LLM then reasoned against a conversation that didn't match what the target actually saw — score/hint feedback was computed on a turn that, from the attacker's point of view, didn't happen. Append a `[INJECTED <technique>]` system marker to H_attacker on every injected turn so the attacker's next-turn reasoning stays aligned. Same fix applies to Crescendo and GOAT because the injection block runs before the strategy branches. Also adds a defensive heuristic (`_looks_already_encoded` / `looksAlreadyEncoded`) that skips injection when the reply is already a long Base64-charset string, preventing double-encoding if a user extends the GOAT catalogue with encoding-style techniques. Docstring / JSDoc for goat() / redTeamGoat replaces the "not recommended" warning with a note describing the new behaviour. Closes #326 Closes #334 Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Potential fix for pull request finding 'Empty except' Co-authored-by: Copilot Autofix powered by AI <223894421+github-code-quality[bot]@users.noreply.github.com> * Potential fix for pull request finding 'Unused local variable' Co-authored-by: Copilot Autofix powered by AI <223894421+github-code-quality[bot]@users.noreply.github.com> * fix: update injection test for [INJECTED] marker and drop unused Any import - test_injection_keeps_original_in_attacker_history now accounts for the [INJECTED <technique>] system marker appended by PR #365 — assistant turn is at [-2], marker at [-1]. - Remove unused Any import from report/_aggregate.py (CodeQL finding). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(tests): cast result dict to satisfy pyright in injection tests Pyright couldn't narrow the AgentReturnTypes union to dict when indexing result["content"] — assert isinstance + cast keeps the test assertions identical while making the types explicit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs(red-team): add GOAT Strategy page under Red Teaming (#367) The GOAT strategy shipped in #346 but was invisible to users — zero mentions in the docs site. Adds a dedicated page covering: - When to choose GOAT over Crescendo (comparison table) - Python + TS quick starts - How the per-turn attacker loop works + JSON contract - The 7-technique default catalogue and how to override it - Full config reference (including injection_probability behaviour after #365's H_attacker marker fix) - OpenTelemetry span attributes for observability - Paper-fidelity notes + known limitations (short runs, non-English) Also links to the new page from the Red Teaming overview and wires it into the sidebar nav in vocs.config.tsx between Overview and Reports. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(red-team): GOAT DX polish — phase_kind, metaprompt warn, techniques split Three senior-review DX items merged into the GOAT strategy PR: 1. phase_kind on RedTeamStrategy (Py) / phaseKind on RedTeamStrategy (JS) - Crescendo returns "staged" (default), GOAT returns "progress" - Orchestrator now emits red_team.phase for staged strategies and red_team.progress_bucket for progress strategies, so dashboards don't conflate GOAT's coarse early/mid/late label with Crescendo's semantic warmup/probing/escalation/direct phases. 2. Warn on metaprompt_template passed to a strategy that ignores it (needs_metaprompt_plan=False). Previously the value was silently stored and never rendered — users couldn't tell their custom plan was being dropped. Now fires a UserWarning (Py) / console.warn (JS) at construction time. 3. Split .goat(techniques=...) into goat_techniques= (semantic catalogue the attacker LLM picks from each turn) and encoding_techniques= (Base64/ROT13/... encoders driven by injection_probability). The old techniques= kwarg keeps working as a deprecated alias for encoding_techniques= with a DeprecationWarning. Passing both raises TypeError. Same shape in TypeScript via goatTechniques / encodingTechniques on GoatConfig. Tests: 207 Python + 138 JS, all green. Adds TestPhaseKind, TestMetapromptTemplateIgnoredWarning, TestGoatTechniquesKwargSplit and the JS equivalents under red-team.test.ts. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(tests): satisfy strategy ABC signature in backward-compat test Addresses github-code-quality bot comment on the OldCustomStrategy override — declare the full build_system_prompt signature so the override matches the abstract base. Behavior unchanged; the test only probes phase_kind default. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(tests): satisfy pyright + tsc on DX polish tests - pyright: narrow `agent._strategy` to GoatStrategy before accessing `.techniques` in the goat/encoding independence test - tsc: CrescendoStrategy doesn't declare `phaseKind` (optional on the interface); access via a typed view in the default-behaviour test Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-authored-by: Copilot Autofix powered by AI <223894421+github-code-quality[bot]@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Stacks on #340 (paper fidelity, merged into base). Adds the zero-friction report pipeline: auto-save via pytest plugin,
scenario redteam-reportCLI, TS parity, and a docs page.User flow (before → after)
Before — three friction points in every red-team test:
After — three commands, zero boilerplate in test code:
End-to-end data flow
What's in this PR
M1 — Auto-save via pytest plugin
python/scenario/pytest_plugin.py::_auto_save_redteam_reporthooks the existingauto_reporting_runwrapper.isinstance(agent, RedTeamAgent)in the agents list — no marker/decorator needed.SCENARIO_REDTEAM_REPORT=0disables,..._REPORT_DIR=pathoverrides the batch root.M2 —
scenario redteam-reportCLIpython/scenario/cli.py:_find_batch_dir()picks the latest timestamped dir via lexicographic sort.setup.py+pyproject.tomlregister the console script and[report]extras (streamlit,plotly,pandas).--latest N,--batch <ts>,--dir <path>,--port,--no-browser.M3 — TypeScript auto-save parity
javascript/src/red-team-report.ts— same JSON shape as Python, duck-typedisRedTeamAgent().runner/run.tscalls it afterexecution.execute()when a RedTeamAgent is present.M4 — Docs page
docs/docs/pages/advanced/red-teaming/report.mdx— quick start, dashboard viewing, JSON shape, env-var config, CI/CD snippet, headless/SSH running, advanced manual save, troubleshooting (7 common failure modes).docs/vocs.config.tsxnav entry under Red Teaming.Report module (previously untracked)
python/scenario/report/{__init__,_save,_aggregate,app}.pymoved into tracking.Key design decisions
_BATCH_DIR, lazy-initYYYYMMDD_HHMMSS)_aggregated_fixes.json, refresh via button.isinstancedetection instead of@pytest.mark.redteamRedTeamAgent, the report Just Appears.Verified end-to-end
Ran
test_goat_short_run_produces_report(3-turn GOAT vs. mock always-refuse agent):./redteam-reports/<ts>/<ts>_goat_report_verification_goat.json.severity=high,break_severity=none,failure_summary(non-empty), 4 suggestions, correctfailing_turn_index=null(agent held)._find_batch_dirresolves--latest 1/2/3correctly.Related
feat/red-team-dynamic-techniques.🤖 Generated with Claude Code