feat(red-team): guard against cross-run RedTeamAgent reuse (#329) by Aryansharma28 · Pull Request #353 · langwatch/scenario

Aryansharma28 · 2026-04-15T15:42:04Z

Summary

Stacked on #346 (alongside #349 which is also open). Closes #329 via Option A from the issue: explicit single-use contract enforced at runtime.

`RedTeamAgent` carries mutable per-run state on the instance (attacker history, turn scores, backtrack counter/history, parse failure count). Reusing the same agent across `scenario.run()` calls — serial or parallel — silently interleaves that state, corrupting both runs. The issue mentions that the `running-in-parallel` docs even suggested this was supported.

Guard logic: track the first call's `thread_id`. On turn 1 of a later call with a different `thread_id`, raise a clear `RuntimeError` telling the user to instantiate a fresh agent. On turn > 1 with a changed `thread_id`, also raise (defensive — catches manual-call misuse).
Mock-friendly: use `getattr(input, "thread_id", None)` so existing `MagicMock(spec=AgentInput)` test fixtures that don't set `thread_id` don't trip the guard. Production `AgentInput` always has it (required field).
Docstrings on `crescendo()` / `goat()` / `redTeamCrescendo` / `redTeamGoat` now state the single-use contract explicitly — removes the old soft "attack plan might go stale" note that undersold the problem.
TS mirror: same shape via `runThreadId`.

Why Option A, not ContextVars

The issue offered two options:

A (this PR): document and enforce single-use. Runtime guard + docs.
B: isolate per-run state via `ContextVar` / `AsyncLocalStorage`, letting users reuse agents transparently.

Option B hides the bug. If a user reuses an agent, they still see consistent behaviour inside the run, but inspecting the instance shows weird interleaved state — and the `AsyncLocalStorage` parity story between Python contextvars and Node's async context is its own project (TS `AsyncLocalStorage` doesn't propagate reliably across some queue boundaries). The public API is cheap: `RedTeamAgent.goat(...)` is one line. There's no ergonomics reason to allow reuse. Catching misuse loudly is the right default.

Test plan

Python: 184 tests pass (180 pre-existing + 4 new). New tests:
- `test_first_run_records_thread_id` — records on turn 1
- `test_same_thread_multi_turn_is_ok` — legitimate multi-turn doesn't trip
- `test_cross_run_reuse_raises_on_turn_1` — the core guard
- `test_mid_run_thread_change_raises` — defensive turn-2 path
TypeScript: 345 tests pass (343 + 2 new mirroring the core guard cases)
Confirmed getattr fallback keeps all 30+ existing tests that use `MagicMock(spec=AgentInput)` passing unchanged
Manual: instantiate one `RedTeamAgent.goat(...)`, pass it to two sequential `scenario.run()` calls, confirm the second raises with a clear message (requires API keys)

Closes red team: shared mutable state on RedTeamAgent makes parallel scenario.run() unsafe #329
Stacks on feat: add GOAT strategy with dynamic technique selection for RedTeamAgent #346; sibling of refactor(red-team): strategies own their output parsing #349 (not dependent on it)
Part of EPIC: epic: scenarios red teaming langwatch#1713

🤖 Generated with Claude Code

RedTeamAgent keeps mutable per-run state on the instance (attacker history, turn scores, backtrack counter, backtrack history, parse failure count). Reusing the same agent across scenario.run() calls — serial or parallel — silently interleaves that state, corrupting both runs. This lands Option A from #329: document the contract, enforce it at runtime. - Track self._run_thread_id / this.runThreadId (first call's thread_id). - On turn 1, if a different thread_id was previously recorded, raise a clear RuntimeError directing the user to instantiate a fresh agent. - On turn > 1 with a changed thread_id, raise too (defensive — shouldn't happen under the normal orchestrator but catches manual-call misuse). - Use getattr fallback so existing MagicMock(spec=AgentInput) test fixtures that don't set thread_id don't trip the guard. - Update crescendo()/goat() / redTeamCrescendo/redTeamGoat docstrings to state the single-use contract explicitly. Option B (ContextVar / AsyncLocalStorage isolation) would let users reuse agents transparently but hides the bug — the correct shape is to enforce the single-use contract loudly. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

github-actions · 2026-04-15T16:02:40Z

Automated low-risk assessment

This PR was evaluated against the repository's Low-Risk Pull Requests procedure and does not qualify as low risk.

The PR changes production runtime behavior by enforcing a single-use contract on RedTeamAgent (adding thread_id tracking and raising errors on reuse or mid-run thread changes) in both TypeScript and Python. This is a behavioral/API change that can break callers and is not limited to docs/tests/config, so it does not meet the low-risk criteria.

This PR requires a manual review before merging.

…gent (#346) * feat: add RedTeamAgent.goat() strategy with dynamic technique selection Add GOAT (Generative Offensive Agent Tester) as a separate strategy alongside Crescendo. Based on Meta's GOAT paper (ICML 2025, 97% ASR). - GoatStrategy with 7-technique catalogue (hypothetical framing, persona modification, refusal suppression, response priming, dual response, topic splitting, authority & social engineering) - Soft progress stages (early/mid/late) instead of fixed phases - Dedicated GOAT metaprompt template for adaptive attack planning - Python: RedTeamAgent.goat(target=..., model=...) - TypeScript: scenario.redTeamGoat({ target, model }) - Crescendo (.crescendo()) is completely untouched Closes #2143 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: default marathon_script turns to total_turns and decouple GOAT metaprompt from Crescendo phases marathon_script(turns=...) was required with no default, causing TypeError in all test calls that omit it. Now defaults to self.total_turns. Also makes _generate_attack_plan strategy-aware: only computes Crescendo phase boundaries when using CrescendoStrategy, removing unnecessary coupling for the GOAT strategy. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: address PR review — GoatStrategy export, JS factory defaults, and test coverage Python: - Export GoatStrategy from scenario.__init__ (was missing, CrescendoStrategy was exported but not GoatStrategy) - Add 25 unit tests for GoatStrategy: stage boundaries, prompt building, factory method defaults - Fix technique 6 example message (was placeholder "...") JavaScript: - Fix redTeamGoat() always sets GOAT_METAPROMPT_TEMPLATE (was conditionally falling back to Crescendo template when attackPlan supplied) - Add totalTurns: 30 default to redTeamGoat() to match Python - Add metapromptTemplate to CrescendoConfig so users can override via both factory APIs - Fix renderMetapromptTemplate: phase boundary vars only injected for Crescendo (via optional phaseEnds param) - generateAttackPlan passes phaseEnds only when strategy instanceof CrescendoStrategy - marathonScript turns param is now optional, defaults to this.totalTurns (matches Python fix) - Make GoatStrategy.getStage() private (matches Python _get_stage()) - Fix float notation 0.3/0.7 → 0.30/0.70 to match Python Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix: update renderMetapromptTemplate tests to use explicit phaseEnds param Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * refactor: remove instanceof coupling, unify config types, add JS GOAT tests Architecture: - Add template_variables() to RedTeamStrategy base (Python) and phaseEnds?() to RedTeamStrategy interface (JS) — strategies declare their own template vars - CrescendoStrategy overrides to return phase boundary turn numbers; GoatStrategy returns nothing. Removes isinstance(CrescendoStrategy) check from orchestrator - Remove _PHASES import from red_team_agent.py (no longer needed) JS: - CrescendoConfig = Omit<RedTeamAgentConfig, "strategy"> — eliminates 13-field duplication across three interfaces - GoatConfig gets doc comment explaining it is a named hook for future GOAT params - CrescendoStrategy.getPhase() made private; tests updated to use getPhaseName() - Add 24 JS unit tests for GoatStrategy (stage boundaries, buildSystemPrompt, phaseEnds, redTeamGoat factory defaults) - Remove vestigial vi.doMock("ai") that never intercepted calls (module pre-loaded) - phaseEnds test uses literal [2, 4, 7] instead of re-deriving the formula Python: - metaprompt_template falsy check fixed: `or` → `is not None` (matches JS `??`) - Add "Should not be reached" comment to GoatStrategy fallback (matches Crescendo) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix: remove phaseEnds test on GoatStrategy — absence is enforced by types Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix: docstrings, error messages, exports, and test coverage gaps - Fix GoatStrategy docstring: remove benchmark-specific "in 5 turns" claim - Fix get_phase_name base class docstring: strategy-agnostic return value wording - Wrap .format() in _generate_attack_plan with helpful ValueError on KeyError (e.g. user passes Crescendo template to GOAT agent — was silent crash) - Export GOAT_METAPROMPT_TEMPLATE from Python scenario.__init__ and JS index.ts so users can inspect/extend without importing from internal paths - Update GoatConfig JSDoc to document inherited options and totalTurns=30 default - Add Python test: goat() allows overriding metaprompt_template via kwargs - Add JS test: renderMetapromptTemplate leaves phase placeholders as literals when phaseEnds is omitted (documents silent passthrough behavior) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix(red-team): close GOAT default-template foot-gun, document agent reuse The `goat()`/`redTeamGoat` factories used `setdefault` / object spread patterns that left an explicit `metaprompt_template=None` (Python) or `metapromptTemplate: undefined` (TypeScript) in place. The constructor then fell back to the Crescendo `_DEFAULT_METAPROMPT_TEMPLATE`, which contains `{phase1_end}` placeholders that GoatStrategy.template_variables() does not provide — first attack-plan render dies with KeyError. Force the GOAT default whenever the caller's value is None/undefined. Also document the silent-stale-plan failure mode: `_attack_plan` is cached on the instance and survives across `scenario.run()` calls. Reusing the same agent across scenarios with different descriptions silently uses the first run's plan. Added `.. note::` blocks to both `goat()` and `crescendo()` Python docstrings and `@remarks` to the TS factories. Added a warning to `goat()` about combining `injection_probability` with GOAT — the GOAT metaprompt already steers the attacker toward encoding techniques, and post-hoc encoding desyncs H_attacker from what the target saw. Default 0.0 is the safe path. Verified: 156 Python + 108 JS tests pass. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * refactor(red-team): align GOAT with Meta's paper — drop pre-generated plan and stage hints (#340) * refactor(red-team): align GOAT with Meta's paper — drop pre-generated plan + stage hints Meta's GOAT paper (ICML 2025) does not pre-generate an attack plan via a metaprompt LLM call, and the attacker's system prompt has no early/mid/late stage hints. Adaptation is driven entirely by the per-turn score/hint feedback that lives in the attacker's private conversation history (H_attacker). Changes: - Add `needs_metaprompt_plan` (Python) / `needsMetapromptPlan` (TS) property on the strategy interface, default True. GoatStrategy overrides to False; CrescendoStrategy keeps True. - Orchestrator (`call()`) consults the flag and skips `_generateAttackPlan` when False, saving one LLM call on turn 1 and eliminating the description-keyed stale-plan bug entirely for GOAT. - Remove `_STAGES` / `STAGES` array and `_get_stage` / `getStage` methods from GoatStrategy. `build_system_prompt` no longer renders a "Stage:" line or an ATTACK PLAN section. - Keep `get_phase_name` returning a coarse progress bucket (`early`/`mid`/`late`) for telemetry dashboards only — this label is no longer surfaced to the attacker. - Drop `GOAT_METAPROMPT_TEMPLATE` constant and its public export (Python `scenario.__init__.py` + JS `index.ts`). GOAT never renders a template. - Simplify `goat()` / `redTeamGoat` factories: no more `setdefault` / object-spread template injection dance (and no accompanying foot-gun). - Update tests: rewrite GoatStrategy stage tests as progress-bucket tests; add explicit assertions that GOAT prompts contain no ATTACK PLAN section and no stage hints; drop obsolete template-rendering tests. Net effect: GOAT behaviour moves from ~60% to ~85% faithful to the paper. The remaining gap is structured output (observation/strategy/reply JSON), tracked in #2142 / #330. Tests: 152 Python + 103 JS passing. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat(red-team): structured attacker output — observation/strategy/reply JSON (#341) * feat(red-team): structured attacker output — observation / strategy / reply JSON Meta's GOAT paper (ICML 2025) has the attacker emit a structured chain of thought at every turn: observation of the target's last response, strategy (which technique it will use and why), then the actual reply. We were letting the attacker emit free text — losing the reasoning signal, making per-turn technique selection invisible to telemetry, and leaving no gate against the attacker skipping the "Thought" step entirely. This commit implements the paper's contract: - New `JSON_OUTPUT_CONTRACT` module constant (Python `_red_team/base.py`, TypeScript `red-team-strategy.ts`) appended to every attacker system prompt. Instructs the attacker to emit {"observation": "...", "strategy": "...", "reply": "..."} and nothing else. - Added to both `CrescendoStrategy` and `GoatStrategy` system prompts so both strategies benefit. - `RedTeamAgent._parse_attacker_output` (Python) / `parseAttackerOutput` (TS) parses the attacker's raw output into `(reply, observation, strategy)`. Strips markdown fences, handles malformed JSON, coerces non-string fields. Fallback on parse failure: the whole raw response becomes the `reply` and a WARN-level log fires — the scenario keeps running. - Only `reply` reaches the target. `observation` and `strategy` are emitted as OpenTelemetry span attributes (`red_team.reasoning.observation`, `red_team.reasoning.strategy`, `red_team.reasoning.parse_failed`) so dashboards can answer "which technique works against which target?" — the paper's core selling point. - The raw JSON output is kept in H_attacker so the attacker sees its own format on subsequent turns (keeps the output shape consistent with the system prompt's directive). Paper fidelity: moves GOAT from ~85% to ~95% faithful. Tests: 163 Python (+11) and 115 JS (+12) passing. - Parser: well-formed JSON, code-fence stripping (```json and ```), malformed JSON fallback, missing/empty reply fallback, non-object JSON fallback, non-string field coercion, whitespace trimming. - Contract: OUTPUT FORMAT section present in both GOAT and Crescendo system prompts. - Existing tests (which mock attacker with plain strings) continue to pass via the graceful fallback path. Closes langwatch/scenario#2142 (structured attacker output). Closes #330 (GOAT technique telemetry) once consumers wire the span attributes to dashboards. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * refactor(red-team): scope JSON output contract to GOAT only Crescendo does not emit structured output in Microsoft's Crescendo paper — applying the JSON contract to both strategies was scope creep for a GOAT-focused PR stack. Roll it back on Crescendo behind a new per-strategy flag so the parser only runs when the attacker is actually instructed to emit JSON. Changes: - Add `emits_structured_output` (Python) / `emitsStructuredOutput` (TS) property on RedTeamStrategy. Default False/falsy. GoatStrategy overrides to True. - Remove `JSON_OUTPUT_CONTRACT` import and interpolation from CrescendoStrategy in both Python and TS. - Gate the parser in `call()`: run it only when the strategy's `emits_structured_output` flag is set. Otherwise use raw attacker output as the reply with no parsing and no telemetry spam. - Update tests: GoatStrategy asserts OUTPUT FORMAT present + flag true; CrescendoStrategy asserts OUTPUT FORMAT absent + flag falsy. Tests: 164 Python (+1) / 116 JS (+1) passing. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(red-team): type-check Crescendo's optional emitsStructuredOutput via interface CI typecheck in vitest-examples failed on: expect(new CrescendoStrategy().emitsStructuredOutput).toBeUndefined(); because `emitsStructuredOutput` is declared on the RedTeamStrategy interface as optional but never added to the Crescendo class — so direct access on the concrete type is a TS2339 error under strict mode. Access via the interface type so the optional property is visible. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(red-team): scale backtrack budget with total_turns, add run telemetry (#347) - max_backtracks: new optional param, auto-scales max(1, total_turns // 3) so 5-turn runs don't over-provision and 100-turn runs aren't starved against hardened targets (closes #331) - red_team.progress span attr: current_turn/total_turns, enables timeline filters in dashboards without deriving from turn/total_turns pairs - red_team.parse_failure_count span attr: cumulative count of malformed structured-output responses per run, surfaces attacker output-format reliability per provider/model for GOAT runs - TS parity for max_backtracks; telemetry is Python-only for now (TS red-team has no OTel instrumentation yet) #336 (success_score default) deferred — issue explicitly requires a data sweep before changing; will be a separate follow-up PR. Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat(red-team): techniques as data — typed catalogue + chosen_technique_ids telemetry (#348) Replace the inline TECHNIQUE_CATALOGUE string literal with a typed list of Technique records. Attackers still see the byte-identical rendered prompt, but the catalogue is now first-class data that downstream code can query, extend, and serialize. - New Technique dataclass/interface (Py + TS) with id/name/description/example - DEFAULT_GOAT_TECHNIQUES exports the paper's 7 techniques - render_catalogue(techniques) / renderCatalogue(techniques) produces the attacker-facing prompt, locked to byte-parity with the previous string - extract_chosen_ids parses the attacker's `strategy` field case-insensitively against both ID (HYPOTHETICAL_FRAMING) and name (HYPOTHETICAL FRAMING) forms - GoatStrategy(techniques=...) accepts a user override — closes the research ask "what if I add a new technique?" — with duplicate-ID validation - New span attr: red_team.chosen_technique_ids: list[str] per GOAT turn, groupable in dashboards to answer "which techniques work on which targets?" (Python only — TS red-team has no OTel instrumentation yet) - RedTeamStrategy.chosen_technique_ids() default returns [] so non-catalogue strategies contribute nothing (Crescendo unaffected) Closes #330. Partial toward #335 (Py↔TS parity — the two lists still exist separately; shared JSON fixture is a follow-up). Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * refactor(red-team): strategies own their output parsing (#349) Move `parse_attacker_output` from a RedTeamAgent static method into the strategy interface. Remove the `emits_structured_output` feature flag — the orchestrator always calls `strategy.parse_attacker_output(raw)`, and strategies themselves decide how to interpret the attacker's output. - New `AttackerOutput` dataclass / interface (reply, observation, strategy, parse_failed). Exported from the package root. - Base `RedTeamStrategy` provides a default `parse_attacker_output` that returns `AttackerOutput(reply=raw)` — the right shape for strategies without a JSON contract. GoatStrategy overrides with JSON parsing. - Remove the `if self._strategy.emits_structured_output:` branch from `RedTeamAgent.call()`. Single code path, telemetry always emitted. - Delete `RedTeamAgent._parse_attacker_output` static (Py) / the module-level `parseAttackerOutput` export (TS) — parser is on the strategy now. - `parse_failed` is set by the strategy itself, not inferred downstream by inspecting obs/strategy emptiness — cleaner and lets custom strategies signal failure without tripping the heuristic. New strategies can now add custom output schemas without orchestrator changes — adding a technique interface extension on top of the flag-based design would have required the exact branch we just removed. Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat(red-team): zero-friction report dashboard — auto-save + CLI + docs (#351) feat(red-team): zero-friction report dashboard — auto-save + `scenario redteam-report` CLI Problem: the existing `save_redteam_report()` helper required users to import it and call it explicitly after every `scenario.run()`, and launching the Streamlit dashboard required memorizing the app path and batch directory. The report module was untracked — no one discovered it. This change wires auto-save into the pytest plugin and adds a CLI that auto-discovers the latest batch. ## The three-command user flow pip install 'langwatch-scenario[report]' pytest path/to/redteam_tests.py # reports save automatically scenario redteam-report # opens dashboard No imports in test code, no explicit save call, no path arguments. ## What landed **M1 — Auto-save via pytest plugin** - `pytest_plugin.py::_auto_save_redteam_report` hooks `auto_reporting_run` to detect `isinstance(agent, RedTeamAgent)` in the agents list and call `save_redteam_report`. Swallows errors with a warning so reporting failures never break tests. - Env vars: `SCENARIO_REDTEAM_REPORT=0` disables, `..._REPORT_DIR=path` overrides batch root. **M2 — CLI `scenario redteam-report`** - New `scenario/cli.py` with `_find_batch_dir()` (lexicographic sort of timestamped dirs → latest first). - `setup.py` registers `console_scripts: scenario = scenario.cli:main`. - `pyproject.toml` adds `[project.optional-dependencies] report = [streamlit, plotly, pandas]`. - Flags: `--latest N`, `--batch <ts>`, `--dir <path>`, `--port`, `--no-browser`. **M3 — TypeScript auto-save parity** - New `javascript/src/red-team-report.ts` with `isRedTeamAgent()` + `saveRedTeamReport()`, mirroring Python's JSON shape. - `runner/run.ts` calls it after `execution.execute()` when a RedTeamAgent is found. Same env vars. - TS version skips the LLM-based severity/suggestion analysis at save time (placeholder fields, `analysis_pending: true`) to keep tests fast; dashboard's on-demand aggregator computes them. - Exports `saveRedTeamReport` from the package root for users running scenarios outside the default runner. **M4 — Docs page under Red Teaming** - `docs/docs/pages/advanced/red-teaming/report.mdx` — quick start, how it works (with end-to-end data-flow diagram), dashboard viewing, JSON shape, config env vars, CI / CD snippet, advanced manual save, headless / SSH running, troubleshooting (7 common failure modes). - `docs/vocs.config.tsx` — nav entry "Reports Dashboard" under the existing Red Teaming section. **Report module (previously untracked)** - `python/scenario/report/` — moved into tracking. Contains `_save.save_redteam_report`, `_aggregate.aggregate_fixes` (the on-demand cross-batch LLM aggregator), and `app.py` (the 976-line Streamlit dashboard). ## Verified end-to-end Ran a 3-turn GOAT scenario against a mock always-refuse agent: - Test passed in 71s. - Report auto-saved to `./redteam-reports/20260415_153324/<ts>_<slug>_goat.json`. - Payload contains all analysis fields (severity=high, break_severity=none, 4 suggestions, failure_summary populated). - CLI `_find_batch_dir` resolves --latest 1/2/3 correctly. ## Design notes (see docs for details) - `isinstance` detection over pytest markers: zero-friction, nothing for users to remember. - Timestamp-named batch directories: lexicographic sort = chronological sort; no metadata file needed. - LLM analysis synchronous on save (Python only): dashboards open instantly; no waiting for 20 LLM calls when you click. - Same JSON shape across Python and TS: one Streamlit app serves both. - Save errors swallowed with warning: reporting is observability, not correctness. Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(red-team): three edge-case hardening fixes (#337, #333a, #333b) (#354) Three small correctness fixes on red-team edge cases, batched so each gets its own test but reviewers only open one PR. All three convert "fails silent" bugs into "fails loud" signals. ## #337 — Regression test for the KeyError helper `_generate_attack_plan` already wraps `template.format()` in a try/except that turns a raw KeyError into a friendly ValueError pointing at the common cause (Crescendo ↔ GOAT template mismatch). There was no test — a future refactor could silently lose the helpful message. Adds `TestMetapromptTemplateKeyErrorHelper`: construct an agent with a bogus placeholder, assert the ValueError fires and its message mentions the unknown key, the available keys, and the Crescendo/GOAT hint. ## #333a — Validate empty techniques list Before: `RedTeamAgent.crescendo(injection_probability=0.5, techniques=[])` silently skipped injection on every turn (falsy empty list short-circuited the `and self._techniques` guard in `call()`), so users thought injection was active but got no encoded attacks. After: raise ValueError at construction with a clear message pointing at both workarounds ("disable injection, or provide at least one technique"). `techniques=None` still falls back to DEFAULT_TECHNIQUES — unchanged. ## #333b — Empty/whitespace plan must fail loud Strategies that use a plan (Crescendo, `needs_metaprompt_plan=True`) REQUIRE a non-empty plan by design. Previously `_generate_attack_plan` only rejected None content — an empty string or all-whitespace response from the metaprompt LLM would be silently stored as the plan, then rendered as a labeled-but-blank "ATTACK PLAN:" section. The attacker LLM reads that as "your plan is nothing" and degrades to generic attacks. The user gets no signal the metaprompt call silently failed. Fix `_generate_attack_plan` to raise RuntimeError when the returned content is None, empty, or whitespace-only. The check moves from `build_system_prompt` (wrong layer — prettifies malformed state) to the generation step (right layer — prevents malformed state). An earlier version of this PR instead omitted the ATTACK PLAN section in the system prompt when empty; that hid the upstream bug. Adds `TestEmptyMetapromptPlanRaises` with three cases (empty, whitespace, real plan) using monkeypatched `litellm.acompletion`. ## Tests 187 Python red-team tests pass (+ 7 new). Closes #337 Addresses parts a + b of #333 (part c — scorer-failure cascade WARN threshold — remains open for a follow-up) Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat(red-team): guard against cross-run RedTeamAgent reuse (#329) (#353) RedTeamAgent keeps mutable per-run state on the instance (attacker history, turn scores, backtrack counter, backtrack history, parse failure count). Reusing the same agent across scenario.run() calls — serial or parallel — silently interleaves that state, corrupting both runs. This lands Option A from #329: document the contract, enforce it at runtime. - Track self._run_thread_id / this.runThreadId (first call's thread_id). - On turn 1, if a different thread_id was previously recorded, raise a clear RuntimeError directing the user to instantiate a fresh agent. - On turn > 1 with a changed thread_id, raise too (defensive — shouldn't happen under the normal orchestrator but catches manual-call misuse). - Use getattr fallback so existing MagicMock(spec=AgentInput) test fixtures that don't set thread_id don't trip the guard. - Update crescendo()/goat() / redTeamCrescendo/redTeamGoat docstrings to state the single-use contract explicitly. Option B (ContextVar / AsyncLocalStorage isolation) would let users reuse agents transparently but hides the bug — the correct shape is to enforce the single-use contract loudly. Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(red-team): annotate H_attacker on post-hoc injection (closes #326, #334) (#365) fix(red-team): annotate H_attacker when post-hoc injection fires (#326, #334) When injection_probability fires, the target received the encoded form but the attacker's private history (H_attacker) only recorded the plaintext. The attacker LLM then reasoned against a conversation that didn't match what the target actually saw — score/hint feedback was computed on a turn that, from the attacker's point of view, didn't happen. Append a `[INJECTED <technique>]` system marker to H_attacker on every injected turn so the attacker's next-turn reasoning stays aligned. Same fix applies to Crescendo and GOAT because the injection block runs before the strategy branches. Also adds a defensive heuristic (`_looks_already_encoded` / `looksAlreadyEncoded`) that skips injection when the reply is already a long Base64-charset string, preventing double-encoding if a user extends the GOAT catalogue with encoding-style techniques. Docstring / JSDoc for goat() / redTeamGoat replaces the "not recommended" warning with a note describing the new behaviour. Closes #326 Closes #334 Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Potential fix for pull request finding 'Empty except' Co-authored-by: Copilot Autofix powered by AI <223894421+github-code-quality[bot]@users.noreply.github.com> * Potential fix for pull request finding 'Unused local variable' Co-authored-by: Copilot Autofix powered by AI <223894421+github-code-quality[bot]@users.noreply.github.com> * fix: update injection test for [INJECTED] marker and drop unused Any import - test_injection_keeps_original_in_attacker_history now accounts for the [INJECTED <technique>] system marker appended by PR #365 — assistant turn is at [-2], marker at [-1]. - Remove unused Any import from report/_aggregate.py (CodeQL finding). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(tests): cast result dict to satisfy pyright in injection tests Pyright couldn't narrow the AgentReturnTypes union to dict when indexing result["content"] — assert isinstance + cast keeps the test assertions identical while making the types explicit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs(red-team): add GOAT Strategy page under Red Teaming (#367) The GOAT strategy shipped in #346 but was invisible to users — zero mentions in the docs site. Adds a dedicated page covering: - When to choose GOAT over Crescendo (comparison table) - Python + TS quick starts - How the per-turn attacker loop works + JSON contract - The 7-technique default catalogue and how to override it - Full config reference (including injection_probability behaviour after #365's H_attacker marker fix) - OpenTelemetry span attributes for observability - Paper-fidelity notes + known limitations (short runs, non-English) Also links to the new page from the Red Teaming overview and wires it into the sidebar nav in vocs.config.tsx between Overview and Reports. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(red-team): GOAT DX polish — phase_kind, metaprompt warn, techniques split Three senior-review DX items merged into the GOAT strategy PR: 1. phase_kind on RedTeamStrategy (Py) / phaseKind on RedTeamStrategy (JS) - Crescendo returns "staged" (default), GOAT returns "progress" - Orchestrator now emits red_team.phase for staged strategies and red_team.progress_bucket for progress strategies, so dashboards don't conflate GOAT's coarse early/mid/late label with Crescendo's semantic warmup/probing/escalation/direct phases. 2. Warn on metaprompt_template passed to a strategy that ignores it (needs_metaprompt_plan=False). Previously the value was silently stored and never rendered — users couldn't tell their custom plan was being dropped. Now fires a UserWarning (Py) / console.warn (JS) at construction time. 3. Split .goat(techniques=...) into goat_techniques= (semantic catalogue the attacker LLM picks from each turn) and encoding_techniques= (Base64/ROT13/... encoders driven by injection_probability). The old techniques= kwarg keeps working as a deprecated alias for encoding_techniques= with a DeprecationWarning. Passing both raises TypeError. Same shape in TypeScript via goatTechniques / encodingTechniques on GoatConfig. Tests: 207 Python + 138 JS, all green. Adds TestPhaseKind, TestMetapromptTemplateIgnoredWarning, TestGoatTechniquesKwargSplit and the JS equivalents under red-team.test.ts. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(tests): satisfy strategy ABC signature in backward-compat test Addresses github-code-quality bot comment on the OldCustomStrategy override — declare the full build_system_prompt signature so the override matches the abstract base. Behavior unchanged; the test only probes phase_kind default. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(tests): satisfy pyright + tsc on DX polish tests - pyright: narrow `agent._strategy` to GoatStrategy before accessing `.techniques` in the goat/encoding independence test - tsc: CrescendoStrategy doesn't declare `phaseKind` (optional on the interface); access via a typed view in the default-behaviour test Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-authored-by: Copilot Autofix powered by AI <223894421+github-code-quality[bot]@users.noreply.github.com>

Aryansharma28 force-pushed the feat/red-team-reuse-guard branch from 3e6bcaf to 0ec646f Compare April 15, 2026 16:02

Aryansharma28 merged commit a57b45e into feat/red-team-dynamic-techniques Apr 15, 2026
3 checks passed

Aryansharma28 deleted the feat/red-team-reuse-guard branch April 15, 2026 16:02

Aryansharma28 mentioned this pull request Apr 20, 2026

feat: add GOAT strategy with dynamic technique selection for RedTeamAgent #346

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(red-team): guard against cross-run RedTeamAgent reuse (#329)#353

feat(red-team): guard against cross-run RedTeamAgent reuse (#329)#353
Aryansharma28 merged 1 commit intofeat/red-team-dynamic-techniquesfrom
feat/red-team-reuse-guard

Aryansharma28 commented Apr 15, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Apr 15, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Aryansharma28 commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why Option A, not ContextVars

Test plan

Related

Uh oh!

github-actions Bot commented Apr 15, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Aryansharma28 commented Apr 15, 2026 •

edited

Loading