Skip to content

feat(red-team): guard against cross-run RedTeamAgent reuse (#329)#353

Merged
Aryansharma28 merged 1 commit intofeat/red-team-dynamic-techniquesfrom
feat/red-team-reuse-guard
Apr 15, 2026
Merged

feat(red-team): guard against cross-run RedTeamAgent reuse (#329)#353
Aryansharma28 merged 1 commit intofeat/red-team-dynamic-techniquesfrom
feat/red-team-reuse-guard

Conversation

@Aryansharma28
Copy link
Copy Markdown
Contributor

@Aryansharma28 Aryansharma28 commented Apr 15, 2026

Summary

Stacked on #346 (alongside #349 which is also open). Closes #329 via Option A from the issue: explicit single-use contract enforced at runtime.

`RedTeamAgent` carries mutable per-run state on the instance (attacker history, turn scores, backtrack counter/history, parse failure count). Reusing the same agent across `scenario.run()` calls — serial or parallel — silently interleaves that state, corrupting both runs. The issue mentions that the `running-in-parallel` docs even suggested this was supported.

  • Guard logic: track the first call's `thread_id`. On turn 1 of a later call with a different `thread_id`, raise a clear `RuntimeError` telling the user to instantiate a fresh agent. On turn > 1 with a changed `thread_id`, also raise (defensive — catches manual-call misuse).
  • Mock-friendly: use `getattr(input, "thread_id", None)` so existing `MagicMock(spec=AgentInput)` test fixtures that don't set `thread_id` don't trip the guard. Production `AgentInput` always has it (required field).
  • Docstrings on `crescendo()` / `goat()` / `redTeamCrescendo` / `redTeamGoat` now state the single-use contract explicitly — removes the old soft "attack plan might go stale" note that undersold the problem.
  • TS mirror: same shape via `runThreadId`.

Why Option A, not ContextVars

The issue offered two options:

  • A (this PR): document and enforce single-use. Runtime guard + docs.
  • B: isolate per-run state via `ContextVar` / `AsyncLocalStorage`, letting users reuse agents transparently.

Option B hides the bug. If a user reuses an agent, they still see consistent behaviour inside the run, but inspecting the instance shows weird interleaved state — and the `AsyncLocalStorage` parity story between Python contextvars and Node's async context is its own project (TS `AsyncLocalStorage` doesn't propagate reliably across some queue boundaries). The public API is cheap: `RedTeamAgent.goat(...)` is one line. There's no ergonomics reason to allow reuse. Catching misuse loudly is the right default.

Test plan

  • Python: 184 tests pass (180 pre-existing + 4 new). New tests:
    • `test_first_run_records_thread_id` — records on turn 1
    • `test_same_thread_multi_turn_is_ok` — legitimate multi-turn doesn't trip
    • `test_cross_run_reuse_raises_on_turn_1` — the core guard
    • `test_mid_run_thread_change_raises` — defensive turn-2 path
  • TypeScript: 345 tests pass (343 + 2 new mirroring the core guard cases)
  • Confirmed getattr fallback keeps all 30+ existing tests that use `MagicMock(spec=AgentInput)` passing unchanged
  • Manual: instantiate one `RedTeamAgent.goat(...)`, pass it to two sequential `scenario.run()` calls, confirm the second raises with a clear message (requires API keys)

Related

🤖 Generated with Claude Code

RedTeamAgent keeps mutable per-run state on the instance (attacker history,
turn scores, backtrack counter, backtrack history, parse failure count).
Reusing the same agent across scenario.run() calls — serial or parallel —
silently interleaves that state, corrupting both runs.

This lands Option A from #329: document the contract, enforce it at runtime.

- Track self._run_thread_id / this.runThreadId (first call's thread_id).
- On turn 1, if a different thread_id was previously recorded, raise a
  clear RuntimeError directing the user to instantiate a fresh agent.
- On turn > 1 with a changed thread_id, raise too (defensive — shouldn't
  happen under the normal orchestrator but catches manual-call misuse).
- Use getattr fallback so existing MagicMock(spec=AgentInput) test
  fixtures that don't set thread_id don't trip the guard.
- Update crescendo()/goat() / redTeamCrescendo/redTeamGoat docstrings
  to state the single-use contract explicitly.

Option B (ContextVar / AsyncLocalStorage isolation) would let users reuse
agents transparently but hides the bug — the correct shape is to enforce
the single-use contract loudly.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@Aryansharma28 Aryansharma28 force-pushed the feat/red-team-reuse-guard branch from 3e6bcaf to 0ec646f Compare April 15, 2026 16:02
@github-actions
Copy link
Copy Markdown
Contributor

Automated low-risk assessment

This PR was evaluated against the repository's Low-Risk Pull Requests procedure and does not qualify as low risk.

The PR changes production runtime behavior by enforcing a single-use contract on RedTeamAgent (adding thread_id tracking and raising errors on reuse or mid-run thread changes) in both TypeScript and Python. This is a behavioral/API change that can break callers and is not limited to docs/tests/config, so it does not meet the low-risk criteria.

This PR requires a manual review before merging.

@Aryansharma28 Aryansharma28 merged commit a57b45e into feat/red-team-dynamic-techniques Apr 15, 2026
3 checks passed
@Aryansharma28 Aryansharma28 deleted the feat/red-team-reuse-guard branch April 15, 2026 16:02
Aryansharma28 added a commit that referenced this pull request Apr 29, 2026
…gent (#346)

* feat: add RedTeamAgent.goat() strategy with dynamic technique selection

Add GOAT (Generative Offensive Agent Tester) as a separate strategy
alongside Crescendo. Based on Meta's GOAT paper (ICML 2025, 97% ASR).

- GoatStrategy with 7-technique catalogue (hypothetical framing, persona
  modification, refusal suppression, response priming, dual response,
  topic splitting, authority & social engineering)
- Soft progress stages (early/mid/late) instead of fixed phases
- Dedicated GOAT metaprompt template for adaptive attack planning
- Python: RedTeamAgent.goat(target=..., model=...)
- TypeScript: scenario.redTeamGoat({ target, model })
- Crescendo (.crescendo()) is completely untouched

Closes #2143

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: default marathon_script turns to total_turns and decouple GOAT metaprompt from Crescendo phases

marathon_script(turns=...) was required with no default, causing TypeError
in all test calls that omit it. Now defaults to self.total_turns.

Also makes _generate_attack_plan strategy-aware: only computes Crescendo
phase boundaries when using CrescendoStrategy, removing unnecessary
coupling for the GOAT strategy.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: address PR review — GoatStrategy export, JS factory defaults, and test coverage

Python:
- Export GoatStrategy from scenario.__init__ (was missing, CrescendoStrategy was exported but not GoatStrategy)
- Add 25 unit tests for GoatStrategy: stage boundaries, prompt building, factory method defaults
- Fix technique 6 example message (was placeholder "...")

JavaScript:
- Fix redTeamGoat() always sets GOAT_METAPROMPT_TEMPLATE (was conditionally falling back to Crescendo template when attackPlan supplied)
- Add totalTurns: 30 default to redTeamGoat() to match Python
- Add metapromptTemplate to CrescendoConfig so users can override via both factory APIs
- Fix renderMetapromptTemplate: phase boundary vars only injected for Crescendo (via optional phaseEnds param)
- generateAttackPlan passes phaseEnds only when strategy instanceof CrescendoStrategy
- marathonScript turns param is now optional, defaults to this.totalTurns (matches Python fix)
- Make GoatStrategy.getStage() private (matches Python _get_stage())
- Fix float notation 0.3/0.7 → 0.30/0.70 to match Python

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix: update renderMetapromptTemplate tests to use explicit phaseEnds param

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* refactor: remove instanceof coupling, unify config types, add JS GOAT tests

Architecture:
- Add template_variables() to RedTeamStrategy base (Python) and phaseEnds?() to
  RedTeamStrategy interface (JS) — strategies declare their own template vars
- CrescendoStrategy overrides to return phase boundary turn numbers; GoatStrategy
  returns nothing. Removes isinstance(CrescendoStrategy) check from orchestrator
- Remove _PHASES import from red_team_agent.py (no longer needed)

JS:
- CrescendoConfig = Omit<RedTeamAgentConfig, "strategy"> — eliminates 13-field
  duplication across three interfaces
- GoatConfig gets doc comment explaining it is a named hook for future GOAT params
- CrescendoStrategy.getPhase() made private; tests updated to use getPhaseName()
- Add 24 JS unit tests for GoatStrategy (stage boundaries, buildSystemPrompt,
  phaseEnds, redTeamGoat factory defaults)
- Remove vestigial vi.doMock("ai") that never intercepted calls (module pre-loaded)
- phaseEnds test uses literal [2, 4, 7] instead of re-deriving the formula

Python:
- metaprompt_template falsy check fixed: `or` → `is not None` (matches JS `??`)
- Add "Should not be reached" comment to GoatStrategy fallback (matches Crescendo)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix: remove phaseEnds test on GoatStrategy — absence is enforced by types

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix: docstrings, error messages, exports, and test coverage gaps

- Fix GoatStrategy docstring: remove benchmark-specific "in 5 turns" claim
- Fix get_phase_name base class docstring: strategy-agnostic return value wording
- Wrap .format() in _generate_attack_plan with helpful ValueError on KeyError
  (e.g. user passes Crescendo template to GOAT agent — was silent crash)
- Export GOAT_METAPROMPT_TEMPLATE from Python scenario.__init__ and JS index.ts
  so users can inspect/extend without importing from internal paths
- Update GoatConfig JSDoc to document inherited options and totalTurns=30 default
- Add Python test: goat() allows overriding metaprompt_template via kwargs
- Add JS test: renderMetapromptTemplate leaves phase placeholders as literals
  when phaseEnds is omitted (documents silent passthrough behavior)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix(red-team): close GOAT default-template foot-gun, document agent reuse

The `goat()`/`redTeamGoat` factories used `setdefault` / object spread
patterns that left an explicit `metaprompt_template=None` (Python) or
`metapromptTemplate: undefined` (TypeScript) in place. The constructor
then fell back to the Crescendo `_DEFAULT_METAPROMPT_TEMPLATE`, which
contains `{phase1_end}` placeholders that GoatStrategy.template_variables()
does not provide — first attack-plan render dies with KeyError.

Force the GOAT default whenever the caller's value is None/undefined.

Also document the silent-stale-plan failure mode: `_attack_plan` is
cached on the instance and survives across `scenario.run()` calls. Reusing
the same agent across scenarios with different descriptions silently uses
the first run's plan. Added `.. note::` blocks to both `goat()` and
`crescendo()` Python docstrings and `@remarks` to the TS factories.

Added a warning to `goat()` about combining `injection_probability` with
GOAT — the GOAT metaprompt already steers the attacker toward encoding
techniques, and post-hoc encoding desyncs H_attacker from what the target
saw. Default 0.0 is the safe path.

Verified: 156 Python + 108 JS tests pass.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* refactor(red-team): align GOAT with Meta's paper — drop pre-generated plan and stage hints (#340)

* refactor(red-team): align GOAT with Meta's paper — drop pre-generated plan + stage hints

Meta's GOAT paper (ICML 2025) does not pre-generate an attack plan via a
metaprompt LLM call, and the attacker's system prompt has no early/mid/late
stage hints. Adaptation is driven entirely by the per-turn score/hint
feedback that lives in the attacker's private conversation history
(H_attacker).

Changes:
  - Add `needs_metaprompt_plan` (Python) / `needsMetapromptPlan` (TS)
    property on the strategy interface, default True. GoatStrategy overrides
    to False; CrescendoStrategy keeps True.
  - Orchestrator (`call()`) consults the flag and skips `_generateAttackPlan`
    when False, saving one LLM call on turn 1 and eliminating the
    description-keyed stale-plan bug entirely for GOAT.
  - Remove `_STAGES` / `STAGES` array and `_get_stage` / `getStage` methods
    from GoatStrategy. `build_system_prompt` no longer renders a "Stage:" line
    or an ATTACK PLAN section.
  - Keep `get_phase_name` returning a coarse progress bucket
    (`early`/`mid`/`late`) for telemetry dashboards only — this label is no
    longer surfaced to the attacker.
  - Drop `GOAT_METAPROMPT_TEMPLATE` constant and its public export (Python
    `scenario.__init__.py` + JS `index.ts`). GOAT never renders a template.
  - Simplify `goat()` / `redTeamGoat` factories: no more `setdefault` /
    object-spread template injection dance (and no accompanying foot-gun).
  - Update tests: rewrite GoatStrategy stage tests as progress-bucket tests;
    add explicit assertions that GOAT prompts contain no ATTACK PLAN section
    and no stage hints; drop obsolete template-rendering tests.

Net effect: GOAT behaviour moves from ~60% to ~85% faithful to the paper.
The remaining gap is structured output (observation/strategy/reply JSON),
tracked in #2142 / #330.

Tests: 152 Python + 103 JS passing.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat(red-team): structured attacker output — observation/strategy/reply JSON (#341)

* feat(red-team): structured attacker output — observation / strategy / reply JSON

Meta's GOAT paper (ICML 2025) has the attacker emit a structured chain of
thought at every turn: observation of the target's last response, strategy
(which technique it will use and why), then the actual reply. We were
letting the attacker emit free text — losing the reasoning signal, making
per-turn technique selection invisible to telemetry, and leaving no gate
against the attacker skipping the "Thought" step entirely.

This commit implements the paper's contract:

- New `JSON_OUTPUT_CONTRACT` module constant (Python `_red_team/base.py`,
  TypeScript `red-team-strategy.ts`) appended to every attacker system
  prompt. Instructs the attacker to emit
      {"observation": "...", "strategy": "...", "reply": "..."}
  and nothing else.

- Added to both `CrescendoStrategy` and `GoatStrategy` system prompts so
  both strategies benefit.

- `RedTeamAgent._parse_attacker_output` (Python) / `parseAttackerOutput`
  (TS) parses the attacker's raw output into `(reply, observation, strategy)`.
  Strips markdown fences, handles malformed JSON, coerces non-string
  fields. Fallback on parse failure: the whole raw response becomes the
  `reply` and a WARN-level log fires — the scenario keeps running.

- Only `reply` reaches the target. `observation` and `strategy` are emitted
  as OpenTelemetry span attributes (`red_team.reasoning.observation`,
  `red_team.reasoning.strategy`, `red_team.reasoning.parse_failed`) so
  dashboards can answer "which technique works against which target?" —
  the paper's core selling point.

- The raw JSON output is kept in H_attacker so the attacker sees its own
  format on subsequent turns (keeps the output shape consistent with the
  system prompt's directive).

Paper fidelity: moves GOAT from ~85% to ~95% faithful.

Tests: 163 Python (+11) and 115 JS (+12) passing.
  - Parser: well-formed JSON, code-fence stripping (```json and ```),
    malformed JSON fallback, missing/empty reply fallback, non-object JSON
    fallback, non-string field coercion, whitespace trimming.
  - Contract: OUTPUT FORMAT section present in both GOAT and Crescendo
    system prompts.
  - Existing tests (which mock attacker with plain strings) continue to
    pass via the graceful fallback path.

Closes langwatch/scenario#2142 (structured attacker output).
Closes #330 (GOAT technique telemetry) once consumers
wire the span attributes to dashboards.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* refactor(red-team): scope JSON output contract to GOAT only

Crescendo does not emit structured output in Microsoft's Crescendo paper —
applying the JSON contract to both strategies was scope creep for a
GOAT-focused PR stack. Roll it back on Crescendo behind a new per-strategy
flag so the parser only runs when the attacker is actually instructed to
emit JSON.

Changes:
  - Add `emits_structured_output` (Python) / `emitsStructuredOutput` (TS)
    property on RedTeamStrategy. Default False/falsy. GoatStrategy overrides
    to True.
  - Remove `JSON_OUTPUT_CONTRACT` import and interpolation from
    CrescendoStrategy in both Python and TS.
  - Gate the parser in `call()`: run it only when the strategy's
    `emits_structured_output` flag is set. Otherwise use raw attacker
    output as the reply with no parsing and no telemetry spam.
  - Update tests: GoatStrategy asserts OUTPUT FORMAT present + flag true;
    CrescendoStrategy asserts OUTPUT FORMAT absent + flag falsy.

Tests: 164 Python (+1) / 116 JS (+1) passing.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix(red-team): type-check Crescendo's optional emitsStructuredOutput via interface

CI typecheck in vitest-examples failed on:
  expect(new CrescendoStrategy().emitsStructuredOutput).toBeUndefined();
because `emitsStructuredOutput` is declared on the RedTeamStrategy interface
as optional but never added to the Crescendo class — so direct access on the
concrete type is a TS2339 error under strict mode.

Access via the interface type so the optional property is visible.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix(red-team): scale backtrack budget with total_turns, add run telemetry (#347)

- max_backtracks: new optional param, auto-scales max(1, total_turns // 3)
  so 5-turn runs don't over-provision and 100-turn runs aren't starved
  against hardened targets (closes #331)
- red_team.progress span attr: current_turn/total_turns, enables timeline
  filters in dashboards without deriving from turn/total_turns pairs
- red_team.parse_failure_count span attr: cumulative count of malformed
  structured-output responses per run, surfaces attacker output-format
  reliability per provider/model for GOAT runs
- TS parity for max_backtracks; telemetry is Python-only for now (TS
  red-team has no OTel instrumentation yet)

#336 (success_score default) deferred — issue explicitly requires a data
sweep before changing; will be a separate follow-up PR.

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat(red-team): techniques as data — typed catalogue + chosen_technique_ids telemetry (#348)

Replace the inline TECHNIQUE_CATALOGUE string literal with a typed list of
Technique records. Attackers still see the byte-identical rendered prompt,
but the catalogue is now first-class data that downstream code can query,
extend, and serialize.

- New Technique dataclass/interface (Py + TS) with id/name/description/example
- DEFAULT_GOAT_TECHNIQUES exports the paper's 7 techniques
- render_catalogue(techniques) / renderCatalogue(techniques) produces the
  attacker-facing prompt, locked to byte-parity with the previous string
- extract_chosen_ids parses the attacker's `strategy` field case-insensitively
  against both ID (HYPOTHETICAL_FRAMING) and name (HYPOTHETICAL FRAMING) forms
- GoatStrategy(techniques=...) accepts a user override — closes the research
  ask "what if I add a new technique?" — with duplicate-ID validation
- New span attr: red_team.chosen_technique_ids: list[str] per GOAT turn,
  groupable in dashboards to answer "which techniques work on which targets?"
  (Python only — TS red-team has no OTel instrumentation yet)
- RedTeamStrategy.chosen_technique_ids() default returns [] so non-catalogue
  strategies contribute nothing (Crescendo unaffected)

Closes #330. Partial toward #335 (Py↔TS parity — the two lists still exist
separately; shared JSON fixture is a follow-up).

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* refactor(red-team): strategies own their output parsing (#349)

Move `parse_attacker_output` from a RedTeamAgent static method into the
strategy interface. Remove the `emits_structured_output` feature flag — the
orchestrator always calls `strategy.parse_attacker_output(raw)`, and strategies
themselves decide how to interpret the attacker's output.

- New `AttackerOutput` dataclass / interface (reply, observation, strategy,
  parse_failed). Exported from the package root.
- Base `RedTeamStrategy` provides a default `parse_attacker_output` that
  returns `AttackerOutput(reply=raw)` — the right shape for strategies
  without a JSON contract. GoatStrategy overrides with JSON parsing.
- Remove the `if self._strategy.emits_structured_output:` branch from
  `RedTeamAgent.call()`. Single code path, telemetry always emitted.
- Delete `RedTeamAgent._parse_attacker_output` static (Py) / the module-level
  `parseAttackerOutput` export (TS) — parser is on the strategy now.
- `parse_failed` is set by the strategy itself, not inferred downstream by
  inspecting obs/strategy emptiness — cleaner and lets custom strategies
  signal failure without tripping the heuristic.

New strategies can now add custom output schemas without orchestrator
changes — adding a technique interface extension on top of the flag-based
design would have required the exact branch we just removed.

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat(red-team): zero-friction report dashboard — auto-save + CLI + docs (#351)

feat(red-team): zero-friction report dashboard — auto-save + `scenario redteam-report` CLI

Problem: the existing `save_redteam_report()` helper required users to
import it and call it explicitly after every `scenario.run()`, and
launching the Streamlit dashboard required memorizing the app path and
batch directory. The report module was untracked — no one discovered it.

This change wires auto-save into the pytest plugin and adds a CLI that
auto-discovers the latest batch.

## The three-command user flow

    pip install 'langwatch-scenario[report]'
    pytest path/to/redteam_tests.py              # reports save automatically
    scenario redteam-report                      # opens dashboard

No imports in test code, no explicit save call, no path arguments.

## What landed

**M1 — Auto-save via pytest plugin**
  - `pytest_plugin.py::_auto_save_redteam_report` hooks
    `auto_reporting_run` to detect `isinstance(agent, RedTeamAgent)` in the
    agents list and call `save_redteam_report`. Swallows errors with a
    warning so reporting failures never break tests.
  - Env vars: `SCENARIO_REDTEAM_REPORT=0` disables, `..._REPORT_DIR=path`
    overrides batch root.

**M2 — CLI `scenario redteam-report`**
  - New `scenario/cli.py` with `_find_batch_dir()` (lexicographic sort of
    timestamped dirs → latest first).
  - `setup.py` registers `console_scripts: scenario = scenario.cli:main`.
  - `pyproject.toml` adds `[project.optional-dependencies] report =
    [streamlit, plotly, pandas]`.
  - Flags: `--latest N`, `--batch <ts>`, `--dir <path>`, `--port`,
    `--no-browser`.

**M3 — TypeScript auto-save parity**
  - New `javascript/src/red-team-report.ts` with `isRedTeamAgent()` +
    `saveRedTeamReport()`, mirroring Python's JSON shape.
  - `runner/run.ts` calls it after `execution.execute()` when a
    RedTeamAgent is found. Same env vars.
  - TS version skips the LLM-based severity/suggestion analysis at save
    time (placeholder fields, `analysis_pending: true`) to keep tests
    fast; dashboard's on-demand aggregator computes them.
  - Exports `saveRedTeamReport` from the package root for users running
    scenarios outside the default runner.

**M4 — Docs page under Red Teaming**
  - `docs/docs/pages/advanced/red-teaming/report.mdx` — quick start, how
    it works (with end-to-end data-flow diagram), dashboard viewing,
    JSON shape, config env vars, CI / CD snippet, advanced manual save,
    headless / SSH running, troubleshooting (7 common failure modes).
  - `docs/vocs.config.tsx` — nav entry "Reports Dashboard" under the
    existing Red Teaming section.

**Report module (previously untracked)**
  - `python/scenario/report/` — moved into tracking. Contains
    `_save.save_redteam_report`, `_aggregate.aggregate_fixes` (the
    on-demand cross-batch LLM aggregator), and `app.py` (the 976-line
    Streamlit dashboard).

## Verified end-to-end

Ran a 3-turn GOAT scenario against a mock always-refuse agent:
  - Test passed in 71s.
  - Report auto-saved to `./redteam-reports/20260415_153324/<ts>_<slug>_goat.json`.
  - Payload contains all analysis fields (severity=high, break_severity=none,
    4 suggestions, failure_summary populated).
  - CLI `_find_batch_dir` resolves --latest 1/2/3 correctly.

## Design notes (see docs for details)

  - `isinstance` detection over pytest markers: zero-friction, nothing
    for users to remember.
  - Timestamp-named batch directories: lexicographic sort = chronological
    sort; no metadata file needed.
  - LLM analysis synchronous on save (Python only): dashboards open
    instantly; no waiting for 20 LLM calls when you click.
  - Same JSON shape across Python and TS: one Streamlit app serves both.
  - Save errors swallowed with warning: reporting is observability, not
    correctness.

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix(red-team): three edge-case hardening fixes (#337, #333a, #333b) (#354)

Three small correctness fixes on red-team edge cases, batched so each
gets its own test but reviewers only open one PR. All three convert
"fails silent" bugs into "fails loud" signals.

## #337 — Regression test for the KeyError helper

`_generate_attack_plan` already wraps `template.format()` in a try/except
that turns a raw KeyError into a friendly ValueError pointing at the
common cause (Crescendo ↔ GOAT template mismatch). There was no test —
a future refactor could silently lose the helpful message.

Adds `TestMetapromptTemplateKeyErrorHelper`: construct an agent with a
bogus placeholder, assert the ValueError fires and its message mentions
the unknown key, the available keys, and the Crescendo/GOAT hint.

## #333a — Validate empty techniques list

Before: `RedTeamAgent.crescendo(injection_probability=0.5, techniques=[])`
silently skipped injection on every turn (falsy empty list short-circuited
the `and self._techniques` guard in `call()`), so users thought injection
was active but got no encoded attacks.

After: raise ValueError at construction with a clear message pointing at
both workarounds ("disable injection, or provide at least one technique").
`techniques=None` still falls back to DEFAULT_TECHNIQUES — unchanged.

## #333b — Empty/whitespace plan must fail loud

Strategies that use a plan (Crescendo, `needs_metaprompt_plan=True`)
REQUIRE a non-empty plan by design. Previously `_generate_attack_plan`
only rejected None content — an empty string or all-whitespace response
from the metaprompt LLM would be silently stored as the plan, then
rendered as a labeled-but-blank "ATTACK PLAN:" section. The attacker LLM
reads that as "your plan is nothing" and degrades to generic attacks.
The user gets no signal the metaprompt call silently failed.

Fix `_generate_attack_plan` to raise RuntimeError when the returned
content is None, empty, or whitespace-only. The check moves from
`build_system_prompt` (wrong layer — prettifies malformed state) to the
generation step (right layer — prevents malformed state). An earlier
version of this PR instead omitted the ATTACK PLAN section in the
system prompt when empty; that hid the upstream bug.

Adds `TestEmptyMetapromptPlanRaises` with three cases (empty, whitespace,
real plan) using monkeypatched `litellm.acompletion`.

## Tests

187 Python red-team tests pass (+ 7 new).

Closes #337
Addresses parts a + b of #333 (part c — scorer-failure
cascade WARN threshold — remains open for a follow-up)

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat(red-team): guard against cross-run RedTeamAgent reuse (#329) (#353)

RedTeamAgent keeps mutable per-run state on the instance (attacker history,
turn scores, backtrack counter, backtrack history, parse failure count).
Reusing the same agent across scenario.run() calls — serial or parallel —
silently interleaves that state, corrupting both runs.

This lands Option A from #329: document the contract, enforce it at runtime.

- Track self._run_thread_id / this.runThreadId (first call's thread_id).
- On turn 1, if a different thread_id was previously recorded, raise a
  clear RuntimeError directing the user to instantiate a fresh agent.
- On turn > 1 with a changed thread_id, raise too (defensive — shouldn't
  happen under the normal orchestrator but catches manual-call misuse).
- Use getattr fallback so existing MagicMock(spec=AgentInput) test
  fixtures that don't set thread_id don't trip the guard.
- Update crescendo()/goat() / redTeamCrescendo/redTeamGoat docstrings
  to state the single-use contract explicitly.

Option B (ContextVar / AsyncLocalStorage isolation) would let users reuse
agents transparently but hides the bug — the correct shape is to enforce
the single-use contract loudly.

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix(red-team): annotate H_attacker on post-hoc injection (closes #326, #334) (#365)

fix(red-team): annotate H_attacker when post-hoc injection fires (#326, #334)

When injection_probability fires, the target received the encoded form but
the attacker's private history (H_attacker) only recorded the plaintext.
The attacker LLM then reasoned against a conversation that didn't match
what the target actually saw — score/hint feedback was computed on a turn
that, from the attacker's point of view, didn't happen.

Append a `[INJECTED <technique>]` system marker to H_attacker on every
injected turn so the attacker's next-turn reasoning stays aligned. Same
fix applies to Crescendo and GOAT because the injection block runs before
the strategy branches.

Also adds a defensive heuristic (`_looks_already_encoded` / `looksAlreadyEncoded`)
that skips injection when the reply is already a long Base64-charset string,
preventing double-encoding if a user extends the GOAT catalogue with
encoding-style techniques.

Docstring / JSDoc for goat() / redTeamGoat replaces the "not recommended"
warning with a note describing the new behaviour.

Closes #326
Closes #334

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Potential fix for pull request finding 'Empty except'

Co-authored-by: Copilot Autofix powered by AI <223894421+github-code-quality[bot]@users.noreply.github.com>

* Potential fix for pull request finding 'Unused local variable'

Co-authored-by: Copilot Autofix powered by AI <223894421+github-code-quality[bot]@users.noreply.github.com>

* fix: update injection test for [INJECTED] marker and drop unused Any import

- test_injection_keeps_original_in_attacker_history now accounts for the
  [INJECTED <technique>] system marker appended by PR #365 — assistant
  turn is at [-2], marker at [-1].
- Remove unused Any import from report/_aggregate.py (CodeQL finding).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(tests): cast result dict to satisfy pyright in injection tests

Pyright couldn't narrow the AgentReturnTypes union to dict when
indexing result["content"] — assert isinstance + cast keeps the
test assertions identical while making the types explicit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs(red-team): add GOAT Strategy page under Red Teaming (#367)

The GOAT strategy shipped in #346 but was invisible to users — zero
mentions in the docs site. Adds a dedicated page covering:

- When to choose GOAT over Crescendo (comparison table)
- Python + TS quick starts
- How the per-turn attacker loop works + JSON contract
- The 7-technique default catalogue and how to override it
- Full config reference (including injection_probability behaviour after
  #365's H_attacker marker fix)
- OpenTelemetry span attributes for observability
- Paper-fidelity notes + known limitations (short runs, non-English)

Also links to the new page from the Red Teaming overview and wires it
into the sidebar nav in vocs.config.tsx between Overview and Reports.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(red-team): GOAT DX polish — phase_kind, metaprompt warn, techniques split

Three senior-review DX items merged into the GOAT strategy PR:

1. phase_kind on RedTeamStrategy (Py) / phaseKind on RedTeamStrategy (JS)
   - Crescendo returns "staged" (default), GOAT returns "progress"
   - Orchestrator now emits red_team.phase for staged strategies and
     red_team.progress_bucket for progress strategies, so dashboards
     don't conflate GOAT's coarse early/mid/late label with Crescendo's
     semantic warmup/probing/escalation/direct phases.

2. Warn on metaprompt_template passed to a strategy that ignores it
   (needs_metaprompt_plan=False). Previously the value was silently
   stored and never rendered — users couldn't tell their custom plan
   was being dropped. Now fires a UserWarning (Py) / console.warn (JS)
   at construction time.

3. Split .goat(techniques=...) into goat_techniques= (semantic catalogue
   the attacker LLM picks from each turn) and encoding_techniques=
   (Base64/ROT13/... encoders driven by injection_probability). The old
   techniques= kwarg keeps working as a deprecated alias for
   encoding_techniques= with a DeprecationWarning. Passing both raises
   TypeError. Same shape in TypeScript via goatTechniques /
   encodingTechniques on GoatConfig.

Tests: 207 Python + 138 JS, all green. Adds TestPhaseKind,
TestMetapromptTemplateIgnoredWarning, TestGoatTechniquesKwargSplit and
the JS equivalents under red-team.test.ts.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(tests): satisfy strategy ABC signature in backward-compat test

Addresses github-code-quality bot comment on the OldCustomStrategy
override — declare the full build_system_prompt signature so the
override matches the abstract base. Behavior unchanged; the test only
probes phase_kind default.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(tests): satisfy pyright + tsc on DX polish tests

- pyright: narrow `agent._strategy` to GoatStrategy before accessing
  `.techniques` in the goat/encoding independence test
- tsc: CrescendoStrategy doesn't declare `phaseKind` (optional on the
  interface); access via a typed view in the default-behaviour test

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: Copilot Autofix powered by AI <223894421+github-code-quality[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant