Skip to content

feat: add GOAT strategy with dynamic technique selection for RedTeamAgent#346

Open
Aryansharma28 wants to merge 9 commits intomainfrom
feat/red-team-dynamic-techniques
Open

feat: add GOAT strategy with dynamic technique selection for RedTeamAgent#346
Aryansharma28 wants to merge 9 commits intomainfrom
feat/red-team-dynamic-techniques

Conversation

@Aryansharma28
Copy link
Copy Markdown
Contributor

Re-land of #306 — the original PR was squash-merged prematurely and subsequently reverted via #345. Opening this to restore the GOAT strategy on main through the intended stacked-PR workflow.

Same branch (feat/red-team-dynamic-techniques) and same commits as #306.

Summary

Adds RedTeamAgent.goat() (Python) / redTeamGoat() (TS), implementing Meta's GOAT methodology (ICML 2025) for dynamic per-turn technique selection. 7-technique catalogue, 3 soft progress stages, paper-sourced adaptive attacks. See #306 for the full description.

Stack

This PR is the base of a 3-PR stack:

  1. This PR — base GOAT strategy
  2. refactor(red-team): align GOAT with Meta's paper — drop pre-generated plan and stage hints #340 — paper-fidelity refactor (drops pre-generated attack plan + stage hints; stacks on this PR)
  3. feat(red-team): structured attacker output — observation/strategy/reply JSON #341 — structured attacker output (observation/strategy/reply JSON; stacks on refactor(red-team): align GOAT with Meta's paper — drop pre-generated plan and stage hints #340)

Merge bottom-up without --delete-branch until the top of the stack lands, then clean up branches.

Test plan

See #306 test plan — unchanged. 108 JS tests, 156 Python tests passing on the branch.

Context

🤖 Generated with Claude Code

Aryansharma28 and others added 9 commits March 23, 2026 16:29
Add GOAT (Generative Offensive Agent Tester) as a separate strategy
alongside Crescendo. Based on Meta's GOAT paper (ICML 2025, 97% ASR).

- GoatStrategy with 7-technique catalogue (hypothetical framing, persona
  modification, refusal suppression, response priming, dual response,
  topic splitting, authority & social engineering)
- Soft progress stages (early/mid/late) instead of fixed phases
- Dedicated GOAT metaprompt template for adaptive attack planning
- Python: RedTeamAgent.goat(target=..., model=...)
- TypeScript: scenario.redTeamGoat({ target, model })
- Crescendo (.crescendo()) is completely untouched

Closes #2143

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…etaprompt from Crescendo phases

marathon_script(turns=...) was required with no default, causing TypeError
in all test calls that omit it. Now defaults to self.total_turns.

Also makes _generate_attack_plan strategy-aware: only computes Crescendo
phase boundaries when using CrescendoStrategy, removing unnecessary
coupling for the GOAT strategy.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…d test coverage

Python:
- Export GoatStrategy from scenario.__init__ (was missing, CrescendoStrategy was exported but not GoatStrategy)
- Add 25 unit tests for GoatStrategy: stage boundaries, prompt building, factory method defaults
- Fix technique 6 example message (was placeholder "...")

JavaScript:
- Fix redTeamGoat() always sets GOAT_METAPROMPT_TEMPLATE (was conditionally falling back to Crescendo template when attackPlan supplied)
- Add totalTurns: 30 default to redTeamGoat() to match Python
- Add metapromptTemplate to CrescendoConfig so users can override via both factory APIs
- Fix renderMetapromptTemplate: phase boundary vars only injected for Crescendo (via optional phaseEnds param)
- generateAttackPlan passes phaseEnds only when strategy instanceof CrescendoStrategy
- marathonScript turns param is now optional, defaults to this.totalTurns (matches Python fix)
- Make GoatStrategy.getStage() private (matches Python _get_stage())
- Fix float notation 0.3/0.7 → 0.30/0.70 to match Python

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…param

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… tests

Architecture:
- Add template_variables() to RedTeamStrategy base (Python) and phaseEnds?() to
  RedTeamStrategy interface (JS) — strategies declare their own template vars
- CrescendoStrategy overrides to return phase boundary turn numbers; GoatStrategy
  returns nothing. Removes isinstance(CrescendoStrategy) check from orchestrator
- Remove _PHASES import from red_team_agent.py (no longer needed)

JS:
- CrescendoConfig = Omit<RedTeamAgentConfig, "strategy"> — eliminates 13-field
  duplication across three interfaces
- GoatConfig gets doc comment explaining it is a named hook for future GOAT params
- CrescendoStrategy.getPhase() made private; tests updated to use getPhaseName()
- Add 24 JS unit tests for GoatStrategy (stage boundaries, buildSystemPrompt,
  phaseEnds, redTeamGoat factory defaults)
- Remove vestigial vi.doMock("ai") that never intercepted calls (module pre-loaded)
- phaseEnds test uses literal [2, 4, 7] instead of re-deriving the formula

Python:
- metaprompt_template falsy check fixed: `or` → `is not None` (matches JS `??`)
- Add "Should not be reached" comment to GoatStrategy fallback (matches Crescendo)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ypes

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Fix GoatStrategy docstring: remove benchmark-specific "in 5 turns" claim
- Fix get_phase_name base class docstring: strategy-agnostic return value wording
- Wrap .format() in _generate_attack_plan with helpful ValueError on KeyError
  (e.g. user passes Crescendo template to GOAT agent — was silent crash)
- Export GOAT_METAPROMPT_TEMPLATE from Python scenario.__init__ and JS index.ts
  so users can inspect/extend without importing from internal paths
- Update GoatConfig JSDoc to document inherited options and totalTurns=30 default
- Add Python test: goat() allows overriding metaprompt_template via kwargs
- Add JS test: renderMetapromptTemplate leaves phase placeholders as literals
  when phaseEnds is omitted (documents silent passthrough behavior)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Resolves conflicts between GOAT strategy work and main's red-team additions
(injection_probability, AttackTechnique catalogue, marathon_script signature
cleanup).

Resolutions:
- Take main's marathon_script signature (drops turns param, uses total_turns)
  in both Python and TypeScript; supersedes the turns-optional fix.
- Keep HEAD's template_variables() decoupling so GOAT and Crescendo each
  contribute their own metaprompt placeholders; drop now-unused _PHASES and
  _marathon_script imports.
- Combine public API exports across both branches: GoatStrategy,
  GOAT_METAPROMPT_TEMPLATE, AttackTechnique, DEFAULT_TECHNIQUES.
- Add injection_probability and techniques kwargs to RedTeamAgent.goat()
  for parity with .crescendo().
- Combine test suites: GOAT stage/factory tests alongside main's injection
  probability and marathon-judge integration tests.

Verified: 156 Python tests pass, 108 JS tests pass.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…euse

The `goat()`/`redTeamGoat` factories used `setdefault` / object spread
patterns that left an explicit `metaprompt_template=None` (Python) or
`metapromptTemplate: undefined` (TypeScript) in place. The constructor
then fell back to the Crescendo `_DEFAULT_METAPROMPT_TEMPLATE`, which
contains `{phase1_end}` placeholders that GoatStrategy.template_variables()
does not provide — first attack-plan render dies with KeyError.

Force the GOAT default whenever the caller's value is None/undefined.

Also document the silent-stale-plan failure mode: `_attack_plan` is
cached on the instance and survives across `scenario.run()` calls. Reusing
the same agent across scenarios with different descriptions silently uses
the first run's plan. Added `.. note::` blocks to both `goat()` and
`crescendo()` Python docstrings and `@remarks` to the TS factories.

Added a warning to `goat()` about combining `injection_probability` with
GOAT — the GOAT metaprompt already steers the attacker toward encoding
techniques, and post-hoc encoding desyncs H_attacker from what the target
saw. Default 0.0 is the safe path.

Verified: 156 Python + 108 JS tests pass.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown
Contributor

Automated low-risk assessment

This PR was evaluated against the repository's Low-Risk Pull Requests procedure and does not qualify as low risk.

The PR adds a new GOAT red‑team strategy (JS + Python), a GOAT metaprompt template, a redTeamGoat factory, and changes metaprompt rendering and orchestration logic that affect what is sent to LLMs. Because it modifies runtime attack logic and prompt/template handling (i.e., behavior of an integration with language models) rather than only docs/tests/UI, it does not meet the low‑risk criteria.

This PR requires a manual review before merging.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant