feat: add GOAT strategy with dynamic technique selection for RedTeamAgent#346
Open
Aryansharma28 wants to merge 9 commits intomainfrom
Open
feat: add GOAT strategy with dynamic technique selection for RedTeamAgent#346Aryansharma28 wants to merge 9 commits intomainfrom
Aryansharma28 wants to merge 9 commits intomainfrom
Conversation
Add GOAT (Generative Offensive Agent Tester) as a separate strategy
alongside Crescendo. Based on Meta's GOAT paper (ICML 2025, 97% ASR).
- GoatStrategy with 7-technique catalogue (hypothetical framing, persona
modification, refusal suppression, response priming, dual response,
topic splitting, authority & social engineering)
- Soft progress stages (early/mid/late) instead of fixed phases
- Dedicated GOAT metaprompt template for adaptive attack planning
- Python: RedTeamAgent.goat(target=..., model=...)
- TypeScript: scenario.redTeamGoat({ target, model })
- Crescendo (.crescendo()) is completely untouched
Closes #2143
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…etaprompt from Crescendo phases marathon_script(turns=...) was required with no default, causing TypeError in all test calls that omit it. Now defaults to self.total_turns. Also makes _generate_attack_plan strategy-aware: only computes Crescendo phase boundaries when using CrescendoStrategy, removing unnecessary coupling for the GOAT strategy. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…d test coverage Python: - Export GoatStrategy from scenario.__init__ (was missing, CrescendoStrategy was exported but not GoatStrategy) - Add 25 unit tests for GoatStrategy: stage boundaries, prompt building, factory method defaults - Fix technique 6 example message (was placeholder "...") JavaScript: - Fix redTeamGoat() always sets GOAT_METAPROMPT_TEMPLATE (was conditionally falling back to Crescendo template when attackPlan supplied) - Add totalTurns: 30 default to redTeamGoat() to match Python - Add metapromptTemplate to CrescendoConfig so users can override via both factory APIs - Fix renderMetapromptTemplate: phase boundary vars only injected for Crescendo (via optional phaseEnds param) - generateAttackPlan passes phaseEnds only when strategy instanceof CrescendoStrategy - marathonScript turns param is now optional, defaults to this.totalTurns (matches Python fix) - Make GoatStrategy.getStage() private (matches Python _get_stage()) - Fix float notation 0.3/0.7 → 0.30/0.70 to match Python Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…param Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… tests
Architecture:
- Add template_variables() to RedTeamStrategy base (Python) and phaseEnds?() to
RedTeamStrategy interface (JS) — strategies declare their own template vars
- CrescendoStrategy overrides to return phase boundary turn numbers; GoatStrategy
returns nothing. Removes isinstance(CrescendoStrategy) check from orchestrator
- Remove _PHASES import from red_team_agent.py (no longer needed)
JS:
- CrescendoConfig = Omit<RedTeamAgentConfig, "strategy"> — eliminates 13-field
duplication across three interfaces
- GoatConfig gets doc comment explaining it is a named hook for future GOAT params
- CrescendoStrategy.getPhase() made private; tests updated to use getPhaseName()
- Add 24 JS unit tests for GoatStrategy (stage boundaries, buildSystemPrompt,
phaseEnds, redTeamGoat factory defaults)
- Remove vestigial vi.doMock("ai") that never intercepted calls (module pre-loaded)
- phaseEnds test uses literal [2, 4, 7] instead of re-deriving the formula
Python:
- metaprompt_template falsy check fixed: `or` → `is not None` (matches JS `??`)
- Add "Should not be reached" comment to GoatStrategy fallback (matches Crescendo)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ypes Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Fix GoatStrategy docstring: remove benchmark-specific "in 5 turns" claim - Fix get_phase_name base class docstring: strategy-agnostic return value wording - Wrap .format() in _generate_attack_plan with helpful ValueError on KeyError (e.g. user passes Crescendo template to GOAT agent — was silent crash) - Export GOAT_METAPROMPT_TEMPLATE from Python scenario.__init__ and JS index.ts so users can inspect/extend without importing from internal paths - Update GoatConfig JSDoc to document inherited options and totalTurns=30 default - Add Python test: goat() allows overriding metaprompt_template via kwargs - Add JS test: renderMetapromptTemplate leaves phase placeholders as literals when phaseEnds is omitted (documents silent passthrough behavior) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Resolves conflicts between GOAT strategy work and main's red-team additions (injection_probability, AttackTechnique catalogue, marathon_script signature cleanup). Resolutions: - Take main's marathon_script signature (drops turns param, uses total_turns) in both Python and TypeScript; supersedes the turns-optional fix. - Keep HEAD's template_variables() decoupling so GOAT and Crescendo each contribute their own metaprompt placeholders; drop now-unused _PHASES and _marathon_script imports. - Combine public API exports across both branches: GoatStrategy, GOAT_METAPROMPT_TEMPLATE, AttackTechnique, DEFAULT_TECHNIQUES. - Add injection_probability and techniques kwargs to RedTeamAgent.goat() for parity with .crescendo(). - Combine test suites: GOAT stage/factory tests alongside main's injection probability and marathon-judge integration tests. Verified: 156 Python tests pass, 108 JS tests pass. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…euse
The `goat()`/`redTeamGoat` factories used `setdefault` / object spread
patterns that left an explicit `metaprompt_template=None` (Python) or
`metapromptTemplate: undefined` (TypeScript) in place. The constructor
then fell back to the Crescendo `_DEFAULT_METAPROMPT_TEMPLATE`, which
contains `{phase1_end}` placeholders that GoatStrategy.template_variables()
does not provide — first attack-plan render dies with KeyError.
Force the GOAT default whenever the caller's value is None/undefined.
Also document the silent-stale-plan failure mode: `_attack_plan` is
cached on the instance and survives across `scenario.run()` calls. Reusing
the same agent across scenarios with different descriptions silently uses
the first run's plan. Added `.. note::` blocks to both `goat()` and
`crescendo()` Python docstrings and `@remarks` to the TS factories.
Added a warning to `goat()` about combining `injection_probability` with
GOAT — the GOAT metaprompt already steers the attacker toward encoding
techniques, and post-hoc encoding desyncs H_attacker from what the target
saw. Default 0.0 is the safe path.
Verified: 156 Python + 108 JS tests pass.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Contributor
|
Automated low-risk assessment This PR was evaluated against the repository's Low-Risk Pull Requests procedure and does not qualify as low risk.
This PR requires a manual review before merging. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Re-land of #306 — the original PR was squash-merged prematurely and subsequently reverted via #345. Opening this to restore the GOAT strategy on
mainthrough the intended stacked-PR workflow.Same branch (
feat/red-team-dynamic-techniques) and same commits as #306.Summary
Adds
RedTeamAgent.goat()(Python) /redTeamGoat()(TS), implementing Meta's GOAT methodology (ICML 2025) for dynamic per-turn technique selection. 7-technique catalogue, 3 soft progress stages, paper-sourced adaptive attacks. See #306 for the full description.Stack
This PR is the base of a 3-PR stack:
Merge bottom-up without
--delete-branchuntil the top of the stack lands, then clean up branches.Test plan
See #306 test plan — unchanged. 108 JS tests, 156 Python tests passing on the branch.
Context
🤖 Generated with Claude Code