feat: add GOAT strategy with dynamic technique selection for RedTeamAgent#306
Merged
Aryansharma28 merged 9 commits intomainfrom Apr 14, 2026
Merged
feat: add GOAT strategy with dynamic technique selection for RedTeamAgent#306Aryansharma28 merged 9 commits intomainfrom
Aryansharma28 merged 9 commits intomainfrom
Conversation
Add GOAT (Generative Offensive Agent Tester) as a separate strategy
alongside Crescendo. Based on Meta's GOAT paper (ICML 2025, 97% ASR).
- GoatStrategy with 7-technique catalogue (hypothetical framing, persona
modification, refusal suppression, response priming, dual response,
topic splitting, authority & social engineering)
- Soft progress stages (early/mid/late) instead of fixed phases
- Dedicated GOAT metaprompt template for adaptive attack planning
- Python: RedTeamAgent.goat(target=..., model=...)
- TypeScript: scenario.redTeamGoat({ target, model })
- Crescendo (.crescendo()) is completely untouched
Closes #2143
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…etaprompt from Crescendo phases marathon_script(turns=...) was required with no default, causing TypeError in all test calls that omit it. Now defaults to self.total_turns. Also makes _generate_attack_plan strategy-aware: only computes Crescendo phase boundaries when using CrescendoStrategy, removing unnecessary coupling for the GOAT strategy. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
a8aafff to
21fa64d
Compare
…d test coverage Python: - Export GoatStrategy from scenario.__init__ (was missing, CrescendoStrategy was exported but not GoatStrategy) - Add 25 unit tests for GoatStrategy: stage boundaries, prompt building, factory method defaults - Fix technique 6 example message (was placeholder "...") JavaScript: - Fix redTeamGoat() always sets GOAT_METAPROMPT_TEMPLATE (was conditionally falling back to Crescendo template when attackPlan supplied) - Add totalTurns: 30 default to redTeamGoat() to match Python - Add metapromptTemplate to CrescendoConfig so users can override via both factory APIs - Fix renderMetapromptTemplate: phase boundary vars only injected for Crescendo (via optional phaseEnds param) - generateAttackPlan passes phaseEnds only when strategy instanceof CrescendoStrategy - marathonScript turns param is now optional, defaults to this.totalTurns (matches Python fix) - Make GoatStrategy.getStage() private (matches Python _get_stage()) - Fix float notation 0.3/0.7 → 0.30/0.70 to match Python Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…param Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… tests
Architecture:
- Add template_variables() to RedTeamStrategy base (Python) and phaseEnds?() to
RedTeamStrategy interface (JS) — strategies declare their own template vars
- CrescendoStrategy overrides to return phase boundary turn numbers; GoatStrategy
returns nothing. Removes isinstance(CrescendoStrategy) check from orchestrator
- Remove _PHASES import from red_team_agent.py (no longer needed)
JS:
- CrescendoConfig = Omit<RedTeamAgentConfig, "strategy"> — eliminates 13-field
duplication across three interfaces
- GoatConfig gets doc comment explaining it is a named hook for future GOAT params
- CrescendoStrategy.getPhase() made private; tests updated to use getPhaseName()
- Add 24 JS unit tests for GoatStrategy (stage boundaries, buildSystemPrompt,
phaseEnds, redTeamGoat factory defaults)
- Remove vestigial vi.doMock("ai") that never intercepted calls (module pre-loaded)
- phaseEnds test uses literal [2, 4, 7] instead of re-deriving the formula
Python:
- metaprompt_template falsy check fixed: `or` → `is not None` (matches JS `??`)
- Add "Should not be reached" comment to GoatStrategy fallback (matches Crescendo)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ypes Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Fix GoatStrategy docstring: remove benchmark-specific "in 5 turns" claim - Fix get_phase_name base class docstring: strategy-agnostic return value wording - Wrap .format() in _generate_attack_plan with helpful ValueError on KeyError (e.g. user passes Crescendo template to GOAT agent — was silent crash) - Export GOAT_METAPROMPT_TEMPLATE from Python scenario.__init__ and JS index.ts so users can inspect/extend without importing from internal paths - Update GoatConfig JSDoc to document inherited options and totalTurns=30 default - Add Python test: goat() allows overriding metaprompt_template via kwargs - Add JS test: renderMetapromptTemplate leaves phase placeholders as literals when phaseEnds is omitted (documents silent passthrough behavior) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Contributor
Author
|
@copilot resolve the merge conflicts in this pull request |
Resolves conflicts between GOAT strategy work and main's red-team additions (injection_probability, AttackTechnique catalogue, marathon_script signature cleanup). Resolutions: - Take main's marathon_script signature (drops turns param, uses total_turns) in both Python and TypeScript; supersedes the turns-optional fix. - Keep HEAD's template_variables() decoupling so GOAT and Crescendo each contribute their own metaprompt placeholders; drop now-unused _PHASES and _marathon_script imports. - Combine public API exports across both branches: GoatStrategy, GOAT_METAPROMPT_TEMPLATE, AttackTechnique, DEFAULT_TECHNIQUES. - Add injection_probability and techniques kwargs to RedTeamAgent.goat() for parity with .crescendo(). - Combine test suites: GOAT stage/factory tests alongside main's injection probability and marathon-judge integration tests. Verified: 156 Python tests pass, 108 JS tests pass. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…euse
The `goat()`/`redTeamGoat` factories used `setdefault` / object spread
patterns that left an explicit `metaprompt_template=None` (Python) or
`metapromptTemplate: undefined` (TypeScript) in place. The constructor
then fell back to the Crescendo `_DEFAULT_METAPROMPT_TEMPLATE`, which
contains `{phase1_end}` placeholders that GoatStrategy.template_variables()
does not provide — first attack-plan render dies with KeyError.
Force the GOAT default whenever the caller's value is None/undefined.
Also document the silent-stale-plan failure mode: `_attack_plan` is
cached on the instance and survives across `scenario.run()` calls. Reusing
the same agent across scenarios with different descriptions silently uses
the first run's plan. Added `.. note::` blocks to both `goat()` and
`crescendo()` Python docstrings and `@remarks` to the TS factories.
Added a warning to `goat()` about combining `injection_probability` with
GOAT — the GOAT metaprompt already steers the attacker toward encoding
techniques, and post-hoc encoding desyncs H_attacker from what the target
saw. Default 0.0 is the safe path.
Verified: 156 Python + 108 JS tests pass.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Contributor
|
Automated low-risk assessment This PR was evaluated against the repository's Low-Risk Pull Requests procedure and does not qualify as low risk.
This PR requires a manual review before merging. |
This was referenced Apr 14, 2026
Open
Open
This was referenced Apr 14, 2026
Aryansharma28
added a commit
that referenced
this pull request
Apr 14, 2026
Aryansharma28
added a commit
that referenced
this pull request
Apr 14, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
RedTeamAgent.goat()(Python) andredTeamGoat()(JS) implementing Meta's GOAT methodology (ICML 2025) for dynamic per-turn technique selectionmarathon_script()bug:turnsparameter was required with no default — now optional, defaults tototal_turns(Python + JS)_PHASESconstants — each strategy uses its own templateGoatStrategyfrom the public API (scenario.__init__) alongsideCrescendoStrategyChanges (latest commit)
Python
GoatStrategynow exported fromscenarioroot (was missing).goat()factory defaultsJavaScript
redTeamGoat()always setsGOAT_METAPROMPT_TEMPLATE— previously fell back to Crescendo's template whenattackPlanwas suppliedredTeamGoat()defaultstotalTurnsto 30 (matches Python)metapromptTemplateadded toCrescendoConfigso users can override it via both factory APIsrenderMetapromptTemplateonly injects{phase1End}/{phase2End}/{phase3End}when called from the Crescendo path (via new optionalphaseEndsparam) — GOAT path is cleanGoatStrategy.getStage()made private (matches Python's_get_stage())0.3/0.7→0.30/0.70to match PythonTest Results (GOAT vs bank-demo & data-demo agents)
pg_read_file,pg_sleep,lo_import)Test plan
.goat()strategy (5 attack surfaces)redTeamGoat()strategy (5 attack surfaces)marathon_script()works without explicitturnsargumentGoatStrategyexported from public APItotalTurnsdefault,turnsoptional, metaprompt template always setCloses #2143
🤖 Generated with Claude Code