Skip to content

feat: add GOAT strategy with dynamic technique selection for RedTeamAgent#306

Merged
Aryansharma28 merged 9 commits intomainfrom
feat/red-team-dynamic-techniques
Apr 14, 2026
Merged

feat: add GOAT strategy with dynamic technique selection for RedTeamAgent#306
Aryansharma28 merged 9 commits intomainfrom
feat/red-team-dynamic-techniques

Conversation

@Aryansharma28
Copy link
Copy Markdown
Contributor

@Aryansharma28 Aryansharma28 commented Mar 25, 2026

Summary

  • Adds RedTeamAgent.goat() (Python) and redTeamGoat() (JS) implementing Meta's GOAT methodology (ICML 2025) for dynamic per-turn technique selection
  • 7-technique catalogue (Hypothetical Framing, Persona Modification, Refusal Suppression, Response Priming, Dual Response, Topic Splitting, Authority & Social Engineering) with 3 soft progress stages instead of Crescendo's fixed phases
  • Fixes marathon_script() bug: turns parameter was required with no default — now optional, defaults to total_turns (Python + JS)
  • Decouples GOAT metaprompt from Crescendo's _PHASES constants — each strategy uses its own template
  • Exports GoatStrategy from the public API (scenario.__init__) alongside CrescendoStrategy

Changes (latest commit)

Python

  • GoatStrategy now exported from scenario root (was missing)
  • 25 new unit tests: stage boundaries (early/mid/late), prompt building, .goat() factory defaults
  • Technique 6 example message was a placeholder — replaced with a complete realistic example

JavaScript

  • redTeamGoat() always sets GOAT_METAPROMPT_TEMPLATE — previously fell back to Crescendo's template when attackPlan was supplied
  • redTeamGoat() defaults totalTurns to 30 (matches Python)
  • metapromptTemplate added to CrescendoConfig so users can override it via both factory APIs
  • renderMetapromptTemplate only injects {phase1End} / {phase2End} / {phase3End} when called from the Crescendo path (via new optional phaseEnds param) — GOAT path is clean
  • GoatStrategy.getStage() made private (matches Python's _get_stage())
  • Float notation 0.3/0.70.30/0.70 to match Python

Test Results (GOAT vs bank-demo & data-demo agents)

Agent Held Broken Vulnerabilities Found
Bank support (Python) 3/5 2/5 PII leak (email/phone/DOB), customer data leak via social engineering
Data analytics (TypeScript) 4/5 1/5 Agent generates forbidden SQL (pg_read_file, pg_sleep, lo_import)

Test plan

  • Run bank-demo red team tests with .goat() strategy (5 attack surfaces)
  • Run data-demo red team tests with redTeamGoat() strategy (5 attack surfaces)
  • Verify marathon_script() works without explicit turns argument
  • Verify GOAT metaprompt generates correctly without Crescendo phase boundaries
  • Unit tests for GoatStrategy stage progression (25 tests added)
  • GoatStrategy exported from public API
  • JS/Python parity: totalTurns default, turns optional, metaprompt template always set

Closes #2143

🤖 Generated with Claude Code

Aryansharma28 and others added 2 commits March 23, 2026 16:29
Add GOAT (Generative Offensive Agent Tester) as a separate strategy
alongside Crescendo. Based on Meta's GOAT paper (ICML 2025, 97% ASR).

- GoatStrategy with 7-technique catalogue (hypothetical framing, persona
  modification, refusal suppression, response priming, dual response,
  topic splitting, authority & social engineering)
- Soft progress stages (early/mid/late) instead of fixed phases
- Dedicated GOAT metaprompt template for adaptive attack planning
- Python: RedTeamAgent.goat(target=..., model=...)
- TypeScript: scenario.redTeamGoat({ target, model })
- Crescendo (.crescendo()) is completely untouched

Closes #2143

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…etaprompt from Crescendo phases

marathon_script(turns=...) was required with no default, causing TypeError
in all test calls that omit it. Now defaults to self.total_turns.

Also makes _generate_attack_plan strategy-aware: only computes Crescendo
phase boundaries when using CrescendoStrategy, removing unnecessary
coupling for the GOAT strategy.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@Aryansharma28 Aryansharma28 force-pushed the feat/red-team-dynamic-techniques branch from a8aafff to 21fa64d Compare March 26, 2026 13:21
Aryansharma28 and others added 5 commits April 7, 2026 17:50
…d test coverage

Python:
- Export GoatStrategy from scenario.__init__ (was missing, CrescendoStrategy was exported but not GoatStrategy)
- Add 25 unit tests for GoatStrategy: stage boundaries, prompt building, factory method defaults
- Fix technique 6 example message (was placeholder "...")

JavaScript:
- Fix redTeamGoat() always sets GOAT_METAPROMPT_TEMPLATE (was conditionally falling back to Crescendo template when attackPlan supplied)
- Add totalTurns: 30 default to redTeamGoat() to match Python
- Add metapromptTemplate to CrescendoConfig so users can override via both factory APIs
- Fix renderMetapromptTemplate: phase boundary vars only injected for Crescendo (via optional phaseEnds param)
- generateAttackPlan passes phaseEnds only when strategy instanceof CrescendoStrategy
- marathonScript turns param is now optional, defaults to this.totalTurns (matches Python fix)
- Make GoatStrategy.getStage() private (matches Python _get_stage())
- Fix float notation 0.3/0.7 → 0.30/0.70 to match Python

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…param

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… tests

Architecture:
- Add template_variables() to RedTeamStrategy base (Python) and phaseEnds?() to
  RedTeamStrategy interface (JS) — strategies declare their own template vars
- CrescendoStrategy overrides to return phase boundary turn numbers; GoatStrategy
  returns nothing. Removes isinstance(CrescendoStrategy) check from orchestrator
- Remove _PHASES import from red_team_agent.py (no longer needed)

JS:
- CrescendoConfig = Omit<RedTeamAgentConfig, "strategy"> — eliminates 13-field
  duplication across three interfaces
- GoatConfig gets doc comment explaining it is a named hook for future GOAT params
- CrescendoStrategy.getPhase() made private; tests updated to use getPhaseName()
- Add 24 JS unit tests for GoatStrategy (stage boundaries, buildSystemPrompt,
  phaseEnds, redTeamGoat factory defaults)
- Remove vestigial vi.doMock("ai") that never intercepted calls (module pre-loaded)
- phaseEnds test uses literal [2, 4, 7] instead of re-deriving the formula

Python:
- metaprompt_template falsy check fixed: `or` → `is not None` (matches JS `??`)
- Add "Should not be reached" comment to GoatStrategy fallback (matches Crescendo)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ypes

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Fix GoatStrategy docstring: remove benchmark-specific "in 5 turns" claim
- Fix get_phase_name base class docstring: strategy-agnostic return value wording
- Wrap .format() in _generate_attack_plan with helpful ValueError on KeyError
  (e.g. user passes Crescendo template to GOAT agent — was silent crash)
- Export GOAT_METAPROMPT_TEMPLATE from Python scenario.__init__ and JS index.ts
  so users can inspect/extend without importing from internal paths
- Update GoatConfig JSDoc to document inherited options and totalTurns=30 default
- Add Python test: goat() allows overriding metaprompt_template via kwargs
- Add JS test: renderMetapromptTemplate leaves phase placeholders as literals
  when phaseEnds is omitted (documents silent passthrough behavior)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@Aryansharma28
Copy link
Copy Markdown
Contributor Author

@copilot resolve the merge conflicts in this pull request

Aryansharma28 and others added 2 commits April 14, 2026 11:39
Resolves conflicts between GOAT strategy work and main's red-team additions
(injection_probability, AttackTechnique catalogue, marathon_script signature
cleanup).

Resolutions:
- Take main's marathon_script signature (drops turns param, uses total_turns)
  in both Python and TypeScript; supersedes the turns-optional fix.
- Keep HEAD's template_variables() decoupling so GOAT and Crescendo each
  contribute their own metaprompt placeholders; drop now-unused _PHASES and
  _marathon_script imports.
- Combine public API exports across both branches: GoatStrategy,
  GOAT_METAPROMPT_TEMPLATE, AttackTechnique, DEFAULT_TECHNIQUES.
- Add injection_probability and techniques kwargs to RedTeamAgent.goat()
  for parity with .crescendo().
- Combine test suites: GOAT stage/factory tests alongside main's injection
  probability and marathon-judge integration tests.

Verified: 156 Python tests pass, 108 JS tests pass.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…euse

The `goat()`/`redTeamGoat` factories used `setdefault` / object spread
patterns that left an explicit `metaprompt_template=None` (Python) or
`metapromptTemplate: undefined` (TypeScript) in place. The constructor
then fell back to the Crescendo `_DEFAULT_METAPROMPT_TEMPLATE`, which
contains `{phase1_end}` placeholders that GoatStrategy.template_variables()
does not provide — first attack-plan render dies with KeyError.

Force the GOAT default whenever the caller's value is None/undefined.

Also document the silent-stale-plan failure mode: `_attack_plan` is
cached on the instance and survives across `scenario.run()` calls. Reusing
the same agent across scenarios with different descriptions silently uses
the first run's plan. Added `.. note::` blocks to both `goat()` and
`crescendo()` Python docstrings and `@remarks` to the TS factories.

Added a warning to `goat()` about combining `injection_probability` with
GOAT — the GOAT metaprompt already steers the attacker toward encoding
techniques, and post-hoc encoding desyncs H_attacker from what the target
saw. Default 0.0 is the safe path.

Verified: 156 Python + 108 JS tests pass.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown
Contributor

Automated low-risk assessment

This PR was evaluated against the repository's Low-Risk Pull Requests procedure and does not qualify as low risk.

The PR adds a new GOAT red‑team strategy and changes runtime behavior across both Python and JavaScript (new strategy classes, metaprompt templates, factory functions, and default totalTurns), and it modifies metaprompt rendering and agent orchestration logic. These are behavioral changes to security/attack orchestration code and public APIs rather than purely documentation/tests or trivial formatting, so they do not meet the low‑risk criteria.

This PR requires a manual review before merging.

@Aryansharma28 Aryansharma28 added the firefighting Urgent fix that bypasses approval check label Apr 14, 2026
@Aryansharma28 Aryansharma28 merged commit e62c292 into main Apr 14, 2026
9 checks passed
@Aryansharma28 Aryansharma28 deleted the feat/red-team-dynamic-techniques branch April 14, 2026 13:07
@Aryansharma28 Aryansharma28 restored the feat/red-team-dynamic-techniques branch April 14, 2026 13:11
Aryansharma28 added a commit that referenced this pull request Apr 14, 2026
Aryansharma28 added a commit that referenced this pull request Apr 14, 2026
This reverts commit e62c292.

Reason: Claude squash-merged #306 without realizing it would auto-close #340
(whose base was #306's branch). Reverting so #306, #340, #341 can land in the
proper stacked order.
@Aryansharma28 Aryansharma28 removed the firefighting Urgent fix that bypasses approval check label Apr 14, 2026
Aryansharma28 added a commit that referenced this pull request Apr 14, 2026
…#345)

This reverts commit e62c292.

Reason: Claude squash-merged #306 without realizing it would auto-close #340
(whose base was #306's branch). Reverting so #306, #340, #341 can land in the
proper stacked order.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant