feat: add GOAT strategy with dynamic technique selection for RedTeamAgent by Aryansharma28 · Pull Request #346 · langwatch/scenario

Aryansharma28 · 2026-04-14T13:20:42Z

Re-land of #306 — the original PR was squash-merged prematurely and subsequently reverted via #345. Opening this to restore the GOAT strategy on main through the intended stacked-PR workflow.

Same branch (feat/red-team-dynamic-techniques) and same commits as #306.

Summary

Adds RedTeamAgent.goat() (Python) / redTeamGoat() (TS), implementing Meta's GOAT methodology (ICML 2025) for dynamic per-turn technique selection. 7-technique catalogue, 3 soft progress stages, paper-sourced adaptive attacks. See #306 for the full description.

Stack

This PR is the base of a 3-PR stack:

This PR — base GOAT strategy
refactor(red-team): align GOAT with Meta's paper — drop pre-generated plan and stage hints #340 — paper-fidelity refactor (drops pre-generated attack plan + stage hints; stacks on this PR)
feat(red-team): structured attacker output — observation/strategy/reply JSON #341 — structured attacker output (observation/strategy/reply JSON; stacks on refactor(red-team): align GOAT with Meta's paper — drop pre-generated plan and stage hints #340)

Merge bottom-up without --delete-branch until the top of the stack lands, then clean up branches.

Test plan

See #306 test plan — unchanged. 108 JS tests, 156 Python tests passing on the branch.

Context

Original PR: feat: add GOAT strategy with dynamic technique selection for RedTeamAgent #306 (merged 2026-04-14, then reverted)
Revert: revert: premature squash merge of #306 to restore stacked PR workflow #345 (merged 2026-04-14)
Epic: epic: scenarios red teaming langwatch#1713

🤖 Generated with Claude Code

Add GOAT (Generative Offensive Agent Tester) as a separate strategy alongside Crescendo. Based on Meta's GOAT paper (ICML 2025, 97% ASR). - GoatStrategy with 7-technique catalogue (hypothetical framing, persona modification, refusal suppression, response priming, dual response, topic splitting, authority & social engineering) - Soft progress stages (early/mid/late) instead of fixed phases - Dedicated GOAT metaprompt template for adaptive attack planning - Python: RedTeamAgent.goat(target=..., model=...) - TypeScript: scenario.redTeamGoat({ target, model }) - Crescendo (.crescendo()) is completely untouched Closes #2143 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…etaprompt from Crescendo phases marathon_script(turns=...) was required with no default, causing TypeError in all test calls that omit it. Now defaults to self.total_turns. Also makes _generate_attack_plan strategy-aware: only computes Crescendo phase boundaries when using CrescendoStrategy, removing unnecessary coupling for the GOAT strategy. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…d test coverage Python: - Export GoatStrategy from scenario.__init__ (was missing, CrescendoStrategy was exported but not GoatStrategy) - Add 25 unit tests for GoatStrategy: stage boundaries, prompt building, factory method defaults - Fix technique 6 example message (was placeholder "...") JavaScript: - Fix redTeamGoat() always sets GOAT_METAPROMPT_TEMPLATE (was conditionally falling back to Crescendo template when attackPlan supplied) - Add totalTurns: 30 default to redTeamGoat() to match Python - Add metapromptTemplate to CrescendoConfig so users can override via both factory APIs - Fix renderMetapromptTemplate: phase boundary vars only injected for Crescendo (via optional phaseEnds param) - generateAttackPlan passes phaseEnds only when strategy instanceof CrescendoStrategy - marathonScript turns param is now optional, defaults to this.totalTurns (matches Python fix) - Make GoatStrategy.getStage() private (matches Python _get_stage()) - Fix float notation 0.3/0.7 → 0.30/0.70 to match Python Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…param Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

… tests Architecture: - Add template_variables() to RedTeamStrategy base (Python) and phaseEnds?() to RedTeamStrategy interface (JS) — strategies declare their own template vars - CrescendoStrategy overrides to return phase boundary turn numbers; GoatStrategy returns nothing. Removes isinstance(CrescendoStrategy) check from orchestrator - Remove _PHASES import from red_team_agent.py (no longer needed) JS: - CrescendoConfig = Omit<RedTeamAgentConfig, "strategy"> — eliminates 13-field duplication across three interfaces - GoatConfig gets doc comment explaining it is a named hook for future GOAT params - CrescendoStrategy.getPhase() made private; tests updated to use getPhaseName() - Add 24 JS unit tests for GoatStrategy (stage boundaries, buildSystemPrompt, phaseEnds, redTeamGoat factory defaults) - Remove vestigial vi.doMock("ai") that never intercepted calls (module pre-loaded) - phaseEnds test uses literal [2, 4, 7] instead of re-deriving the formula Python: - metaprompt_template falsy check fixed: `or` → `is not None` (matches JS `??`) - Add "Should not be reached" comment to GoatStrategy fallback (matches Crescendo) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…ypes Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- Fix GoatStrategy docstring: remove benchmark-specific "in 5 turns" claim - Fix get_phase_name base class docstring: strategy-agnostic return value wording - Wrap .format() in _generate_attack_plan with helpful ValueError on KeyError (e.g. user passes Crescendo template to GOAT agent — was silent crash) - Export GOAT_METAPROMPT_TEMPLATE from Python scenario.__init__ and JS index.ts so users can inspect/extend without importing from internal paths - Update GoatConfig JSDoc to document inherited options and totalTurns=30 default - Add Python test: goat() allows overriding metaprompt_template via kwargs - Add JS test: renderMetapromptTemplate leaves phase placeholders as literals when phaseEnds is omitted (documents silent passthrough behavior) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Resolves conflicts between GOAT strategy work and main's red-team additions (injection_probability, AttackTechnique catalogue, marathon_script signature cleanup). Resolutions: - Take main's marathon_script signature (drops turns param, uses total_turns) in both Python and TypeScript; supersedes the turns-optional fix. - Keep HEAD's template_variables() decoupling so GOAT and Crescendo each contribute their own metaprompt placeholders; drop now-unused _PHASES and _marathon_script imports. - Combine public API exports across both branches: GoatStrategy, GOAT_METAPROMPT_TEMPLATE, AttackTechnique, DEFAULT_TECHNIQUES. - Add injection_probability and techniques kwargs to RedTeamAgent.goat() for parity with .crescendo(). - Combine test suites: GOAT stage/factory tests alongside main's injection probability and marathon-judge integration tests. Verified: 156 Python tests pass, 108 JS tests pass. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…euse The `goat()`/`redTeamGoat` factories used `setdefault` / object spread patterns that left an explicit `metaprompt_template=None` (Python) or `metapromptTemplate: undefined` (TypeScript) in place. The constructor then fell back to the Crescendo `_DEFAULT_METAPROMPT_TEMPLATE`, which contains `{phase1_end}` placeholders that GoatStrategy.template_variables() does not provide — first attack-plan render dies with KeyError. Force the GOAT default whenever the caller's value is None/undefined. Also document the silent-stale-plan failure mode: `_attack_plan` is cached on the instance and survives across `scenario.run()` calls. Reusing the same agent across scenarios with different descriptions silently uses the first run's plan. Added `.. note::` blocks to both `goat()` and `crescendo()` Python docstrings and `@remarks` to the TS factories. Added a warning to `goat()` about combining `injection_probability` with GOAT — the GOAT metaprompt already steers the attacker toward encoding techniques, and post-hoc encoding desyncs H_attacker from what the target saw. Default 0.0 is the safe path. Verified: 156 Python + 108 JS tests pass. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

github-actions · 2026-04-14T13:21:05Z

Automated low-risk assessment

This PR was evaluated against the repository's Low-Risk Pull Requests procedure and does not qualify as low risk.

The PR adds a new GOAT red‑team strategy (JS + Python), a GOAT metaprompt template, a redTeamGoat factory, and changes metaprompt rendering and orchestration logic that affect what is sent to LLMs. Because it modifies runtime attack logic and prompt/template handling (i.e., behavior of an integration with language models) rather than only docs/tests/UI, it does not meet the low‑risk criteria.

This PR requires a manual review before merging.

Aryansharma28 and others added 9 commits March 23, 2026 16:29

fix: update renderMetapromptTemplate tests to use explicit phaseEnds …

0364511

…param Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

fix: remove phaseEnds test on GoatStrategy — absence is enforced by t…

a9541ac

…ypes Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add GOAT strategy with dynamic technique selection for RedTeamAgent#346

feat: add GOAT strategy with dynamic technique selection for RedTeamAgent#346
Aryansharma28 wants to merge 9 commits intomainfrom
feat/red-team-dynamic-techniques

Aryansharma28 commented Apr 14, 2026

Uh oh!

github-actions bot commented Apr 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Aryansharma28 commented Apr 14, 2026

Summary

Stack

Test plan

Context

Uh oh!

github-actions bot commented Apr 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant