feat: add RedTeamAgent with Crescendo strategy for TypeScript#255
feat: add RedTeamAgent with Crescendo strategy for TypeScript#255Aryansharma28 wants to merge 3 commits intomainfrom
Conversation
Port the Python RedTeamAgent to TypeScript for @langwatch/scenario,
including per-turn response scoring feedback loop and adaptive attacks.
## New files
- `red-team-strategy.ts` — RedTeamStrategy interface with scoring params
- `crescendo-strategy.ts` — 4-phase Crescendo strategy (warmup → probing →
escalation → direct) with feedback block injection
- `red-team-agent.ts` — RedTeamAgentImpl with attack plan generation,
per-turn response scoring (0-10), score caching, and adaptation hints
- `metaprompt-template.ts` — Rewritten template with external-user framing
and structured phase boundaries
- `red-team.test.ts` — 22 unit tests covering phases, scoring, prompts,
metaprompt rendering, and marathonScript
## Modified files
- `agents/index.ts` — export red-team barrel
- `script/index.ts` — add marathonScript() helper
## Key features
- `redTeamCrescendo()` factory — drop-in red team agent with Crescendo
- `redTeamAgent()` factory — bring your own strategy
- `marathonScript({ turns, checks, finalChecks })` — generate multi-turn
scripts with per-turn checks and a final judge
- Per-turn response scoring via metaprompt model (cached, with JSON
fallback and markdown fence stripping)
- Configurable `scoreResponses`, `metapromptTemperature`
- External-user framing in all prompts (attacker is never the agent/staff)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Automated low-risk assessment This PR was evaluated against the repository's Low-Risk Pull Requests procedure and does not qualify as low risk.
This PR requires a manual review before merging. |
Math.min(phase.end, totalTurns) with phase.end=Infinity produced totalTurns*totalTurns (e.g. 2500 for 50 turns). Changed to Math.min(phase.end, 1.0) to match Python implementation. Added regression test for direct phase turn range. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Automated low-risk assessment This PR was evaluated against the repository's Low-Risk Pull Requests procedure and does not qualify as low risk.
This PR requires a manual review before merging. |
generateText() in AI SDK uses maxOutputTokens, not maxTokens. This caused a DTS build failure in CI. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Automated low-risk assessment This PR was evaluated against the repository's Low-Risk Pull Requests procedure and does not qualify as low risk.
This PR requires a manual review before merging. |
|
Superseded by #271 which includes all TS red-team work (base + refusal detection + early exit + backtracking). Merged to main. |
Summary
Port of the Python
RedTeamAgentwith Crescendo strategy to the@langwatch/scenarioTypeScript package, including the per-turn response scoring feedback loop that makes attacks adaptive and reactive.Closes langwatch/langwatch#1999
What's included
New exports
redTeamCrescendo(config)— Factory to create a red-team agent using the 4-phase Crescendo strategy (warmup → probing → escalation → direct). Drop-in compatible withscenario.run().redTeamAgent(config)— Factory for custom strategies (bring your ownRedTeamStrategyimplementation).marathonScript({ turns, checks, finalChecks })— Helper to generate multi-turn scripts with per-turn deterministic checks and a final judge evaluation.Per-turn response scoring (Crescendo feedback loop)
After each turn, the metaprompt model scores the target agent's response on a 0-10 scale and produces a one-sentence adaptation hint. This score + hint is injected into the attacker's system prompt for the next turn, enabling:
Score scale:
Scoring is cached per turn, handles malformed JSON gracefully, strips markdown fences, and clamps to 0-10. Can be disabled with
scoreResponses: false.Crescendo strategy phases
Metaprompt template
Rewritten from the Python version with:
{phase1End},{phase2End},{phase3End}placeholders computed fromtotalTurnsFiles
agents/red-team/red-team-strategy.tsRedTeamStrategyinterface withbuildSystemPrompt()+getPhaseName()agents/red-team/crescendo-strategy.tsCrescendoStrategy— 4-phase implementation with feedback blockagents/red-team/red-team-agent.tsRedTeamAgentImpl— attack plan generation, scoring, delegationagents/red-team/metaprompt-template.tsrenderMetapromptTemplate()agents/red-team/index.tsagents/index.tsexport * from "./red-team"script/index.tsmarathonScript()agents/__tests__/red-team.test.tsConfiguration
Usage example
Test plan
@langwatch/scenariotestsparentSpanIdtype issue is resolved🤖 Generated with Claude Code