feat: add backtracking on hard refusals for RedTeamAgent (TypeScript) by Aryansharma28 · Pull Request #271 · langwatch/scenario

Aryansharma28 · 2026-03-09T14:54:03Z

Summary

Closes langwatch/langwatch#2043 (TypeScript portion)

When a target hard-refuses an attack, the refused exchange is removed from conversation history so the target "forgets" it refused. The attacker retries with a different technique from a clean context.
Backtrack history (failed approaches) is fed back to the strategy prompt so the attacker avoids repeating the same techniques.
Marathon script pads iterations by MAX_BACKTRACKS (10) when early exit is enabled to compensate for wasted turns.

Mirrors the Python implementation (PR #270) with the same consensus defaults from PyRIT, DeepTeam, and Promptfoo Hydra:

Hard backtrack via message removal (max 10)
Score cached as 0 on backtrack (no LLM scorer call)
Scoring skipped on backtracked turns

Test plan

All 70 TypeScript unit tests pass (including 10 new backtracking tests)
Verified with bank-demo integration tests (system prompt extraction + unauthorized transfer) — backtracking fires correctly, 10/10 backtracks consumed in both scenarios
Review backtrack history rendering in strategy prompts

🤖 Generated with Claude Code

Port the Python RedTeamAgent to TypeScript for @langwatch/scenario, including per-turn response scoring feedback loop and adaptive attacks. ## New files - `red-team-strategy.ts` — RedTeamStrategy interface with scoring params - `crescendo-strategy.ts` — 4-phase Crescendo strategy (warmup → probing → escalation → direct) with feedback block injection - `red-team-agent.ts` — RedTeamAgentImpl with attack plan generation, per-turn response scoring (0-10), score caching, and adaptation hints - `metaprompt-template.ts` — Rewritten template with external-user framing and structured phase boundaries - `red-team.test.ts` — 22 unit tests covering phases, scoring, prompts, metaprompt rendering, and marathonScript ## Modified files - `agents/index.ts` — export red-team barrel - `script/index.ts` — add marathonScript() helper ## Key features - `redTeamCrescendo()` factory — drop-in red team agent with Crescendo - `redTeamAgent()` factory — bring your own strategy - `marathonScript({ turns, checks, finalChecks })` — generate multi-turn scripts with per-turn checks and a final judge - Per-turn response scoring via metaprompt model (cached, with JSON fallback and markdown fence stripping) - Configurable `scoreResponses`, `metapromptTemperature` - External-user framing in all prompts (attacker is never the agent/staff) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Math.min(phase.end, totalTurns) with phase.end=Infinity produced totalTurns*totalTurns (e.g. 2500 for 50 turns). Changed to Math.min(phase.end, 1.0) to match Python implementation. Added regression test for direct phase turn range. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

generateText() in AI SDK uses maxOutputTokens, not maxTokens. This caused a DTS build failure in CI. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…us refusals Adds detectRefusal() that classifies target responses as hard/soft/none refusals using substring matching. Hard refusals short-circuit the LLM scorer call, saving API calls and providing faster feedback. Opt-out via detectRefusals: false. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Addresses review feedback: detectRefusal() was accidentally public, now private to match Python's _detect_refusal convention. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

When the scorer confirms the attack objective is met (score >= threshold for N consecutive turns), the marathon script exits early — saving API tokens and preserving a clean signal. Adds successScore (default 9) and successConfirmTurns (default 2) config, checkEarlyExit() method, and instance marathonScript() that auto-inserts exit checks. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

When a target hard-refuses an attack, the refused exchange is removed from conversation history so the target "forgets" it refused. The attacker retries with a different technique from a clean context. Mirrors the Python implementation with the same consensus defaults from PyRIT/DeepTeam/Promptfoo: - Hard backtrack via message removal (max 10) - Backtrack history fed to strategy as "FAILED APPROACHES" block - Score cached as 0 on backtrack (no LLM scorer call) - Marathon script padded by MAX_BACKTRACKS when early exit enabled - Scoring skipped on backtracked turns Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Use Array.from() on MapIterator for ES5 compat in checkEarlyExit - Remove unused mockInner, input, and makeInput variables in tests Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

github-actions · 2026-03-09T15:16:12Z

Automated low-risk assessment

This PR was evaluated against the repository's Low-Risk Pull Requests procedure and does not qualify as low risk.

The PR introduces new runtime behavior (a Crescendo red-team strategy, backtracking on hard refusals, scoring/early-exit logic, and marathon script padding) rather than only documentation or UI changes. These are functional changes to agent behavior and scoring that could affect security testing and orchestration, so they do not meet the policy for low-risk changes.

This PR requires a manual review before merging.

…ad CVEs Add pnpm.overrides for vite in javascript/ and docs/ to fix: - #268: server.fs.deny bypassed with queries (vite >=7.0.0 <=7.3.1) - #271: Arbitrary File Read via WebSocket (vite >=7.0.0 <=7.3.1) - #277: Arbitrary File Read via WebSocket (vite >=6.0.0 <=6.4.1) Part of #400

…ad CVEs (#419) Add pnpm.overrides for vite in javascript/ and docs/ to fix: - #268: server.fs.deny bypassed with queries (vite >=7.0.0 <=7.3.1) - #271: Arbitrary File Read via WebSocket (vite >=7.0.0 <=7.3.1) - #277: Arbitrary File Read via WebSocket (vite >=6.0.0 <=6.4.1) Part of #400

Aryansharma28 and others added 7 commits March 6, 2026 12:34

fix: use maxOutputTokens in scorer generateText call

6e3ff03

generateText() in AI SDK uses maxOutputTokens, not maxTokens. This caused a DTS build failure in CI. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

fix: make detectRefusal private for encapsulation

6d9d88b

Addresses review feedback: detectRefusal() was accidentally public, now private to match Python's _detect_refusal convention. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

github-code-quality Bot found potential problems Mar 9, 2026

View reviewed changes

Comment thread javascript/src/agents/__tests__/red-team.test.ts Fixed

Comment thread javascript/src/agents/__tests__/red-team.test.ts Fixed

fix: resolve CI type errors in red-team backtracking

6cf7af1

- Use Array.from() on MapIterator for ES5 compat in checkEarlyExit - Remove unused mockInner, input, and makeInput variables in tests Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

rogeriochaves approved these changes Mar 9, 2026

View reviewed changes

rogeriochaves merged commit 79157cd into main Mar 9, 2026
11 of 13 checks passed

rogeriochaves mentioned this pull request Mar 9, 2026

chore(main): release javascript 0.4.7 #273

Merged

This was referenced May 2, 2026

fix(security): resolve all high-severity Dependabot alerts flagged by Vanta #400

Open

fix(security): patch vite server.fs.deny bypass and WebSocket file read CVEs #419

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add backtracking on hard refusals for RedTeamAgent (TypeScript)#271

feat: add backtracking on hard refusals for RedTeamAgent (TypeScript)#271
rogeriochaves merged 8 commits intomainfrom
feat/red-team-early-exit-ts

Aryansharma28 commented Mar 9, 2026

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented Mar 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Aryansharma28 commented Mar 9, 2026

Summary

Test plan

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented Mar 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants