Skip to content

feat: add backtracking on hard refusals for RedTeamAgent (TypeScript)#271

Merged
rogeriochaves merged 8 commits intomainfrom
feat/red-team-early-exit-ts
Mar 9, 2026
Merged

feat: add backtracking on hard refusals for RedTeamAgent (TypeScript)#271
rogeriochaves merged 8 commits intomainfrom
feat/red-team-early-exit-ts

Conversation

@Aryansharma28
Copy link
Copy Markdown
Contributor

Summary

Closes langwatch/langwatch#2043 (TypeScript portion)

  • When a target hard-refuses an attack, the refused exchange is removed from conversation history so the target "forgets" it refused. The attacker retries with a different technique from a clean context.
  • Backtrack history (failed approaches) is fed back to the strategy prompt so the attacker avoids repeating the same techniques.
  • Marathon script pads iterations by MAX_BACKTRACKS (10) when early exit is enabled to compensate for wasted turns.

Mirrors the Python implementation (PR #270) with the same consensus defaults from PyRIT, DeepTeam, and Promptfoo Hydra:

  • Hard backtrack via message removal (max 10)
  • Score cached as 0 on backtrack (no LLM scorer call)
  • Scoring skipped on backtracked turns

Test plan

  • All 70 TypeScript unit tests pass (including 10 new backtracking tests)
  • Verified with bank-demo integration tests (system prompt extraction + unauthorized transfer) — backtracking fires correctly, 10/10 backtracks consumed in both scenarios
  • Review backtrack history rendering in strategy prompts

🤖 Generated with Claude Code

Aryansharma28 and others added 7 commits March 6, 2026 12:34
Port the Python RedTeamAgent to TypeScript for @langwatch/scenario,
including per-turn response scoring feedback loop and adaptive attacks.

## New files
- `red-team-strategy.ts` — RedTeamStrategy interface with scoring params
- `crescendo-strategy.ts` — 4-phase Crescendo strategy (warmup → probing →
  escalation → direct) with feedback block injection
- `red-team-agent.ts` — RedTeamAgentImpl with attack plan generation,
  per-turn response scoring (0-10), score caching, and adaptation hints
- `metaprompt-template.ts` — Rewritten template with external-user framing
  and structured phase boundaries
- `red-team.test.ts` — 22 unit tests covering phases, scoring, prompts,
  metaprompt rendering, and marathonScript

## Modified files
- `agents/index.ts` — export red-team barrel
- `script/index.ts` — add marathonScript() helper

## Key features
- `redTeamCrescendo()` factory — drop-in red team agent with Crescendo
- `redTeamAgent()` factory — bring your own strategy
- `marathonScript({ turns, checks, finalChecks })` — generate multi-turn
  scripts with per-turn checks and a final judge
- Per-turn response scoring via metaprompt model (cached, with JSON
  fallback and markdown fence stripping)
- Configurable `scoreResponses`, `metapromptTemperature`
- External-user framing in all prompts (attacker is never the agent/staff)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Math.min(phase.end, totalTurns) with phase.end=Infinity produced
totalTurns*totalTurns (e.g. 2500 for 50 turns). Changed to
Math.min(phase.end, 1.0) to match Python implementation. Added
regression test for direct phase turn range.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
generateText() in AI SDK uses maxOutputTokens, not maxTokens.
This caused a DTS build failure in CI.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…us refusals

Adds detectRefusal() that classifies target responses as hard/soft/none refusals
using substring matching. Hard refusals short-circuit the LLM scorer call, saving
API calls and providing faster feedback. Opt-out via detectRefusals: false.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Addresses review feedback: detectRefusal() was accidentally public,
now private to match Python's _detect_refusal convention.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
When the scorer confirms the attack objective is met (score >= threshold
for N consecutive turns), the marathon script exits early — saving API
tokens and preserving a clean signal. Adds successScore (default 9) and
successConfirmTurns (default 2) config, checkEarlyExit() method, and
instance marathonScript() that auto-inserts exit checks.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
When a target hard-refuses an attack, the refused exchange is removed from
conversation history so the target "forgets" it refused. The attacker retries
with a different technique from a clean context. Mirrors the Python
implementation with the same consensus defaults from PyRIT/DeepTeam/Promptfoo:

- Hard backtrack via message removal (max 10)
- Backtrack history fed to strategy as "FAILED APPROACHES" block
- Score cached as 0 on backtrack (no LLM scorer call)
- Marathon script padded by MAX_BACKTRACKS when early exit enabled
- Scoring skipped on backtracked turns

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Comment thread javascript/src/agents/__tests__/red-team.test.ts Fixed
Comment thread javascript/src/agents/__tests__/red-team.test.ts Fixed
- Use Array.from() on MapIterator for ES5 compat in checkEarlyExit
- Remove unused mockInner, input, and makeInput variables in tests

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Mar 9, 2026

Automated low-risk assessment

This PR was evaluated against the repository's Low-Risk Pull Requests procedure and does not qualify as low risk.

The PR introduces new runtime behavior (a Crescendo red-team strategy, backtracking on hard refusals, scoring/early-exit logic, and marathon script padding) rather than only documentation or UI changes. These are functional changes to agent behavior and scoring that could affect security testing and orchestration, so they do not meet the policy for low-risk changes.

This PR requires a manual review before merging.

@rogeriochaves rogeriochaves merged commit 79157cd into main Mar 9, 2026
11 of 13 checks passed
sergioestebance added a commit that referenced this pull request May 2, 2026
…ad CVEs

Add pnpm.overrides for vite in javascript/ and docs/ to fix:
- #268: server.fs.deny bypassed with queries (vite >=7.0.0 <=7.3.1)
- #271: Arbitrary File Read via WebSocket (vite >=7.0.0 <=7.3.1)
- #277: Arbitrary File Read via WebSocket (vite >=6.0.0 <=6.4.1)

Part of #400
sergioestebance added a commit that referenced this pull request May 2, 2026
…ad CVEs

Add pnpm.overrides for vite in javascript/ and docs/ to fix:
- #268: server.fs.deny bypassed with queries (vite >=7.0.0 <=7.3.1)
- #271: Arbitrary File Read via WebSocket (vite >=7.0.0 <=7.3.1)
- #277: Arbitrary File Read via WebSocket (vite >=6.0.0 <=6.4.1)

Part of #400
sergioestebance added a commit that referenced this pull request May 2, 2026
…ad CVEs (#419)

Add pnpm.overrides for vite in javascript/ and docs/ to fix:
- #268: server.fs.deny bypassed with queries (vite >=7.0.0 <=7.3.1)
- #271: Arbitrary File Read via WebSocket (vite >=7.0.0 <=7.3.1)
- #277: Arbitrary File Read via WebSocket (vite >=6.0.0 <=6.4.1)

Part of #400
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

red teaming: backtracking on hard refusals

2 participants