feat: add backtracking on hard refusals for RedTeamAgent (TypeScript)#271
Merged
rogeriochaves merged 8 commits intomainfrom Mar 9, 2026
Merged
feat: add backtracking on hard refusals for RedTeamAgent (TypeScript)#271rogeriochaves merged 8 commits intomainfrom
rogeriochaves merged 8 commits intomainfrom
Conversation
Port the Python RedTeamAgent to TypeScript for @langwatch/scenario,
including per-turn response scoring feedback loop and adaptive attacks.
## New files
- `red-team-strategy.ts` — RedTeamStrategy interface with scoring params
- `crescendo-strategy.ts` — 4-phase Crescendo strategy (warmup → probing →
escalation → direct) with feedback block injection
- `red-team-agent.ts` — RedTeamAgentImpl with attack plan generation,
per-turn response scoring (0-10), score caching, and adaptation hints
- `metaprompt-template.ts` — Rewritten template with external-user framing
and structured phase boundaries
- `red-team.test.ts` — 22 unit tests covering phases, scoring, prompts,
metaprompt rendering, and marathonScript
## Modified files
- `agents/index.ts` — export red-team barrel
- `script/index.ts` — add marathonScript() helper
## Key features
- `redTeamCrescendo()` factory — drop-in red team agent with Crescendo
- `redTeamAgent()` factory — bring your own strategy
- `marathonScript({ turns, checks, finalChecks })` — generate multi-turn
scripts with per-turn checks and a final judge
- Per-turn response scoring via metaprompt model (cached, with JSON
fallback and markdown fence stripping)
- Configurable `scoreResponses`, `metapromptTemperature`
- External-user framing in all prompts (attacker is never the agent/staff)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Math.min(phase.end, totalTurns) with phase.end=Infinity produced totalTurns*totalTurns (e.g. 2500 for 50 turns). Changed to Math.min(phase.end, 1.0) to match Python implementation. Added regression test for direct phase turn range. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
generateText() in AI SDK uses maxOutputTokens, not maxTokens. This caused a DTS build failure in CI. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…us refusals Adds detectRefusal() that classifies target responses as hard/soft/none refusals using substring matching. Hard refusals short-circuit the LLM scorer call, saving API calls and providing faster feedback. Opt-out via detectRefusals: false. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Addresses review feedback: detectRefusal() was accidentally public, now private to match Python's _detect_refusal convention. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
When the scorer confirms the attack objective is met (score >= threshold for N consecutive turns), the marathon script exits early — saving API tokens and preserving a clean signal. Adds successScore (default 9) and successConfirmTurns (default 2) config, checkEarlyExit() method, and instance marathonScript() that auto-inserts exit checks. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
When a target hard-refuses an attack, the refused exchange is removed from conversation history so the target "forgets" it refused. The attacker retries with a different technique from a clean context. Mirrors the Python implementation with the same consensus defaults from PyRIT/DeepTeam/Promptfoo: - Hard backtrack via message removal (max 10) - Backtrack history fed to strategy as "FAILED APPROACHES" block - Score cached as 0 on backtrack (no LLM scorer call) - Marathon script padded by MAX_BACKTRACKS when early exit enabled - Scoring skipped on backtracked turns Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Use Array.from() on MapIterator for ES5 compat in checkEarlyExit - Remove unused mockInner, input, and makeInput variables in tests Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Contributor
|
Automated low-risk assessment This PR was evaluated against the repository's Low-Risk Pull Requests procedure and does not qualify as low risk.
This PR requires a manual review before merging. |
This was referenced Mar 9, 2026
rogeriochaves
approved these changes
Mar 9, 2026
This was referenced Mar 9, 2026
This was referenced May 2, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Closes langwatch/langwatch#2043 (TypeScript portion)
MAX_BACKTRACKS(10) when early exit is enabled to compensate for wasted turns.Mirrors the Python implementation (PR #270) with the same consensus defaults from PyRIT, DeepTeam, and Promptfoo Hydra:
Test plan
🤖 Generated with Claude Code