feat: pattern-based refusal detection for RedTeamAgent (TS)#256
feat: pattern-based refusal detection for RedTeamAgent (TS)#256Aryansharma28 wants to merge 2 commits intofeat/red-teaming-tsfrom
Conversation
…us refusals Adds detectRefusal() that classifies target responses as hard/soft/none refusals using substring matching. Hard refusals short-circuit the LLM scorer call, saving API calls and providing faster feedback. Opt-out via detectRefusals: false. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Automated low-risk assessment This PR was evaluated against the repository's Low-Risk Pull Requests procedure and does not qualify as low risk.
This PR requires a manual review before merging. |
What's changingThis PR adds a fast pattern-matching layer before the LLM scorer in the RedTeamAgent's Crescendo feedback loop. Previously, every turn after the first made an LLM API call to rate the target's response 0-10 — even when the response was an obvious hard refusal like "I cannot help with that." Now, 11 hard refusal patterns (e.g. "I cannot", "violates my policy", "I must decline") are detected via simple substring matching. When a hard refusal is found, we immediately set Soft refusals ("I'd prefer not to", "let me redirect") and normal responses still go through the LLM scorer as before, since they require nuanced judgment. Impact for users
Impact on attack qualityThe adaptation hint changes from whatever the LLM scorer would have said to a fixed |
Baseline comparisonBefore (baseline — current
After (this PR):
Net effect: Same attack quality, lower cost, faster execution. The Crescendo strategy doesn't lose anything because hard refusals were always score-0 anyway — we're just detecting that faster. |
How the baseline scoring flow works todayIn the current Crescendo implementation (before this PR), the feedback loop works like this on every turn after turn 1:
The problem: step 2 is an LLM API call every single turn. When the target says "I cannot help with that", the LLM scorer predictably returns This PR short-circuits step 2 for obvious hard refusals, going straight to step 4 with |
Turn flow with refusal detectionThe key branch is after
|
Addresses review feedback: detectRefusal() was accidentally public, now private to match Python's _detect_refusal convention. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Automated low-risk assessment This PR was evaluated against the repository's Low-Risk Pull Requests procedure and does not qualify as low risk.
This PR requires a manual review before merging. |
|
Superseded by #271 which includes all TS red-team work. Merged to main. |
Summary
detectRefusal()method that classifies target responses ashard,soft, ornoneusing case-insensitive substring matching against 11 hard and 5 soft refusal patternsdetectRefusalsconfig option (defaulttrue) for opt-outgetLastAssistantContent()as reusable private methodCloses langwatch/langwatch#2044 (TypeScript half)
Parent: Crescendo improvements (langwatch/langwatch#2041)
EPIC: Scenarios Red Teaming (langwatch/langwatch#1713)
Test plan
npx vitest run src/agents/__tests__/red-team.test.ts)🤖 Generated with Claude Code