-
Notifications
You must be signed in to change notification settings - Fork 9
Multi-turn Jailbreak guardrail #42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
steven10a
commented
Nov 17, 2025
- Updating Jailbreak guardrail to use conversation history to improve performance on multi-turn jailbreaks
- optimized system prompt for better performance
- Updated Eval tool to accept a multi-turn flag when evaluating jailbreaks with multi-turn conversations
- Updated docs and tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR enhances the Jailbreak guardrail to support multi-turn conversation analysis, enabling detection of adversarial manipulation patterns that emerge across multiple conversation turns. Key improvements include conversation history integration, an optimized system prompt with detailed taxonomy guidance, and evaluation tool support for incremental turn-by-turn testing.
- Renamed metadata field from
requiresConversationHistorytousesConversationHistoryfor clarity - Enhanced Jailbreak guardrail with conversation history analysis and detailed jailbreak taxonomy
- Added multi-turn evaluation mode to incrementally process conversation-aware guardrails
- Updated error handling to fail-open instead of fail-closed for LLM parsing errors
Reviewed Changes
Copilot reviewed 15 out of 15 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| src/spec.ts | Renamed metadata field to better reflect conversation history usage semantics |
| src/checks/jailbreak.ts | Complete rewrite with conversation-aware detection, expanded system prompt taxonomy, and history extraction logic |
| src/checks/llm-base.ts | Improved prompt building with schema introspection, changed error handling to fail-open, added type guards |
| src/evals/guardrail-evals.ts | Added multiTurn parameter for turn-by-turn evaluation mode |
| src/evals/core/async-engine.ts | Implemented incremental conversation processing with payload extraction and result annotation |
| src/cli.ts | Added --multi-turn flag for CLI evaluation command |
| src/base-client.ts | Updated to use renamed usesConversationHistory metadata field |
| src/checks/prompt_injection_detection.ts | Updated metadata field name |
| src/tests/unit/llm-base.test.ts | Updated tests to reflect fail-open error handling behavior |
| src/tests/unit/evals/async-engine.test.ts | New tests for multi-turn evaluation engine behavior |
| src/tests/unit/checks/jailbreak.test.ts | Comprehensive tests for conversation-aware jailbreak detection |
| src/tests/unit/base-client.test.ts | Updated test to use renamed metadata field |
| docs/ref/checks/jailbreak.md | Updated documentation with multi-turn support details |
| docs/evals.md | Added multi-turn evaluation documentation and examples |
| README.md | Updated example with multiTurn parameter |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Copilot reviewed 15 out of 15 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
|
@codex review again pls |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| jailbreak, | ||
| 'Detects attempts to jailbreak or bypass AI safety measures using techniques such as prompt injection, role-playing requests, system prompt overrides, or social engineering.', | ||
| 'text/plain', | ||
| JailbreakConfig as z.ZodType<JailbreakConfig>, | ||
| undefined, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Register jailbreak guardrail with required context schema
jailbreak still calls runLLM via ctx.guardrailLlm, but this registry entry is created with undefined context requirements, so the spec advertises hasContext=false and skips validation that an OpenAI client is present. Consumers can now instantiate/run the guardrail without guardrailLlm, leading to runtime failures when runLLM dereferences it. Please register with the LLM context schema (as other LLM guardrails do) so the dependency is enforced.
Useful? React with 👍 / 👎.
gabor-openai
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TY