-
Notifications
You must be signed in to change notification settings - Fork 17
Multi-turn jailbreak #51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR enhances the jailbreak guardrail to detect multi-turn attack patterns by leveraging conversation history. The implementation adds context-aware evaluation capabilities and improves the system prompt with a comprehensive taxonomy of jailbreak techniques.
Key Changes
- Refactored jailbreak guardrail to analyze conversation history (up to 10 most recent turns) for detecting multi-turn escalation patterns
- Added
uses_conversation_historymetadata flag to guardrail specifications for context-aware guardrails - Introduced
--multi-turnevaluation flag for turn-by-turn incremental processing of conversation-aware guardrails
Reviewed Changes
Copilot reviewed 11 out of 12 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
src/guardrails/checks/text/jailbreak.py |
Complete refactor: added conversation history analysis, comprehensive jailbreak taxonomy in system prompt, and JailbreakLLMOutput schema with reason field |
src/guardrails/checks/text/llm_base.py |
Enhanced prompt building to dynamically generate field instructions based on output model schema |
src/guardrails/spec.py |
Added uses_conversation_history boolean field to GuardrailSpecMetadata for identifying context-aware guardrails |
src/guardrails/evals/guardrail_evals.py |
Integrated multi_turn parameter throughout evaluation flow and reduced default latency iterations from 50 to 25 |
src/guardrails/evals/core/async_engine.py |
Renamed _run_incremental_prompt_injection to _run_incremental_guardrails, refactored to use metadata-based detection for conversation-aware guardrails |
src/guardrails/checks/text/prompt_injection_detection.py |
Added uses_conversation_history=True to metadata |
tests/unit/checks/test_jailbreak.py |
Comprehensive new test suite covering conversation history handling, confidence thresholds, error handling, and edge cases |
tests/unit/checks/test_llm_base.py |
Updated test to pass output_model parameter to _build_full_prompt |
tests/unit/evals/test_async_engine.py |
Updated function name references from _run_incremental_prompt_injection to _run_incremental_guardrails |
docs/ref/checks/jailbreak.md |
Added comprehensive documentation for multi-turn support, conversation history handling, and expanded return field descriptions |
docs/evals.md |
Documented --multi-turn flag and updated multi-turn data format section with clearer examples |
.gitignore |
Added internal development files and directories (scripts/, PROJECT_CONTEXT.md, PR_READINESS_CHECKLIST.md, sys_prompts/) |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| async def jailbreak(ctx: GuardrailLLMContextProto, data: str, config: LLMConfig) -> GuardrailResult: | ||
| """Detect jailbreak attempts leveraging full conversation history when available.""" | ||
| conversation_history = ctx.get_conversation_history() or [] | ||
| analysis_payload = _build_analysis_payload(conversation_history, data) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Guardrail crashes when context lacks history accessor
The new jailbreak check now unconditionally calls ctx.get_conversation_history() (see src/guardrails/checks/text/jailbreak.py lines 235‑238), but most contexts that the SDK hands to guardrails don’t implement that method. For example, the default contexts returned by _create_default_context in src/guardrails/client.py and the public GuardrailsContext in src/guardrails/context.py only provide a guardrail_llm attribute and no get_conversation_history. When jailbreak runs without an explicit conversation_history (the normal single‑turn case or when users call run_guardrails with a GuardrailsContext), the call raises AttributeError: 'DefaultContext' object has no attribute 'get_conversation_history', causing the guardrail to fail for every request. The check needs to tolerate contexts without that method (e.g., via getattr(ctx, "get_conversation_history", lambda: None) or a protocol helper) before trying to read the history.
Useful? React with 👍 / 👎.
gabor-openai
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excited for this, thank you!
Updated the Jailbreak Guardrail to use conversation history as context
--multi-turnflag to the eval tool to run multi-turn evaluationsuse_conversation_historyflag to registration of guardrails that are context aware instead of hardcoding a list of guardrail names that are context awaregpt-4.1-mini initial eval results
Will have a separate PR with the full benchmark results