Multi-turn jailbreak #51

steven10a · 2025-11-14T16:59:31Z

Updated the Jailbreak Guardrail to use conversation history as context

Improves multi-turn jailbreak detection
Developed a new multi-turn dataset to run evals
Optimized system prompt for better performance
Updated docs and tests
Added --multi-turn flag to the eval tool to run multi-turn evaluations
Added use_conversation_history flag to registration of guardrails that are context aware instead of hardcoding a list of guardrail names that are context aware

gpt-4.1-mini initial eval results

{
  "input": {
    "Jailbreak": {
      "true_positives": 4664,
      "false_positives": 19,
      "false_negatives": 336,
      "true_negatives": 4981,
      "total_samples": 10000,
      "precision": 0.9959427717275251,
      "recall": 0.9328,
      "f1_score": 0.963337808530414
    }
  }
}

Will have a separate PR with the full benchmark results

Copilot

Pull Request Overview

This PR enhances the jailbreak guardrail to detect multi-turn attack patterns by leveraging conversation history. The implementation adds context-aware evaluation capabilities and improves the system prompt with a comprehensive taxonomy of jailbreak techniques.

Key Changes

Refactored jailbreak guardrail to analyze conversation history (up to 10 most recent turns) for detecting multi-turn escalation patterns
Added uses_conversation_history metadata flag to guardrail specifications for context-aware guardrails
Introduced --multi-turn evaluation flag for turn-by-turn incremental processing of conversation-aware guardrails

Reviewed Changes

Copilot reviewed 11 out of 12 changed files in this pull request and generated no comments.

Show a summary per file

File	Description
`src/guardrails/checks/text/jailbreak.py`	Complete refactor: added conversation history analysis, comprehensive jailbreak taxonomy in system prompt, and `JailbreakLLMOutput` schema with reason field
`src/guardrails/checks/text/llm_base.py`	Enhanced prompt building to dynamically generate field instructions based on output model schema
`src/guardrails/spec.py`	Added `uses_conversation_history` boolean field to `GuardrailSpecMetadata` for identifying context-aware guardrails
`src/guardrails/evals/guardrail_evals.py`	Integrated `multi_turn` parameter throughout evaluation flow and reduced default latency iterations from 50 to 25
`src/guardrails/evals/core/async_engine.py`	Renamed `_run_incremental_prompt_injection` to `_run_incremental_guardrails`, refactored to use metadata-based detection for conversation-aware guardrails
`src/guardrails/checks/text/prompt_injection_detection.py`	Added `uses_conversation_history=True` to metadata
`tests/unit/checks/test_jailbreak.py`	Comprehensive new test suite covering conversation history handling, confidence thresholds, error handling, and edge cases
`tests/unit/checks/test_llm_base.py`	Updated test to pass `output_model` parameter to `_build_full_prompt`
`tests/unit/evals/test_async_engine.py`	Updated function name references from `_run_incremental_prompt_injection` to `_run_incremental_guardrails`
`docs/ref/checks/jailbreak.md`	Added comprehensive documentation for multi-turn support, conversation history handling, and expanded return field descriptions
`docs/evals.md`	Documented `--multi-turn` flag and updated multi-turn data format section with clearer examples
`.gitignore`	Added internal development files and directories (`scripts/`, `PROJECT_CONTEXT.md`, `PR_READINESS_CHECKLIST.md`, `sys_prompts/`)

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2025-11-14T17:12:27Z

src/guardrails/checks/text/jailbreak.py

+async def jailbreak(ctx: GuardrailLLMContextProto, data: str, config: LLMConfig) -> GuardrailResult:
+    """Detect jailbreak attempts leveraging full conversation history when available."""
+    conversation_history = ctx.get_conversation_history() or []
+    analysis_payload = _build_analysis_payload(conversation_history, data)


Guardrail crashes when context lacks history accessor

The new jailbreak check now unconditionally calls ctx.get_conversation_history() (see src/guardrails/checks/text/jailbreak.py lines 235‑238), but most contexts that the SDK hands to guardrails don’t implement that method. For example, the default contexts returned by _create_default_context in src/guardrails/client.py and the public GuardrailsContext in src/guardrails/context.py only provide a guardrail_llm attribute and no get_conversation_history. When jailbreak runs without an explicit conversation_history (the normal single‑turn case or when users call run_guardrails with a GuardrailsContext), the call raises AttributeError: 'DefaultContext' object has no attribute 'get_conversation_history', causing the guardrail to fail for every request. The check needs to tolerate contexts without that method (e.g., via getattr(ctx, "get_conversation_history", lambda: None) or a protocol helper) before trying to read the history.

Useful? React with 👍 / 👎.

gabor-openai

Excited for this, thank you!

Multi-turn jailbreak

6fa9a5c

Copilot AI review requested due to automatic review settings November 14, 2025 16:59

Copilot started reviewing on behalf of steven10a November 14, 2025 16:59 View session

Copilot finished reviewing on behalf of steven10a November 14, 2025 17:02

Copilot AI reviewed Nov 14, 2025

View reviewed changes

steven10a requested a review from gabor-openai November 14, 2025 17:03

chatgpt-codex-connector bot reviewed Nov 14, 2025

View reviewed changes

defensive handling of conv history not existing

0fa09fc

gabor-openai approved these changes Nov 17, 2025

View reviewed changes

gabor-openai merged commit 6f8839b into main Nov 17, 2025
3 checks passed

gabor-openai deleted the dev/steven/jb_multi_turn branch November 17, 2025 21:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Multi-turn jailbreak #51

Multi-turn jailbreak #51

Uh oh!

steven10a commented Nov 14, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

chatgpt-codex-connector bot Nov 14, 2025

Uh oh!

gabor-openai left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Multi-turn jailbreak #51

Multi-turn jailbreak #51

Uh oh!

Conversation

steven10a commented Nov 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Key Changes

Reviewed Changes

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Nov 14, 2025

Choose a reason for hiding this comment

Uh oh!

gabor-openai left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

steven10a commented Nov 14, 2025 •

edited

Loading