Skip to content

Conversation

@steven10a
Copy link
Collaborator

@steven10a steven10a commented Nov 14, 2025

Updated the Jailbreak Guardrail to use conversation history as context

  • Improves multi-turn jailbreak detection
  • Developed a new multi-turn dataset to run evals
  • Optimized system prompt for better performance
  • Updated docs and tests
  • Added --multi-turn flag to the eval tool to run multi-turn evaluations
  • Added use_conversation_history flag to registration of guardrails that are context aware instead of hardcoding a list of guardrail names that are context aware

gpt-4.1-mini initial eval results

{
  "input": {
    "Jailbreak": {
      "true_positives": 4664,
      "false_positives": 19,
      "false_negatives": 336,
      "true_negatives": 4981,
      "total_samples": 10000,
      "precision": 0.9959427717275251,
      "recall": 0.9328,
      "f1_score": 0.963337808530414
    }
  }
}

Will have a separate PR with the full benchmark results

Copilot AI review requested due to automatic review settings November 14, 2025 16:59
Copilot finished reviewing on behalf of steven10a November 14, 2025 17:02
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR enhances the jailbreak guardrail to detect multi-turn attack patterns by leveraging conversation history. The implementation adds context-aware evaluation capabilities and improves the system prompt with a comprehensive taxonomy of jailbreak techniques.

Key Changes

  • Refactored jailbreak guardrail to analyze conversation history (up to 10 most recent turns) for detecting multi-turn escalation patterns
  • Added uses_conversation_history metadata flag to guardrail specifications for context-aware guardrails
  • Introduced --multi-turn evaluation flag for turn-by-turn incremental processing of conversation-aware guardrails

Reviewed Changes

Copilot reviewed 11 out of 12 changed files in this pull request and generated no comments.

Show a summary per file
File Description
src/guardrails/checks/text/jailbreak.py Complete refactor: added conversation history analysis, comprehensive jailbreak taxonomy in system prompt, and JailbreakLLMOutput schema with reason field
src/guardrails/checks/text/llm_base.py Enhanced prompt building to dynamically generate field instructions based on output model schema
src/guardrails/spec.py Added uses_conversation_history boolean field to GuardrailSpecMetadata for identifying context-aware guardrails
src/guardrails/evals/guardrail_evals.py Integrated multi_turn parameter throughout evaluation flow and reduced default latency iterations from 50 to 25
src/guardrails/evals/core/async_engine.py Renamed _run_incremental_prompt_injection to _run_incremental_guardrails, refactored to use metadata-based detection for conversation-aware guardrails
src/guardrails/checks/text/prompt_injection_detection.py Added uses_conversation_history=True to metadata
tests/unit/checks/test_jailbreak.py Comprehensive new test suite covering conversation history handling, confidence thresholds, error handling, and edge cases
tests/unit/checks/test_llm_base.py Updated test to pass output_model parameter to _build_full_prompt
tests/unit/evals/test_async_engine.py Updated function name references from _run_incremental_prompt_injection to _run_incremental_guardrails
docs/ref/checks/jailbreak.md Added comprehensive documentation for multi-turn support, conversation history handling, and expanded return field descriptions
docs/evals.md Documented --multi-turn flag and updated multi-turn data format section with clearer examples
.gitignore Added internal development files and directories (scripts/, PROJECT_CONTEXT.md, PR_READINESS_CHECKLIST.md, sys_prompts/)

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines 235 to 238
async def jailbreak(ctx: GuardrailLLMContextProto, data: str, config: LLMConfig) -> GuardrailResult:
"""Detect jailbreak attempts leveraging full conversation history when available."""
conversation_history = ctx.get_conversation_history() or []
analysis_payload = _build_analysis_payload(conversation_history, data)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Guardrail crashes when context lacks history accessor

The new jailbreak check now unconditionally calls ctx.get_conversation_history() (see src/guardrails/checks/text/jailbreak.py lines 235‑238), but most contexts that the SDK hands to guardrails don’t implement that method. For example, the default contexts returned by _create_default_context in src/guardrails/client.py and the public GuardrailsContext in src/guardrails/context.py only provide a guardrail_llm attribute and no get_conversation_history. When jailbreak runs without an explicit conversation_history (the normal single‑turn case or when users call run_guardrails with a GuardrailsContext), the call raises AttributeError: 'DefaultContext' object has no attribute 'get_conversation_history', causing the guardrail to fail for every request. The check needs to tolerate contexts without that method (e.g., via getattr(ctx, "get_conversation_history", lambda: None) or a protocol helper) before trying to read the history.

Useful? React with 👍 / 👎.

Copy link
Collaborator

@gabor-openai gabor-openai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excited for this, thank you!

@gabor-openai gabor-openai merged commit 6f8839b into main Nov 17, 2025
3 checks passed
@gabor-openai gabor-openai deleted the dev/steven/jb_multi_turn branch November 17, 2025 21:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants