Skip to content

Conversation

@steven10a
Copy link
Collaborator

  • Updating Jailbreak guardrail to use conversation history to improve performance on multi-turn jailbreaks
  • optimized system prompt for better performance
  • Updated Eval tool to accept a multi-turn flag when evaluating jailbreaks with multi-turn conversations
  • Updated docs and tests

Copilot AI review requested due to automatic review settings November 17, 2025 18:17
Copilot finished reviewing on behalf of steven10a November 17, 2025 18:18
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR enhances the Jailbreak guardrail to support multi-turn conversation analysis, enabling detection of adversarial manipulation patterns that emerge across multiple conversation turns. Key improvements include conversation history integration, an optimized system prompt with detailed taxonomy guidance, and evaluation tool support for incremental turn-by-turn testing.

  • Renamed metadata field from requiresConversationHistory to usesConversationHistory for clarity
  • Enhanced Jailbreak guardrail with conversation history analysis and detailed jailbreak taxonomy
  • Added multi-turn evaluation mode to incrementally process conversation-aware guardrails
  • Updated error handling to fail-open instead of fail-closed for LLM parsing errors

Reviewed Changes

Copilot reviewed 15 out of 15 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
src/spec.ts Renamed metadata field to better reflect conversation history usage semantics
src/checks/jailbreak.ts Complete rewrite with conversation-aware detection, expanded system prompt taxonomy, and history extraction logic
src/checks/llm-base.ts Improved prompt building with schema introspection, changed error handling to fail-open, added type guards
src/evals/guardrail-evals.ts Added multiTurn parameter for turn-by-turn evaluation mode
src/evals/core/async-engine.ts Implemented incremental conversation processing with payload extraction and result annotation
src/cli.ts Added --multi-turn flag for CLI evaluation command
src/base-client.ts Updated to use renamed usesConversationHistory metadata field
src/checks/prompt_injection_detection.ts Updated metadata field name
src/tests/unit/llm-base.test.ts Updated tests to reflect fail-open error handling behavior
src/tests/unit/evals/async-engine.test.ts New tests for multi-turn evaluation engine behavior
src/tests/unit/checks/jailbreak.test.ts Comprehensive tests for conversation-aware jailbreak detection
src/tests/unit/base-client.test.ts Updated test to use renamed metadata field
docs/ref/checks/jailbreak.md Updated documentation with multi-turn support details
docs/evals.md Added multi-turn evaluation documentation and examples
README.md Updated example with multiTurn parameter

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

@steven10a steven10a requested a review from Copilot November 17, 2025 20:15
Copilot finished reviewing on behalf of steven10a November 17, 2025 20:19
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 15 out of 15 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

steven10a and others added 2 commits November 17, 2025 15:31
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
@gabor-openai gabor-openai self-requested a review November 17, 2025 21:12
@gabor-openai
Copy link
Collaborator

@codex review again pls

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +244 to +248
jailbreak,
'Detects attempts to jailbreak or bypass AI safety measures using techniques such as prompt injection, role-playing requests, system prompt overrides, or social engineering.',
'text/plain',
JailbreakConfig as z.ZodType<JailbreakConfig>,
undefined,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Register jailbreak guardrail with required context schema

jailbreak still calls runLLM via ctx.guardrailLlm, but this registry entry is created with undefined context requirements, so the spec advertises hasContext=false and skips validation that an OpenAI client is present. Consumers can now instantiate/run the guardrail without guardrailLlm, leading to runtime failures when runLLM dereferences it. Please register with the LLM context schema (as other LLM guardrails do) so the dependency is enforced.

Useful? React with 👍 / 👎.

Copy link
Collaborator

@gabor-openai gabor-openai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TY

@gabor-openai gabor-openai merged commit c333e24 into main Nov 17, 2025
1 check passed
@steven10a steven10a deleted the dev/steven/multi_jb branch November 18, 2025 15:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants