Multi-turn Jailbreak guardrail #42

steven10a · 2025-11-17T18:17:09Z

Updating Jailbreak guardrail to use conversation history to improve performance on multi-turn jailbreaks
optimized system prompt for better performance
Updated Eval tool to accept a multi-turn flag when evaluating jailbreaks with multi-turn conversations
Updated docs and tests

Copilot

Pull Request Overview

This PR enhances the Jailbreak guardrail to support multi-turn conversation analysis, enabling detection of adversarial manipulation patterns that emerge across multiple conversation turns. Key improvements include conversation history integration, an optimized system prompt with detailed taxonomy guidance, and evaluation tool support for incremental turn-by-turn testing.

Renamed metadata field from requiresConversationHistory to usesConversationHistory for clarity
Enhanced Jailbreak guardrail with conversation history analysis and detailed jailbreak taxonomy
Added multi-turn evaluation mode to incrementally process conversation-aware guardrails
Updated error handling to fail-open instead of fail-closed for LLM parsing errors

Reviewed Changes

Copilot reviewed 15 out of 15 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
src/spec.ts	Renamed metadata field to better reflect conversation history usage semantics
src/checks/jailbreak.ts	Complete rewrite with conversation-aware detection, expanded system prompt taxonomy, and history extraction logic
src/checks/llm-base.ts	Improved prompt building with schema introspection, changed error handling to fail-open, added type guards
src/evals/guardrail-evals.ts	Added multiTurn parameter for turn-by-turn evaluation mode
src/evals/core/async-engine.ts	Implemented incremental conversation processing with payload extraction and result annotation
src/cli.ts	Added --multi-turn flag for CLI evaluation command
src/base-client.ts	Updated to use renamed usesConversationHistory metadata field
src/checks/prompt_injection_detection.ts	Updated metadata field name
src/tests/unit/llm-base.test.ts	Updated tests to reflect fail-open error handling behavior
src/tests/unit/evals/async-engine.test.ts	New tests for multi-turn evaluation engine behavior
src/tests/unit/checks/jailbreak.test.ts	Comprehensive tests for conversation-aware jailbreak detection
src/tests/unit/base-client.test.ts	Updated test to use renamed metadata field
docs/ref/checks/jailbreak.md	Updated documentation with multi-turn support details
docs/evals.md	Added multi-turn evaluation documentation and examples
README.md	Updated example with multiTurn parameter

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

src/evals/core/async-engine.ts

src/checks/llm-base.ts

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

src/checks/jailbreak.ts

Copilot

Pull Request Overview

Copilot reviewed 15 out of 15 changed files in this pull request and generated 2 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

src/checks/llm-base.ts

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

gabor-openai · 2025-11-17T21:14:04Z

@codex review again pls

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2025-11-17T21:24:29Z

src/checks/jailbreak.ts

+  jailbreak,
+  'Detects attempts to jailbreak or bypass AI safety measures using techniques such as prompt injection, role-playing requests, system prompt overrides, or social engineering.',
+  'text/plain',
+  JailbreakConfig as z.ZodType<JailbreakConfig>,
+  undefined,


Register jailbreak guardrail with required context schema

jailbreak still calls runLLM via ctx.guardrailLlm, but this registry entry is created with undefined context requirements, so the spec advertises hasContext=false and skips validation that an OpenAI client is present. Consumers can now instantiate/run the guardrail without guardrailLlm, leading to runtime failures when runLLM dereferences it. Please register with the LLM context schema (as other LLM guardrails do) so the dependency is enforced.

Useful? React with 👍 / 👎.

gabor-openai

TY

Multi-turn Jailbreak guardrail

a192074

Copilot AI review requested due to automatic review settings November 17, 2025 18:17

Copilot started reviewing on behalf of steven10a November 17, 2025 18:17 View session

Copilot finished reviewing on behalf of steven10a November 17, 2025 18:18

Copilot AI reviewed Nov 17, 2025

View reviewed changes

src/evals/core/async-engine.ts Outdated Show resolved Hide resolved

src/checks/llm-base.ts Outdated Show resolved Hide resolved

chatgpt-codex-connector bot reviewed Nov 17, 2025

View reviewed changes

src/checks/jailbreak.ts Show resolved Hide resolved

Address comments

aceafa4

steven10a requested a review from Copilot November 17, 2025 20:15

Copilot started reviewing on behalf of steven10a November 17, 2025 20:16 View session

Copilot finished reviewing on behalf of steven10a November 17, 2025 20:19

Copilot AI reviewed Nov 17, 2025

View reviewed changes

src/checks/llm-base.ts Outdated Show resolved Hide resolved

src/checks/llm-base.ts Outdated Show resolved Hide resolved

steven10a and others added 2 commits November 17, 2025 15:31

Update src/checks/llm-base.ts

430cd6a

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Update src/checks/llm-base.ts

e5d81f0

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

gabor-openai self-requested a review November 17, 2025 21:12

chatgpt-codex-connector bot reviewed Nov 17, 2025

View reviewed changes

gabor-openai approved these changes Nov 17, 2025

View reviewed changes

gabor-openai merged commit c333e24 into main Nov 17, 2025
1 check passed

steven10a deleted the dev/steven/multi_jb branch November 18, 2025 15:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Multi-turn Jailbreak guardrail #42

Multi-turn Jailbreak guardrail #42

Uh oh!

steven10a commented Nov 17, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

gabor-openai commented Nov 17, 2025

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

chatgpt-codex-connector bot Nov 17, 2025

Uh oh!

gabor-openai left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Multi-turn Jailbreak guardrail #42

Multi-turn Jailbreak guardrail #42

Uh oh!

Conversation

steven10a commented Nov 17, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Uh oh!

Uh oh!

gabor-openai commented Nov 17, 2025

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Nov 17, 2025

Choose a reason for hiding this comment

Uh oh!

gabor-openai left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants