Updated prompt injection check #27

steven10a · 2025-10-28T18:43:36Z

Updated system prompt of prompt injection guardrail for better performance
Small change to llm_base so all LLM based checks use a shared error reporter and updated other LLM checks to use it
Update eval tool to properly parse multi-turn input data
Updated evals with results of V2

Copilot

Pull Request Overview

This PR enhances the Prompt Injection Detection guardrail to focus exclusively on analyzing tool calls and tool outputs, while improving the evidence gathering and evaluation framework. The changes refine the security model to only flag content with direct evidence of malicious instructions, rather than inferring injection from behavioral symptoms.

Key changes:

Updated prompt injection detection to skip assistant content messages and only analyze tool calls/outputs
Added evidence field to PromptInjectionDetectionOutput for capturing specific injection indicators
Enhanced conversation history parsing to gracefully handle non-JSON data
Refactored error handling with shared create_error_result helper function

Reviewed Changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
`src/guardrails/checks/text/prompt_injection_detection.py`	Core logic updates: skip assistant messages, add evidence field, enhance system prompt with detailed injection detection criteria
`tests/unit/checks/test_prompt_injection_detection.py`	Comprehensive test coverage for new skip behavior, assistant message handling, and tool output injection scenarios
`src/guardrails/evals/core/async_engine.py`	Enhanced conversation parsing to handle plain strings and non-conversation JSON, support for Jailbreak guardrail
`src/guardrails/evals/core/types.py`	Added `conversation_history` field and getter method to Context class
`src/guardrails/checks/text/llm_base.py`	Extracted `create_error_result` helper function for consistent error handling
`src/guardrails/checks/text/hallucination_detection.py`	Updated to use shared `create_error_result` helper
`tests/integration/test_suite.py`	Commented out multiple test cases, removed config fields
`src/guardrails/evals/.gitignore`	Added `PI_eval/` directory to ignore list

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

src/guardrails/evals/core/async_engine.py

tests/integration/test_suite.py

src/guardrails/checks/text/prompt_injection_detection.py

src/guardrails/evals/core/async_engine.py

Copilot

Pull Request Overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 4 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

src/guardrails/checks/text/prompt_injection_detection.py

tests/unit/checks/test_prompt_injection_detection.py

src/guardrails/evals/core/async_engine.py

Copilot

Pull Request Overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 2 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2025-10-28T19:16:26Z

src/guardrails/evals/core/async_engine.py

                            "guardrails": [
                                {
                                    "name": guardrail.definition.name,
                                    "config": (guardrail.config.__dict__ if hasattr(guardrail.config, "__dict__") else guardrail.config),
                                }
                                for guardrail in self.guardrails
-                                if guardrail.definition.name == "Prompt Injection Detection"
+                                if guardrail.definition.name in conversation_aware_names
                            ],


The configuration creation logic filters guardrails by name match with conversation_aware_names, but this creates a minimal config with only conversation-aware guardrails. If self.guardrails doesn't contain a guardrail matching the expected trigger name from sample.expected_triggers, the minimal_config will have an empty guardrails list, which could cause the evaluation to fail silently or produce incorrect results. The filtering should ensure at least one matching guardrail exists or handle the empty case.

This is in an if statement that handles that case

Copilot · 2025-10-28T19:16:26Z

src/guardrails/checks/text/prompt_injection_detection.py

    """
-    normalized_messages = normalize_conversation(messages)
-    user_texts = [entry["content"] for entry in normalized_messages if entry.get("role") == "user" and isinstance(entry.get("content"), str)]
+    user_texts = [entry["content"] for entry in messages if entry.get("role") == "user" and isinstance(entry.get("content"), str)]


This list comprehension will raise a TypeError if entry[\"content\"] is not a string but is a truthy non-string type (e.g., a list or dict). The isinstance check happens after the value is already accessed with entry[\"content\"], but the value could be any type. Consider using .get(\"content\") instead of direct access, or handle the case where content might be None before the isinstance check.

Suggested change

user_texts = [entry["content"] for entry in messages if entry.get("role") == "user" and isinstance(entry.get("content"), str)]

user_texts = [entry.get("content") for entry in messages if entry.get("role") == "user" and isinstance(entry.get("content"), str)]

We are receiving a normalized message list, so this is not an issue

gabor-openai

LGTM thank you

Updated prompt injection check

fae454b

Copilot AI review requested due to automatic review settings October 28, 2025 18:43

Copilot AI reviewed Oct 28, 2025

View reviewed changes

Formatting changes

98ab91e

steven10a requested a review from Copilot October 28, 2025 18:56

Copilot AI reviewed Oct 28, 2025

View reviewed changes

Removed legacy code

be2ced6

steven10a requested a review from Copilot October 28, 2025 19:14

Copilot AI reviewed Oct 28, 2025

View reviewed changes

steven10a requested a review from gabor-openai October 28, 2025 19:26

steven10a added 2 commits October 28, 2025 21:47

update results doc

c3b7eef

updating dataset details

f4ae9f6

gabor-openai approved these changes Oct 29, 2025

View reviewed changes

gabor-openai merged commit ab3f458 into main Oct 29, 2025
3 checks passed

gabor-openai deleted the dev/steve/PI_eval branch October 29, 2025 16:54

	user_texts = [entry["content"] for entry in messages if entry.get("role") == "user" and isinstance(entry.get("content"), str)]
	user_texts = [entry.get("content") for entry in messages if entry.get("role") == "user" and isinstance(entry.get("content"), str)]

Updated prompt injection check #27

Updated prompt injection check #27

Uh oh!

Conversation

steven10a commented Oct 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Copilot AI Oct 28, 2025

Choose a reason for hiding this comment

Uh oh!

steven10a Oct 28, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Oct 28, 2025

Choose a reason for hiding this comment

Uh oh!

steven10a Oct 28, 2025

Choose a reason for hiding this comment

Uh oh!

gabor-openai left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

steven10a commented Oct 28, 2025 •

edited

Loading