-
Notifications
You must be signed in to change notification settings - Fork 17
Updated prompt injection check #27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR enhances the Prompt Injection Detection guardrail to focus exclusively on analyzing tool calls and tool outputs, while improving the evidence gathering and evaluation framework. The changes refine the security model to only flag content with direct evidence of malicious instructions, rather than inferring injection from behavioral symptoms.
Key changes:
- Updated prompt injection detection to skip assistant content messages and only analyze tool calls/outputs
- Added
evidencefield toPromptInjectionDetectionOutputfor capturing specific injection indicators - Enhanced conversation history parsing to gracefully handle non-JSON data
- Refactored error handling with shared
create_error_resulthelper function
Reviewed Changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
src/guardrails/checks/text/prompt_injection_detection.py |
Core logic updates: skip assistant messages, add evidence field, enhance system prompt with detailed injection detection criteria |
tests/unit/checks/test_prompt_injection_detection.py |
Comprehensive test coverage for new skip behavior, assistant message handling, and tool output injection scenarios |
src/guardrails/evals/core/async_engine.py |
Enhanced conversation parsing to handle plain strings and non-conversation JSON, support for Jailbreak guardrail |
src/guardrails/evals/core/types.py |
Added conversation_history field and getter method to Context class |
src/guardrails/checks/text/llm_base.py |
Extracted create_error_result helper function for consistent error handling |
src/guardrails/checks/text/hallucination_detection.py |
Updated to use shared create_error_result helper |
tests/integration/test_suite.py |
Commented out multiple test cases, removed config fields |
src/guardrails/evals/.gitignore |
Added PI_eval/ directory to ignore list |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Copilot reviewed 10 out of 10 changed files in this pull request and generated 4 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Copilot reviewed 10 out of 10 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| "guardrails": [ | ||
| { | ||
| "name": guardrail.definition.name, | ||
| "config": (guardrail.config.__dict__ if hasattr(guardrail.config, "__dict__") else guardrail.config), | ||
| } | ||
| for guardrail in self.guardrails | ||
| if guardrail.definition.name == "Prompt Injection Detection" | ||
| if guardrail.definition.name in conversation_aware_names | ||
| ], |
Copilot
AI
Oct 28, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The configuration creation logic filters guardrails by name match with conversation_aware_names, but this creates a minimal config with only conversation-aware guardrails. If self.guardrails doesn't contain a guardrail matching the expected trigger name from sample.expected_triggers, the minimal_config will have an empty guardrails list, which could cause the evaluation to fail silently or produce incorrect results. The filtering should ensure at least one matching guardrail exists or handle the empty case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is in an if statement that handles that case
| """ | ||
| normalized_messages = normalize_conversation(messages) | ||
| user_texts = [entry["content"] for entry in normalized_messages if entry.get("role") == "user" and isinstance(entry.get("content"), str)] | ||
| user_texts = [entry["content"] for entry in messages if entry.get("role") == "user" and isinstance(entry.get("content"), str)] |
Copilot
AI
Oct 28, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This list comprehension will raise a TypeError if entry[\"content\"] is not a string but is a truthy non-string type (e.g., a list or dict). The isinstance check happens after the value is already accessed with entry[\"content\"], but the value could be any type. Consider using .get(\"content\") instead of direct access, or handle the case where content might be None before the isinstance check.
| user_texts = [entry["content"] for entry in messages if entry.get("role") == "user" and isinstance(entry.get("content"), str)] | |
| user_texts = [entry.get("content") for entry in messages if entry.get("role") == "user" and isinstance(entry.get("content"), str)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are receiving a normalized message list, so this is not an issue
gabor-openai
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM thank you
llm_baseso all LLM based checks use a shared error reporter and updated other LLM checks to use it