Skip to content

Conversation

@steven10a
Copy link
Collaborator

@steven10a steven10a commented Oct 28, 2025

  • Updated system prompt of prompt injection guardrail for better performance
  • Small change to llm_base so all LLM based checks use a shared error reporter and updated other LLM checks to use it
  • Update eval tool to properly parse multi-turn input data
  • Updated evals with results of V2

Copilot AI review requested due to automatic review settings October 28, 2025 18:43
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR enhances the Prompt Injection Detection guardrail to focus exclusively on analyzing tool calls and tool outputs, while improving the evidence gathering and evaluation framework. The changes refine the security model to only flag content with direct evidence of malicious instructions, rather than inferring injection from behavioral symptoms.

Key changes:

  • Updated prompt injection detection to skip assistant content messages and only analyze tool calls/outputs
  • Added evidence field to PromptInjectionDetectionOutput for capturing specific injection indicators
  • Enhanced conversation history parsing to gracefully handle non-JSON data
  • Refactored error handling with shared create_error_result helper function

Reviewed Changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
src/guardrails/checks/text/prompt_injection_detection.py Core logic updates: skip assistant messages, add evidence field, enhance system prompt with detailed injection detection criteria
tests/unit/checks/test_prompt_injection_detection.py Comprehensive test coverage for new skip behavior, assistant message handling, and tool output injection scenarios
src/guardrails/evals/core/async_engine.py Enhanced conversation parsing to handle plain strings and non-conversation JSON, support for Jailbreak guardrail
src/guardrails/evals/core/types.py Added conversation_history field and getter method to Context class
src/guardrails/checks/text/llm_base.py Extracted create_error_result helper function for consistent error handling
src/guardrails/checks/text/hallucination_detection.py Updated to use shared create_error_result helper
tests/integration/test_suite.py Commented out multiple test cases, removed config fields
src/guardrails/evals/.gitignore Added PI_eval/ directory to ignore list

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@steven10a steven10a requested a review from Copilot October 28, 2025 18:56
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 4 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@steven10a steven10a requested a review from Copilot October 28, 2025 19:14
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 237 to 244
"guardrails": [
{
"name": guardrail.definition.name,
"config": (guardrail.config.__dict__ if hasattr(guardrail.config, "__dict__") else guardrail.config),
}
for guardrail in self.guardrails
if guardrail.definition.name == "Prompt Injection Detection"
if guardrail.definition.name in conversation_aware_names
],
Copy link

Copilot AI Oct 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The configuration creation logic filters guardrails by name match with conversation_aware_names, but this creates a minimal config with only conversation-aware guardrails. If self.guardrails doesn't contain a guardrail matching the expected trigger name from sample.expected_triggers, the minimal_config will have an empty guardrails list, which could cause the evaluation to fail silently or produce incorrect results. The filtering should ensure at least one matching guardrail exists or handle the empty case.

Copilot uses AI. Check for mistakes.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is in an if statement that handles that case

"""
normalized_messages = normalize_conversation(messages)
user_texts = [entry["content"] for entry in normalized_messages if entry.get("role") == "user" and isinstance(entry.get("content"), str)]
user_texts = [entry["content"] for entry in messages if entry.get("role") == "user" and isinstance(entry.get("content"), str)]
Copy link

Copilot AI Oct 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This list comprehension will raise a TypeError if entry[\"content\"] is not a string but is a truthy non-string type (e.g., a list or dict). The isinstance check happens after the value is already accessed with entry[\"content\"], but the value could be any type. Consider using .get(\"content\") instead of direct access, or handle the case where content might be None before the isinstance check.

Suggested change
user_texts = [entry["content"] for entry in messages if entry.get("role") == "user" and isinstance(entry.get("content"), str)]
user_texts = [entry.get("content") for entry in messages if entry.get("role") == "user" and isinstance(entry.get("content"), str)]

Copilot uses AI. Check for mistakes.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are receiving a normalized message list, so this is not an issue

Copy link
Collaborator

@gabor-openai gabor-openai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM thank you

@gabor-openai gabor-openai merged commit ab3f458 into main Oct 29, 2025
3 checks passed
@gabor-openai gabor-openai deleted the dev/steve/PI_eval branch October 29, 2025 16:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants