Making evals multi-turn. Updating PI sys prompt #20

steven10a · 2025-10-15T19:14:32Z

Updated evals to run in a multi-turn fashion
Updated the Prompt Injection check system prompt
Eliminated the use of last checked index. Instead it incrementally looks at all the actions since the last user message. This provides more context when making the judgement
Sourced 1046 agent trace examples form AgentDojo (a common PI benchmark).
Combined with our original synthetic data and re-running evals

Copilot

Pull Request Overview

This PR transitions the prompt injection detection system from incremental index-based tracking to a multi-turn evaluation approach that analyzes the full conversation context since the last user message. The system now evaluates all actions following each user turn, providing richer context for detecting injected instructions. The PR also updates the system prompt to better define prompt injection attacks and incorporates 1,046 real-world examples from the AgentDojo benchmark.

Key Changes:

Removed _injection_last_checked_index tracking from all client classes
Introduced _run_incremental_prompt_injection to evaluate conversations turn-by-turn
Refactored _parse_conversation_history into _slice_conversation_since_latest_user for clearer intent
Updated the system prompt with clearer definitions and evaluation criteria for prompt injection
Integrated AgentDojo dataset examples alongside existing synthetic data

Reviewed Changes

Copilot reviewed 12 out of 13 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
src/guardrails/client.py	Removed index tracking fields and methods from all client classes
src/guardrails/types.py	Removed injection index protocol methods from GuardrailLLMContextProto
src/guardrails/checks/text/prompt_injection_detection.py	Refactored to slice conversations from latest user message; updated system prompt with injection-focused criteria
src/guardrails/evals/core/async_engine.py	Added incremental multi-turn evaluation logic and conversation payload parsing
src/guardrails/evals/guardrail_evals.py	Updated import handling with fallback for direct script execution
src/guardrails/agents.py	Removed index tracking methods from ToolConversationContext
tests/unit/test_client_sync.py	Updated tests to verify absence of index tracking methods
tests/unit/test_client_async.py	Updated tests to verify absence of index tracking methods
tests/unit/test_agents.py	Updated tests to verify absence of index tracking methods
tests/unit/checks/test_prompt_injection_detection.py	Removed index tracking from fake context and assertions
tests/unit/evals/test_async_engine.py	New test file for incremental evaluation and payload parsing logic
docs/ref/checks/prompt_injection_detection.md	Updated benchmark description and performance metrics with AgentDojo dataset

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

src/guardrails/evals/guardrail_evals.py

Copilot

Pull Request Overview

Copilot reviewed 16 out of 17 changed files in this pull request and generated 2 comments.

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

src/guardrails/checks/text/prompt_injection_detection.py

docs/ref/checks/prompt_injection_detection.md

Copilot

Pull Request Overview

Copilot reviewed 16 out of 17 changed files in this pull request and generated 4 comments.

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

Copilot · 2025-10-16T15:02:33Z

src/guardrails/checks/text/prompt_injection_detection.py

+    if isinstance(content, str):
+        return content
+
+    if isinstance(content, list):
+        parts: list[str] = []
+        for item in content:
+            if isinstance(item, dict):
+                text = item.get("text")
+                if text:
+                    parts.append(text)
+                    continue
+                fallback = item.get("content")
+                if isinstance(fallback, str):
+                    parts.append(fallback)
+            elif isinstance(item, str):
+                parts.append(item)
+            else:
+                parts.append(str(item))
+        return " ".join(filter(None, parts))
+
+    if content is None:
+        return ""
+
+    return str(content)


The content coercion logic does not filter out falsy non-None values (e.g., empty strings) before calling str(item) on line 302. This means an empty string in the list will become '', then pass through filter(None, ...) and still contribute whitespace to the joined result if there are subsequent non-empty parts. Consider checking if item: before appending str(item) to ensure consistent behavior.

I disagree with this. We already guard against that edge case with filter(None, part). Implementing this may actually cause us to hide legit values like 0 or "0".

Copilot · 2025-10-16T15:02:34Z

src/guardrails/checks/text/prompt_injection_detection.py

            return _create_skip_result(
                "No LLM actions or user intent to evaluate",
                config.confidence_threshold,
                user_goal=user_intent_dict.get("most_recent_message", "N/A"),
-                action=llm_actions,
+                action=recent_messages,
+                data=str(data),
+            )


The skip message 'No LLM actions or user intent to evaluate' is ambiguous because this condition only checks user_intent_dict['most_recent_message'] being empty, not actionable_messages. Consider splitting this into two separate conditions or clarifying the message to indicate specifically that no user intent was found.

[nit]. "No user intent to evaluate" would be clearer. But this is every much an edge case and should never happen. We should always have at least 1 user message.

Copilot · 2025-10-16T15:02:34Z

src/guardrails/evals/core/async_engine.py

+def _parse_conversation_payload(data: str) -> list[Any] | None:
+    """Attempt to parse sample data into a conversation history list."""
+    try:
+        payload = json.loads(data)
+    except json.JSONDecodeError:
+        return None


The docstring does not document the return type behavior. It should explicitly state that the function returns a list of conversation messages if successful, or None if parsing fails or the payload structure is invalid.

Also think this is a [nit]

Copilot · 2025-10-16T15:02:34Z

src/guardrails/evals/__init__.py

+def __getattr__(name: str) -> Any:
+    if name == "GuardrailEval":
+        from guardrails.evals.guardrail_evals import GuardrailEval as _GuardrailEval
+
+        return _GuardrailEval


The lazy import in __getattr__ does not cache the imported class, causing repeated imports on every attribute access. Consider caching _GuardrailEval in a module-level variable or using sys.modules to avoid redundant imports.

Suggested change

def __getattr__(name: str) -> Any:

if name == "GuardrailEval":

from guardrails.evals.guardrail_evals import GuardrailEval as _GuardrailEval

return _GuardrailEval

_cached_GuardrailEval = None

def __getattr__(name: str) -> Any:

global _cached_GuardrailEval

if name == "GuardrailEval":

if _cached_GuardrailEval is None:

from guardrails.evals.guardrail_evals import GuardrailEval as _GuardrailEval

_cached_GuardrailEval = _GuardrailEval

return _cached_GuardrailEval

I think the evals tool could use a refresh and restructure. This is okay for now and I will follow up with an eval specific PR

gabor-openai · 2025-10-16T17:39:19Z

docs/ref/checks/prompt_injection_detection.md

- **Data type**: Internal synthetic dataset simulating realistic agent traces
- **Test scenarios**: Multi-turn conversations with function calls and tool outputs
+- **Synthetic dataset**: 1,000 samples with 500 positive cases (50% prevalence) simulating realistic agent traces
+- **AgentDojo dataset**: 1,046 samples from AgentDojo's workspace, travel, banking, and Slack suite combined with the "important_instructions" attack (949 positive cases, 97 negative samples)


In a follow-up PR, could you include a link to this dataset?

steven10a added 3 commits October 15, 2025 15:04

Making evals multi-turn. Updating PI sys prompt

0637f65

Removed any reference to last_checked_index

f77b1e1

updating eval results

2a0c2a0

steven10a marked this pull request as ready for review October 16, 2025 00:48

Copilot AI review requested due to automatic review settings October 16, 2025 00:48

Copilot AI reviewed Oct 16, 2025

View reviewed changes

src/guardrails/evals/guardrail_evals.py Outdated Show resolved Hide resolved

Change eval script invocation

48147b3

steven10a requested a review from Copilot October 16, 2025 14:16

Copilot AI reviewed Oct 16, 2025

View reviewed changes

src/guardrails/checks/text/prompt_injection_detection.py Outdated Show resolved Hide resolved

docs/ref/checks/prompt_injection_detection.md Outdated Show resolved Hide resolved

steven10a requested a review from gabor-openai October 16, 2025 14:23

steven10a added 2 commits October 16, 2025 10:44

update PI docs

e864794

adding unit test for this change

250dd12

steven10a requested a review from Copilot October 16, 2025 15:01

Copilot AI reviewed Oct 16, 2025

View reviewed changes

gabor-openai approved these changes Oct 16, 2025

View reviewed changes

gabor-openai merged commit a251298 into main Oct 16, 2025
3 checks passed

gabor-openai deleted the dev/steven/eval_update branch October 16, 2025 17:40

Making evals multi-turn. Updating PI sys prompt #20

Making evals multi-turn. Updating PI sys prompt #20

Uh oh!

Conversation

steven10a commented Oct 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Copilot AI Oct 16, 2025

Choose a reason for hiding this comment

Uh oh!

steven10a Oct 16, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Oct 16, 2025

Choose a reason for hiding this comment

Uh oh!

steven10a Oct 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI Oct 16, 2025

Choose a reason for hiding this comment

Uh oh!

steven10a Oct 16, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Oct 16, 2025

Choose a reason for hiding this comment

Uh oh!

steven10a Oct 16, 2025

Choose a reason for hiding this comment

Uh oh!

gabor-openai Oct 16, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

steven10a commented Oct 15, 2025 •

edited

Loading

steven10a Oct 16, 2025 •

edited

Loading