Dev/steven/nsfw docs #30

steven10a · 2025-10-29T02:02:50Z

Adding nsfw docs and results

Copilot

Pull Request Overview

This PR enhances the Prompt Injection Detection guardrail with improved analysis capabilities, better test coverage, and broader conversation-aware guardrail support. The changes focus on detecting malicious instructions in tool calls and tool outputs that deviate from user intent.

Key changes:

Enhanced prompt injection detection to analyze tool outputs for embedded injection directives (fake conversations, response manipulation)
Extended evaluation framework to support multiple conversation-aware guardrails beyond just prompt injection detection
Added comprehensive test coverage for various injection attack patterns and edge cases

Reviewed Changes

Copilot reviewed 12 out of 13 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
`src/guardrails/checks/text/prompt_injection_detection.py`	Enhanced detection logic with evidence field, improved prompts for tool output analysis, and updated docstrings to focus on tool calls/outputs
`src/guardrails/checks/text/llm_base.py`	Extracted `create_error_result` helper function for standardized error handling
`src/guardrails/checks/text/hallucination_detection.py`	Refactored to use new `create_error_result` helper for consistent error handling
`src/guardrails/evals/core/async_engine.py`	Extended conversation-aware support to multiple guardrails (Jailbreak, Prompt Injection), improved payload parsing to handle non-JSON strings
`src/guardrails/evals/core/types.py`	Added `conversation_history` field and `get_conversation_history` method to Context class
`tests/unit/checks/test_prompt_injection_detection.py`	Added comprehensive tests for injection patterns, assistant message handling, and edge cases
`tests/unit/evals/test_async_engine.py`	Updated test to reflect new behavior of wrapping non-JSON strings as user messages
`tests/integration/test_suite.py`	Removed redundant config fields from pipeline configuration
`tests/unit/test_resources_responses.py`	Added blank line for formatting
`src/guardrails/evals/.gitignore`	Added `PI_eval/` directory to gitignore
`mkdocs.yml`	Reorganized checks documentation alphabetically
`docs/ref/checks/nsfw.md`	Updated benchmark results with new model performance metrics

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2025-10-29T02:04:37Z

src/guardrails/evals/core/async_engine.py

+                    # Create a minimal guardrails config for conversation-aware checks
                    minimal_config = {
                        "version": 1,


The config dictionary is missing the stage_name key that was previously present. While this may be intentional cleanup, the code should ensure the minimal config structure is valid and matches what GuardrailsAsyncOpenAI expects. Consider adding a comment explaining the minimal required structure.

Suggested change

# Create a minimal guardrails config for conversation-aware checks

minimal_config = {

"version": 1,

# Create a minimal guardrails config for conversation-aware checks.

# The minimal required structure for GuardrailsAsyncOpenAI includes:

# - "version": config version

# - "stage_name": name of the stage (e.g., "output")

# - "output": { "guardrails": [ ... ] }

minimal_config = {

"version": 1,

"stage_name": "output",

gabor-openai

LGTM TY

steven10a added 4 commits October 28, 2025 14:37

Updated prompt injection check

fae454b

Formatting changes

98ab91e

Removed legacy code

be2ced6

add nsfw docs

10b868c

Copilot AI review requested due to automatic review settings October 29, 2025 02:02

Copilot AI reviewed Oct 29, 2025

View reviewed changes

steven10a requested a review from gabor-openai October 29, 2025 14:11

gabor-openai approved these changes Oct 29, 2025

View reviewed changes

gabor-openai merged commit 12c4add into main Oct 29, 2025
9 checks passed

gabor-openai deleted the dev/steven/nsfw_docs branch October 29, 2025 16:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Dev/steven/nsfw docs #30

Dev/steven/nsfw docs #30

Uh oh!

steven10a commented Oct 29, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Oct 29, 2025

Uh oh!

gabor-openai left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

-                    # Create a minimal guardrails config for conversation-aware checks
-                    minimal_config = {
-                        "version": 1,
+                    # Create a minimal guardrails config for conversation-aware checks.
+                    # The minimal required structure for GuardrailsAsyncOpenAI includes:
+                    # - "version": config version
+                    # - "stage_name": name of the stage (e.g., "output")
+                    # - "output": { "guardrails": [ ... ] }
+                    minimal_config = {
+                        "version": 1,
+                        "stage_name": "output",

Dev/steven/nsfw docs #30

Dev/steven/nsfw docs #30

Uh oh!

Conversation

steven10a commented Oct 29, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Oct 29, 2025

Choose a reason for hiding this comment

Uh oh!

gabor-openai left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants