Skip to content

Conversation

@steven10a
Copy link
Collaborator

Adding nsfw docs and results

Copilot AI review requested due to automatic review settings October 29, 2025 02:02
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR enhances the Prompt Injection Detection guardrail with improved analysis capabilities, better test coverage, and broader conversation-aware guardrail support. The changes focus on detecting malicious instructions in tool calls and tool outputs that deviate from user intent.

Key changes:

  • Enhanced prompt injection detection to analyze tool outputs for embedded injection directives (fake conversations, response manipulation)
  • Extended evaluation framework to support multiple conversation-aware guardrails beyond just prompt injection detection
  • Added comprehensive test coverage for various injection attack patterns and edge cases

Reviewed Changes

Copilot reviewed 12 out of 13 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
src/guardrails/checks/text/prompt_injection_detection.py Enhanced detection logic with evidence field, improved prompts for tool output analysis, and updated docstrings to focus on tool calls/outputs
src/guardrails/checks/text/llm_base.py Extracted create_error_result helper function for standardized error handling
src/guardrails/checks/text/hallucination_detection.py Refactored to use new create_error_result helper for consistent error handling
src/guardrails/evals/core/async_engine.py Extended conversation-aware support to multiple guardrails (Jailbreak, Prompt Injection), improved payload parsing to handle non-JSON strings
src/guardrails/evals/core/types.py Added conversation_history field and get_conversation_history method to Context class
tests/unit/checks/test_prompt_injection_detection.py Added comprehensive tests for injection patterns, assistant message handling, and edge cases
tests/unit/evals/test_async_engine.py Updated test to reflect new behavior of wrapping non-JSON strings as user messages
tests/integration/test_suite.py Removed redundant config fields from pipeline configuration
tests/unit/test_resources_responses.py Added blank line for formatting
src/guardrails/evals/.gitignore Added PI_eval/ directory to gitignore
mkdocs.yml Reorganized checks documentation alphabetically
docs/ref/checks/nsfw.md Updated benchmark results with new model performance metrics

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +233 to 235
# Create a minimal guardrails config for conversation-aware checks
minimal_config = {
"version": 1,
Copy link

Copilot AI Oct 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The config dictionary is missing the stage_name key that was previously present. While this may be intentional cleanup, the code should ensure the minimal config structure is valid and matches what GuardrailsAsyncOpenAI expects. Consider adding a comment explaining the minimal required structure.

Suggested change
# Create a minimal guardrails config for conversation-aware checks
minimal_config = {
"version": 1,
# Create a minimal guardrails config for conversation-aware checks.
# The minimal required structure for GuardrailsAsyncOpenAI includes:
# - "version": config version
# - "stage_name": name of the stage (e.g., "output")
# - "output": { "guardrails": [ ... ] }
minimal_config = {
"version": 1,
"stage_name": "output",

Copilot uses AI. Check for mistakes.
Copy link
Collaborator

@gabor-openai gabor-openai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM TY

@gabor-openai gabor-openai merged commit 12c4add into main Oct 29, 2025
9 checks passed
@gabor-openai gabor-openai deleted the dev/steven/nsfw_docs branch October 29, 2025 16:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants