-
Notifications
You must be signed in to change notification settings - Fork 17
Dev/steven/nsfw docs #30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR enhances the Prompt Injection Detection guardrail with improved analysis capabilities, better test coverage, and broader conversation-aware guardrail support. The changes focus on detecting malicious instructions in tool calls and tool outputs that deviate from user intent.
Key changes:
- Enhanced prompt injection detection to analyze tool outputs for embedded injection directives (fake conversations, response manipulation)
- Extended evaluation framework to support multiple conversation-aware guardrails beyond just prompt injection detection
- Added comprehensive test coverage for various injection attack patterns and edge cases
Reviewed Changes
Copilot reviewed 12 out of 13 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
src/guardrails/checks/text/prompt_injection_detection.py |
Enhanced detection logic with evidence field, improved prompts for tool output analysis, and updated docstrings to focus on tool calls/outputs |
src/guardrails/checks/text/llm_base.py |
Extracted create_error_result helper function for standardized error handling |
src/guardrails/checks/text/hallucination_detection.py |
Refactored to use new create_error_result helper for consistent error handling |
src/guardrails/evals/core/async_engine.py |
Extended conversation-aware support to multiple guardrails (Jailbreak, Prompt Injection), improved payload parsing to handle non-JSON strings |
src/guardrails/evals/core/types.py |
Added conversation_history field and get_conversation_history method to Context class |
tests/unit/checks/test_prompt_injection_detection.py |
Added comprehensive tests for injection patterns, assistant message handling, and edge cases |
tests/unit/evals/test_async_engine.py |
Updated test to reflect new behavior of wrapping non-JSON strings as user messages |
tests/integration/test_suite.py |
Removed redundant config fields from pipeline configuration |
tests/unit/test_resources_responses.py |
Added blank line for formatting |
src/guardrails/evals/.gitignore |
Added PI_eval/ directory to gitignore |
mkdocs.yml |
Reorganized checks documentation alphabetically |
docs/ref/checks/nsfw.md |
Updated benchmark results with new model performance metrics |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # Create a minimal guardrails config for conversation-aware checks | ||
| minimal_config = { | ||
| "version": 1, |
Copilot
AI
Oct 29, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The config dictionary is missing the stage_name key that was previously present. While this may be intentional cleanup, the code should ensure the minimal config structure is valid and matches what GuardrailsAsyncOpenAI expects. Consider adding a comment explaining the minimal required structure.
| # Create a minimal guardrails config for conversation-aware checks | |
| minimal_config = { | |
| "version": 1, | |
| # Create a minimal guardrails config for conversation-aware checks. | |
| # The minimal required structure for GuardrailsAsyncOpenAI includes: | |
| # - "version": config version | |
| # - "stage_name": name of the stage (e.g., "output") | |
| # - "output": { "guardrails": [ ... ] } | |
| minimal_config = { | |
| "version": 1, | |
| "stage_name": "output", |
gabor-openai
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM TY
Adding nsfw docs and results