Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Binary file modified docs/benchmarking/alignment_roc_curves.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
22 changes: 16 additions & 6 deletions docs/evals.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,16 +4,26 @@ Evaluate guardrail performance against labeled datasets with precision, recall,

## Quick Start

### Invocation Options
Install the project (e.g., `pip install -e .`) and run the CLI entry point:
```bash
guardrails-evals --help
```
During local development you can run the module directly:
```bash
python -m guardrails.evals.guardrail_evals --help
```

### Basic Evaluation
```bash
python guardrail_evals.py \
guardrails-evals \
--config-path guardrails_config.json \
--dataset-path data.jsonl
```

### Benchmark Mode
```bash
python guardrail_evals.py \
guardrails-evals \
--config-path guardrails_config.json \
--dataset-path data.jsonl \
--mode benchmark \
Expand Down Expand Up @@ -154,15 +164,15 @@ The evaluation tool supports OpenAI, Azure OpenAI, and any OpenAI-compatible API

### OpenAI (Default)
```bash
python guardrail_evals.py \
guardrails-evals \
--config-path config.json \
--dataset-path data.jsonl \
--api-key sk-...
```

### Azure OpenAI
```bash
python guardrail_evals.py \
guardrails-evals \
--config-path config.json \
--dataset-path data.jsonl \
--azure-endpoint https://your-resource.openai.azure.com \
Expand All @@ -176,7 +186,7 @@ python guardrail_evals.py \
Any model which supports the OpenAI interface can be used with `--base-url` and `--api-key`.

```bash
python guardrail_evals.py \
guardrails-evals \
--config-path config.json \
--dataset-path data.jsonl \
--base-url http://localhost:11434/v1 \
Expand All @@ -198,4 +208,4 @@ python guardrail_evals.py \
## Next Steps

- See the [API Reference](./ref/eval/guardrail_evals.md) for detailed documentation
- Use [Wizard UI](https://guardrails.openai.com/) for configuring guardrails without code
- Use [Wizard UI](https://guardrails.openai.com/) for configuring guardrails without code
34 changes: 20 additions & 14 deletions docs/ref/checks/prompt_injection_detection.md
Original file line number Diff line number Diff line change
Expand Up @@ -67,8 +67,14 @@ Returns a `GuardrailResult` with the following `info` dictionary:
"confidence": 0.1,
"threshold": 0.7,
"user_goal": "What's the weather in Tokyo?",
"action": "get_weather(location='Tokyo')",
"checked_text": "Original input text"
"action": [
{
"type": "function_call",
"name": "get_weather",
"arguments": "{'location': 'Tokyo'}"
}
],
"checked_text": "[{'role': 'user', 'content': 'What is the weather in Tokyo?'}]"
}
```

Expand All @@ -77,18 +83,18 @@ Returns a `GuardrailResult` with the following `info` dictionary:
- **`confidence`**: Confidence score (0.0 to 1.0) that the action is misaligned
- **`threshold`**: The confidence threshold that was configured
- **`user_goal`**: The tracked user intent from conversation
- **`action`**: The specific action being evaluated
- **`checked_text`**: Original input text
- **`action`**: The list of function calls or tool outputs analyzed for alignment
- **`checked_text`**: Serialized conversation history inspected during analysis

## Benchmark Results

### Dataset Description

This benchmark evaluates model performance on a synthetic dataset of agent conversation traces:
This benchmark evaluates model performance on agent conversation traces:

- **Dataset size**: 1,000 samples with 500 positive cases (50% prevalence)
- **Data type**: Internal synthetic dataset simulating realistic agent traces
- **Test scenarios**: Multi-turn conversations with function calls and tool outputs
- **Synthetic dataset**: 1,000 samples with 500 positive cases (50% prevalence) simulating realistic agent traces
- **AgentDojo dataset**: 1,046 samples from AgentDojo's workspace, travel, banking, and Slack suite combined with the "important_instructions" attack (949 positive cases, 97 negative samples)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In a follow-up PR, could you include a link to this dataset?

- **Test scenarios**: Multi-turn conversations with function calls and tool outputs across realistic workplace domains
- **Misalignment examples**: Unrelated function calls, harmful operations, and data leakage

**Example of misaligned conversation:**
Expand All @@ -107,12 +113,12 @@ This benchmark evaluates model performance on a synthetic dataset of agent conve

| Model | ROC AUC | Prec@R=0.80 | Prec@R=0.90 | Prec@R=0.95 | Recall@FPR=0.01 |
|---------------|---------|-------------|-------------|-------------|-----------------|
| gpt-5 | 0.9997 | 1.000 | 1.000 | 1.000 | 0.998 |
| gpt-5-mini | 0.9998 | 1.000 | 1.000 | 0.998 | 0.998 |
| gpt-5-nano | 0.9987 | 0.996 | 0.996 | 0.996 | 0.996 |
| gpt-4.1 | 0.9990 | 1.000 | 1.000 | 1.000 | 0.998 |
| gpt-4.1-mini (default) | 0.9930 | 1.000 | 1.000 | 1.000 | 0.986 |
| gpt-4.1-nano | 0.9431 | 0.982 | 0.845 | 0.695 | 0.000 |
| gpt-5 | 0.9604 | 0.998 | 0.995 | 0.963 | 0.431 |
| gpt-5-mini | 0.9796 | 0.999 | 0.999 | 0.966 | 0.000 |
| gpt-5-nano | 0.8651 | 0.963 | 0.963 | 0.951 | 0.056 |
| gpt-4.1 | 0.9846 | 0.998 | 0.998 | 0.998 | 0.000 |
| gpt-4.1-mini (default) | 0.9728 | 0.995 | 0.995 | 0.995 | 0.000 |
| gpt-4.1-nano | 0.8677 | 0.974 | 0.974 | 0.974 | 0.000 |

**Notes:**

Expand Down
1 change: 1 addition & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -76,6 +76,7 @@ packages = ["src/guardrails"]

[project.scripts]
guardrails = "guardrails.cli:main"
guardrails-evals = "guardrails.evals.guardrail_evals:main"

[tool.ruff]
line-length = 150
Expand Down
8 changes: 0 additions & 8 deletions src/guardrails/agents.py
Original file line number Diff line number Diff line change
Expand Up @@ -166,14 +166,6 @@ class ToolConversationContext:
def get_conversation_history(self) -> list:
return self.conversation_history

def get_injection_last_checked_index(self) -> int:
"""Return 0 to check all messages (required by prompt injection check)."""
return 0

def update_injection_last_checked_index(self, new_index: int) -> None:
"""No-op (required by prompt injection check interface)."""
pass

return ToolConversationContext(
guardrail_llm=base_context.guardrail_llm,
conversation_history=conversation_history,
Expand Down
Loading
Loading