Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 7 additions & 1 deletion docs/ref/checks/custom_prompt_check.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,8 @@ Implements custom content checks using configurable LLM prompts. Uses your custo
"config": {
"model": "gpt-5",
"confidence_threshold": 0.7,
"system_prompt_details": "Determine if the user's request needs to be escalated to a senior support agent. Indications of escalation include: ..."
"system_prompt_details": "Determine if the user's request needs to be escalated to a senior support agent. Indications of escalation include: ...",
"include_reasoning": false
}
}
```
Expand All @@ -20,6 +21,10 @@ Implements custom content checks using configurable LLM prompts. Uses your custo
- **`model`** (required): Model to use for the check (e.g., "gpt-5")
- **`confidence_threshold`** (required): Minimum confidence score to trigger tripwire (0.0 to 1.0)
- **`system_prompt_details`** (required): Custom instructions defining the content detection criteria
- **`include_reasoning`** (optional): Whether to include reasoning/explanation fields in the guardrail output (default: `false`)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we include something about how this influences classifier performance?

- When `false`: The LLM only generates the essential fields (`flagged` and `confidence`), reducing token generation costs
- When `true`: Additionally, returns detailed reasoning for its decisions
- **Use Case**: Keep disabled for production to minimize costs; enable for development and debugging

## Implementation Notes

Expand All @@ -42,3 +47,4 @@ Returns a `GuardrailResult` with the following `info` dictionary:
- **`flagged`**: Whether the custom validation criteria were met
- **`confidence`**: Confidence score (0.0 to 1.0) for the validation
- **`threshold`**: The confidence threshold that was configured
- **`reason`**: Explanation of why the input was flagged (or not flagged) - *only included when `include_reasoning=true`*
23 changes: 15 additions & 8 deletions docs/ref/checks/hallucination_detection.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,8 @@ Flags model text containing factual claims that are clearly contradicted or not
"config": {
"model": "gpt-4.1-mini",
"confidence_threshold": 0.7,
"knowledge_source": "vs_abc123"
"knowledge_source": "vs_abc123",
"include_reasoning": false
}
}
```
Expand All @@ -24,6 +25,10 @@ Flags model text containing factual claims that are clearly contradicted or not
- **`model`** (required): OpenAI model (required) to use for validation (e.g., "gpt-4.1-mini")
- **`confidence_threshold`** (required): Minimum confidence score to trigger tripwire (0.0 to 1.0)
- **`knowledge_source`** (required): OpenAI vector store ID starting with "vs_" containing reference documents
- **`include_reasoning`** (optional): Whether to include detailed reasoning fields in the output (default: `false`)
- When `false`: Returns only `flagged` and `confidence` to save tokens
- When `true`: Additionally, returns `reasoning`, `hallucination_type`, `hallucinated_statements`, and `verified_statements`
- Recommended: Keep disabled for production (default); enable for development/debugging

### Tuning guidance

Expand Down Expand Up @@ -103,7 +108,9 @@ See [`examples/`](https://github.com/openai/openai-guardrails-js/tree/main/examp

## What It Returns

Returns a `GuardrailResult` with the following `info` dictionary:
Returns a `GuardrailResult` with the following `info` dictionary.

**With `include_reasoning=true`:**

```json
{
Expand All @@ -118,15 +125,15 @@ Returns a `GuardrailResult` with the following `info` dictionary:
}
```

### Fields

- **`flagged`**: Whether the content was flagged as potentially hallucinated
- **`confidence`**: Confidence score (0.0 to 1.0) for the detection
- **`reasoning`**: Explanation of why the content was flagged
- **`hallucination_type`**: Type of issue detected (e.g., "factual_error", "unsupported_claim")
- **`hallucinated_statements`**: Specific statements that are contradicted or unsupported
- **`verified_statements`**: Statements that are supported by your documents
- **`threshold`**: The confidence threshold that was configured

Tip: `hallucination_type` is typically one of `factual_error`, `unsupported_claim`, or `none`.
- **`reasoning`**: Explanation of why the content was flagged - *only included when `include_reasoning=true`*
- **`hallucination_type`**: Type of issue detected (e.g., "factual_error", "unsupported_claim", "none") - *only included when `include_reasoning=true`*
- **`hallucinated_statements`**: Specific statements that are contradicted or unsupported - *only included when `include_reasoning=true`*
- **`verified_statements`**: Statements that are supported by your documents - *only included when `include_reasoning=true`*

## Benchmark Results

Expand Down
9 changes: 7 additions & 2 deletions docs/ref/checks/jailbreak.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,8 @@ Detects attempts to bypass safety or policy constraints via manipulation (prompt
"name": "Jailbreak",
"config": {
"model": "gpt-4.1-mini",
"confidence_threshold": 0.7
"confidence_threshold": 0.7,
"include_reasoning": false
}
}
```
Expand All @@ -42,6 +43,10 @@ Detects attempts to bypass safety or policy constraints via manipulation (prompt

- **`model`** (required): Model to use for detection (e.g., "gpt-4.1-mini")
- **`confidence_threshold`** (required): Minimum confidence score to trigger tripwire (0.0 to 1.0)
- **`include_reasoning`** (optional): Whether to include reasoning/explanation fields in the guardrail output (default: `false`)
- When `false`: The LLM only generates the essential fields (`flagged` and `confidence`), reducing token generation costs
- When `true`: Additionally, returns detailed reasoning for its decisions
- **Use Case**: Keep disabled for production to minimize costs; enable for development and debugging

### Tuning guidance

Expand All @@ -68,7 +73,7 @@ Returns a `GuardrailResult` with the following `info` dictionary:
- **`flagged`**: Whether a jailbreak attempt was detected
- **`confidence`**: Confidence score (0.0 to 1.0) for the detection
- **`threshold`**: The confidence threshold that was configured
- **`reason`**: Natural language rationale describing why the request was (or was not) flagged
- **`reason`**: Natural language rationale describing why the request was (or was not) flagged - *only included when `include_reasoning=true`*
- **`used_conversation_history`**: Indicates whether prior conversation turns were included
- **`checked_text`**: JSON payload containing the conversation slice and latest input analyzed

Expand Down
7 changes: 6 additions & 1 deletion docs/ref/checks/llm_base.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,8 @@ Base configuration for LLM-based guardrails. Provides common configuration optio
"name": "NSFW Text", // or "Jailbreak", "Hallucination Detection", etc.
"config": {
"model": "gpt-5",
"confidence_threshold": 0.7
"confidence_threshold": 0.7,
"include_reasoning": false
}
}
```
Expand All @@ -20,6 +21,10 @@ Base configuration for LLM-based guardrails. Provides common configuration optio

- **`model`** (required): OpenAI model to use for the check (e.g., "gpt-5")
- **`confidence_threshold`** (required): Minimum confidence score to trigger tripwire (0.0 to 1.0)
- **`include_reasoning`** (optional): Whether to include reasoning/explanation fields in the guardrail output (default: `false`)
- When `false`: The LLM only generates the essential fields (`flagged` and `confidence`), reducing token generation costs
- When `true`: Additionally, returns detailed reasoning for its decisions
- **Use Case**: Keep disabled for production to minimize costs; enable for development and debugging

## What It Does

Expand Down
8 changes: 7 additions & 1 deletion docs/ref/checks/nsfw.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,8 @@ Flags workplace‑inappropriate model outputs: explicit sexual content, profanit
"name": "NSFW Text",
"config": {
"model": "gpt-4.1-mini",
"confidence_threshold": 0.7
"confidence_threshold": 0.7,
"include_reasoning": false
}
}
```
Expand All @@ -29,6 +30,10 @@ Flags workplace‑inappropriate model outputs: explicit sexual content, profanit

- **`model`** (required): Model to use for detection (e.g., "gpt-4.1-mini")
- **`confidence_threshold`** (required): Minimum confidence score to trigger tripwire (0.0 to 1.0)
- **`include_reasoning`** (optional): Whether to include reasoning/explanation fields in the guardrail output (default: `false`)
- When `false`: The LLM only generates the essential fields (`flagged` and `confidence`), reducing token generation costs
- When `true`: Additionally, returns detailed reasoning for its decisions
- **Use Case**: Keep disabled for production to minimize costs; enable for development and debugging

### Tuning guidance

Expand All @@ -51,6 +56,7 @@ Returns a `GuardrailResult` with the following `info` dictionary:
- **`flagged`**: Whether NSFW content was detected
- **`confidence`**: Confidence score (0.0 to 1.0) for the detection
- **`threshold`**: The confidence threshold that was configured
- **`reason`**: Explanation of why the input was flagged (or not flagged) - *only included when `include_reasoning=true`*

### Examples

Expand Down
13 changes: 9 additions & 4 deletions docs/ref/checks/off_topic_prompts.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,8 @@ Ensures content stays within defined business scope using LLM analysis. Flags co
"config": {
"model": "gpt-5",
"confidence_threshold": 0.7,
"system_prompt_details": "Customer support for our e-commerce platform. Topics include order status, returns, shipping, and product questions."
"system_prompt_details": "Customer support for our e-commerce platform. Topics include order status, returns, shipping, and product questions.",
"include_reasoning": false
}
}
```
Expand All @@ -20,6 +21,10 @@ Ensures content stays within defined business scope using LLM analysis. Flags co
- **`model`** (required): Model to use for analysis (e.g., "gpt-5")
- **`confidence_threshold`** (required): Minimum confidence score to trigger tripwire (0.0 to 1.0)
- **`system_prompt_details`** (required): Description of your business scope and acceptable topics
- **`include_reasoning`** (optional): Whether to include reasoning/explanation fields in the guardrail output (default: `false`)
- When `false`: The LLM only generates the essential fields (`flagged` and `confidence`), reducing token generation costs
- When `true`: Additionally, returns detailed reasoning for its decisions
- **Use Case**: Keep disabled for production to minimize costs; enable for development and debugging

## Implementation Notes

Expand All @@ -40,7 +45,7 @@ Returns a `GuardrailResult` with the following `info` dictionary:
}
```

- **`flagged`**: Whether the content aligns with your business scope
- **`confidence`**: Confidence score (0.0 to 1.0) for the prompt injection detection assessment
- **`flagged`**: Whether the content is off-topic (outside your business scope)
- **`confidence`**: Confidence score (0.0 to 1.0) for the assessment
- **`threshold`**: The confidence threshold that was configured
- **`business_scope`**: Copy of the scope provided in configuration
- **`reason`**: Explanation of why the input was flagged (or not flagged) - *only included when `include_reasoning=true`*
11 changes: 8 additions & 3 deletions docs/ref/checks/prompt_injection_detection.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,8 @@ After tool execution, the prompt injection detection check validates that the re
"name": "Prompt Injection Detection",
"config": {
"model": "gpt-4.1-mini",
"confidence_threshold": 0.7
"confidence_threshold": 0.7,
"include_reasoning": false
}
}
```
Expand All @@ -40,6 +41,10 @@ After tool execution, the prompt injection detection check validates that the re

- **`model`** (required): Model to use for prompt injection detection analysis (e.g., "gpt-4.1-mini")
- **`confidence_threshold`** (required): Minimum confidence score to trigger tripwire (0.0 to 1.0)
- **`include_reasoning`** (optional): Whether to include detailed reasoning fields (`observation` and `evidence`) in the output (default: `false`)
- When `false`: Returns only `flagged` and `confidence` to save tokens
- When `true`: Additionally, returns `observation` and `evidence` fields
- Recommended: Keep disabled for production (default); enable for development/debugging

**Flags as MISALIGNED:**

Expand Down Expand Up @@ -85,15 +90,15 @@ Returns a `GuardrailResult` with the following `info` dictionary:
}
```

- **`observation`**: What the AI action is doing
- **`flagged`**: Whether the action is misaligned (boolean)
- **`confidence`**: Confidence score (0.0 to 1.0) that the action is misaligned
- **`evidence`**: Specific evidence from conversation history that supports the decision (null when aligned)
- **`threshold`**: The confidence threshold that was configured
- **`user_goal`**: The tracked user intent from conversation
- **`action`**: The list of function calls or tool outputs analyzed for alignment
- **`recent_messages`**: Most recent conversation slice evaluated during the check
- **`recent_messages_json`**: JSON-serialized snapshot of the recent conversation slice
- **`observation`**: What the AI action is doing - *only included when `include_reasoning=true`*
- **`evidence`**: Specific evidence from conversation history that supports the decision (null when aligned) - *only included when `include_reasoning=true`*

## Benchmark Results

Expand Down
Loading