A property-based testing framework for LLM agents, inspired by Haskell's QuickCheck. Automatically generates test cases, finds property violations, and shrinks failures to minimal reproducers.
Prerequisites: Python 3.9+
# Create virtual environment
python3 -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt
# Set OpenAI API key (optional, for non-deterministic mode)
export OPENAI_API_KEY='your-key-here'--config: Path to YAML configuration file--deterministic: Use stubbed LLM responses for reproducibility (no API key needed)--seed: Random seed for deterministic runs (ensures identical results)--cases: Number of test cases to generate--parallel: Run tests in parallel for faster execution--output: Output directory for results (default:results/)
# Deterministic mode (no API key needed, reproducible)
python3 quickcheck.py run --config config/spec.yaml --deterministic --seed 42 --cases 10
# Non-deterministic mode (uses real LLM)
python3 quickcheck.py run --config config/spec.yaml --cases 20
# Parallel execution (faster)
python3 quickcheck.py run --config config/spec.yaml --deterministic --parallel --cases 200Deterministic mode uses stubbed LLM responses for reproducible testing:
python3 quickcheck.py run \
--config config/spec.yaml \
--deterministic \
--seed 42 \
--cases 100 \
--parallel \
--output results/my_testPerformance: Runs 200 test cases in ~5 seconds on a laptop.
Non-deterministic mode uses real LLM calls via OpenAI API:
export OPENAI_API_KEY='sk-...'
python3 quickcheck.py run \
--config config/spec.yaml \
--cases 50 \
--parallel \
--output results/llm_testThe input YAML configures test generation and execution:
templates:
system: "You are a {{role}} assistant for ShopFast..."
user: "Help me with {{task}}: {{context}}"
variables:
role:
type: enum
values: [customer_support, order_specialist, returns_specialist]
task:
type: enum
values: [order_status, refund_inquiry, order_cancellation]
context:
type: generated
generator: llm # or 'stub' for deterministic mode
template: "Generate a realistic customer query about {{task}}"
mcp_tools:
- name: order_lookup
mock: true
description: "Retrieve order information from database"
responses:
- condition: {order_id: {$gt: 0}}
result: {status: "delivered", order_id: "{order_id}"}
requirements:
- id: tool_citation
type: protocol
description: "Must cite tool name if used"
- id: helpful_response
type: domain
description: "Response must be helpful and address the user's request"
execution:
model: gpt-4o-mini
temperature: 0
max_tokens: 1024Full Schema:
templates:
system: string # Prompt with {{variables}}
user: string
variables:
<name>:
type: enum | generated
values: [...] # for enum
generator: llm|stub # for generated
template: string # generation prompt
mcp_tools:
- name: string
mock: boolean
description: string
responses: {...} # mock responses
requirements:
- id: string
type: protocol | domain # protocol=rules, domain=LLM judge
description: string
execution:
model: string
temperature: float
max_tokens: integerEach run generates 5 files:
| File | Description |
|---|---|
*_summary.yaml |
Quick overview with one example per cluster + clickable trace refs |
*_reproducers.yaml |
Minimal reproducers with minimality proofs |
*_detailed.json |
All failures with trace references (no embedded traces) |
*_coverage.json |
Variable and property coverage stats |
*_traces.json |
Full execution traces (single source of truth) |
File Naming: run_<timestamp>_<hash>_<mode>_<seed>_<type>.<ext>
Example: run_20251009_030224_12ef9201_det_42_summary.yaml
Cross-Reference Navigation: Files use clickable references:
"trace_ref": "run_xxx_traces.json", // Cmd+Click to open
"trace_location": "\"failure_id\": 0" // Cmd+F to find exact caseProperties are classified into two categories for optimal checking:
Protocol Properties (Deterministic)
- Structural/format requirements checkable with rules
- Examples: tool citation, response length, message format
- Implementation: Rule-based checkers in
src/judge/judge.py - Benefit: Fast, reliable, no API calls needed
Domain Properties (LLM-as-Judge)
- Semantic requirements requiring understanding
- Examples: helpfulness, appropriateness, completeness
- Implementation: LLM evaluation via adapter with fallback to heuristics
- Benefit: Rich semantic checking when needed
This separation enables fast deterministic testing while preserving semantic evaluation capability.
The shrinker defines a partial order over test cases where case1 < case2 if case1 is "simpler":
Ordering Rules:
- Enum variables: Prefer earlier values in the list (canonical representatives)
- Generated strings: Shorter is simpler (token/character count)
Shrinking Algorithm:
- Strategy: Greedy descent through the lattice
- Process: Apply simplification transformations one at a time
- Termination: Stop when no neighbor is both simpler and still fails
- Minimality Proof: Test all neighbors to confirm they pass
Example Lattice:
{role: returns_specialist, behavior: be_detailed, context: "Long query..."}
↓
{role: customer_support, behavior: be_detailed, context: "Long query..."} ← (canonical role)
↓
{role: customer_support, behavior: be_concise, context: "Long query..."} ← (canonical behavior)
↓
{role: customer_support, behavior: be_concise, context: "Short"} ← (truncated context)
Failures are clustered by violation signature:
signature = hash(property_id + severity + description_pattern)Algorithm:
- Extract signature from each failure's violations
- Group failures with identical signatures
- Select minimal reproducer as cluster representative
- Report cluster frequency and examples
Benefits:
- De-duplicates similar failures
- Identifies distinct failure modes
- Prioritizes by frequency
We simplified MCP to a minimal testing interface:
Simplification Decisions:
- Mock-only mode: No real MCP server integration (task requirement: "simple")
- Minimal API:
call_tool(name, args) -> result - Deterministic responses: Hash-based for reproducibility
- No authentication: Testing doesn't need real credentials
Mock Implementation (src/mcp/mock.py):
- Deterministic response selection based on arguments + seed
- Configurable response templates in YAML
- Argument validation and error simulation
Off-the-shelf servers used: None (custom mock implementation)
Sample Config: config/spec.yaml - E-commerce customer support agent configuration
Sample Reports: See sample_reports/canonical_run/ for reference output from 10 test cases (seed 42)
The canonical run demonstrates the complete output structure. Each run generates 5 files:
*_summary.yaml- Quick failure overview with clickable trace references*_reproducers.yaml- Minimal reproducers with minimality proofs*_detailed.json- All test cases with trace references (no embedded traces)*_coverage.json- Variable and property coverage statistics*_traces.json- Full execution traces (single source of truth)
Navigation: Start with *_summary.yaml, click trace_ref filenames, then Cmd+F the trace_location pattern to jump to exact failures
Clear module boundaries as per task requirements:
src/
├── generator/ - Test case generation (enum + LLM)
├── runner/ - Agent execution with MCP tools
├── judge/ - Property evaluation (protocol + domain)
├── shrinker/ - Minimal reproducer finding
├── reporter/ - Failure clustering and reports
├── mcp/ - Mock MCP tool adapter
├── llm/ - LLM adapter (OpenAI + stub)
└── utils/ - Logging and utilities
3 Required Tests:
python3 quickcheck.py run --deterministic --seed 42 --cases 10Evidence: See sample_reports/canonical_run/run_20251009_030224_12ef9201_det_42_reproducers.yaml:2
minimal_reproducers:
- property: tool_citation
violation: Must cite tool name if used
evidence: 'Tool "order_lookup" used but not mentioned in response'
variables:
role: customer_support
behavior: be_concise
task: order_status
minimality_proof:
is_minimal: true
reason: Changing any of 2 tested variables makes test pass
passing_neighbors:
- changed_variable: role
from_value: customer_support
to_value: returns_specialist
- changed_variable: task
from_value: order_status
to_value: order_cancellation✅ Verified: Shrinking reduced failure to minimal reproducer with proof
# Compare two runs with same seed
python3 scripts/compare_runs.py \
sample_reports/determinism_proof/run1/run_20251009_030613_12ef9201_det_42_detailed.json \
sample_reports/determinism_proof/run2/run_20251009_030614_12ef9201_det_42_detailed.jsonOutput:
================================================================================
DETERMINISTIC RUN COMPARISON
================================================================================
1. METADATA:
Run 1 seed: 42
Run 2 seed: 42
2. SUMMARY STATISTICS:
Total cases: Run1=10, Run2=10
Passes: Run1=2, Run2=2
Failures: Run1=8, Run2=8
✓ PASS: Runs are identical (deterministic)
✅ Verified: Same seed produces identical results
- Run 1:
sample_reports/determinism_proof/run1/run_20251009_030613_12ef9201_det_42_detailed.json - Run 2:
sample_reports/determinism_proof/run2/run_20251009_030614_12ef9201_det_42_detailed.json
python3 quickcheck.py run --config config/spec.yaml --cases 3 --parallelTerminal Output:
Running 3 test cases from config/spec.yaml
Parallel execution enabled with 8 workers
Running tests... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:00
Found 3 failures
Failure Summary
┏━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ Property ┃ Failures ┃ Example ┃
┡━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ tool_citation │ 3 │ Tool "order_lookup" used but not mentioned in │
│ │ │ resp... │
└───────────────┴──────────┴───────────────────────────────────────────────────┘
Evidence: See sample_reports/test3_llm_run/run_20251009_051812_12ef9201_rand_summary.yaml:1
failure_clusters:
- property: tool_citation
description: Must cite tool name if used
count: 3
minimal_example: 'Tool "order_lookup" used but not mentioned in response:
"The order status for order #452 is currently pending payment...."'How It Was Detected: See sample_reports/test3_llm_run/run_20251009_051812_12ef9201_rand_traces.json:70
{
"violations": [
{
"property_id": "tool_citation",
"description": "Must cite tool name if used",
"evidence": "Tool \"order_lookup\" used but not mentioned in response: \"The order status for order #452 is currently pending payment....\"",
"severity": "high"
}
]
}The judge detected that:
- Agent called
order_lookuptool (timestamp: 1759967244.068966) - Agent response: "The order status for order #452 is currently pending payment."
- Violation: Response does not mention "order_lookup" tool name ❌
Coverage: See sample_reports/test3_llm_run/run_20251009_051812_12ef9201_rand_coverage.json:23
{
"properties": {
"tool_citation": {
"tested": true,
"violated": true,
"violation_count": 3
},
"helpful_response": {
"tested": true,
"violated": false,
"violation_count": 0
}
}
}✅ Verified: Protocol property (tool_citation) detected 3 violations via rule-based checking
Performance: Exceeds requirements (5s vs 120s target for 200 cases)
metadata:
run_id: run_20251009_030224_12ef9201_det_42
total_cases: 10
failures: 8
deterministic: true
seed: 42
failure_clusters:
- cluster_id: 5eaeb96e
property: tool_citation
count: 8
failures:
- case_id: 0
shrunk_variables:
role: customer_support
task: order_status
trace_ref: run_20251009_030224_12ef9201_det_42_traces.json
trace_search: '"failure_id": 0'
minimality_proof:
is_minimal: true
passing_neighbors_count: 2minimal_reproducers:
- cluster_id: 5eaeb96e
property: tool_citation
variables:
role: customer_support
behavior: be_concise
task: order_status
minimality_proof:
is_minimal: true
reason: Changing any of 2 tested variables makes test pass
passing_neighbors:
- changed_variable: role
from_value: customer_support
to_value: returns_specialist[
{
"failure_id": 0,
"trace": {
"messages": [
{"role": "system", "content": "You are a customer_support..."},
{"role": "user", "content": "Help with order_status..."},
{"role": "assistant", "content": "The status is delivered..."}
],
"tool_calls": [
{"tool": "order_lookup", "arguments": {"order_id": 1}}
],
"timing": {"total": 0.0008, "llm": 0.0001, "tools": 0.1}
},
"violations": [
{
"property_id": "tool_citation",
"evidence": "Tool 'order_lookup' used but not mentioned"
}
]
}
]{
"metadata": {
"run_id": "run_20251009_030224_12ef9201_det_42",
"total_cases": 10,
"passes": 2,
"failures": 8,
"deterministic": true,
"seed": 42
},
"summary": {
"pass_rate": 0.2,
"failure_rate": 0.8,
"unique_failure_modes": 1,
"properties_tested": 2,
"properties_violated": 1
},
"failure_clusters": [
{
"cluster_id": "5eaeb96e",
"property": "tool_citation",
"count": 8,
"failures": [
{
"case_id": 0,
"trace_ref": "run_20251009_030224_12ef9201_det_42_traces.json",
"trace_location": "\"failure_id\": 0",
"violations": [...],
"shrunk_variables": {...}
}
]
}
]
}{
"variables": {
"role": {
"total_values": 3,
"covered_values": 3,
"coverage_percent": 100.0,
"uncovered_values": []
},
"task": {
"total_values": 4,
"covered_values": 4,
"coverage_percent": 100.0
}
},
"properties": {
"tool_citation": {
"tested": true,
"violated": true,
"violation_count": 8
},
"helpful_response": {
"tested": true,
"violated": false,
"violation_count": 0
}
},
"total_combinations_tested": 10
}Clear module boundaries as per task requirements:
src/
├── generator/ - Test case generation (enum + LLM)
├── runner/ - Agent execution with MCP tools
├── judge/ - Property evaluation (protocol + domain)
├── shrinker/ - Minimal reproducer finding
├── reporter/ - Failure clustering and reports
├── mcp/ - Mock MCP tool adapter
├── llm/ - LLM adapter (OpenAI + stub)
└── utils/ - Logging and utilities