QuickCheck for LLM Agents

A property-based testing framework for LLM agents, inspired by Haskell's QuickCheck. Automatically generates test cases, finds property violations, and shrinks failures to minimal reproducers.

Quick Start

Installation

Prerequisites: Python 3.9+

# Create virtual environment
python3 -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

# Set OpenAI API key (optional, for non-deterministic mode)
export OPENAI_API_KEY='your-key-here'

CLI Flags

--config: Path to YAML configuration file
--deterministic: Use stubbed LLM responses for reproducibility (no API key needed)
--seed: Random seed for deterministic runs (ensures identical results)
--cases: Number of test cases to generate
--parallel: Run tests in parallel for faster execution
--output: Output directory for results (default: results/)

Running Tests

# Deterministic mode (no API key needed, reproducible)
python3 quickcheck.py run --config config/spec.yaml --deterministic --seed 42 --cases 10

# Non-deterministic mode (uses real LLM)
python3 quickcheck.py run --config config/spec.yaml --cases 20

# Parallel execution (faster)
python3 quickcheck.py run --config config/spec.yaml --deterministic --parallel --cases 200

How to Run

Deterministic Mode

Deterministic mode uses stubbed LLM responses for reproducible testing:

python3 quickcheck.py run \
  --config config/spec.yaml \
  --deterministic \
  --seed 42 \
  --cases 100 \
  --parallel \
  --output results/my_test

Performance: Runs 200 test cases in ~5 seconds on a laptop.

Non-Deterministic Mode

Non-deterministic mode uses real LLM calls via OpenAI API:

export OPENAI_API_KEY='sk-...'

python3 quickcheck.py run \
  --config config/spec.yaml \
  --cases 50 \
  --parallel \
  --output results/llm_test

YAML Schemas

Input Configuration Schema

The input YAML configures test generation and execution:

templates:
  system: "You are a {{role}} assistant for ShopFast..."
  user: "Help me with {{task}}: {{context}}"

variables:
  role:
    type: enum
    values: [customer_support, order_specialist, returns_specialist]

  task:
    type: enum
    values: [order_status, refund_inquiry, order_cancellation]

  context:
    type: generated
    generator: llm  # or 'stub' for deterministic mode
    template: "Generate a realistic customer query about {{task}}"

mcp_tools:
  - name: order_lookup
    mock: true
    description: "Retrieve order information from database"
    responses:
      - condition: {order_id: {$gt: 0}}
        result: {status: "delivered", order_id: "{order_id}"}

requirements:
  - id: tool_citation
    type: protocol
    description: "Must cite tool name if used"

  - id: helpful_response
    type: domain
    description: "Response must be helpful and address the user's request"

execution:
  model: gpt-4o-mini
  temperature: 0
  max_tokens: 1024

Full Schema:

templates:
  system: string  # Prompt with {{variables}}
  user: string

variables:
  <name>:
    type: enum | generated
    values: [...]        # for enum
    generator: llm|stub  # for generated
    template: string     # generation prompt

mcp_tools:
  - name: string
    mock: boolean
    description: string
    responses: {...}     # mock responses

requirements:
  - id: string
    type: protocol | domain  # protocol=rules, domain=LLM judge
    description: string

execution:
  model: string
  temperature: float
  max_tokens: integer

Output Files Schema

Each run generates 5 files:

File	Description
`*_summary.yaml`	Quick overview with one example per cluster + clickable trace refs
`*_reproducers.yaml`	Minimal reproducers with minimality proofs
`*_detailed.json`	All failures with trace references (no embedded traces)
`*_coverage.json`	Variable and property coverage stats
`*_traces.json`	Full execution traces (single source of truth)

File Naming: run_<timestamp>_<hash>_<mode>_<seed>_<type>.<ext>

Example: run_20251009_030224_12ef9201_det_42_summary.yaml

Cross-Reference Navigation: Files use clickable references:

"trace_ref": "run_xxx_traces.json",      // Cmd+Click to open
"trace_location": "\"failure_id\": 0"    // Cmd+F to find exact case

Design Notes

Property Decomposition Strategy

Properties are classified into two categories for optimal checking:

Protocol Properties (Deterministic)

Structural/format requirements checkable with rules
Examples: tool citation, response length, message format
Implementation: Rule-based checkers in src/judge/judge.py
Benefit: Fast, reliable, no API calls needed

Domain Properties (LLM-as-Judge)

Semantic requirements requiring understanding
Examples: helpfulness, appropriateness, completeness
Implementation: LLM evaluation via adapter with fallback to heuristics
Benefit: Rich semantic checking when needed

This separation enables fast deterministic testing while preserving semantic evaluation capability.

Shrink Lattice / Partial Orders

The shrinker defines a partial order over test cases where case1 < case2 if case1 is "simpler":

Ordering Rules:

Enum variables: Prefer earlier values in the list (canonical representatives)
Generated strings: Shorter is simpler (token/character count)

Shrinking Algorithm:

Strategy: Greedy descent through the lattice
Process: Apply simplification transformations one at a time
Termination: Stop when no neighbor is both simpler and still fails
Minimality Proof: Test all neighbors to confirm they pass

Example Lattice:

{role: returns_specialist, behavior: be_detailed, context: "Long query..."}
                              ↓
{role: customer_support, behavior: be_detailed, context: "Long query..."}  ← (canonical role)
                              ↓
{role: customer_support, behavior: be_concise, context: "Long query..."}   ← (canonical behavior)
                              ↓
{role: customer_support, behavior: be_concise, context: "Short"}           ← (truncated context)

Failure Clustering Approach

Failures are clustered by violation signature:

signature = hash(property_id + severity + description_pattern)

Algorithm:

Extract signature from each failure's violations
Group failures with identical signatures
Select minimal reproducer as cluster representative
Report cluster frequency and examples

Benefits:

De-duplicates similar failures
Identifies distinct failure modes
Prioritizes by frequency

MCP Simplifications

We simplified MCP to a minimal testing interface:

Simplification Decisions:

Mock-only mode: No real MCP server integration (task requirement: "simple")
Minimal API: call_tool(name, args) -> result
Deterministic responses: Hash-based for reproducibility
No authentication: Testing doesn't need real credentials

Mock Implementation (src/mcp/mock.py):

Deterministic response selection based on arguments + seed
Configurable response templates in YAML
Argument validation and error simulation

Off-the-shelf servers used: None (custom mock implementation)

Sample Configuration & Reports

Sample Config: config/spec.yaml - E-commerce customer support agent configuration

Sample Reports: See sample_reports/canonical_run/ for reference output from 10 test cases (seed 42)

The canonical run demonstrates the complete output structure. Each run generates 5 files:

*_summary.yaml - Quick failure overview with clickable trace references
*_reproducers.yaml - Minimal reproducers with minimality proofs
*_detailed.json - All test cases with trace references (no embedded traces)
*_coverage.json - Variable and property coverage statistics
*_traces.json - Full execution traces (single source of truth)

Navigation: Start with *_summary.yaml, click trace_ref filenames, then Cmd+F the trace_location pattern to jump to exact failures

Module Structure

Clear module boundaries as per task requirements:

src/
├── generator/    - Test case generation (enum + LLM)
├── runner/       - Agent execution with MCP tools
├── judge/        - Property evaluation (protocol + domain)
├── shrinker/     - Minimal reproducer finding
├── reporter/     - Failure clustering and reports
├── mcp/          - Mock MCP tool adapter
├── llm/          - LLM adapter (OpenAI + stub)
└── utils/        - Logging and utilities

Testing & Verification

3 Required Tests:

Test 1: Property Violation Found & Shrunk

python3 quickcheck.py run --deterministic --seed 42 --cases 10

Evidence: See sample_reports/canonical_run/run_20251009_030224_12ef9201_det_42_reproducers.yaml:2

minimal_reproducers:
  - property: tool_citation
    violation: Must cite tool name if used
    evidence: 'Tool "order_lookup" used but not mentioned in response'
    variables:
      role: customer_support
      behavior: be_concise
      task: order_status
    minimality_proof:
      is_minimal: true
      reason: Changing any of 2 tested variables makes test pass
      passing_neighbors:
        - changed_variable: role
          from_value: customer_support
          to_value: returns_specialist
        - changed_variable: task
          from_value: order_status
          to_value: order_cancellation

✅ Verified: Shrinking reduced failure to minimal reproducer with proof

Test 2: Reproducibility via --seed

# Compare two runs with same seed
python3 scripts/compare_runs.py \
  sample_reports/determinism_proof/run1/run_20251009_030613_12ef9201_det_42_detailed.json \
  sample_reports/determinism_proof/run2/run_20251009_030614_12ef9201_det_42_detailed.json

Output:

================================================================================
DETERMINISTIC RUN COMPARISON
================================================================================

1. METADATA:
   Run 1 seed: 42
   Run 2 seed: 42

2. SUMMARY STATISTICS:
   Total cases: Run1=10, Run2=10
   Passes:      Run1=2, Run2=2
   Failures:    Run1=8, Run2=8

✓ PASS: Runs are identical (deterministic)

✅ Verified: Same seed produces identical results

Run 1: sample_reports/determinism_proof/run1/run_20251009_030613_12ef9201_det_42_detailed.json
Run 2: sample_reports/determinism_proof/run2/run_20251009_030614_12ef9201_det_42_detailed.json

Test 3: Protocol Property Violation Detection

python3 quickcheck.py run --config config/spec.yaml --cases 3 --parallel

Terminal Output:

Running 3 test cases from config/spec.yaml
Parallel execution enabled with 8 workers
Running tests... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:00
Found 3 failures

                                Failure Summary
┏━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ Property      ┃ Failures ┃ Example                                           ┃
┡━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ tool_citation │        3 │ Tool "order_lookup" used but not mentioned in     │
│               │          │ resp...                                           │
└───────────────┴──────────┴───────────────────────────────────────────────────┘

Evidence: See sample_reports/test3_llm_run/run_20251009_051812_12ef9201_rand_summary.yaml:1

failure_clusters:
  - property: tool_citation
    description: Must cite tool name if used
    count: 3
    minimal_example: 'Tool "order_lookup" used but not mentioned in response:
                      "The order status for order #452 is currently pending payment...."'

How It Was Detected: See sample_reports/test3_llm_run/run_20251009_051812_12ef9201_rand_traces.json:70

{
  "violations": [
    {
      "property_id": "tool_citation",
      "description": "Must cite tool name if used",
      "evidence": "Tool \"order_lookup\" used but not mentioned in response: \"The order status for order #452 is currently pending payment....\"",
      "severity": "high"
    }
  ]
}

The judge detected that:

Agent called order_lookup tool (timestamp: 1759967244.068966)
Agent response: "The order status for order #452 is currently pending payment."
Violation: Response does not mention "order_lookup" tool name ❌

Coverage: See sample_reports/test3_llm_run/run_20251009_051812_12ef9201_rand_coverage.json:23

{
  "properties": {
    "tool_citation": {
      "tested": true,
      "violated": true,
      "violation_count": 3
    },
    "helpful_response": {
      "tested": true,
      "violated": false,
      "violation_count": 0
    }
  }
}

✅ Verified: Protocol property (tool_citation) detected 3 violations via rule-based checking

Performance: Exceeds requirements (5s vs 120s target for 200 cases)

Output File Examples

Summary YAML (Quick Overview)

metadata:
  run_id: run_20251009_030224_12ef9201_det_42
  total_cases: 10
  failures: 8
  deterministic: true
  seed: 42

failure_clusters:
  - cluster_id: 5eaeb96e
    property: tool_citation
    count: 8
    failures:
      - case_id: 0
        shrunk_variables:
          role: customer_support
          task: order_status
        trace_ref: run_20251009_030224_12ef9201_det_42_traces.json
        trace_search: '"failure_id": 0'
        minimality_proof:
          is_minimal: true
          passing_neighbors_count: 2

Reproducers YAML (Minimal Cases)

minimal_reproducers:
  - cluster_id: 5eaeb96e
    property: tool_citation
    variables:
      role: customer_support
      behavior: be_concise
      task: order_status
    minimality_proof:
      is_minimal: true
      reason: Changing any of 2 tested variables makes test pass
      passing_neighbors:
        - changed_variable: role
          from_value: customer_support
          to_value: returns_specialist

Traces JSON (Full Execution Details)

[
  {
    "failure_id": 0,
    "trace": {
      "messages": [
        {"role": "system", "content": "You are a customer_support..."},
        {"role": "user", "content": "Help with order_status..."},
        {"role": "assistant", "content": "The status is delivered..."}
      ],
      "tool_calls": [
        {"tool": "order_lookup", "arguments": {"order_id": 1}}
      ],
      "timing": {"total": 0.0008, "llm": 0.0001, "tools": 0.1}
    },
    "violations": [
      {
        "property_id": "tool_citation",
        "evidence": "Tool 'order_lookup' used but not mentioned"
      }
    ]
  }
]

Detailed JSON (All Test Cases)

{
  "metadata": {
    "run_id": "run_20251009_030224_12ef9201_det_42",
    "total_cases": 10,
    "passes": 2,
    "failures": 8,
    "deterministic": true,
    "seed": 42
  },
  "summary": {
    "pass_rate": 0.2,
    "failure_rate": 0.8,
    "unique_failure_modes": 1,
    "properties_tested": 2,
    "properties_violated": 1
  },
  "failure_clusters": [
    {
      "cluster_id": "5eaeb96e",
      "property": "tool_citation",
      "count": 8,
      "failures": [
        {
          "case_id": 0,
          "trace_ref": "run_20251009_030224_12ef9201_det_42_traces.json",
          "trace_location": "\"failure_id\": 0",
          "violations": [...],
          "shrunk_variables": {...}
        }
      ]
    }
  ]
}

Coverage JSON (Variable & Property Stats)

{
  "variables": {
    "role": {
      "total_values": 3,
      "covered_values": 3,
      "coverage_percent": 100.0,
      "uncovered_values": []
    },
    "task": {
      "total_values": 4,
      "covered_values": 4,
      "coverage_percent": 100.0
    }
  },
  "properties": {
    "tool_citation": {
      "tested": true,
      "violated": true,
      "violation_count": 8
    },
    "helpful_response": {
      "tested": true,
      "violated": false,
      "violation_count": 0
    }
  },
  "total_combinations_tested": 10
}

Module Structure

Clear module boundaries as per task requirements:

src/
├── generator/    - Test case generation (enum + LLM)
├── runner/       - Agent execution with MCP tools
├── judge/        - Property evaluation (protocol + domain)
├── shrinker/     - Minimal reproducer finding
├── reporter/     - Failure clustering and reports
├── mcp/          - Mock MCP tool adapter
├── llm/          - LLM adapter (OpenAI + stub)
└── utils/        - Logging and utilities

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
config		config
sample_reports		sample_reports
scripts		scripts
src		src
.coveragerc		.coveragerc
.gitignore		.gitignore
README.md		README.md
pytest.ini		pytest.ini
quickcheck.py		quickcheck.py
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

QuickCheck for LLM Agents

Quick Start

Installation

CLI Flags

Running Tests

How to Run

Deterministic Mode

Non-Deterministic Mode

YAML Schemas

Input Configuration Schema

Output Files Schema

Design Notes

Property Decomposition Strategy

Shrink Lattice / Partial Orders

Failure Clustering Approach

MCP Simplifications

Sample Configuration & Reports

Module Structure

Testing & Verification

Test 1: Property Violation Found & Shrunk

Test 2: Reproducibility via --seed

Test 3: Protocol Property Violation Detection

Output File Examples

Summary YAML (Quick Overview)

Reproducers YAML (Minimal Cases)

Traces JSON (Full Execution Details)

Detailed JSON (All Test Cases)

Coverage JSON (Variable & Property Stats)

Module Structure

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages