Skip to content

nishant1952/quickcheck

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

31 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

QuickCheck for LLM Agents

A property-based testing framework for LLM agents, inspired by Haskell's QuickCheck. Automatically generates test cases, finds property violations, and shrinks failures to minimal reproducers.

Quick Start

Installation

Prerequisites: Python 3.9+

# Create virtual environment
python3 -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

# Set OpenAI API key (optional, for non-deterministic mode)
export OPENAI_API_KEY='your-key-here'

CLI Flags

  • --config: Path to YAML configuration file
  • --deterministic: Use stubbed LLM responses for reproducibility (no API key needed)
  • --seed: Random seed for deterministic runs (ensures identical results)
  • --cases: Number of test cases to generate
  • --parallel: Run tests in parallel for faster execution
  • --output: Output directory for results (default: results/)

Running Tests

# Deterministic mode (no API key needed, reproducible)
python3 quickcheck.py run --config config/spec.yaml --deterministic --seed 42 --cases 10

# Non-deterministic mode (uses real LLM)
python3 quickcheck.py run --config config/spec.yaml --cases 20

# Parallel execution (faster)
python3 quickcheck.py run --config config/spec.yaml --deterministic --parallel --cases 200

How to Run

Deterministic Mode

Deterministic mode uses stubbed LLM responses for reproducible testing:

python3 quickcheck.py run \
  --config config/spec.yaml \
  --deterministic \
  --seed 42 \
  --cases 100 \
  --parallel \
  --output results/my_test

Performance: Runs 200 test cases in ~5 seconds on a laptop.

Non-Deterministic Mode

Non-deterministic mode uses real LLM calls via OpenAI API:

export OPENAI_API_KEY='sk-...'

python3 quickcheck.py run \
  --config config/spec.yaml \
  --cases 50 \
  --parallel \
  --output results/llm_test

YAML Schemas

Input Configuration Schema

The input YAML configures test generation and execution:

templates:
  system: "You are a {{role}} assistant for ShopFast..."
  user: "Help me with {{task}}: {{context}}"

variables:
  role:
    type: enum
    values: [customer_support, order_specialist, returns_specialist]

  task:
    type: enum
    values: [order_status, refund_inquiry, order_cancellation]

  context:
    type: generated
    generator: llm  # or 'stub' for deterministic mode
    template: "Generate a realistic customer query about {{task}}"

mcp_tools:
  - name: order_lookup
    mock: true
    description: "Retrieve order information from database"
    responses:
      - condition: {order_id: {$gt: 0}}
        result: {status: "delivered", order_id: "{order_id}"}

requirements:
  - id: tool_citation
    type: protocol
    description: "Must cite tool name if used"

  - id: helpful_response
    type: domain
    description: "Response must be helpful and address the user's request"

execution:
  model: gpt-4o-mini
  temperature: 0
  max_tokens: 1024

Full Schema:

templates:
  system: string  # Prompt with {{variables}}
  user: string

variables:
  <name>:
    type: enum | generated
    values: [...]        # for enum
    generator: llm|stub  # for generated
    template: string     # generation prompt

mcp_tools:
  - name: string
    mock: boolean
    description: string
    responses: {...}     # mock responses

requirements:
  - id: string
    type: protocol | domain  # protocol=rules, domain=LLM judge
    description: string

execution:
  model: string
  temperature: float
  max_tokens: integer

Output Files Schema

Each run generates 5 files:

File Description
*_summary.yaml Quick overview with one example per cluster + clickable trace refs
*_reproducers.yaml Minimal reproducers with minimality proofs
*_detailed.json All failures with trace references (no embedded traces)
*_coverage.json Variable and property coverage stats
*_traces.json Full execution traces (single source of truth)

File Naming: run_<timestamp>_<hash>_<mode>_<seed>_<type>.<ext>

Example: run_20251009_030224_12ef9201_det_42_summary.yaml

Cross-Reference Navigation: Files use clickable references:

"trace_ref": "run_xxx_traces.json",      // Cmd+Click to open
"trace_location": "\"failure_id\": 0"    // Cmd+F to find exact case

Design Notes

Property Decomposition Strategy

Properties are classified into two categories for optimal checking:

Protocol Properties (Deterministic)

  • Structural/format requirements checkable with rules
  • Examples: tool citation, response length, message format
  • Implementation: Rule-based checkers in src/judge/judge.py
  • Benefit: Fast, reliable, no API calls needed

Domain Properties (LLM-as-Judge)

  • Semantic requirements requiring understanding
  • Examples: helpfulness, appropriateness, completeness
  • Implementation: LLM evaluation via adapter with fallback to heuristics
  • Benefit: Rich semantic checking when needed

This separation enables fast deterministic testing while preserving semantic evaluation capability.

Shrink Lattice / Partial Orders

The shrinker defines a partial order over test cases where case1 < case2 if case1 is "simpler":

Ordering Rules:

  1. Enum variables: Prefer earlier values in the list (canonical representatives)
  2. Generated strings: Shorter is simpler (token/character count)

Shrinking Algorithm:

  • Strategy: Greedy descent through the lattice
  • Process: Apply simplification transformations one at a time
  • Termination: Stop when no neighbor is both simpler and still fails
  • Minimality Proof: Test all neighbors to confirm they pass

Example Lattice:

{role: returns_specialist, behavior: be_detailed, context: "Long query..."}
                              ↓
{role: customer_support, behavior: be_detailed, context: "Long query..."}  ← (canonical role)
                              ↓
{role: customer_support, behavior: be_concise, context: "Long query..."}   ← (canonical behavior)
                              ↓
{role: customer_support, behavior: be_concise, context: "Short"}           ← (truncated context)

Failure Clustering Approach

Failures are clustered by violation signature:

signature = hash(property_id + severity + description_pattern)

Algorithm:

  1. Extract signature from each failure's violations
  2. Group failures with identical signatures
  3. Select minimal reproducer as cluster representative
  4. Report cluster frequency and examples

Benefits:

  • De-duplicates similar failures
  • Identifies distinct failure modes
  • Prioritizes by frequency

MCP Simplifications

We simplified MCP to a minimal testing interface:

Simplification Decisions:

  • Mock-only mode: No real MCP server integration (task requirement: "simple")
  • Minimal API: call_tool(name, args) -> result
  • Deterministic responses: Hash-based for reproducibility
  • No authentication: Testing doesn't need real credentials

Mock Implementation (src/mcp/mock.py):

  • Deterministic response selection based on arguments + seed
  • Configurable response templates in YAML
  • Argument validation and error simulation

Off-the-shelf servers used: None (custom mock implementation)


Sample Configuration & Reports

Sample Config: config/spec.yaml - E-commerce customer support agent configuration

Sample Reports: See sample_reports/canonical_run/ for reference output from 10 test cases (seed 42)

The canonical run demonstrates the complete output structure. Each run generates 5 files:

  • *_summary.yaml - Quick failure overview with clickable trace references
  • *_reproducers.yaml - Minimal reproducers with minimality proofs
  • *_detailed.json - All test cases with trace references (no embedded traces)
  • *_coverage.json - Variable and property coverage statistics
  • *_traces.json - Full execution traces (single source of truth)

Navigation: Start with *_summary.yaml, click trace_ref filenames, then Cmd+F the trace_location pattern to jump to exact failures

Module Structure

Clear module boundaries as per task requirements:

src/
├── generator/    - Test case generation (enum + LLM)
├── runner/       - Agent execution with MCP tools
├── judge/        - Property evaluation (protocol + domain)
├── shrinker/     - Minimal reproducer finding
├── reporter/     - Failure clustering and reports
├── mcp/          - Mock MCP tool adapter
├── llm/          - LLM adapter (OpenAI + stub)
└── utils/        - Logging and utilities

Testing & Verification

3 Required Tests:

Test 1: Property Violation Found & Shrunk

python3 quickcheck.py run --deterministic --seed 42 --cases 10

Evidence: See sample_reports/canonical_run/run_20251009_030224_12ef9201_det_42_reproducers.yaml:2

minimal_reproducers:
  - property: tool_citation
    violation: Must cite tool name if used
    evidence: 'Tool "order_lookup" used but not mentioned in response'
    variables:
      role: customer_support
      behavior: be_concise
      task: order_status
    minimality_proof:
      is_minimal: true
      reason: Changing any of 2 tested variables makes test pass
      passing_neighbors:
        - changed_variable: role
          from_value: customer_support
          to_value: returns_specialist
        - changed_variable: task
          from_value: order_status
          to_value: order_cancellation

Verified: Shrinking reduced failure to minimal reproducer with proof

Test 2: Reproducibility via --seed

# Compare two runs with same seed
python3 scripts/compare_runs.py \
  sample_reports/determinism_proof/run1/run_20251009_030613_12ef9201_det_42_detailed.json \
  sample_reports/determinism_proof/run2/run_20251009_030614_12ef9201_det_42_detailed.json

Output:

================================================================================
DETERMINISTIC RUN COMPARISON
================================================================================

1. METADATA:
   Run 1 seed: 42
   Run 2 seed: 42

2. SUMMARY STATISTICS:
   Total cases: Run1=10, Run2=10
   Passes:      Run1=2, Run2=2
   Failures:    Run1=8, Run2=8

✓ PASS: Runs are identical (deterministic)

Verified: Same seed produces identical results

  • Run 1: sample_reports/determinism_proof/run1/run_20251009_030613_12ef9201_det_42_detailed.json
  • Run 2: sample_reports/determinism_proof/run2/run_20251009_030614_12ef9201_det_42_detailed.json

Test 3: Protocol Property Violation Detection

python3 quickcheck.py run --config config/spec.yaml --cases 3 --parallel

Terminal Output:

Running 3 test cases from config/spec.yaml
Parallel execution enabled with 8 workers
Running tests... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:00
Found 3 failures

                                Failure Summary
┏━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ Property      ┃ Failures ┃ Example                                           ┃
┡━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ tool_citation │        3 │ Tool "order_lookup" used but not mentioned in     │
│               │          │ resp...                                           │
└───────────────┴──────────┴───────────────────────────────────────────────────┘

Evidence: See sample_reports/test3_llm_run/run_20251009_051812_12ef9201_rand_summary.yaml:1

failure_clusters:
  - property: tool_citation
    description: Must cite tool name if used
    count: 3
    minimal_example: 'Tool "order_lookup" used but not mentioned in response:
                      "The order status for order #452 is currently pending payment...."'

How It Was Detected: See sample_reports/test3_llm_run/run_20251009_051812_12ef9201_rand_traces.json:70

{
  "violations": [
    {
      "property_id": "tool_citation",
      "description": "Must cite tool name if used",
      "evidence": "Tool \"order_lookup\" used but not mentioned in response: \"The order status for order #452 is currently pending payment....\"",
      "severity": "high"
    }
  ]
}

The judge detected that:

  1. Agent called order_lookup tool (timestamp: 1759967244.068966)
  2. Agent response: "The order status for order #452 is currently pending payment."
  3. Violation: Response does not mention "order_lookup" tool name ❌

Coverage: See sample_reports/test3_llm_run/run_20251009_051812_12ef9201_rand_coverage.json:23

{
  "properties": {
    "tool_citation": {
      "tested": true,
      "violated": true,
      "violation_count": 3
    },
    "helpful_response": {
      "tested": true,
      "violated": false,
      "violation_count": 0
    }
  }
}

Verified: Protocol property (tool_citation) detected 3 violations via rule-based checking

Performance: Exceeds requirements (5s vs 120s target for 200 cases)

Output File Examples

Summary YAML (Quick Overview)

metadata:
  run_id: run_20251009_030224_12ef9201_det_42
  total_cases: 10
  failures: 8
  deterministic: true
  seed: 42

failure_clusters:
  - cluster_id: 5eaeb96e
    property: tool_citation
    count: 8
    failures:
      - case_id: 0
        shrunk_variables:
          role: customer_support
          task: order_status
        trace_ref: run_20251009_030224_12ef9201_det_42_traces.json
        trace_search: '"failure_id": 0'
        minimality_proof:
          is_minimal: true
          passing_neighbors_count: 2

Reproducers YAML (Minimal Cases)

minimal_reproducers:
  - cluster_id: 5eaeb96e
    property: tool_citation
    variables:
      role: customer_support
      behavior: be_concise
      task: order_status
    minimality_proof:
      is_minimal: true
      reason: Changing any of 2 tested variables makes test pass
      passing_neighbors:
        - changed_variable: role
          from_value: customer_support
          to_value: returns_specialist

Traces JSON (Full Execution Details)

[
  {
    "failure_id": 0,
    "trace": {
      "messages": [
        {"role": "system", "content": "You are a customer_support..."},
        {"role": "user", "content": "Help with order_status..."},
        {"role": "assistant", "content": "The status is delivered..."}
      ],
      "tool_calls": [
        {"tool": "order_lookup", "arguments": {"order_id": 1}}
      ],
      "timing": {"total": 0.0008, "llm": 0.0001, "tools": 0.1}
    },
    "violations": [
      {
        "property_id": "tool_citation",
        "evidence": "Tool 'order_lookup' used but not mentioned"
      }
    ]
  }
]

Detailed JSON (All Test Cases)

{
  "metadata": {
    "run_id": "run_20251009_030224_12ef9201_det_42",
    "total_cases": 10,
    "passes": 2,
    "failures": 8,
    "deterministic": true,
    "seed": 42
  },
  "summary": {
    "pass_rate": 0.2,
    "failure_rate": 0.8,
    "unique_failure_modes": 1,
    "properties_tested": 2,
    "properties_violated": 1
  },
  "failure_clusters": [
    {
      "cluster_id": "5eaeb96e",
      "property": "tool_citation",
      "count": 8,
      "failures": [
        {
          "case_id": 0,
          "trace_ref": "run_20251009_030224_12ef9201_det_42_traces.json",
          "trace_location": "\"failure_id\": 0",
          "violations": [...],
          "shrunk_variables": {...}
        }
      ]
    }
  ]
}

Coverage JSON (Variable & Property Stats)

{
  "variables": {
    "role": {
      "total_values": 3,
      "covered_values": 3,
      "coverage_percent": 100.0,
      "uncovered_values": []
    },
    "task": {
      "total_values": 4,
      "covered_values": 4,
      "coverage_percent": 100.0
    }
  },
  "properties": {
    "tool_citation": {
      "tested": true,
      "violated": true,
      "violation_count": 8
    },
    "helpful_response": {
      "tested": true,
      "violated": false,
      "violation_count": 0
    }
  },
  "total_combinations_tested": 10
}

Module Structure

Clear module boundaries as per task requirements:

src/
├── generator/    - Test case generation (enum + LLM)
├── runner/       - Agent execution with MCP tools
├── judge/        - Property evaluation (protocol + domain)
├── shrinker/     - Minimal reproducer finding
├── reporter/     - Failure clustering and reports
├── mcp/          - Mock MCP tool adapter
├── llm/          - LLM adapter (OpenAI + stub)
└── utils/        - Logging and utilities

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages