AQBE — Agentic Quantized Benchmark Evaluation

A generalized evaluation framework for quantized LLMs in agentic workflows.

Why AQBE

Standard LLM benchmarks (MMLU, HumanEval, GSM8K) test isolated single-turn capability. AQBE targets the failure modes that only appear in production agentic workflows:

Structured output reliability under token pressure
Tool chain stability across multi-step skill files
Schema compliance with constrained vocabularies
Context retention across sequential tool calls
Adversarial robustness in chat interfaces
Thinking mode token exhaustion on structured output

Usage

# Run all task packs with default models
python3 aqbe.py

# Custom model registry
python3 aqbe.py --models models.yaml

# Specific task pack
python3 aqbe.py --tasks task_packs/general_agentic.yaml

# Filter by category
python3 aqbe.py --category agentic

# Subset of models
python3 aqbe.py --model-filter small openai

# Custom output directory
python3 aqbe.py --output results/my_run

File Structure

aqbe/
  aqbe.py              # Main runner — model-agnostic, task-agnostic
  models.yaml          # Model registry (endpoints, auth, params)
  task_packs/
    financial_agentic.yaml   # Financial domain tasks
    general_agentic.yaml     # Domain-agnostic agentic tasks (add your own)
  results/             # Auto-created — JSON + MD + per-model responses

Adding Models

Edit models.yaml:

models:
  my_model:
    label: "My Model Name"
    url: "http://localhost:8081/v1"
    model: "my-model-id"
    headers:
      Authorization: "Bearer my-key"
    temperature: 0.1
    max_tokens: 1024
    timeout: 60

Adding Task Packs

Create a YAML file in task_packs/:

pack_name: my_pack
tasks:
  - id: my_task
    category: reasoning
    description: "What this tests"
    system: "System prompt for the model"
    prompt: "User prompt"
    eval: regex_checks    # or: json_fields, numeric
    checks:
      - name: has_answer
        pattern: "expected.*pattern"
        case_insensitive: true
      - name: not_too_long
        max_words: 200
      - name: no_bad_word
        pattern: "forbidden"
        invert: true

Eval Types

Type	Use for	Required fields
`regex_checks`	Keyword presence, length, invert	`checks` list
`json_fields`	JSON structure validation	`expected_fields`, `expected_values`
`numeric`	Arithmetic correctness	`expected` dict with key: value

Scoring

Each check is worth 2 points (configurable via points field)
Score = (checks passed / total possible) × 100%
Hard FAIL (0%) flags deployment blockers
Results saved per-run with timestamp to results/

Task Packs Included

Pack	Domain	Tasks	Key failure modes tested
`financial_agentic.yaml`	Finance	8	Canonical paths, tool grounding, valuation reasoning
`general_agentic.yaml`	General	14	Structured output, multi-turn, edge cases, agentic harness, adversarial

Run both together:

python3 aqbe.py --tasks task_packs/financial_agentic.yaml task_packs/general_agentic.yaml

Key Findings (from initial FLAME-Q / AQBE v1 run)

qwen3.5-4b non-thinking: best local model for agentic tasks (93.9%, 4.4s)
gemma-4-26b-a4b-it: tied at 93.9%, fastest large model at 3.5s
Thinking mode at ≤4b: degrades structured output — token budget exhausted by reasoning chain
/no_think prefix: fixes thinking mode for qwen3.5 models on llamacpp servers
gpt-oss-20b: underperforms its size — generic analysis, misses specific entities
Models <2b: fail canonical path schema and multi-tool selection

See EVAL_SUMMARY.md in model_test/ for full analysis.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
results		results
samples		samples
task_packs		task_packs
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
aqbe.py		aqbe.py
models.yaml		models.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AQBE — Agentic Quantized Benchmark Evaluation

Why AQBE

Usage

File Structure

Adding Models

Adding Task Packs

Eval Types

Scoring

Task Packs Included

Key Findings (from initial FLAME-Q / AQBE v1 run)

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

AQBE — Agentic Quantized Benchmark Evaluation

Why AQBE

Usage

File Structure

Adding Models

Adding Task Packs

Eval Types

Scoring

Task Packs Included

Key Findings (from initial FLAME-Q / AQBE v1 run)

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages