Open-source LLM benchmark built for real work, not leaderboard games.
Standard benchmarks test abstract reasoning — Bencher tests what actually matters: can a model call the right tools with the right parameters, and can it build a working chart from raw data? Four domains, two suites, difficulty scaling from trivial to brutal.
This started as a personal tool to stop guessing which model handles our workflows better. It's growing into something more general, but the philosophy stays the same: test on tasks that look like production, not on puzzles.
Tool Calling — give the model a set of tools and a natural language request, check if it picks the right tools, passes correct parameters, chains calls in the right order.
Chart Building — give the model raw data and a visualization request, check if the output is valid HTML/SVG with correct data, proper design tokens, and working interactivity where required.
Both suites run across 4 domains (trading, devops, HR analytics, API gateway), 4 difficulty levels, and 10 prompt quality variations per task.
pip install -e .This gives you the bench-run CLI. Alternatively: python -m llm_bench.cli.
For visual chart evaluation (renders charts to screenshots via Playwright):
pip install -e ".[visual]"
playwright install chromiumPass --api-key directly or create .env in the project root:
LLM_API_KEY=sk-or-v1-...
One variable, any provider. Works with any OpenAI-compatible endpoint.
Run everything on a model:
bench-run run --model google/gemma-4-31b-itThat's it. All domains, all suites, all difficulties, all quality levels. Results stream to a JSONL file as they complete, so even if something crashes you don't lose progress.
| Domain | ID prefix | What's inside |
|---|---|---|
trading |
E01, C_E01 |
Trader's journal — trade history, strategies, PnL, risk analysis |
devops |
DO_E01, DO_C_E01 |
Server monitoring — deployments, incidents, metrics, alerts |
data_analysis |
DA_E01, DA_C_E01 |
HR analytics — employees, surveys, projects, training |
api_integration |
AI_E01, AI_C_E01 |
API gateway — endpoints, keys, request logs, webhooks |
Each domain has its own fixture data, tools, scenarios, and expected results. The model sees realistic data structures — not toy examples.
| Flag | Default | What it does |
|---|---|---|
--model |
required | Model ID. OpenRouter format: google/gemma-4-31b-it. For other providers, whatever their API expects. |
--suite |
all |
Which suite to run. tools = only tool calling, charts = only chart building, all = both. |
--domain |
all |
Which domain(s). Single: --domain trading. Multiple: --domain trading,devops. All: --domain all. |
--task-id |
all tasks | Run specific tasks only. Comma-separated: --task-id H01,DO_M03,C_X05. Useful for debugging a single scenario. |
--mode |
normal | Set to mixed for balanced cross-domain sampling — picks equal number of tasks per difficulty from each domain instead of running domains sequentially. |
--seed |
42 |
Random seed for --mode mixed. Same seed = same task selection. |
Each task has a difficulty (easy, medium, hard, expert) and each task is tested at quality levels 1-10 (where 1 is a vague/sloppy prompt and 10 is a precise, well-structured prompt).
| Flag | Default | What it does |
|---|---|---|
--min-quality |
1 |
Lower bound of prompt quality range. |
--max-quality |
10 |
Upper bound. --min-quality 10 --max-quality 10 = only the best prompts. |
--min-difficulty |
easy |
Lower bound of difficulty. |
--max-difficulty |
expert |
Upper bound. --max-difficulty medium = skip hard and expert tasks. |
| Flag | Default | What it does |
|---|---|---|
--temperature |
0.0 |
Sampling temperature. 0 = deterministic. |
--top-p |
provider default | Top-p (nucleus) sampling. |
--max-tokens |
provider default | Max tokens in response. |
| Flag | Default | What it does |
|---|---|---|
--workers |
4 |
Parallel task runners. Higher = faster but more API load. |
--task-timeout |
300 |
Per-task timeout in seconds. Tasks that exceed this get an infra error (score=0, retryable). |
--resume |
— | Path to existing JSONL. Skips already-completed tasks, retries infra errors (timeouts, rate limits, 503s). Model errors are not retried — if the model failed, it failed. |
These flags only apply when using OpenRouter as the backend. Ignored for other providers.
| Flag | Default | What it does |
|---|---|---|
--provider-sort |
— | How OpenRouter picks the backend: price (cheapest), throughput (fastest tokens/sec), latency (lowest TTFT). |
--provider-order |
— | Preferred providers in order: --provider-order Google,Together. OpenRouter tries these first. |
--provider-no-fallback |
false |
If set, only use providers from --provider-order. No fallbacks. Useful when you need a specific provider for reproducibility. |
| Flag | Default | What it does |
|---|---|---|
--llm-judge |
false |
Enable LLM-as-judge scoring. A second model evaluates each response for correctness, reasoning quality, and hallucinations (tools) or visual clarity, data accuracy, and design quality (charts). Scores are separate from the main automated scoring. |
--judge-model |
— | Which model judges. Required with --llm-judge. Example: --judge-model openai/gpt-4o. |
--visual-judge |
false |
Enable visual chart evaluation. Renders each chart HTML to a PNG screenshot via Playwright, then sends the image to a multimodal model for scoring (0-100). Falls back to sending HTML as text if Playwright isn't installed. |
| Flag | Default | What it does |
|---|---|---|
--api-key |
from env | API key. Falls back to LLM_API_KEY from environment. |
--base-url |
OpenRouter | Any OpenAI-compatible endpoint. --base-url https://api.openai.com/v1 for direct OpenAI access. |
--output-dir |
llm_bench/data/results/ |
Where reports and chart HTMLs go. |
bench-run run --model google/gemma-4-31b-it --domain trading
bench-run run --model minimax/minimax-m2.5 --domain trading
bench-run compare \
results/incremental_google_gemma-4-31b-it_*.jsonl \
results/incremental_minimax_minimax-m2.5_*.jsonl \
--models "Gemma 4 31B,Minimax M2.5"bench-run run --model anthropic/claude-sonnet-4 --suite toolsbench-run run --model openai/gpt-4o \
--min-quality 10 --max-quality 10 \
--min-difficulty hardbench-run run --model openai/gpt-4o-mini \
--max-difficulty easy \
--min-quality 5 --max-quality 5bench-run run --model google/gemma-4-31b-it \
--resume results/incremental_google_gemma-4-31b-it_20260411_183154.jsonlCompleted tasks are skipped. Infra errors (timeouts, rate limits) are retried. Model errors stay as-is.
bench-run run --model google/gemma-4-31b-it \
--llm-judge --judge-model openai/gpt-4o \
--visual-judgebench-run run --model llama3 \
--base-url http://localhost:11434/v1 \
--api-key ollamabench-run judge results/incremental_modelA.jsonl \
--judge-model openai/gpt-4o \
--model "Model A" \
--visualbench-run run --model google/gemma-4-31b-it \
--task-id DO_H03 \
--min-quality 10 --max-quality 10Harder tasks are worth more:
| Difficulty | Weight |
|---|---|
| Easy | 1.0x |
| Medium | 1.5x |
| Hard | 2.5x |
| Expert | 4.0x |
Each task at each quality level contributes raw_score * weight. Final percentage = achieved / max possible * 100%.
| Criterion | Max | What it checks |
|---|---|---|
| tool_selection | 25 | Did the model pick the right tool(s)? |
| required_params | 25 | Are all required parameters present and correct? |
| optional_params | 20 | Are optional parameters used when appropriate? |
| param_types | 15 | Are parameter types correct (string vs number vs bool)? |
| chain_order | 15 | For multi-tool tasks, are calls in the right sequence? |
| Criterion | Max | What it checks |
|---|---|---|
| data_presence | 30 | Is the source data actually in the output? |
| code_validity | 20 | Does the HTML/SVG parse without errors? |
| design_tokens | 20 | Are the specified colors, fonts, radii used? |
| chart_type_match | 10 | Is it the right chart type (bar vs line vs pie)? |
| design_rules | 10 | No shadows, gradients, or other banned patterns? |
| interactivity | 10 | Are interactive features present when required? |
- Model errors (bad output, parse failure) — score=0, counted in results. The model blew it.
- Infra errors (503, timeout, rate limit) — score=0, tracked separately. Not the model's fault. Retryable via
--resume.
When --visual-judge is enabled, each chart is rendered to PNG via Playwright and sent to a multimodal model for visual scoring (0-100). This catches things automated scoring can't — like charts that technically have the right elements but look broken when rendered.
Falls back to sending HTML source as text if Playwright isn't installed (less accurate but still useful).
Each run produces:
- JSONL — one result per line, written as tasks complete. Crash-safe.
- JSON report — full breakdown by difficulty, quality, domain, suite. Usage stats, error summary.
- Terminal tables — scores, difficulty breakdown, domain comparison. Printed via Rich.
- Chart HTML files — every generated chart saved as
{task_id}_Q{quality}.html.
| Suite | Difficulty | IDs |
|---|---|---|
| Tools | Easy | E01-E05 |
| Tools | Medium | M01-M06 |
| Tools | Hard | H01-H05 |
| Tools | Expert | X01-X04 |
| Charts | Easy | C_E01-C_E05 |
| Charts | Medium | C_M01-C_M06 |
| Charts | Hard | C_H01-C_H06 |
| Charts | Expert | C_X01-C_X05 |
Tools: DO_E01-DO_X04, Charts: DO_C_E01-DO_C_X03
Tools: DA_E01-DA_X04, Charts: DA_C_E01-DA_C_X03
Tools: AI_E01-AI_X04, Charts: AI_C_E01-AI_C_X03