Bencher

Open-source LLM benchmark built for real work, not leaderboard games.

Standard benchmarks test abstract reasoning — Bencher tests what actually matters: can a model call the right tools with the right parameters, and can it build a working chart from raw data? Four domains, two suites, difficulty scaling from trivial to brutal.

This started as a personal tool to stop guessing which model handles our workflows better. It's growing into something more general, but the philosophy stays the same: test on tasks that look like production, not on puzzles.

What it tests

Tool Calling — give the model a set of tools and a natural language request, check if it picks the right tools, passes correct parameters, chains calls in the right order.

Chart Building — give the model raw data and a visualization request, check if the output is valid HTML/SVG with correct data, proper design tokens, and working interactivity where required.

Both suites run across 4 domains (trading, devops, HR analytics, API gateway), 4 difficulty levels, and 10 prompt quality variations per task.

Install

pip install -e .

This gives you the bench-run CLI. Alternatively: python -m llm_bench.cli.

For visual chart evaluation (renders charts to screenshots via Playwright):

pip install -e ".[visual]"
playwright install chromium

API key

Pass --api-key directly or create .env in the project root:

LLM_API_KEY=sk-or-v1-...

One variable, any provider. Works with any OpenAI-compatible endpoint.

Quick start

Run everything on a model:

bench-run run --model google/gemma-4-31b-it

That's it. All domains, all suites, all difficulties, all quality levels. Results stream to a JSONL file as they complete, so even if something crashes you don't lose progress.

Domains

Domain	ID prefix	What's inside
`trading`	`E01`, `C_E01`	Trader's journal — trade history, strategies, PnL, risk analysis
`devops`	`DO_E01`, `DO_C_E01`	Server monitoring — deployments, incidents, metrics, alerts
`data_analysis`	`DA_E01`, `DA_C_E01`	HR analytics — employees, surveys, projects, training
`api_integration`	`AI_E01`, `AI_C_E01`	API gateway — endpoints, keys, request logs, webhooks

Each domain has its own fixture data, tools, scenarios, and expected results. The model sees realistic data structures — not toy examples.

Flags

What to test

Flag	Default	What it does
`--model`	required	Model ID. OpenRouter format: `google/gemma-4-31b-it`. For other providers, whatever their API expects.
`--suite`	`all`	Which suite to run. `tools` = only tool calling, `charts` = only chart building, `all` = both.
`--domain`	`all`	Which domain(s). Single: `--domain trading`. Multiple: `--domain trading,devops`. All: `--domain all`.
`--task-id`	all tasks	Run specific tasks only. Comma-separated: `--task-id H01,DO_M03,C_X05`. Useful for debugging a single scenario.
`--mode`	normal	Set to `mixed` for balanced cross-domain sampling — picks equal number of tasks per difficulty from each domain instead of running domains sequentially.
`--seed`	`42`	Random seed for `--mode mixed`. Same seed = same task selection.

Difficulty and quality filters

Each task has a difficulty (easy, medium, hard, expert) and each task is tested at quality levels 1-10 (where 1 is a vague/sloppy prompt and 10 is a precise, well-structured prompt).

Flag	Default	What it does
`--min-quality`	`1`	Lower bound of prompt quality range.
`--max-quality`	`10`	Upper bound. `--min-quality 10 --max-quality 10` = only the best prompts.
`--min-difficulty`	`easy`	Lower bound of difficulty.
`--max-difficulty`	`expert`	Upper bound. `--max-difficulty medium` = skip hard and expert tasks.

Model parameters

Flag	Default	What it does
`--temperature`	`0.0`	Sampling temperature. 0 = deterministic.
`--top-p`	provider default	Top-p (nucleus) sampling.
`--max-tokens`	provider default	Max tokens in response.

Execution

Flag	Default	What it does
`--workers`	`4`	Parallel task runners. Higher = faster but more API load.
`--task-timeout`	`300`	Per-task timeout in seconds. Tasks that exceed this get an infra error (score=0, retryable).
`--resume`	—	Path to existing JSONL. Skips already-completed tasks, retries infra errors (timeouts, rate limits, 503s). Model errors are not retried — if the model failed, it failed.

Provider routing (OpenRouter-specific)

These flags only apply when using OpenRouter as the backend. Ignored for other providers.

Flag	Default	What it does
`--provider-sort`	—	How OpenRouter picks the backend: `price` (cheapest), `throughput` (fastest tokens/sec), `latency` (lowest TTFT).
`--provider-order`	—	Preferred providers in order: `--provider-order Google,Together`. OpenRouter tries these first.
`--provider-no-fallback`	`false`	If set, only use providers from `--provider-order`. No fallbacks. Useful when you need a specific provider for reproducibility.

Evaluation

Flag	Default	What it does
`--llm-judge`	`false`	Enable LLM-as-judge scoring. A second model evaluates each response for correctness, reasoning quality, and hallucinations (tools) or visual clarity, data accuracy, and design quality (charts). Scores are separate from the main automated scoring.
`--judge-model`	—	Which model judges. Required with `--llm-judge`. Example: `--judge-model openai/gpt-4o`.
`--visual-judge`	`false`	Enable visual chart evaluation. Renders each chart HTML to a PNG screenshot via Playwright, then sends the image to a multimodal model for scoring (0-100). Falls back to sending HTML as text if Playwright isn't installed.

Connection

Flag	Default	What it does
`--api-key`	from env	API key. Falls back to `LLM_API_KEY` from environment.
`--base-url`	OpenRouter	Any OpenAI-compatible endpoint. `--base-url https://api.openai.com/v1` for direct OpenAI access.
`--output-dir`	`llm_bench/data/results/`	Where reports and chart HTMLs go.

Use cases

"I want to compare two models on our trading tasks"

bench-run run --model google/gemma-4-31b-it --domain trading
bench-run run --model minimax/minimax-m2.5 --domain trading

bench-run compare \
  results/incremental_google_gemma-4-31b-it_*.jsonl \
  results/incremental_minimax_minimax-m2.5_*.jsonl \
  --models "Gemma 4 31B,Minimax M2.5"

"I only care about tool calling, skip charts"

bench-run run --model anthropic/claude-sonnet-4 --suite tools

"Run only the hardest tasks with the best prompts"

bench-run run --model openai/gpt-4o \
  --min-quality 10 --max-quality 10 \
  --min-difficulty hard

"Quick sanity check — just easy tasks, one quality level"

bench-run run --model openai/gpt-4o-mini \
  --max-difficulty easy \
  --min-quality 5 --max-quality 5

"I got rate limited halfway through, continue where I left off"

bench-run run --model google/gemma-4-31b-it \
  --resume results/incremental_google_gemma-4-31b-it_20260411_183154.jsonl

Completed tasks are skipped. Infra errors (timeouts, rate limits) are retried. Model errors stay as-is.

"Full evaluation with LLM judge and visual scoring"

bench-run run --model google/gemma-4-31b-it \
  --llm-judge --judge-model openai/gpt-4o \
  --visual-judge

"Use a local model via Ollama"

bench-run run --model llama3 \
  --base-url http://localhost:11434/v1 \
  --api-key ollama

"Run judge on results I already have"

bench-run judge results/incremental_modelA.jsonl \
  --judge-model openai/gpt-4o \
  --model "Model A" \
  --visual

"Debug a single failing task"

bench-run run --model google/gemma-4-31b-it \
  --task-id DO_H03 \
  --min-quality 10 --max-quality 10

Scoring

Weights

Harder tasks are worth more:

Difficulty	Weight
Easy	1.0x
Medium	1.5x
Hard	2.5x
Expert	4.0x

Each task at each quality level contributes raw_score * weight. Final percentage = achieved / max possible * 100%.

Tool Calling (0-100 per task)

Criterion	Max	What it checks
tool_selection	25	Did the model pick the right tool(s)?
required_params	25	Are all required parameters present and correct?
optional_params	20	Are optional parameters used when appropriate?
param_types	15	Are parameter types correct (string vs number vs bool)?
chain_order	15	For multi-tool tasks, are calls in the right sequence?

Chart Building (0-100 per task)

Criterion	Max	What it checks
data_presence	30	Is the source data actually in the output?
code_validity	20	Does the HTML/SVG parse without errors?
design_tokens	20	Are the specified colors, fonts, radii used?
chart_type_match	10	Is it the right chart type (bar vs line vs pie)?
design_rules	10	No shadows, gradients, or other banned patterns?
interactivity	10	Are interactive features present when required?

Error handling

Model errors (bad output, parse failure) — score=0, counted in results. The model blew it.
Infra errors (503, timeout, rate limit) — score=0, tracked separately. Not the model's fault. Retryable via --resume.

Visual judge

When --visual-judge is enabled, each chart is rendered to PNG via Playwright and sent to a multimodal model for visual scoring (0-100). This catches things automated scoring can't — like charts that technically have the right elements but look broken when rendered.

Falls back to sending HTML source as text if Playwright isn't installed (less accurate but still useful).

Output

Each run produces:

JSONL — one result per line, written as tasks complete. Crash-safe.
JSON report — full breakdown by difficulty, quality, domain, suite. Usage stats, error summary.
Terminal tables — scores, difficulty breakdown, domain comparison. Printed via Rich.
Chart HTML files — every generated chart saved as {task_id}_Q{quality}.html.

Task IDs

Trading

Suite	Difficulty	IDs
Tools	Easy	E01-E05
Tools	Medium	M01-M06
Tools	Hard	H01-H05
Tools	Expert	X01-X04
Charts	Easy	C_E01-C_E05
Charts	Medium	C_M01-C_M06
Charts	Hard	C_H01-C_H06
Charts	Expert	C_X01-C_X05

DevOps (prefix: DO_)

Tools: DO_E01-DO_X04, Charts: DO_C_E01-DO_C_X03

Data Analysis (prefix: DA_)

Tools: DA_E01-DA_X04, Charts: DA_C_E01-DA_C_X03

API Integration (prefix: AI_)

Tools: AI_E01-AI_X04, Charts: AI_C_E01-AI_C_X03

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
llm_bench		llm_bench
.gitignore		.gitignore
README.md		README.md
pyproject.toml		pyproject.toml

Folders and files

Latest commit

History

Repository files navigation

Bencher

What it tests

Install

API key

Quick start

Domains

Flags

What to test

Difficulty and quality filters

Model parameters

Execution

Provider routing (OpenRouter-specific)

Evaluation

Connection

Use cases

"I want to compare two models on our trading tasks"

"I only care about tool calling, skip charts"

"Run only the hardest tasks with the best prompts"

"Quick sanity check — just easy tasks, one quality level"

"I got rate limited halfway through, continue where I left off"

"Full evaluation with LLM judge and visual scoring"

"Use a local model via Ollama"

"Run judge on results I already have"

"Debug a single failing task"

Scoring

Weights

Tool Calling (0-100 per task)

Chart Building (0-100 per task)

Error handling

Visual judge

Output

Task IDs

Trading

DevOps (prefix: DO_)

Data Analysis (prefix: DA_)

API Integration (prefix: AI_)

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages