Skip to content

m1xerrr/Bencher

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 

Repository files navigation

Bencher

Open-source LLM benchmark built for real work, not leaderboard games.

Standard benchmarks test abstract reasoning — Bencher tests what actually matters: can a model call the right tools with the right parameters, and can it build a working chart from raw data? Four domains, two suites, difficulty scaling from trivial to brutal.

This started as a personal tool to stop guessing which model handles our workflows better. It's growing into something more general, but the philosophy stays the same: test on tasks that look like production, not on puzzles.

What it tests

Tool Calling — give the model a set of tools and a natural language request, check if it picks the right tools, passes correct parameters, chains calls in the right order.

Chart Building — give the model raw data and a visualization request, check if the output is valid HTML/SVG with correct data, proper design tokens, and working interactivity where required.

Both suites run across 4 domains (trading, devops, HR analytics, API gateway), 4 difficulty levels, and 10 prompt quality variations per task.

Install

pip install -e .

This gives you the bench-run CLI. Alternatively: python -m llm_bench.cli.

For visual chart evaluation (renders charts to screenshots via Playwright):

pip install -e ".[visual]"
playwright install chromium

API key

Pass --api-key directly or create .env in the project root:

LLM_API_KEY=sk-or-v1-...

One variable, any provider. Works with any OpenAI-compatible endpoint.


Quick start

Run everything on a model:

bench-run run --model google/gemma-4-31b-it

That's it. All domains, all suites, all difficulties, all quality levels. Results stream to a JSONL file as they complete, so even if something crashes you don't lose progress.


Domains

Domain ID prefix What's inside
trading E01, C_E01 Trader's journal — trade history, strategies, PnL, risk analysis
devops DO_E01, DO_C_E01 Server monitoring — deployments, incidents, metrics, alerts
data_analysis DA_E01, DA_C_E01 HR analytics — employees, surveys, projects, training
api_integration AI_E01, AI_C_E01 API gateway — endpoints, keys, request logs, webhooks

Each domain has its own fixture data, tools, scenarios, and expected results. The model sees realistic data structures — not toy examples.


Flags

What to test

Flag Default What it does
--model required Model ID. OpenRouter format: google/gemma-4-31b-it. For other providers, whatever their API expects.
--suite all Which suite to run. tools = only tool calling, charts = only chart building, all = both.
--domain all Which domain(s). Single: --domain trading. Multiple: --domain trading,devops. All: --domain all.
--task-id all tasks Run specific tasks only. Comma-separated: --task-id H01,DO_M03,C_X05. Useful for debugging a single scenario.
--mode normal Set to mixed for balanced cross-domain sampling — picks equal number of tasks per difficulty from each domain instead of running domains sequentially.
--seed 42 Random seed for --mode mixed. Same seed = same task selection.

Difficulty and quality filters

Each task has a difficulty (easy, medium, hard, expert) and each task is tested at quality levels 1-10 (where 1 is a vague/sloppy prompt and 10 is a precise, well-structured prompt).

Flag Default What it does
--min-quality 1 Lower bound of prompt quality range.
--max-quality 10 Upper bound. --min-quality 10 --max-quality 10 = only the best prompts.
--min-difficulty easy Lower bound of difficulty.
--max-difficulty expert Upper bound. --max-difficulty medium = skip hard and expert tasks.

Model parameters

Flag Default What it does
--temperature 0.0 Sampling temperature. 0 = deterministic.
--top-p provider default Top-p (nucleus) sampling.
--max-tokens provider default Max tokens in response.

Execution

Flag Default What it does
--workers 4 Parallel task runners. Higher = faster but more API load.
--task-timeout 300 Per-task timeout in seconds. Tasks that exceed this get an infra error (score=0, retryable).
--resume Path to existing JSONL. Skips already-completed tasks, retries infra errors (timeouts, rate limits, 503s). Model errors are not retried — if the model failed, it failed.

Provider routing (OpenRouter-specific)

These flags only apply when using OpenRouter as the backend. Ignored for other providers.

Flag Default What it does
--provider-sort How OpenRouter picks the backend: price (cheapest), throughput (fastest tokens/sec), latency (lowest TTFT).
--provider-order Preferred providers in order: --provider-order Google,Together. OpenRouter tries these first.
--provider-no-fallback false If set, only use providers from --provider-order. No fallbacks. Useful when you need a specific provider for reproducibility.

Evaluation

Flag Default What it does
--llm-judge false Enable LLM-as-judge scoring. A second model evaluates each response for correctness, reasoning quality, and hallucinations (tools) or visual clarity, data accuracy, and design quality (charts). Scores are separate from the main automated scoring.
--judge-model Which model judges. Required with --llm-judge. Example: --judge-model openai/gpt-4o.
--visual-judge false Enable visual chart evaluation. Renders each chart HTML to a PNG screenshot via Playwright, then sends the image to a multimodal model for scoring (0-100). Falls back to sending HTML as text if Playwright isn't installed.

Connection

Flag Default What it does
--api-key from env API key. Falls back to LLM_API_KEY from environment.
--base-url OpenRouter Any OpenAI-compatible endpoint. --base-url https://api.openai.com/v1 for direct OpenAI access.
--output-dir llm_bench/data/results/ Where reports and chart HTMLs go.

Use cases

"I want to compare two models on our trading tasks"

bench-run run --model google/gemma-4-31b-it --domain trading
bench-run run --model minimax/minimax-m2.5 --domain trading

bench-run compare \
  results/incremental_google_gemma-4-31b-it_*.jsonl \
  results/incremental_minimax_minimax-m2.5_*.jsonl \
  --models "Gemma 4 31B,Minimax M2.5"

"I only care about tool calling, skip charts"

bench-run run --model anthropic/claude-sonnet-4 --suite tools

"Run only the hardest tasks with the best prompts"

bench-run run --model openai/gpt-4o \
  --min-quality 10 --max-quality 10 \
  --min-difficulty hard

"Quick sanity check — just easy tasks, one quality level"

bench-run run --model openai/gpt-4o-mini \
  --max-difficulty easy \
  --min-quality 5 --max-quality 5

"I got rate limited halfway through, continue where I left off"

bench-run run --model google/gemma-4-31b-it \
  --resume results/incremental_google_gemma-4-31b-it_20260411_183154.jsonl

Completed tasks are skipped. Infra errors (timeouts, rate limits) are retried. Model errors stay as-is.

"Full evaluation with LLM judge and visual scoring"

bench-run run --model google/gemma-4-31b-it \
  --llm-judge --judge-model openai/gpt-4o \
  --visual-judge

"Use a local model via Ollama"

bench-run run --model llama3 \
  --base-url http://localhost:11434/v1 \
  --api-key ollama

"Run judge on results I already have"

bench-run judge results/incremental_modelA.jsonl \
  --judge-model openai/gpt-4o \
  --model "Model A" \
  --visual

"Debug a single failing task"

bench-run run --model google/gemma-4-31b-it \
  --task-id DO_H03 \
  --min-quality 10 --max-quality 10

Scoring

Weights

Harder tasks are worth more:

Difficulty Weight
Easy 1.0x
Medium 1.5x
Hard 2.5x
Expert 4.0x

Each task at each quality level contributes raw_score * weight. Final percentage = achieved / max possible * 100%.

Tool Calling (0-100 per task)

Criterion Max What it checks
tool_selection 25 Did the model pick the right tool(s)?
required_params 25 Are all required parameters present and correct?
optional_params 20 Are optional parameters used when appropriate?
param_types 15 Are parameter types correct (string vs number vs bool)?
chain_order 15 For multi-tool tasks, are calls in the right sequence?

Chart Building (0-100 per task)

Criterion Max What it checks
data_presence 30 Is the source data actually in the output?
code_validity 20 Does the HTML/SVG parse without errors?
design_tokens 20 Are the specified colors, fonts, radii used?
chart_type_match 10 Is it the right chart type (bar vs line vs pie)?
design_rules 10 No shadows, gradients, or other banned patterns?
interactivity 10 Are interactive features present when required?

Error handling

  • Model errors (bad output, parse failure) — score=0, counted in results. The model blew it.
  • Infra errors (503, timeout, rate limit) — score=0, tracked separately. Not the model's fault. Retryable via --resume.

Visual judge

When --visual-judge is enabled, each chart is rendered to PNG via Playwright and sent to a multimodal model for visual scoring (0-100). This catches things automated scoring can't — like charts that technically have the right elements but look broken when rendered.

Falls back to sending HTML source as text if Playwright isn't installed (less accurate but still useful).


Output

Each run produces:

  • JSONL — one result per line, written as tasks complete. Crash-safe.
  • JSON report — full breakdown by difficulty, quality, domain, suite. Usage stats, error summary.
  • Terminal tables — scores, difficulty breakdown, domain comparison. Printed via Rich.
  • Chart HTML files — every generated chart saved as {task_id}_Q{quality}.html.

Task IDs

Trading

Suite Difficulty IDs
Tools Easy E01-E05
Tools Medium M01-M06
Tools Hard H01-H05
Tools Expert X01-X04
Charts Easy C_E01-C_E05
Charts Medium C_M01-C_M06
Charts Hard C_H01-C_H06
Charts Expert C_X01-C_X05

DevOps (prefix: DO_)

Tools: DO_E01-DO_X04, Charts: DO_C_E01-DO_C_X03

Data Analysis (prefix: DA_)

Tools: DA_E01-DA_X04, Charts: DA_C_E01-DA_C_X03

API Integration (prefix: AI_)

Tools: AI_E01-AI_X04, Charts: AI_C_E01-AI_C_X03

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages