Benchmark local LLMs by what actually matters.
BenchLoop is a local-first CLI + web app for benchmarking LLMs running on your own hardware. It scores models across seven repeatable suites — quality, speed, reliability, agentic tool use, coding, instruction following — and gives you receipts: per-task outputs, latency, token counts, machine info, scores.
No accounts, no telemetry, no API keys. Your model, your machine, your numbers.
$ benchloop run --model qwen3:8b --suites speed,toolcall,agent
... 8 tasks, 4 tools, 6 turns avg, 74.6 tok/s ...
Overall 73.4 ████████░░
Quality 73.6 ████████░░
Speed 78.9 █████████░
Agent 96.9 █████████▌
Published runs live at https://bench-loop.com/leaderboard. Every completed local benchmark auto-publishes there.
Hosted LLM leaderboards answer "which model wins on a server farm someone else paid for?" BenchLoop answers "which model + harness + hardware combination actually works for me right now?" — the question you have when picking a local stack.
It is repeatable on purpose: every run persists to disk, the task set is frozen, the scorer is deterministic. If you say "qwen3:8b scored 89 on my 4090", anyone can install BenchLoop and verify it.
pipx install benchloop-cli
benchloop --versionThe PyPI distribution is named
benchloop-cli(the barebenchloopname was taken by an unrelated dataset library). The installed commands are stillbenchloopandbench-loop.
pip install benchloop-cligit clone https://github.com/outsourc-e/bench-loop
cd bench-loop
pip install -e .Make sure you have a local LLM endpoint running. Anything OpenAI-compatible or Ollama-flavored works:
- Ollama at
http://localhost:11434(default) - LM Studio at
http://localhost:1234(--provider openai_compat) - MLX / Osaurus at
http://localhost:8000(--provider openai_compat) - vLLM, Jan, llama-server, etc.
Then:
benchloop run \
--model qwen3:8b \
--endpoint http://localhost:11434 \
--provider ollamaThis runs every default suite, scores them, prints a console report, and persists the full run to ~/.bench-loop/runs/.
benchloop run --model qwen3:8b --suites speed,agentSame model, four ways to talk to it:
benchloop run --model qwen3:8b --harness raw # native tool calling
benchloop run --model qwen3:8b --harness hermes # <tool_call>{...}</tool_call>
benchloop run --model qwen3:8b --harness qwen # <function_call>{...}</function_call>
benchloop run --model qwen3:8b --harness pi # <think>...</think> + Hermes tagsbenchloop run \
--model qwen3:8b \
--endpoint http://localhost:11435 \
--hardware "NVIDIA RTX 4090 24GB" \
--gpu "NVIDIA RTX 4090" \
--gpu-memory-gb 24v0.2.0+ ships the full FastAPI + React dashboard inside the wheel. After pipx install benchloop-cli:
benchloop dashboard
# → open http://127.0.0.1:8877Need it to survive browser/terminal churn? Print a service template instead of keeping the dashboard tied to one shell:
benchloop dashboard --service-template launchd
benchloop dashboard --service-template systemd
benchloop dashboard --service-template windows-taskThis serves the Models, Benchmark, Leaderboard, Compare, and Chat tabs on a single port, with auto-discovered local providers (Ollama, LM Studio, MLX/Osaurus, vLLM, Jan).
For hot-reload development against a clone of bench-loop-web:
benchloop dashboard --dev| Suite | What it scores |
|---|---|
speed |
Latency, throughput, TTFT, generation tok/s across short/medium/long contexts |
toolcall |
Structured tool-call correctness across realistic tasks (weather, stocks, email, search) |
coding |
Executable Python tasks verified in a sandboxed subprocess (10s timeout) |
dataextract |
JSON / structured extraction from messy natural language |
instructfollow |
Constraint following, formatting, exactness |
reasonmath |
Small reasoning + math tasks with deterministic checks |
agent |
Multi-turn agentic tool use. BenchLoop drives a real loop: model emits a tool call, BenchLoop executes it locally, feeds the result back, model iterates until done. Scores correctness, efficiency, no-hallucination, required-tool coverage. |
Overall = 0.55 · quality + 0.20 · speed + 0.25 · reliability
- Quality = mean of non-speed suite scores (size-fair).
- Speed =
12.54 · log2(tok/s) + 0.9, clamped to 0–100. - Reliability = pass rate across all tasks.
- Agent =
correct_final + efficient + no_hallucinated_tools + all_required_called, 25 pts each, averaged across tasks.
A FastAPI backend + React frontend bundle ships alongside the CLI for visualizing runs:
benchloop dashboard # starts the local web app on :5180Tabs: Models, Benchmark, Leaderboard, Compare runs, Chat, agent trace viewer.
Every completed benchmark auto-publishes to https://bench-loop.com/leaderboard via https://api.bench-loop.com/submit. Runs are deduped by (machine_id, run_id) so the same run from the same machine won't be double-counted.
Opt out:
export BENCHLOOP_NO_SUBMIT=1You can still manually export a snapshot for sharing / archiving:
benchloop export --output my-runs.jsonbench-loop/ ← this repo, the CLI + suites + scorers
bench_loop/
cli.py ← `benchloop` entrypoint
suites/ ← speed, toolcall, coding, agent, ...
harness.py ← raw / hermes / qwen / pi adapters
providers/ ← ollama, openai_compat
runner/orchestrator.py ← drives suites + harnesses
tasks/ ← frozen task YAML fixtures
bench-loop-web/ ← the web app (separate repo)
api/ ← FastAPI wrapper around bench_loop
ui/ ← local dashboard
site/ ← public bench-loop.com static site
BenchLoop is v0.1 beta. The benchmark surface, scoring, web app, agent loop, and four harnesses all work end-to-end. Stuff still on the roadmap:
- Streaming TTFT for OpenAI-compatible providers (currently 0 on those backends — ollama TTFT is fine)
- Bigger task fixtures (each suite is intentionally small and frozen for v1)
- Hosted submission flow for community runs
- More provider adapters (TGI, Bedrock, etc. if there's demand)
MIT. See LICENSE.
