A small, self-hosted eval for local LLMs served via llama.cpp's OpenAI-compatible HTTP endpoint. It fires a fixed prompt set at the model, captures every response and its reasoning trace, grades each call, and produces a reproducible head-to-head comparison — across models and across the "thinking on / thinking off" axis.
It deliberately tests crisp, gradeable sub-skills — reasoning, coding, code quality, instruction-following,
long-context retrieval, writing, and single-step tool calling — in a base tier (broad, a sanity floor) and a
hard tier (deliberately tricky, the discriminating set). Most grading is programmatic (numeric answer-checks,
code run against hidden unit tests, structural/format checks, static analysis, structured tool-call matching);
writing is rubric-scored.
| file | what's in it |
|---|---|
| METHODOLOGY.md | how it works — the harness flow, the two tiers, the capabilities, the grading system (programmatic + rubric), the coding_quality correctness+robustness+static-analysis hybrid, sampling, the results schema, reproducibility, how to add a model, and what it doesn't measure |
| TESTS.md | the full per-prompt catalog — every prompt, its id, what it probes, its grader (generated from prompts/*.jsonl) |
| ANALYSIS.md | narrative analysis of the runs done so far — 4 models compared (gemma-4-26b-a4b, gemma-4-31b, qwen3.6-35b-a3b, qwen3.5-122b-a10b at Q3_K_XL): TL;DR, per-model profiles, capability-by-capability, failure patterns, verdict |
report_compare.md |
the numbers tables for those runs (overall / by mode / by capability / by run / by capability×mode×run / per-rubric-criterion means / failures) — generated by report.py over all the result files |
For a single-model report, run python3 scripts/report.py results/<one-run>.jsonl --out report.md (one-off, not checked in).
prompts/ prompt sets, one .jsonl per capability/tier (see TESTS.md). Generated by gen_prompts.py — deterministic.
scripts/
gen_prompts.py builds prompts/*.jsonl (seeded; re-run after edits)
run_eval.py runs a model through the eval → results/<tag>__<ts>.jsonl (+ .log + .meta.json)
graders.py the programmatic grader registry
grade_rubrics.py list/apply for the human-scored (writing/coherence) rubric items
report.py turns one or more results .jsonl files into a markdown comparison report
results/ one .jsonl per run + a .log mirror + a .meta.json with the run/server config
ANALYSIS.md METHODOLOGY.md TESTS.md README.md
# 1. serve a model with llama.cpp (OpenAI-compatible) on :8080, e.g.
# llama-server -m model.gguf -c 524288 --parallel 2 --jinja --port 8080 ...
# 2. (re)generate the prompt sets if needed
python3 scripts/gen_prompts.py
# 3. set scripts/run_eval.py's SAMPLING to the model's vendor recommendation, then smoke-test
python3 scripts/run_eval.py --tag mymodel --caps reasoning --modes on,off --smoke
# 4. full run (all base + hard caps, both thinking modes, serial)
python3 scripts/run_eval.py --tag mymodel --caps everything --modes on,off --concurrency 1
# --caps options: 'all' (base caps) · 'hard' (the *_hard caps) · 'everything' (both) · or a comma list
# 5. grade the human-scored writing/coherence items
python3 scripts/grade_rubrics.py list results/mymodel__<ts>.jsonl # dump pending items for scoring
python3 scripts/grade_rubrics.py apply results/mymodel__<ts>.jsonl '{"<idx>": {"<criterion>": 1..5, ...}, ...}'
# 6. report (one or more result files → multi-model comparison, grouped by tag)
python3 scripts/report.py results/*.jsonl --out report.md- Python 3 (uses only the stdlib for the harness;
ruffoptional —pip install --break-system-packages ruff— for thecode_qualitylint check, skipped if absent). - A llama.cpp server with the OpenAI-compatible endpoint and
--jinja(so reasoning traces come back inreasoning_contentand the thinking toggle works viachat_template_kwargs:{enable_thinking:…}).
Runs done so far: gemma-4-26b-a4b (Q8), gemma-4-31b (Q5), qwen3.6-35b-a3b (Q8), qwen3.5-122b-a10b (Q3_K_XL)
— base + hard tiers + both code-quality tiers, both thinking modes, coherence excluded. See
ANALYSIS.md. Headline: the two Gemmas co-lead (~98% base; the 31B-dense edges the hard tier because
the 26B-A4B over-thinks into truncated outputs); Qwen3.6-35B-A3B is a clear third; Qwen3.5-122B-A10B at Q3_K_XL
finishes 4th despite ~4× the parameter count of Qwen3.6 — the aggressive Q3 quant erases the size advantage.
The tool_calling capability is new (added 2026-05-13, backfilled across all 4 models on 2026-05-14).
Gemma-4-26B-A4B is the perfect-score winner here at 52/52 (100%), followed by Gemma-31B and Qwen3.5-Q3 tied at
51/52 (98.1%), Qwen3.6 at 50/52 (96.2%). Tool-call discipline is uniformly strong across the cohort — only the
hard tier (and only tc-h-03 Hindi-not-in-enum + tc-h-05 parallel mixed-enum) discriminates.