llm-eval

A small, self-hosted eval for local LLMs served via llama.cpp's OpenAI-compatible HTTP endpoint. It fires a fixed prompt set at the model, captures every response and its reasoning trace, grades each call, and produces a reproducible head-to-head comparison — across models and across the "thinking on / thinking off" axis.

It deliberately tests crisp, gradeable sub-skills — reasoning, coding, code quality, instruction-following, long-context retrieval, writing, and single-step tool calling — in a base tier (broad, a sanity floor) and a hard tier (deliberately tricky, the discriminating set). Most grading is programmatic (numeric answer-checks, code run against hidden unit tests, structural/format checks, static analysis, structured tool-call matching); writing is rubric-scored. ⚠️ It does not test long-horizon agentic loops, multi-step tool chains, multi-file work, or novel synthesis — "tops this eval" ≠ "best model" (see the methodology doc).

Docs

file	what's in it
METHODOLOGY.md	how it works — the harness flow, the two tiers, the capabilities, the grading system (programmatic + rubric), the `coding_quality` correctness+robustness+static-analysis hybrid, sampling, the results schema, reproducibility, how to add a model, and what it doesn't measure
TESTS.md	the full per-prompt catalog — every prompt, its id, what it probes, its grader (generated from `prompts/*.jsonl`)
ANALYSIS.md	narrative analysis of the runs done so far — 4 models compared (`gemma-4-26b-a4b`, `gemma-4-31b`, `qwen3.6-35b-a3b`, `qwen3.5-122b-a10b` at Q3_K_XL): TL;DR, per-model profiles, capability-by-capability, failure patterns, verdict
`report_compare.md`	the numbers tables for those runs (overall / by mode / by capability / by run / by capability×mode×run / per-rubric-criterion means / failures) — generated by `report.py` over all the result files

For a single-model report, run python3 scripts/report.py results/<one-run>.jsonl --out report.md (one-off, not checked in).

Layout

prompts/        prompt sets, one .jsonl per capability/tier (see TESTS.md). Generated by gen_prompts.py — deterministic.
scripts/
  gen_prompts.py    builds prompts/*.jsonl (seeded; re-run after edits)
  run_eval.py       runs a model through the eval → results/<tag>__<ts>.jsonl (+ .log + .meta.json)
  graders.py        the programmatic grader registry
  grade_rubrics.py  list/apply for the human-scored (writing/coherence) rubric items
  report.py         turns one or more results .jsonl files into a markdown comparison report
results/        one .jsonl per run + a .log mirror + a .meta.json with the run/server config
ANALYSIS.md  METHODOLOGY.md  TESTS.md  README.md

Quick start

# 1. serve a model with llama.cpp (OpenAI-compatible) on :8080, e.g.
#    llama-server -m model.gguf -c 524288 --parallel 2 --jinja --port 8080 ...

# 2. (re)generate the prompt sets if needed
python3 scripts/gen_prompts.py

# 3. set scripts/run_eval.py's SAMPLING to the model's vendor recommendation, then smoke-test
python3 scripts/run_eval.py --tag mymodel --caps reasoning --modes on,off --smoke

# 4. full run (all base + hard caps, both thinking modes, serial)
python3 scripts/run_eval.py --tag mymodel --caps everything --modes on,off --concurrency 1
#    --caps options: 'all' (base caps) · 'hard' (the *_hard caps) · 'everything' (both) · or a comma list

# 5. grade the human-scored writing/coherence items
python3 scripts/grade_rubrics.py list results/mymodel__<ts>.jsonl    # dump pending items for scoring
python3 scripts/grade_rubrics.py apply results/mymodel__<ts>.jsonl '{"<idx>": {"<criterion>": 1..5, ...}, ...}'

# 6. report (one or more result files → multi-model comparison, grouped by tag)
python3 scripts/report.py results/*.jsonl --out report.md

Requirements

Python 3 (uses only the stdlib for the harness; ruff optional — pip install --break-system-packages ruff — for the code_quality lint check, skipped if absent).
A llama.cpp server with the OpenAI-compatible endpoint and --jinja (so reasoning traces come back in reasoning_content and the thinking toggle works via chat_template_kwargs:{enable_thinking:…}).

Status

Runs done so far: gemma-4-26b-a4b (Q8), gemma-4-31b (Q5), qwen3.6-35b-a3b (Q8), qwen3.5-122b-a10b (Q3_K_XL) — base + hard tiers + both code-quality tiers, both thinking modes, coherence excluded. See ANALYSIS.md. Headline: the two Gemmas co-lead (~98% base; the 31B-dense edges the hard tier because the 26B-A4B over-thinks into truncated outputs); Qwen3.6-35B-A3B is a clear third; Qwen3.5-122B-A10B at Q3_K_XL finishes 4th despite ~4× the parameter count of Qwen3.6 — the aggressive Q3 quant erases the size advantage.

The tool_calling capability is new (added 2026-05-13, backfilled across all 4 models on 2026-05-14). Gemma-4-26B-A4B is the perfect-score winner here at 52/52 (100%), followed by Gemma-31B and Qwen3.5-Q3 tied at 51/52 (98.1%), Qwen3.6 at 50/52 (96.2%). Tool-call discipline is uniformly strong across the cohort — only the hard tier (and only tc-h-03 Hindi-not-in-enum + tc-h-05 parallel mixed-enum) discriminates.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

llm-eval

Docs

Layout

Quick start

Requirements

Status

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
prompts		prompts
results		results
scripts		scripts
.gitignore		.gitignore
ANALYSIS.md		ANALYSIS.md
METHODOLOGY.md		METHODOLOGY.md
README.md		README.md
TESTS.md		TESTS.md
report_compare.md		report_compare.md

Folders and files

Latest commit

History

Repository files navigation

llm-eval

Docs

Layout

Quick start

Requirements

Status

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages