A benchmarking framework for comparing local LLMs running via llama.cpp server. Test different models, sizes, quantizations, and server settings to find the best configuration for your hardware.
Built for Apple Silicon Macs but works on any machine running llama-server.
- Speed: generation tokens/s, prompt eval tokens/s, time to complete
- Memory: peak RSS (captures unified memory on Apple Silicon)
- Quality: auto-scoring via code execution + test suites, keyword/pattern matching, and LLM-as-judge evaluation
33 prompts across 7 categories:
| Category | Prompts | Scoring method |
|---|---|---|
| Coding | FizzBuzz, binary search, REST API, debugging | Keyword matching + code execution |
| Reasoning | River crossing, knights & knaves, lateral thinking, causal reasoning | Pattern matching for correct conclusions |
| Math | Arithmetic, algebra, probability, optimization | Exact answer checking (handles LaTeX) |
| General | Summarization, comparison, instruction following | Structural checks (word count, format) |
| Agentic coding | Full project from spec, refactoring, async debugging, architecture decisions | Code substance + feature coverage |
| Executable | Sorting, dict flattening, cron parsing, matrix ops, graphs, LRU cache, expression eval | Code saved as file, run against hidden test suite |
| ML | Neural net, K-means, linear regression, CartPole RL, decision tree, genetic algorithm | Code run against performance thresholds (accuracy, reward) |
The executable and ML categories test agentic capabilities: the model's response is saved directly as a .py file and executed against test harnesses in tests/.
# Install
pip install -r requirements.txt
# Run benchmarks (edit config with your models first)
python3 bench.py run -c configs/quick.json
# View results
python3 bench.py report
# Score with a judge model
python3 bench.py judge --hf-repo unsloth/gemma-4-31B-it-GGUF:UD-Q4_K_XLBenchmark configs are JSON files listing models and settings to test. Models are loaded from HuggingFace via the -hf flag (downloaded automatically) or from local GGUF files.
{
"configs": [
{
"label": "Qwen3-8B-Q4_K_M",
"hf_repo": "unsloth/Qwen3-8B-GGUF",
"hf_file": "Qwen3-8B-Q4_K_M.gguf",
"quantization": "Q4_K_M",
"param_count": "8B",
"ctx_size": 8192,
"flash_attn": true
}
],
"prompt_dirs": ["prompts/"],
"max_tokens": 8192,
"temperature": 0.0
}| Config | Purpose |
|---|---|
configs/quick.json |
Fast test with small models (Gemma 4 E2B/E4B) |
configs/gemma4-full.json |
All Gemma 4 sizes and quantizations |
configs/multi-model.json |
Cross-model comparison (Gemma 4, Qwen 3, Qwen 3.5, DeepSeek-R1) |
configs/server-settings.json |
Test llama-server flags (context size, KV cache quant, flash attention, batch size, threads) |
Use extra_args to test llama-server flags:
{
"label": "kv-cache-q4",
"hf_repo": "unsloth/gemma-4-E4B-it-GGUF",
"hf_file": "gemma-4-E4B-it-Q4_K_M.gguf",
"extra_args": {"cache-type-k": "q4_0", "cache-type-v": "q4_0"}
}# Benchmarking
python3 bench.py run -c config.json # Run all configs
python3 bench.py run -c config.json --skip-existing # Skip already-benchmarked configs
# Results
python3 bench.py report # Comparison table
python3 bench.py report --run-ids ID1 ID2 # Compare specific runs
python3 bench.py list # List all runs
python3 bench.py show <run_id> # Detailed results
# Scoring
python3 bench.py autoscore # Re-run auto-evaluation
python3 bench.py autoscore --workers 8 # Parallel scoring
python3 bench.py judge --hf-repo repo:file # LLM-as-judge scoring
python3 bench.py score <run_id> # Manual interactive scoring
# Model management
python3 bench.py models # List downloaded GGUF files
python3 bench.py models --clean # Delete unused models
# Maintenance
python3 bench.py purge <run_id> # Delete a run
python3 bench.py export -o results.json # Export dataThree layers of evaluation:
-
Auto-score (immediate, during benchmark run): category-specific checks. Executable/ML prompts run the code against test suites. Coding/math/reasoning use pattern matching and answer verification.
-
LLM-as-judge (separate step): sends each response to a judge model with a scoring rubric. Returns correctness (0-4), completeness (0-3), and quality (0-3) scores.
-
Manual scoring (optional): interactive 0-10 scoring via
bench.py score.
- Python 3.10+
- llama.cpp server (
llama-serveron PATH or via Homebrew) - GGUF models from HuggingFace (downloaded automatically)