Local LLM Benchmark

A benchmarking framework for comparing local LLMs running via llama.cpp server. Test different models, sizes, quantizations, and server settings to find the best configuration for your hardware.

Built for Apple Silicon Macs but works on any machine running llama-server.

What it measures

Speed: generation tokens/s, prompt eval tokens/s, time to complete
Memory: peak RSS (captures unified memory on Apple Silicon)
Quality: auto-scoring via code execution + test suites, keyword/pattern matching, and LLM-as-judge evaluation

Prompt test suite

33 prompts across 7 categories:

Category	Prompts	Scoring method
Coding	FizzBuzz, binary search, REST API, debugging	Keyword matching + code execution
Reasoning	River crossing, knights & knaves, lateral thinking, causal reasoning	Pattern matching for correct conclusions
Math	Arithmetic, algebra, probability, optimization	Exact answer checking (handles LaTeX)
General	Summarization, comparison, instruction following	Structural checks (word count, format)
Agentic coding	Full project from spec, refactoring, async debugging, architecture decisions	Code substance + feature coverage
Executable	Sorting, dict flattening, cron parsing, matrix ops, graphs, LRU cache, expression eval	Code saved as file, run against hidden test suite
ML	Neural net, K-means, linear regression, CartPole RL, decision tree, genetic algorithm	Code run against performance thresholds (accuracy, reward)

The executable and ML categories test agentic capabilities: the model's response is saved directly as a .py file and executed against test harnesses in tests/.

Quick start

# Install
pip install -r requirements.txt

# Run benchmarks (edit config with your models first)
python3 bench.py run -c configs/quick.json

# View results
python3 bench.py report

# Score with a judge model
python3 bench.py judge --hf-repo unsloth/gemma-4-31B-it-GGUF:UD-Q4_K_XL

Configuration

Benchmark configs are JSON files listing models and settings to test. Models are loaded from HuggingFace via the -hf flag (downloaded automatically) or from local GGUF files.

{
  "configs": [
    {
      "label": "Qwen3-8B-Q4_K_M",
      "hf_repo": "unsloth/Qwen3-8B-GGUF",
      "hf_file": "Qwen3-8B-Q4_K_M.gguf",
      "quantization": "Q4_K_M",
      "param_count": "8B",
      "ctx_size": 8192,
      "flash_attn": true
    }
  ],
  "prompt_dirs": ["prompts/"],
  "max_tokens": 8192,
  "temperature": 0.0
}

Included configs

Config	Purpose
`configs/quick.json`	Fast test with small models (Gemma 4 E2B/E4B)
`configs/gemma4-full.json`	All Gemma 4 sizes and quantizations
`configs/multi-model.json`	Cross-model comparison (Gemma 4, Qwen 3, Qwen 3.5, DeepSeek-R1)
`configs/server-settings.json`	Test llama-server flags (context size, KV cache quant, flash attention, batch size, threads)

Server settings to benchmark

Use extra_args to test llama-server flags:

{
  "label": "kv-cache-q4",
  "hf_repo": "unsloth/gemma-4-E4B-it-GGUF",
  "hf_file": "gemma-4-E4B-it-Q4_K_M.gguf",
  "extra_args": {"cache-type-k": "q4_0", "cache-type-v": "q4_0"}
}

Commands

# Benchmarking
python3 bench.py run -c config.json              # Run all configs
python3 bench.py run -c config.json --skip-existing  # Skip already-benchmarked configs

# Results
python3 bench.py report                           # Comparison table
python3 bench.py report --run-ids ID1 ID2         # Compare specific runs
python3 bench.py list                             # List all runs
python3 bench.py show <run_id>                    # Detailed results

# Scoring
python3 bench.py autoscore                        # Re-run auto-evaluation
python3 bench.py autoscore --workers 8            # Parallel scoring
python3 bench.py judge --hf-repo repo:file        # LLM-as-judge scoring
python3 bench.py score <run_id>                   # Manual interactive scoring

# Model management
python3 bench.py models                           # List downloaded GGUF files
python3 bench.py models --clean                   # Delete unused models

# Maintenance
python3 bench.py purge <run_id>                   # Delete a run
python3 bench.py export -o results.json           # Export data

How scoring works

Three layers of evaluation:

Auto-score (immediate, during benchmark run): category-specific checks. Executable/ML prompts run the code against test suites. Coding/math/reasoning use pattern matching and answer verification.
LLM-as-judge (separate step): sends each response to a judge model with a scoring rubric. Returns correctness (0-4), completeness (0-3), and quality (0-3) scores.
Manual scoring (optional): interactive 0-10 scoring via bench.py score.

Requirements

Python 3.10+
llama.cpp server (llama-server on PATH or via Homebrew)
GGUF models from HuggingFace (downloaded automatically)

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
configs		configs
prompts		prompts
tests		tests
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
README.md		README.md
bench.py		bench.py
config.py		config.py
evaluate.py		evaluate.py
judge.py		judge.py
metrics.py		metrics.py
models.py		models.py
report.py		report.py
requirements.txt		requirements.txt
runner.py		runner.py
server.py		server.py
storage.py		storage.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Local LLM Benchmark

What it measures

Prompt test suite

Quick start

Configuration

Included configs

Server settings to benchmark

Commands

How scoring works

Requirements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Local LLM Benchmark

What it measures

Prompt test suite

Quick start

Configuration

Included configs

Server settings to benchmark

Commands

How scoring works

Requirements

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages