Local LLM inference benchmark for macOS Apple Silicon. Compare Ollama, vLLM-MLX, and Docker Model Runner (vLLM) side by side.
- Benchmarks three local inference backends through their OpenAI-compatible APIs
- Measures TTFT, tokens/sec, end-to-end latency, inter-token latency, and memory usage
- Tests at multiple concurrency levels (1, 2, 4, 8)
- Outputs rich terminal tables, JSON results, and comparison charts
- Configurable via TOML — models, backends, prompts, and parameters
- macOS with Apple Silicon (M1/M2/M3/M4)
- Python 3.12+
- uv (recommended) or pip
- At least one backend installed:
- Ollama — manages GGUF models
- vLLM-MLX — MLX 4-bit models from
mlx-community/ - Docker Model Runner — Docker Desktop with vLLM engine
# Clone and set up
git clone https://github.com/your-user/mlxbench.git
cd mlxbench
uv venv .venv --python 3.12
source .venv/bin/activate
uv pip install -e .
# Check which backends are available
mlxbench check
# See registered models
mlxbench modelsEach backend must be running before you benchmark it. Start them in separate terminals:
# Ollama (daemon — auto-loads models on request)
ollama serve
# vLLM-MLX (must specify model and port)
vllm-mlx serve mlx-community/Qwen2.5-3B-Instruct-4bit --port 8100
# Docker Model Runner (always running via Docker Desktop — no action needed)
# Models load on first request. Pull models with: docker model pull ai/ministral3-vllm:14BUse mlxbench serve <backend> <model> to see the exact command for any backend/model combo.
# Benchmark all backends with a specific model
mlxbench run --model qwen-2.5-3b
# Benchmark a specific backend
mlxbench run --backend ollama --model ministral-14b
# Control concurrency levels
mlxbench run --model llama-3.2-1b --concurrency 1 --concurrency 4
# JSON output only (no tables/charts)
mlxbench run --model qwen-2.5-3b --json-only| Command | Description |
|---|---|
mlxbench run |
Run the full benchmark suite |
mlxbench check |
Verify backends are reachable, show serve commands |
mlxbench models |
List model registry and backend mappings |
mlxbench serve <backend> <model> |
Show the command to start a backend |
| Flag | Description | Default |
|---|---|---|
--config |
TOML config file | configs/default.toml |
--backend / -b |
Backend(s) to test | all enabled |
--model / -m |
Model alias(es) to test | all registered |
--concurrency / -c |
Concurrency level(s) | 1, 2, 4, 8 |
--output-dir |
Output directory | benchmarks/ |
--no-charts |
Skip chart generation | off |
--json-only |
Only output JSON | off |
| Backend | Port | Model Format | Model Management |
|---|---|---|---|
| Ollama | 11434 | GGUF | ollama pull <model> |
| vLLM-MLX | 8100 | MLX 4-bit | Auto-downloads from HuggingFace |
| Docker Model Runner (vLLM) | 12434 | Safetensors/MLX | docker model pull <model> |
Edit configs/default.toml to customize backends, ports, models, and benchmark parameters. This is the single source of truth for models — there are no hardcoded defaults in the code.
[general]
warmup_requests = 3
max_tokens = 256
temperature = 0.0
concurrency_levels = [1, 2, 4, 8]
[backends.ollama]
enabled = true
endpoint = "http://localhost:11434/v1"
[backends.vllm-mlx]
enabled = true
endpoint = "http://localhost:8100/v1"
port = 8100
[backends.dmr-vllm]
enabled = true
endpoint = "http://localhost:12434/engines/vllm/v1"
[models."qwen-2.5-3b"]
ollama = "qwen2.5:3b"
vllm-mlx = "mlx-community/Qwen2.5-3B-Instruct-4bit"
dmr-vllm = "ai/qwen3-vllm:3B"Not every model needs an entry for every backend — the benchmark simply skips backends that don't have a mapping for the requested model.
All model configuration lives in configs/default.toml. To add a new model, add a [models."your-alias"] section with the model ID for each backend you want to test:
[models."phi-3.5-mini"]
ollama = "phi3.5:3.8b"
vllm-mlx = "mlx-community/Phi-3.5-mini-instruct-4bit"Then benchmark it with mlxbench run --model phi-3.5-mini.
To find model IDs:
- Ollama:
ollama search <name>or ollama.com/library - vLLM-MLX: Browse huggingface.co/mlx-community
- Docker Model Runner:
docker model search <name>
- Terminal: Rich tables showing median (p95) for TTFT, tok/s, latency, ITL, memory
- JSON: Full results exported to
benchmarks/<timestamp>_results.json - Charts: Grouped bar charts saved as
benchmarks/<timestamp>_<model>_comparison.png
src/llm_bench/
├── cli.py # Typer CLI
├── config.py # TOML config + Pydantic models
├── models.py # Model name registry across backends
├── backends/ # Backend health checks + serve commands
│ ├── ollama.py
│ ├── vllm_mlx.py
│ └── dmr_vllm.py # Docker Model Runner (vLLM engine)
├── runner/ # Request executor, concurrency, orchestrator
├── metrics/ # Data collection, statistics, memory sampling
├── prompts/ # Standard benchmark prompts
└── output/ # Terminal, JSON, chart rendering
- Quantization varies: Ollama uses GGUF, vLLM-MLX uses MLX 4-bit, DMR uses safetensors/MLX. This is inherent to comparing inference stacks.
- Memory: Reports process RSS via psutil. GPU-specific memory is not separately tracked.
- Fairness: All backends receive identical prompts with
temperature=0.0and the samemax_tokens. Warmup requests run before measurement. - DMR naming: Docker Model Runner has its own model registry (
ai/ministral3-vllm:14B). Not all models have DMR equivalents. You can also use HuggingFace models viahf.co/prefix.
MIT