MLXBench

Local LLM inference benchmark for macOS Apple Silicon. Compare Ollama, vLLM-MLX, and Docker Model Runner (vLLM) side by side.

Features

Benchmarks three local inference backends through their OpenAI-compatible APIs
Measures TTFT, tokens/sec, end-to-end latency, inter-token latency, and memory usage
Tests at multiple concurrency levels (1, 2, 4, 8)
Outputs rich terminal tables, JSON results, and comparison charts
Configurable via TOML — models, backends, prompts, and parameters

Requirements

macOS with Apple Silicon (M1/M2/M3/M4)
Python 3.12+
uv (recommended) or pip
At least one backend installed:
- Ollama — manages GGUF models
- vLLM-MLX — MLX 4-bit models from mlx-community/
- Docker Model Runner — Docker Desktop with vLLM engine

Quick Start

# Clone and set up
git clone https://github.com/your-user/mlxbench.git
cd mlxbench
uv venv .venv --python 3.12
source .venv/bin/activate
uv pip install -e .

# Check which backends are available
mlxbench check

# See registered models
mlxbench models

Starting Backends

Each backend must be running before you benchmark it. Start them in separate terminals:

# Ollama (daemon — auto-loads models on request)
ollama serve

# vLLM-MLX (must specify model and port)
vllm-mlx serve mlx-community/Qwen2.5-3B-Instruct-4bit --port 8100

# Docker Model Runner (always running via Docker Desktop — no action needed)
# Models load on first request. Pull models with: docker model pull ai/ministral3-vllm:14B

Use mlxbench serve <backend> <model> to see the exact command for any backend/model combo.

Running Benchmarks

# Benchmark all backends with a specific model
mlxbench run --model qwen-2.5-3b

# Benchmark a specific backend
mlxbench run --backend ollama --model ministral-14b

# Control concurrency levels
mlxbench run --model llama-3.2-1b --concurrency 1 --concurrency 4

# JSON output only (no tables/charts)
mlxbench run --model qwen-2.5-3b --json-only

CLI Commands

Command	Description
`mlxbench run`	Run the full benchmark suite
`mlxbench check`	Verify backends are reachable, show serve commands
`mlxbench models`	List model registry and backend mappings
`mlxbench serve <backend> <model>`	Show the command to start a backend

`run` Options

Flag	Description	Default
`--config`	TOML config file	`configs/default.toml`
`--backend / -b`	Backend(s) to test	all enabled
`--model / -m`	Model alias(es) to test	all registered
`--concurrency / -c`	Concurrency level(s)	`1, 2, 4, 8`
`--output-dir`	Output directory	`benchmarks/`
`--no-charts`	Skip chart generation	off
`--json-only`	Only output JSON	off

Backends

Backend	Port	Model Format	Model Management
Ollama	11434	GGUF	`ollama pull <model>`
vLLM-MLX	8100	MLX 4-bit	Auto-downloads from HuggingFace
Docker Model Runner (vLLM)	12434	Safetensors/MLX	`docker model pull <model>`

Configuration

Edit configs/default.toml to customize backends, ports, models, and benchmark parameters. This is the single source of truth for models — there are no hardcoded defaults in the code.

[general]
warmup_requests = 3
max_tokens = 256
temperature = 0.0
concurrency_levels = [1, 2, 4, 8]

[backends.ollama]
enabled = true
endpoint = "http://localhost:11434/v1"

[backends.vllm-mlx]
enabled = true
endpoint = "http://localhost:8100/v1"
port = 8100

[backends.dmr-vllm]
enabled = true
endpoint = "http://localhost:12434/engines/vllm/v1"

[models."qwen-2.5-3b"]
ollama = "qwen2.5:3b"
vllm-mlx = "mlx-community/Qwen2.5-3B-Instruct-4bit"
dmr-vllm = "ai/qwen3-vllm:3B"

Not every model needs an entry for every backend — the benchmark simply skips backends that don't have a mapping for the requested model.

Adding or Changing Models

All model configuration lives in configs/default.toml. To add a new model, add a [models."your-alias"] section with the model ID for each backend you want to test:

[models."phi-3.5-mini"]
ollama = "phi3.5:3.8b"
vllm-mlx = "mlx-community/Phi-3.5-mini-instruct-4bit"

Then benchmark it with mlxbench run --model phi-3.5-mini.

To find model IDs:

Ollama: ollama search <name> or ollama.com/library
vLLM-MLX: Browse huggingface.co/mlx-community
Docker Model Runner: docker model search <name>

Output

Terminal: Rich tables showing median (p95) for TTFT, tok/s, latency, ITL, memory
JSON: Full results exported to benchmarks/<timestamp>_results.json
Charts: Grouped bar charts saved as benchmarks/<timestamp>_<model>_comparison.png

Architecture

src/llm_bench/
├── cli.py              # Typer CLI
├── config.py           # TOML config + Pydantic models
├── models.py           # Model name registry across backends
├── backends/           # Backend health checks + serve commands
│   ├── ollama.py
│   ├── vllm_mlx.py
│   └── dmr_vllm.py     # Docker Model Runner (vLLM engine)
├── runner/             # Request executor, concurrency, orchestrator
├── metrics/            # Data collection, statistics, memory sampling
├── prompts/            # Standard benchmark prompts
└── output/             # Terminal, JSON, chart rendering

Notes

Quantization varies: Ollama uses GGUF, vLLM-MLX uses MLX 4-bit, DMR uses safetensors/MLX. This is inherent to comparing inference stacks.
Memory: Reports process RSS via psutil. GPU-specific memory is not separately tracked.
Fairness: All backends receive identical prompts with temperature=0.0 and the same max_tokens. Warmup requests run before measurement.
DMR naming: Docker Model Runner has its own model registry (ai/ministral3-vllm:14B). Not all models have DMR equivalents. You can also use HuggingFace models via hf.co/ prefix.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
benchmarks		benchmarks
configs		configs
src/llm_bench		src/llm_bench
tests		tests
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MLXBench

Features

Requirements

Quick Start

Starting Backends

Running Benchmarks

CLI Commands

`run` Options

Backends

Configuration

Adding or Changing Models

Output

Architecture

Notes

License

About

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

MLXBench

Features

Requirements

Quick Start

Starting Backends

Running Benchmarks

CLI Commands

run Options

Backends

Configuration

Adding or Changing Models

Output

Architecture

Notes

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Uh oh!

Contributors

Uh oh!

Languages

`run` Options