Skip to content

linusvwe/MLXBench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

19 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MLXBench

Local LLM inference benchmark for macOS Apple Silicon. Compare Ollama, vLLM-MLX, and Docker Model Runner (vLLM) side by side.

Features

  • Benchmarks three local inference backends through their OpenAI-compatible APIs
  • Measures TTFT, tokens/sec, end-to-end latency, inter-token latency, and memory usage
  • Tests at multiple concurrency levels (1, 2, 4, 8)
  • Outputs rich terminal tables, JSON results, and comparison charts
  • Configurable via TOML — models, backends, prompts, and parameters

Requirements

  • macOS with Apple Silicon (M1/M2/M3/M4)
  • Python 3.12+
  • uv (recommended) or pip
  • At least one backend installed:

Quick Start

# Clone and set up
git clone https://github.com/your-user/mlxbench.git
cd mlxbench
uv venv .venv --python 3.12
source .venv/bin/activate
uv pip install -e .

# Check which backends are available
mlxbench check

# See registered models
mlxbench models

Starting Backends

Each backend must be running before you benchmark it. Start them in separate terminals:

# Ollama (daemon — auto-loads models on request)
ollama serve

# vLLM-MLX (must specify model and port)
vllm-mlx serve mlx-community/Qwen2.5-3B-Instruct-4bit --port 8100

# Docker Model Runner (always running via Docker Desktop — no action needed)
# Models load on first request. Pull models with: docker model pull ai/ministral3-vllm:14B

Use mlxbench serve <backend> <model> to see the exact command for any backend/model combo.

Running Benchmarks

# Benchmark all backends with a specific model
mlxbench run --model qwen-2.5-3b

# Benchmark a specific backend
mlxbench run --backend ollama --model ministral-14b

# Control concurrency levels
mlxbench run --model llama-3.2-1b --concurrency 1 --concurrency 4

# JSON output only (no tables/charts)
mlxbench run --model qwen-2.5-3b --json-only

CLI Commands

Command Description
mlxbench run Run the full benchmark suite
mlxbench check Verify backends are reachable, show serve commands
mlxbench models List model registry and backend mappings
mlxbench serve <backend> <model> Show the command to start a backend

run Options

Flag Description Default
--config TOML config file configs/default.toml
--backend / -b Backend(s) to test all enabled
--model / -m Model alias(es) to test all registered
--concurrency / -c Concurrency level(s) 1, 2, 4, 8
--output-dir Output directory benchmarks/
--no-charts Skip chart generation off
--json-only Only output JSON off

Backends

Backend Port Model Format Model Management
Ollama 11434 GGUF ollama pull <model>
vLLM-MLX 8100 MLX 4-bit Auto-downloads from HuggingFace
Docker Model Runner (vLLM) 12434 Safetensors/MLX docker model pull <model>

Configuration

Edit configs/default.toml to customize backends, ports, models, and benchmark parameters. This is the single source of truth for models — there are no hardcoded defaults in the code.

[general]
warmup_requests = 3
max_tokens = 256
temperature = 0.0
concurrency_levels = [1, 2, 4, 8]

[backends.ollama]
enabled = true
endpoint = "http://localhost:11434/v1"

[backends.vllm-mlx]
enabled = true
endpoint = "http://localhost:8100/v1"
port = 8100

[backends.dmr-vllm]
enabled = true
endpoint = "http://localhost:12434/engines/vllm/v1"

[models."qwen-2.5-3b"]
ollama = "qwen2.5:3b"
vllm-mlx = "mlx-community/Qwen2.5-3B-Instruct-4bit"
dmr-vllm = "ai/qwen3-vllm:3B"

Not every model needs an entry for every backend — the benchmark simply skips backends that don't have a mapping for the requested model.

Adding or Changing Models

All model configuration lives in configs/default.toml. To add a new model, add a [models."your-alias"] section with the model ID for each backend you want to test:

[models."phi-3.5-mini"]
ollama = "phi3.5:3.8b"
vllm-mlx = "mlx-community/Phi-3.5-mini-instruct-4bit"

Then benchmark it with mlxbench run --model phi-3.5-mini.

To find model IDs:

Output

  • Terminal: Rich tables showing median (p95) for TTFT, tok/s, latency, ITL, memory
  • JSON: Full results exported to benchmarks/<timestamp>_results.json
  • Charts: Grouped bar charts saved as benchmarks/<timestamp>_<model>_comparison.png

Architecture

src/llm_bench/
├── cli.py              # Typer CLI
├── config.py           # TOML config + Pydantic models
├── models.py           # Model name registry across backends
├── backends/           # Backend health checks + serve commands
│   ├── ollama.py
│   ├── vllm_mlx.py
│   └── dmr_vllm.py     # Docker Model Runner (vLLM engine)
├── runner/             # Request executor, concurrency, orchestrator
├── metrics/            # Data collection, statistics, memory sampling
├── prompts/            # Standard benchmark prompts
└── output/             # Terminal, JSON, chart rendering

Notes

  • Quantization varies: Ollama uses GGUF, vLLM-MLX uses MLX 4-bit, DMR uses safetensors/MLX. This is inherent to comparing inference stacks.
  • Memory: Reports process RSS via psutil. GPU-specific memory is not separately tracked.
  • Fairness: All backends receive identical prompts with temperature=0.0 and the same max_tokens. Warmup requests run before measurement.
  • DMR naming: Docker Model Runner has its own model registry (ai/ministral3-vllm:14B). Not all models have DMR equivalents. You can also use HuggingFace models via hf.co/ prefix.

License

MIT

About

Tool to benchmark local LLM inference on Apple Silicon. Compare Ollama, vLLM-MLX, and Docker Model Runner (vLLM Backend).

Resources

License

Stars

Watchers

Forks

Contributors

Languages