oMLX 0.3.8 + M3 Ultra: 60-run eval matrix on Qwen3.6-35B-A3B (22,686 questions, 38.6 h sustained inference) #1230

Regis-RCR · 2026-05-13T12:49:13Z

Regis-RCR
May 13, 2026

Sharing a real-workload showcase of oMLX 0.3.8 on a Mac Studio M3 Ultra. One base model (Qwen3.6-35B-A3B, MoE), three quantized checkpoints, ten benchmarks, two reasoning modes, all driven by the oMLX accuracy-benchmark harness. The post is structured around the four oMLX pillars exercised end-to-end: quantization, distilled-checkpoint inference, benchmarking, host performance. Two concrete roadmap asks at the bottom.

Setup

Hardware: Mac Studio (Mac15,14), Apple M3 Ultra, 28 cores (20 performance + 8 efficiency), 96 GB unified memory
OS: macOS 26.4.1 (build 25E253)
Framework: oMLX 0.3.8 (app build), running inside a Python 3.13.12 venv
MLX stack: mlx 0.31.1, mlx-lm 0.31.2, mlx-vlm 0.4.0, mlx-embeddings 0.0.5, mlx-metal 0.31.1
Adjacent: transformers 5.3.0, huggingface-hub 1.7.2, numpy 2.4.3

Pillar 1: quantization (oQ8 vs oQ8e)

Three checkpoints under one base architecture, all published by community maintainers (full credits in the Acknowledgements section):

Checkpoint	Provider	Quant family	On-disk size
Qwen3.6-35B-A3B-MLX-oQ8-FP16	@deepsweet	oQ8 (vanilla)	34 GB
Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled-oQ8e-fp16	@splats	oQ8e	35 GB
Qwen3.6-35B-A3B-Kimi-K2.6-Reasoning-Distilled-oQ8e-fp16	@splats	oQ8e	35 GB

Both quant variants share an 8-bit affine base at group_size=64. The interesting part is the per-submodule override pattern. Reading the two config.json files side by side surfaces a real policy difference, not a cosmetic relabel.

The trade is visible at a glance:

linear_attn.in_proj_qkv (fused Q/K/V for linear-attention layers): oQ8 vanilla moves all 30 layers to coarser gs=128. oQ8e keeps 25 of 30 at the finer gs=64. Accuracy-critical projection stays tight.
linear_attn.out_proj: oQ8 leaves all 30 layers at base gs=64. oQ8e moves all 30 to gs=128. Output projection takes the coarser tier, where it's more robust.
self_attn.{q,k,v}_proj (a smaller set of 10 layers): oQ8 quantizes all coarser, oQ8e halves the coverage.

The framework reads this directly from config.json and inference Just Works. No code path per variant.

Pillar 2: reasoning-distilled checkpoints

Both oQ8e checkpoints are reasoning-distilled variants of the same base (Claude-4.7-Opus and Kimi-K2.6 as teachers). A 3-way head-to-head ran on the same harness:

Claude-trained traces: 1.58x faster wall-clock, -45% output tokens, -3.6 pts mean accuracy vs vanilla. Strong compression effect.
Kimi-trained traces: 0.76x as fast (slower), +16% output tokens, -2.9 pts mean accuracy. No efficiency dividend.

oMLX surfaces the difference by simply running the harness on each model directory. No special configuration.

Pillar 3: benchmarking harness

Ten benchmarks shipped with omlx/eval/: MMLU, MMLU-Pro, TruthfulQA, ARC-Challenge, MathQA, HumanEval, MBPP, LiveCodeBench, BBQ, SafetyBench. Two modes per benchmark (no-thinking, thinking). One JSON per {model, benchmark, mode} tuple, schema:

model_id, benchmark, accuracy, correct, total, time_s, thinking_used,
category_scores, questions[{id, correct, expected, predicted, question,
                            raw_response, category, time_s}]

Per-question timings and raw responses persist, which makes downstream analysis cheap (token estimation, length distributions, category breakdowns, error sampling). Same seeds across runs keep the comparison apples to apples.

Three observations:

Full-set evals on TruthfulQA (817), HumanEval (164), MBPP (200) ran without sampling. The smaller-set benchmarks (300 to 1000 stratified) preserved per-category coverage.
Switching between thinking and no-thinking is a CLI flag, not a separate code path on the caller side.
The harness writes results incrementally. The MMLU thinking pass on the slowest checkpoint ran ~2.5 hours and survived an unintended app reload mid-run with no JSON corruption.

The side panel of the matrix tells the framework story directly: LiveCodeBench alone consumed 8.4 hours of compute across the three models, MMLU took 6.6 hours. Cost per benchmark varies by an order of magnitude, which makes selective re-runs important.

Pillar 4: performance on M3 Ultra

Total workload across the three Qwen variants:

22,686 questions evaluated end-to-end
38.6 hours of sustained inference wall-clock (no-thinking + thinking, all 10 benches, all 3 checkpoints)
~13.4 M output tokens generated (estimated from raw_response characters divided by 4)
96 tokens per second average overall throughput, prefill included

Per-checkpoint throughput tracks output-length: vanilla emits the longest traces (105 tok/s wall-clock-effective), Claude distilled emits the shortest (90 tok/s, but 45% fewer tokens per question). Memory-side, peak resident set sat around 35 GB during a single-model run. A second 35B checkpoint loaded on top stayed under the unified-RAM ceiling.

Reading the numbers

The takeaway for the framework: a single desktop hosted three 35B-parameter MoE checkpoints, ran a 60-job evaluation matrix, and produced reproducible JSON for every question. The oQ8 family compressed a 70 GB FP16 weight set into ~34 GB while preserving competitive accuracy. The thinking-mode dimension multiplies the cost (~10x on most benches) but stays inside a desktop budget.

The framework didn't crash, didn't OOM, didn't require a custom serving stack. The full pipeline ran inside the published Python venv plus the app.

Roadmap questions

Two specific asks, in priority order.

1. Surface live inference stats during accuracy-benchmark runs in the dashboard. When a benchmark queued via /admin/api/bench/accuracy/queue/add is active, the Serving Stats panel currently shows nothing. The bench loop is going through omlx.engine_pool, the same engine that serves OpenAI-API requests, yet none of its tokens/prefill/cache counters reach /admin/api/stats. From a user perspective the box looks idle while a 2-hour MMLU pass is grinding. The stats infrastructure already exists (omlx.cache.stats, the per-request counters). Wiring the bench path to the same publisher would let a user actually watch a long run, catch a thermal-throttle event, or notice a stuck queue. Is this on the 0.4.x roadmap, or do you see a reason to keep the two surfaces separate?

2. Add per-feature ablation flags to the bench harness. oMLX ships several cache layers that materially change throughput and memory: paged_cache, paged_ssd_cache, prefix_cache, hybrid_cache, plus the tiered_manager and boundary_snapshot_store. The harness currently runs with whatever the running app has configured. There is no way to quantify the contribution of any one feature by re-running the same bench with that feature disabled. A --disable-feature paged_ssd_cache flag (or an env-var ablation matrix) on the bench harness would let users produce concrete numbers like "paged_ssd_cache gave us +X tokens/s on this bench at this context length on this hardware." That data feeds back into your release notes and into the user's hardware-sizing decisions. Possible in 0.4?

Acknowledgements

This showcase exists because three groups published the artifacts under permissive terms:

@splats for the two reasoning-distilled checkpoints (Claude-4.7-Opus and Kimi-K2.6 teachers). The 3-way teacher comparison would not be possible without both variants on the same base.
@deepsweet for the vanilla Qwen3.6-35B-A3B-MLX-oQ8-FP16 baseline. It is the reference checkpoint the entire 60-cell matrix is anchored against.
The Qwen team for the Qwen3.6-35B-A3B MoE base model.

A separate 3-way feedback ran on the splats Discussion page covers the model-side findings in detail. This post focuses on the framework angle for the oMLX side.

Sign-off

Thanks for shipping oMLX. The framework let me run 38 hours of inference on a single workstation without a single hand-tuned config. The two asks above would close the only friction points I hit across this workload.

Regis

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

oMLX 0.3.8 + M3 Ultra: 60-run eval matrix on Qwen3.6-35B-A3B (22,686 questions, 38.6 h sustained inference) #1230

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

oMLX 0.3.8 + M3 Ultra: 60-run eval matrix on Qwen3.6-35B-A3B (22,686 questions, 38.6 h sustained inference) #1230

Uh oh!

Uh oh!

Regis-RCR May 13, 2026

Setup

Pillar 1: quantization (oQ8 vs oQ8e)

Pillar 2: reasoning-distilled checkpoints

Pillar 3: benchmarking harness

Pillar 4: performance on M3 Ultra

Reading the numbers

Roadmap questions

Acknowledgements

Sign-off

Replies: 0 comments

Regis-RCR
May 13, 2026