oMLX 0.3.8 + M3 Ultra: 60-run eval matrix on Qwen3.6-35B-A3B (22,686 questions, 38.6 h sustained inference) #1230
Regis-RCR
started this conversation in
Show and tell
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi @jundot,
Sharing a real-workload showcase of oMLX 0.3.8 on a Mac Studio M3 Ultra. One base model (Qwen3.6-35B-A3B, MoE), three quantized checkpoints, ten benchmarks, two reasoning modes, all driven by the oMLX accuracy-benchmark harness. The post is structured around the four oMLX pillars exercised end-to-end: quantization, distilled-checkpoint inference, benchmarking, host performance. Two concrete roadmap asks at the bottom.
Setup
Pillar 1: quantization (oQ8 vs oQ8e)
Three checkpoints under one base architecture, all published by community maintainers (full credits in the Acknowledgements section):
Both quant variants share an 8-bit affine base at
group_size=64. The interesting part is the per-submodule override pattern. Reading the twoconfig.jsonfiles side by side surfaces a real policy difference, not a cosmetic relabel.The trade is visible at a glance:
linear_attn.in_proj_qkv(fused Q/K/V for linear-attention layers): oQ8 vanilla moves all 30 layers to coarsergs=128. oQ8e keeps 25 of 30 at the finergs=64. Accuracy-critical projection stays tight.linear_attn.out_proj: oQ8 leaves all 30 layers at basegs=64. oQ8e moves all 30 togs=128. Output projection takes the coarser tier, where it's more robust.self_attn.{q,k,v}_proj(a smaller set of 10 layers): oQ8 quantizes all coarser, oQ8e halves the coverage.The framework reads this directly from
config.jsonand inference Just Works. No code path per variant.Pillar 2: reasoning-distilled checkpoints
Both
oQ8echeckpoints are reasoning-distilled variants of the same base (Claude-4.7-Opus and Kimi-K2.6 as teachers). A 3-way head-to-head ran on the same harness:oMLX surfaces the difference by simply running the harness on each model directory. No special configuration.
Pillar 3: benchmarking harness
Ten benchmarks shipped with
omlx/eval/: MMLU, MMLU-Pro, TruthfulQA, ARC-Challenge, MathQA, HumanEval, MBPP, LiveCodeBench, BBQ, SafetyBench. Two modes per benchmark (no-thinking, thinking). One JSON per{model, benchmark, mode}tuple, schema:Per-question timings and raw responses persist, which makes downstream analysis cheap (token estimation, length distributions, category breakdowns, error sampling). Same seeds across runs keep the comparison apples to apples.
Three observations:
The side panel of the matrix tells the framework story directly: LiveCodeBench alone consumed 8.4 hours of compute across the three models, MMLU took 6.6 hours. Cost per benchmark varies by an order of magnitude, which makes selective re-runs important.
Pillar 4: performance on M3 Ultra
Total workload across the three Qwen variants:
raw_responsecharacters divided by 4)Per-checkpoint throughput tracks output-length: vanilla emits the longest traces (105 tok/s wall-clock-effective), Claude distilled emits the shortest (90 tok/s, but 45% fewer tokens per question). Memory-side, peak resident set sat around 35 GB during a single-model run. A second 35B checkpoint loaded on top stayed under the unified-RAM ceiling.
Reading the numbers
The takeaway for the framework: a single desktop hosted three 35B-parameter MoE checkpoints, ran a 60-job evaluation matrix, and produced reproducible JSON for every question. The oQ8 family compressed a 70 GB FP16 weight set into ~34 GB while preserving competitive accuracy. The thinking-mode dimension multiplies the cost (~10x on most benches) but stays inside a desktop budget.
The framework didn't crash, didn't OOM, didn't require a custom serving stack. The full pipeline ran inside the published Python venv plus the app.
Roadmap questions
Two specific asks, in priority order.
1. Surface live inference stats during accuracy-benchmark runs in the dashboard. When a benchmark queued via
/admin/api/bench/accuracy/queue/addis active, the Serving Stats panel currently shows nothing. The bench loop is going throughomlx.engine_pool, the same engine that serves OpenAI-API requests, yet none of its tokens/prefill/cache counters reach/admin/api/stats. From a user perspective the box looks idle while a 2-hour MMLU pass is grinding. The stats infrastructure already exists (omlx.cache.stats, the per-request counters). Wiring the bench path to the same publisher would let a user actually watch a long run, catch a thermal-throttle event, or notice a stuck queue. Is this on the 0.4.x roadmap, or do you see a reason to keep the two surfaces separate?2. Add per-feature ablation flags to the bench harness. oMLX ships several cache layers that materially change throughput and memory:
paged_cache,paged_ssd_cache,prefix_cache,hybrid_cache, plus thetiered_managerandboundary_snapshot_store. The harness currently runs with whatever the running app has configured. There is no way to quantify the contribution of any one feature by re-running the same bench with that feature disabled. A--disable-feature paged_ssd_cacheflag (or an env-var ablation matrix) on the bench harness would let users produce concrete numbers like "paged_ssd_cache gave us +X tokens/s on this bench at this context length on this hardware." That data feeds back into your release notes and into the user's hardware-sizing decisions. Possible in 0.4?Acknowledgements
This showcase exists because three groups published the artifacts under permissive terms:
Qwen3.6-35B-A3B-MLX-oQ8-FP16baseline. It is the reference checkpoint the entire 60-cell matrix is anchored against.A separate 3-way feedback ran on the splats Discussion page covers the model-side findings in detail. This post focuses on the framework angle for the oMLX side.
Sign-off
Thanks for shipping oMLX. The framework let me run 38 hours of inference on a single workstation without a single hand-tuned config. The two asks above would close the only friction points I hit across this workload.
Regis
Beta Was this translation helpful? Give feedback.
All reactions