RFC+PoC: MoE offload to disk with on-demand paging #1833

kisasexypantera94 · 2026-05-19T08:46:24Z

kisasexypantera94
May 19, 2026

Hi, this is a cross-post from ggml-org/llama.cpp#23324. The integration would look almost identical, so I thought you might be interested as well.

Running Qwen3-30B-A3B-Q6_K on M3 Pro 36GB, the expert weights alone wire ~27 GiB of memory. Wired memory on Apple Silicon can't be compressed or swapped, which leaves little headroom for other workloads.

This PoC implements on-demand expert loading: instead of keeping all 128 experts resident, we allocate a compact pool of N slots in Metal shared memory and page missing experts from the GGUF file via pread. This is exact inference without fallback to zero/random weights.

~~With lazy pool allocation this could enable running models larger than physical RAM.~~
UPD: managed to run Qwen3-30B-A3B-Q6_K on M1 Pro 16GB with 9.5 tok/s, so the concept works already.

How it works

Each MoE layer gets a compact pool tensor of N slots. A small Metal kernel (kernel_moe_interceptor) copies the selected expert IDs to shared CPU/GPU memory and publishes a request sequence number. A CPU sidecar thread sees the request, resolves expert-to-slot mappings via LRU, loads missing experts from disk with pread, writes remapped slot IDs back to shared memory, and signals an MTLSharedEvent. The GPU encoder waits on this event, then MUL_MAT_ID runs unchanged against the compact pool.

Measurements (Qwen3-30B-A3B-Q6_K, M3 Pro 36GB)

n_slots	wired GiB	saved GiB	tok/s (after warmup)
vanilla	~27.2	—	39.9
128	~27.1	—	32.4
96	~21.6	~5.6	31.2
80	~18.9	~8.3	29.0
64	~16.1	~11.1	26.5
32	~10.6	~16.6	10.6
16	~7.8	~19.4	6.2
8	~6.3	~20.9	4.7

The n_slots=128 row (no eviction) shows the fixed synchronization overhead: ~19% throughput tax just from the interceptor mechanism. At n_slots=96 we get ~5.6 GiB saved with roughly the same overhead. n_slots=64 is the knee of the curve. Below 32 it's cache thrash.

Prompt processing is slow (~4.8 t/s) because cold misses are synchronous.

Portability

The core idea is backend-agnostic.
The Apple-specific parts are the synchronization mechanism (MTLSharedEvent, Metal shared memory)
and MADV_FREE for CPU page release.

Equivalent primitives exist on other backends:

CUDA: cudaEventRecord/cudaStreamWaitEvent for sync, cudaHostAlloc(cudaHostAllocMapped) for zero-copy shared memory
Vulkan: timeline semaphores

Would be interested to hear from people familiar with these backends whether a similar approach is feasible there.

How to run

./build/bin/llama-cli \
  --moe-n-slots 80 \
  --moe-n-layers 48 \
  --no-mmap \
  --override-tensor 'blk\.[0-9]+\.ffn_(gate|up|down|gate_up)_exps\.weight=CPU' \
  -m /path/to/model.gguf \
  -ub 10 \
  --temp 0 -p "your prompt"

Key constraints:

--no-mmap is required because expert tensors must be in anonymous CPU pages for MADV_FREE to work
--override-tensor pins expert weights to CPU so Metal pools can be allocated separately
-ub × 8 (n_expert_used for Qwen3) must be less than --moe-n-slots, otherwise a single ubatch can request more unique experts than slots available
This approach should work for any MoE model as long as expert weights are laid out as [row, col, n_expert] tensors

Known issues and open questions:

CPU_REPACK: Apple's ARM-optimized CPU buffer layout is incompatible with reading quantized bytes as GGUF layout. Worked around by skipping REPACK buffer types during tensor override. Is there a cleaner way?
Peak memory during warmup: all pools are allocated upfront, causing a temporary spike before original CPU pages are released. Lazy per-layer allocation is a possible improvement

Interested in feedback on whether this direction makes sense for llama.cpp, and whether there are cleaner integration points.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC+PoC: MoE offload to disk with on-demand paging #1833

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

RFC+PoC: MoE offload to disk with on-demand paging #1833

Uh oh!

Uh oh!

kisasexypantera94 May 19, 2026

How it works

Measurements (Qwen3-30B-A3B-Q6_K, M3 Pro 36GB)

Portability

How to run

Known issues and open questions:

Replies: 0 comments

kisasexypantera94
May 19, 2026