RFC+PoC: MoE offload to disk with on-demand paging #1833
kisasexypantera94
started this conversation in
Ideas
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Running Qwen3-30B-A3B-Q6_K on M3 Pro 36GB, the expert weights alone wire ~27 GiB of memory. Wired memory on Apple Silicon can't be compressed or swapped, which leaves little headroom for other workloads.
This PoC implements on-demand expert loading: instead of keeping all 128 experts resident, we allocate a compact pool of N slots in Metal shared memory and page missing experts from the GGUF file via
pread. This is exact inference without fallback to zero/random weights.With lazy pool allocation this could enable running models larger than physical RAM.UPD: managed to run Qwen3-30B-A3B-Q6_K on M1 Pro 16GB with 9.5 tok/s, so the concept works already.
How it works
Each MoE layer gets a compact pool tensor of N slots. A small Metal kernel (
kernel_moe_interceptor) copies the selected expert IDs to shared CPU/GPU memory and publishes a request sequence number. A CPU sidecar thread sees the request, resolves expert-to-slot mappings via LRU, loads missing experts from disk withpread, writes remapped slot IDs back to shared memory, and signals anMTLSharedEvent. The GPU encoder waits on this event, thenMUL_MAT_IDruns unchanged against the compact pool.Measurements (Qwen3-30B-A3B-Q6_K, M3 Pro 36GB)
The n_slots=128 row (no eviction) shows the fixed synchronization overhead: ~19% throughput tax just from the interceptor mechanism. At n_slots=96 we get ~5.6 GiB saved with roughly the same overhead. n_slots=64 is the knee of the curve. Below 32 it's cache thrash.
Prompt processing is slow (~4.8 t/s) because cold misses are synchronous.
Portability
The core idea is backend-agnostic.
The Apple-specific parts are the synchronization mechanism (
MTLSharedEvent, Metal shared memory)and
MADV_FREEfor CPU page release.Equivalent primitives exist on other backends:
cudaEventRecord/cudaStreamWaitEventfor sync,cudaHostAlloc(cudaHostAllocMapped)for zero-copy shared memoryWould be interested to hear from people familiar with these backends whether a similar approach is feasible there.
How to run
Key constraints:
--no-mmapis required because expert tensors must be in anonymous CPU pages forMADV_FREEto work--override-tensorpins expert weights to CPU so Metal pools can be allocated separately-ub× 8 (n_expert_used for Qwen3) must be less than--moe-n-slots, otherwise a single ubatch can request more unique experts than slots available[row, col, n_expert]tensorsKnown issues and open questions:
Interested in feedback on whether this direction makes sense for llama.cpp, and whether there are cleaner integration points.
Beta Was this translation helpful? Give feedback.
All reactions