-
Notifications
You must be signed in to change notification settings - Fork 1
Description
Context
During inference serving analysis, KV cache management was identified as a critical design problem that is fundamentally different on Apple Silicon compared to NVIDIA. On NVIDIA, PagedAttention manages scarce GPU VRAM. On Apple Silicon UMA, CPU and GPU share the same memory pool — enabling novel designs that nobody has documented. Current KB depth: 0%.
Gap Description
KV cache management determines serving throughput, maximum concurrent users, and maximum context length. vLLM's PagedAttention on NVIDIA was a breakthrough because it eliminated memory fragmentation in scarce GPU VRAM. On Apple Silicon, the constraints are different: memory is unified but bandwidth is lower (120-546 GB/s vs 3.35 TB/s on H100). Novel approaches exploiting UMA properties are possible but undocumented.
What We Have (relevant but indirect)
- #110: UMA serves as both system RAM and GPU VRAM simultaneously
- #224: mmap + bytesNoCopy creates file-backed Metal buffers with on-demand page faults
- #226: MADV_WILLNEED reduces page faults ~110 to ~6 for mmap'd buffers
- #169: Residency sets prevent OS from reclaiming GPU memory (250ms improvement in llama.cpp)
- #250: 15-70x memory-to-storage bandwidth gap (GPU compute is memory-bound, not storage-bound)
- #113: SLC provides ~2x DRAM bandwidth (8-96MB cache depending on chip)
- #362: MLX KV cache stays in place — no device-to-device transfer needed
What We Need
-
UMA-aware KV cache allocation: How should KV cache pages be allocated on unified memory? Can the CPU scheduler directly read KV cache metadata without a GPU→CPU transfer? (On NVIDIA this requires an explicit copy.)
-
mmap-backed KV cache for eviction: Can KV cache pages be backed by mmap'd files? When memory pressure rises,
MADV_DONTNEEDcould evict pages to SSD, and page faults reload them on demand. Is this viable for the 15-70x bandwidth gap? -
SLC utilization for hot KV cache: The System Level Cache (8-96MB) provides 2x DRAM bandwidth. For a serving workload, the most recent K/V tokens for active sequences are the hottest data. Can we engineer cache-friendly access patterns to keep hot KV pages in SLC?
-
Residency set lifecycle for growing KV cache: As new tokens are generated, KV cache grows. How should residency sets be managed? Add pages incrementally? Batch updates? What's the overhead?
-
Multi-model KV cache isolation: When serving multiple models, each has its own KV cache. How to partition unified memory fairly? Can we use mmap with different files per model for clean isolation?
-
PagedAttention adaptation for UMA: Does the original PagedAttention design (page table, block mapping) even make sense on UMA where there's no separate GPU VRAM to manage? Or is a simpler design better?
-
Prefetch strategies: For continuous batching, the scheduler knows which sequences will be processed next. Can
MADV_WILLNEEDor explicit prefetch prepare KV cache pages before the attention kernel needs them? -
SSD-backed KV cache (radical approach): With MTLIOCommandQueue, could KV cache live on SSD and be streamed to GPU on demand? At 7.9 GB/s SSD bandwidth, how many tokens/s could this support?
Research Areas
Area 1: vLLM-MLX KV Cache Implementation
Source: github.com/vllm-project/vllm MLX backend
Research targets:
- How vLLM-MLX manages KV cache (paged? contiguous? hybrid?)
- Memory allocation strategy (MLX metal allocator vs custom)
- Cache eviction/reuse policy
- Prefix caching: how shared prefixes avoid duplicate KV storage
Area 2: MLX Attention Kernel KV Cache Access
Source: mlx/backend/metal/kernels/steel/attn/
Research targets:
- How the attention kernel reads KV cache (contiguous buffer? offset table?)
- Memory access pattern (sequential vs strided vs random)
- Can KV cache be non-contiguous (page table indirection in kernel)?
- Float32 vs FP16 KV cache storage tradeoffs
Area 3: UMA Memory Management Under Pressure
Source: macOS unified memory documentation, mmap behavior testing
Research targets:
- What happens when GPU buffers compete with system memory?
- macOS memory pressure response for mmap'd Metal buffers
- Can
vm_allocate+ manual page management give finer control than mmap? - Interaction between residency sets and macOS memory compressor
Area 4: SLC Profiling for KV Cache
Source: Metal System Trace, GPU profiling tools
Research targets:
- SLC hit rate for attention kernel memory accesses
- Optimal KV cache layout for SLC utilization
- Whether interleaved K/V or separate K and V arrays are better for caching
- Impact of sequence length on cache behavior
Area 5: Prior Art — Offloading and Tiered Memory
Source: Academic papers on KV cache offloading
Research targets:
- FlexGen (NVIDIA) — KV cache offloading to CPU/SSD. How does this map to UMA?
- InfiniGen — prefetch-based KV cache management
- vLLM prefix caching — content-based deduplication
- SpecInfer — speculative decoding with shared KV cache
Impact
- Inference serving: KV cache is the primary memory bottleneck for serving concurrent users
- Competitive differentiation: UMA-aware KV cache is something NVIDIA literally cannot do
- Distributed inference: Cross-node KV cache coordination is a prerequisite for multi-Mac serving
- Model router: Fast model switching with per-model KV cache isolation
Recommended KB Addition
Skills: unified-memory, mlx-compute, gpu-perf, gpu-io
Topics: KV cache allocation on UMA, mmap-backed cache, SLC utilization, PagedAttention adaptation
Estimated findings: 10-15 new findings (mix of extracted knowledge and original profiling results)