KB Gap: KV cache design on Apple Silicon unified memory

## Context

During inference serving analysis, KV cache management was identified as a critical design problem that is **fundamentally different on Apple Silicon** compared to NVIDIA. On NVIDIA, PagedAttention manages scarce GPU VRAM. On Apple Silicon UMA, CPU and GPU share the same memory pool — enabling novel designs that nobody has documented. Current KB depth: **0%**.

## Gap Description

KV cache management determines serving throughput, maximum concurrent users, and maximum context length. vLLM's PagedAttention on NVIDIA was a breakthrough because it eliminated memory fragmentation in scarce GPU VRAM. On Apple Silicon, the constraints are different: memory is unified but bandwidth is lower (120-546 GB/s vs 3.35 TB/s on H100). Novel approaches exploiting UMA properties are possible but undocumented.

### What We Have (relevant but indirect)

- **#110**: UMA serves as both system RAM and GPU VRAM simultaneously
- **#224**: mmap + bytesNoCopy creates file-backed Metal buffers with on-demand page faults
- **#226**: MADV_WILLNEED reduces page faults ~110 to ~6 for mmap'd buffers
- **#169**: Residency sets prevent OS from reclaiming GPU memory (250ms improvement in llama.cpp)
- **#250**: 15-70x memory-to-storage bandwidth gap (GPU compute is memory-bound, not storage-bound)
- **#113**: SLC provides ~2x DRAM bandwidth (8-96MB cache depending on chip)
- **#362**: MLX KV cache stays in place — no device-to-device transfer needed

### What We Need

1. **UMA-aware KV cache allocation**: How should KV cache pages be allocated on unified memory? Can the CPU scheduler directly read KV cache metadata without a GPU→CPU transfer? (On NVIDIA this requires an explicit copy.)

2. **mmap-backed KV cache for eviction**: Can KV cache pages be backed by mmap'd files? When memory pressure rises, `MADV_DONTNEED` could evict pages to SSD, and page faults reload them on demand. Is this viable for the 15-70x bandwidth gap?

3. **SLC utilization for hot KV cache**: The System Level Cache (8-96MB) provides 2x DRAM bandwidth. For a serving workload, the most recent K/V tokens for active sequences are the hottest data. Can we engineer cache-friendly access patterns to keep hot KV pages in SLC?

4. **Residency set lifecycle for growing KV cache**: As new tokens are generated, KV cache grows. How should residency sets be managed? Add pages incrementally? Batch updates? What's the overhead?

5. **Multi-model KV cache isolation**: When serving multiple models, each has its own KV cache. How to partition unified memory fairly? Can we use mmap with different files per model for clean isolation?

6. **PagedAttention adaptation for UMA**: Does the original PagedAttention design (page table, block mapping) even make sense on UMA where there's no separate GPU VRAM to manage? Or is a simpler design better?

7. **Prefetch strategies**: For continuous batching, the scheduler knows which sequences will be processed next. Can `MADV_WILLNEED` or explicit prefetch prepare KV cache pages before the attention kernel needs them?

8. **SSD-backed KV cache (radical approach)**: With MTLIOCommandQueue, could KV cache live on SSD and be streamed to GPU on demand? At 7.9 GB/s SSD bandwidth, how many tokens/s could this support?

## Research Areas

### Area 1: vLLM-MLX KV Cache Implementation
**Source**: `github.com/vllm-project/vllm` MLX backend
**Research targets**:
- How vLLM-MLX manages KV cache (paged? contiguous? hybrid?)
- Memory allocation strategy (MLX metal allocator vs custom)
- Cache eviction/reuse policy
- Prefix caching: how shared prefixes avoid duplicate KV storage

### Area 2: MLX Attention Kernel KV Cache Access
**Source**: `mlx/backend/metal/kernels/steel/attn/`
**Research targets**:
- How the attention kernel reads KV cache (contiguous buffer? offset table?)
- Memory access pattern (sequential vs strided vs random)
- Can KV cache be non-contiguous (page table indirection in kernel)?
- Float32 vs FP16 KV cache storage tradeoffs

### Area 3: UMA Memory Management Under Pressure
**Source**: macOS unified memory documentation, mmap behavior testing
**Research targets**:
- What happens when GPU buffers compete with system memory?
- macOS memory pressure response for mmap'd Metal buffers
- Can `vm_allocate` + manual page management give finer control than mmap?
- Interaction between residency sets and macOS memory compressor

### Area 4: SLC Profiling for KV Cache
**Source**: Metal System Trace, GPU profiling tools
**Research targets**:
- SLC hit rate for attention kernel memory accesses
- Optimal KV cache layout for SLC utilization
- Whether interleaved K/V or separate K and V arrays are better for caching
- Impact of sequence length on cache behavior

### Area 5: Prior Art — Offloading and Tiered Memory
**Source**: Academic papers on KV cache offloading
**Research targets**:
- FlexGen (NVIDIA) — KV cache offloading to CPU/SSD. How does this map to UMA?
- InfiniGen — prefetch-based KV cache management
- vLLM prefix caching — content-based deduplication
- SpecInfer — speculative decoding with shared KV cache

## Impact

- **Inference serving**: KV cache is the primary memory bottleneck for serving concurrent users
- **Competitive differentiation**: UMA-aware KV cache is something NVIDIA literally cannot do
- **Distributed inference**: Cross-node KV cache coordination is a prerequisite for multi-Mac serving
- **Model router**: Fast model switching with per-model KV cache isolation

## Recommended KB Addition

**Skills**: `unified-memory`, `mlx-compute`, `gpu-perf`, `gpu-io`
**Topics**: KV cache allocation on UMA, mmap-backed cache, SLC utilization, PagedAttention adaptation
**Estimated findings**: 10-15 new findings (mix of extracted knowledge and original profiling results)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KB Gap: KV cache design on Apple Silicon unified memory #14

Context

Gap Description

What We Have (relevant but indirect)

What We Need

Research Areas

Area 1: vLLM-MLX KV Cache Implementation

Area 2: MLX Attention Kernel KV Cache Access

Area 3: UMA Memory Management Under Pressure

Area 4: SLC Profiling for KV Cache

Area 5: Prior Art — Offloading and Tiered Memory

Impact

Recommended KB Addition

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

KB Gap: KV cache design on Apple Silicon unified memory #14

Description

Context

Gap Description

What We Have (relevant but indirect)

What We Need

Research Areas

Area 1: vLLM-MLX KV Cache Implementation

Area 2: MLX Attention Kernel KV Cache Access

Area 3: UMA Memory Management Under Pressure

Area 4: SLC Profiling for KV Cache

Area 5: Prior Art — Offloading and Tiered Memory

Impact

Recommended KB Addition

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions