Skip to content

feat(hardware): add KV cache memory estimator with 256-token rounding#65

Merged
inureyes merged 2 commits into
mainfrom
feature/issue-54-kv-cache-estimator
May 21, 2026
Merged

feat(hardware): add KV cache memory estimator with 256-token rounding#65
inureyes merged 2 commits into
mainfrom
feature/issue-54-kv-cache-estimator

Conversation

@inureyes
Copy link
Copy Markdown
Member

Summary

Replaces the flat KV_CACHE_HEADROOM_GB = 2 constant in recommend_quantization() with a proper KV-cache memory estimator. The new estimator computes exact bytes from model architecture parameters using the formula num_layers × 2 × num_kv_heads × head_dim × elem_bytes × round_up(ctx_len, 256) × batch, matching the actual buffer reservation performed by KVCache.

What changed

  • src/lib/mlxcel-core/src/hardware.rs: Added KV_CACHE_ALLOC_STEP = 256 constant, kv_cache_bytes() pure function with 256-token rounding, KvCacheParams struct, kv_cache_bytes_from_params() config-driven wrapper, and updated recommend_quantization() to accept Option<u64> KV headroom instead of the hardcoded constant. All existing callers pass None for backward compatibility.
  • src/execution/quant_advisor.rs: Updated import to include new types, added estimate_kv_cache_bytes_from_path() / estimate_kv_cache_bytes_from_config() helpers that extract num_hidden_layers, num_key_value_heads, hidden_size, num_attention_heads from config.json, and wired them into advise_quantization() to supply computed KV headroom. Added kv_cache_bytes: Option<u64> to QuantAdvice for the future unified estimator.

Test plan

  • cargo test --lib -p mlxcel-core hardware::tests — all 24 tests pass, covering: dense MHA, GQA (kv_heads < heads), long context (128K), INT8 KV (half bytes), 256-token rounding edge cases (ctx=1, 255, 256, 257), and recommend_quantization correctness with both flat and computed headroom
  • cargo test --lib -p mlxcel quant_advisor::tests — all 6 tests pass
  • cargo clippy --lib --tests -p mlxcel-core -- -D warnings — clean
  • cargo check --lib --tests -p mlxcel — clean

Closes #54

@inureyes inureyes added type:enhancement New features, capabilities, or significant additions priority:medium Medium priority area:core mlxcel-core: MLX FFI, primitives, KV cache, layers status:review Under review status:done Completed and removed status:review Under review labels May 21, 2026
inureyes added 2 commits May 21, 2026 21:48
Add kv_cache_bytes() pure function computing per-layer KV reservation using
the formula: num_layers × 2 (K+V) × num_kv_heads × head_dim × elem_bytes
× round_up(ctx_len, 256) × batch. Context length is rounded up to the next
multiple of 256 to match KVCache's step-aligned pre-allocation.

Add KvCacheParams struct and kv_cache_bytes_from_params() config-driven wrapper.
KvCacheParams.int8_kv controls elem_bytes (1 for INT8, 2 for FP16/BF16),
honoring --cache-type-k / --cache-type-v / --kv-cache-mode flags.

Replace the flat KV_CACHE_HEADROOM_GB = 2 constant in recommend_quantization()
with an optional kv_cache_headroom_bytes parameter. When None, the 2 GiB fallback
preserves backward compatibility; when Some, the computed bytes are converted to
GiB (ceiling) and used for accurate fit decisions. Existing callers pass None.

Wire kv_cache_bytes_from_params into quant_advisor.advise_quantization(): reads
num_layers / num_kv_heads / head_dim from config.json (8192-token default context)
and passes the result to recommend_quantization, replacing the flat constant path.
Also expose kv_cache_bytes as QuantAdvice.kv_cache_bytes for future unified estimator.

Acceptance criteria satisfied:
- kv_cache_dense_mha: 32L/32H/D128/FP16/8K = 4 GiB
- kv_cache_gqa_fewer_kv_heads: 32L/8H/D128/FP16/8K = 1 GiB (GQA)
- kv_cache_long_context_128k: 128K ctx = 16 GiB
- kv_cache_int8_half_memory: INT8 (elem=1) is exactly half of FP16 (elem=2)
- kv_cache_256_token_rounding: ctx=255→256, ctx=257→512, ctx=256→256
- recommend_quant_long_context_tightens_headroom: 8B/24GB: 8K→FP16, 128K→INT4
@inureyes inureyes force-pushed the feature/issue-54-kv-cache-estimator branch from 6501224 to 5ab0fef Compare May 21, 2026 12:50
@inureyes inureyes merged commit a436b97 into main May 21, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:core mlxcel-core: MLX FFI, primitives, KV cache, layers priority:medium Medium priority status:done Completed type:enhancement New features, capabilities, or significant additions

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: KV cache memory estimator

1 participant