feat(hardware): add KV cache memory estimator with 256-token rounding by inureyes · Pull Request #65 · lablup/mlxcel

inureyes · 2026-05-21T12:24:54Z

Summary

Replaces the flat KV_CACHE_HEADROOM_GB = 2 constant in recommend_quantization() with a proper KV-cache memory estimator. The new estimator computes exact bytes from model architecture parameters using the formula num_layers × 2 × num_kv_heads × head_dim × elem_bytes × round_up(ctx_len, 256) × batch, matching the actual buffer reservation performed by KVCache.

What changed

src/lib/mlxcel-core/src/hardware.rs: Added KV_CACHE_ALLOC_STEP = 256 constant, kv_cache_bytes() pure function with 256-token rounding, KvCacheParams struct, kv_cache_bytes_from_params() config-driven wrapper, and updated recommend_quantization() to accept Option<u64> KV headroom instead of the hardcoded constant. All existing callers pass None for backward compatibility.
src/execution/quant_advisor.rs: Updated import to include new types, added estimate_kv_cache_bytes_from_path() / estimate_kv_cache_bytes_from_config() helpers that extract num_hidden_layers, num_key_value_heads, hidden_size, num_attention_heads from config.json, and wired them into advise_quantization() to supply computed KV headroom. Added kv_cache_bytes: Option<u64> to QuantAdvice for the future unified estimator.

Test plan

cargo test --lib -p mlxcel-core hardware::tests — all 24 tests pass, covering: dense MHA, GQA (kv_heads < heads), long context (128K), INT8 KV (half bytes), 256-token rounding edge cases (ctx=1, 255, 256, 257), and recommend_quantization correctness with both flat and computed headroom
cargo test --lib -p mlxcel quant_advisor::tests — all 6 tests pass
cargo clippy --lib --tests -p mlxcel-core -- -D warnings — clean
cargo check --lib --tests -p mlxcel — clean

Closes #54

Add kv_cache_bytes() pure function computing per-layer KV reservation using the formula: num_layers × 2 (K+V) × num_kv_heads × head_dim × elem_bytes × round_up(ctx_len, 256) × batch. Context length is rounded up to the next multiple of 256 to match KVCache's step-aligned pre-allocation. Add KvCacheParams struct and kv_cache_bytes_from_params() config-driven wrapper. KvCacheParams.int8_kv controls elem_bytes (1 for INT8, 2 for FP16/BF16), honoring --cache-type-k / --cache-type-v / --kv-cache-mode flags. Replace the flat KV_CACHE_HEADROOM_GB = 2 constant in recommend_quantization() with an optional kv_cache_headroom_bytes parameter. When None, the 2 GiB fallback preserves backward compatibility; when Some, the computed bytes are converted to GiB (ceiling) and used for accurate fit decisions. Existing callers pass None. Wire kv_cache_bytes_from_params into quant_advisor.advise_quantization(): reads num_layers / num_kv_heads / head_dim from config.json (8192-token default context) and passes the result to recommend_quantization, replacing the flat constant path. Also expose kv_cache_bytes as QuantAdvice.kv_cache_bytes for future unified estimator. Acceptance criteria satisfied: - kv_cache_dense_mha: 32L/32H/D128/FP16/8K = 4 GiB - kv_cache_gqa_fewer_kv_heads: 32L/8H/D128/FP16/8K = 1 GiB (GQA) - kv_cache_long_context_128k: 128K ctx = 16 GiB - kv_cache_int8_half_memory: INT8 (elem=1) is exactly half of FP16 (elem=2) - kv_cache_256_token_rounding: ctx=255→256, ctx=257→512, ctx=256→256 - recommend_quant_long_context_tightens_headroom: 8B/24GB: 8K→FP16, 128K→INT4

inureyes added type:enhancement New features, capabilities, or significant additions priority:medium Medium priority area:core mlxcel-core: MLX FFI, primitives, KV cache, layers status:review Under review status:done Completed and removed status:review Under review labels May 21, 2026

inureyes added 2 commits May 21, 2026 21:48

style: apply cargo fmt

5ab0fef

inureyes force-pushed the feature/issue-54-kv-cache-estimator branch from 6501224 to 5ab0fef Compare May 21, 2026 12:50

inureyes merged commit a436b97 into main May 21, 2026
4 checks passed

inureyes mentioned this pull request May 21, 2026

Epic: Pre-load model memory requirement estimation #52

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(hardware): add KV cache memory estimator with 256-token rounding#65

feat(hardware): add KV cache memory estimator with 256-token rounding#65
inureyes merged 2 commits into
mainfrom
feature/issue-54-kv-cache-estimator

inureyes commented May 21, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

inureyes commented May 21, 2026

Summary

What changed

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant