fix: tighten memory estimator preflight coverage by inureyes · Pull Request #68 · lablup/mlxcel

inureyes · 2026-05-21T13:58:19Z

Summary

Follow-up audit for epic #52 after reviewing the merged sub-issue work (#53, #54, #55, #56 and PRs #64-#67). The original architecture was in place, but several CLI/runtime paths could still report a memory estimate that did not match the load that would actually happen.

This PR tightens those gaps so the estimator remains conservative and consistent across inspect, generate --estimate-memory, serve --estimate-memory, and --recommend-quant.

Refs #52.

Review Findings Fixed

estimate_total_memory(..., batch, ...) accepted a batch argument but still used a KV-cache helper hardcoded to batch 1.
generate --estimate-memory estimated KV cache from --max-tokens only, so long prompts were not counted before model load.
serve --estimate-memory ignored serving concurrency/batching knobs and sized KV as batch 1.
inspect / serve --estimate-memory run before runtime initialization, so MLXCEL_MEMORY_LIMIT was not reflected in available-memory calculation.
inspect, generate, and serve manually inferred int8 KV only from split K/V flags, missing the legacy --kv-cache-mode int8 path and shared validation behavior.
Runtime estimate-vs-actual logging labeled under/over-estimation backwards.
KV-cache config parsing ignored explicit head_dim and common field aliases, underestimating models whose head dimension differs from hidden_size / num_attention_heads.
Safetensors header parsing silently skipped non-numeric shape dimensions.
Weight footprint estimation trusted stale model.safetensors.index.json::metadata.total_size even when the referenced shards were not the files the loader would use.

Changes

Route unified memory estimation through kv_cache_params_from_path(..., batch) so batch scales KV bytes.
Parse MLXCEL_MEMORY_LIMIT directly in the estimator before checking the already-applied MLX allocator cap.
Expand KV config parsing for dim, model_dim, n_heads, n_head, num_kv_heads, n_kv_heads, n_head_kv, multi_query_group_num, head_dim, and head_size.
Move generate --estimate-memory preflight after prompt rendering/tokenization but before model load, using prompt_tokens + max_tokens for KV context length.
Make inspect, generate, and serve use the shared resolve_kv_cache_mode helper.
Make serve --estimate-memory use ctx_size, max_kv_size, max_batch_size, n_parallel, and no_batch to derive the preflight shape.
Make weight_footprint_bytes trust index metadata only after validating shard filenames and existence, sum valid shard headers when total size is missing, and fall back to local safetensors headers for stale indexes.
Reject malformed safetensors headers with non-numeric shape dimensions instead of undercounting them.
Add regression tests for batch scaling, explicit head_dim, env memory caps, prompt+generation sizing, serve preflight shape, stale indexes, and malformed headers.

Validation

Passed:

cargo fmt --all -- --check
cargo clippy -p mlxcel --lib --bin mlxcel --tests -- -D warnings
cargo clippy -p mlxcel-core --lib --tests -- -D warnings
cargo test -p mlxcel memory_estimate --lib
cargo test -p mlxcel quant_advisor --lib
cargo test -p mlxcel --bin mlxcel
cargo test -p mlxcel-core weights::tests

Additional broader checks:

cargo test -p mlxcel --lib ran 2,893 tests and failed only models::sanitize_tests::load_and_sanitize_weights_dequantizes_nvfp4_gemma4_checkpoint. I verified the same test fails on origin/main with the same assertion, so it is not introduced by this PR.
cargo test -p mlxcel-core --lib ran 774 tests and failed two existing environment-sensitive FFI tests: ffi_tests::test_from_bytes_fp16_native_dtype and ffi_tests::test_memory_functions. The focused weights::tests suite covering this PR passed.

Security / Performance Notes

The estimator no longer accepts malformed safetensors shapes by silently dropping invalid dimensions.
Stale or invalid index metadata cannot force the preflight to use an unrelated total_size; it must match validated local shards or fall back to actual local safetensors headers.
No tensor payloads are read for estimation; only safetensors headers and config JSON are inspected, preserving pre-load behavior.
Batch/concurrency sizing is now conservative for server preflight, reducing the chance of admitting a model that later OOMs under configured serving load.

fix: tighten memory estimator preflight coverage

7a577f5

inureyes merged commit d2e7a84 into main May 21, 2026
4 checks passed

inureyes deleted the issue-52-memory-estimator-audit branch May 21, 2026 13:59

inureyes self-assigned this May 21, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: tighten memory estimator preflight coverage#68

fix: tighten memory estimator preflight coverage#68
inureyes merged 1 commit into
mainfrom
issue-52-memory-estimator-audit

inureyes commented May 21, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

inureyes commented May 21, 2026

Summary

Review Findings Fixed

Changes

Validation

Security / Performance Notes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant