Skip to content

fix: tighten memory estimator preflight coverage#68

Merged
inureyes merged 1 commit into
mainfrom
issue-52-memory-estimator-audit
May 21, 2026
Merged

fix: tighten memory estimator preflight coverage#68
inureyes merged 1 commit into
mainfrom
issue-52-memory-estimator-audit

Conversation

@inureyes
Copy link
Copy Markdown
Member

Summary

Follow-up audit for epic #52 after reviewing the merged sub-issue work (#53, #54, #55, #56 and PRs #64-#67). The original architecture was in place, but several CLI/runtime paths could still report a memory estimate that did not match the load that would actually happen.

This PR tightens those gaps so the estimator remains conservative and consistent across inspect, generate --estimate-memory, serve --estimate-memory, and --recommend-quant.

Refs #52.

Review Findings Fixed

  • estimate_total_memory(..., batch, ...) accepted a batch argument but still used a KV-cache helper hardcoded to batch 1.
  • generate --estimate-memory estimated KV cache from --max-tokens only, so long prompts were not counted before model load.
  • serve --estimate-memory ignored serving concurrency/batching knobs and sized KV as batch 1.
  • inspect / serve --estimate-memory run before runtime initialization, so MLXCEL_MEMORY_LIMIT was not reflected in available-memory calculation.
  • inspect, generate, and serve manually inferred int8 KV only from split K/V flags, missing the legacy --kv-cache-mode int8 path and shared validation behavior.
  • Runtime estimate-vs-actual logging labeled under/over-estimation backwards.
  • KV-cache config parsing ignored explicit head_dim and common field aliases, underestimating models whose head dimension differs from hidden_size / num_attention_heads.
  • Safetensors header parsing silently skipped non-numeric shape dimensions.
  • Weight footprint estimation trusted stale model.safetensors.index.json::metadata.total_size even when the referenced shards were not the files the loader would use.

Changes

  • Route unified memory estimation through kv_cache_params_from_path(..., batch) so batch scales KV bytes.
  • Parse MLXCEL_MEMORY_LIMIT directly in the estimator before checking the already-applied MLX allocator cap.
  • Expand KV config parsing for dim, model_dim, n_heads, n_head, num_kv_heads, n_kv_heads, n_head_kv, multi_query_group_num, head_dim, and head_size.
  • Move generate --estimate-memory preflight after prompt rendering/tokenization but before model load, using prompt_tokens + max_tokens for KV context length.
  • Make inspect, generate, and serve use the shared resolve_kv_cache_mode helper.
  • Make serve --estimate-memory use ctx_size, max_kv_size, max_batch_size, n_parallel, and no_batch to derive the preflight shape.
  • Make weight_footprint_bytes trust index metadata only after validating shard filenames and existence, sum valid shard headers when total size is missing, and fall back to local safetensors headers for stale indexes.
  • Reject malformed safetensors headers with non-numeric shape dimensions instead of undercounting them.
  • Add regression tests for batch scaling, explicit head_dim, env memory caps, prompt+generation sizing, serve preflight shape, stale indexes, and malformed headers.

Validation

Passed:

  • cargo fmt --all -- --check
  • cargo clippy -p mlxcel --lib --bin mlxcel --tests -- -D warnings
  • cargo clippy -p mlxcel-core --lib --tests -- -D warnings
  • cargo test -p mlxcel memory_estimate --lib
  • cargo test -p mlxcel quant_advisor --lib
  • cargo test -p mlxcel --bin mlxcel
  • cargo test -p mlxcel-core weights::tests

Additional broader checks:

  • cargo test -p mlxcel --lib ran 2,893 tests and failed only models::sanitize_tests::load_and_sanitize_weights_dequantizes_nvfp4_gemma4_checkpoint. I verified the same test fails on origin/main with the same assertion, so it is not introduced by this PR.
  • cargo test -p mlxcel-core --lib ran 774 tests and failed two existing environment-sensitive FFI tests: ffi_tests::test_from_bytes_fp16_native_dtype and ffi_tests::test_memory_functions. The focused weights::tests suite covering this PR passed.

Security / Performance Notes

  • The estimator no longer accepts malformed safetensors shapes by silently dropping invalid dimensions.
  • Stale or invalid index metadata cannot force the preflight to use an unrelated total_size; it must match validated local shards or fall back to actual local safetensors headers.
  • No tensor payloads are read for estimation; only safetensors headers and config JSON are inspected, preserving pre-load behavior.
  • Batch/concurrency sizing is now conservative for server preflight, reducing the chance of admitting a model that later OOMs under configured serving load.

@inureyes inureyes merged commit d2e7a84 into main May 21, 2026
4 checks passed
@inureyes inureyes deleted the issue-52-memory-estimator-audit branch May 21, 2026 13:59
@inureyes inureyes self-assigned this May 21, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant