feat(ffi): wrap MLX runtime memory APIs (active/peak/limit) by inureyes · Pull Request #66 · lablup/mlxcel

inureyes · 2026-05-21T12:34:14Z

Summary

Bind MLX's mlx/memory.h runtime counters and limit setters through the C++ FFI bridge so mlxcel can finally measure actual MLX-allocator residency and cap the working set. None of get_active_memory / get_peak_memory / get_cache_memory / set_memory_limit / set_cache_limit / reset_peak_memory were exposed before; this PR adds the bridge entries, layers a typed mlxcel_core::memory module on top, and wires the post-load residency log into both the CLI generate path and the server worker.

What changed

C++ bridge (src/lib/mlxcel-core/cpp/mlx_cxx_bridge.{h,cpp}): seven new one-line forwarders to mlx::core::*. Documented per-backend semantics inline — Metal / CUDA populate every counter, the no-gpu CPU CommonAllocator populates active / peak / limit but treats get_cache_memory / set_cache_limit as inert no-ops by MLX upstream design.
cxx FFI declarations (src/lib/mlxcel-core/src/lib.rs): the same set declared in the #[cxx::bridge] module. Auto-re-exported via the existing pub use ffi::* so they show up as mlxcel_core::get_active_memory(...) for parity with the existing set_wired_limit / gpu_max_memory_size precedent.
Typed wrappers (src/lib/mlxcel-core/src/memory.rs, new): active_memory() / peak_memory() / cache_memory() / memory_limit() returning u64, set_memory_limit(u64) -> u64, set_cache_limit(u64) -> u64, reset_peak_memory(), clear_cache(), plus a MemorySnapshot { active, peak, cache, limit } struct with snapshot() / used_bytes(). Wrappers normalise usize ↔ u64 so callers get a stable wire size irrespective of host pointer width.
FFI smoke test (src/lib/mlxcel-core/src/ffi_tests.rs::test_runtime_memory_apis_smoke): allocate an array, read every counter, round-trip set_memory_limit, call reset_peak_memory(). Guards the raw cxx surface.
Memory module unit tests: monotonic-relationship checks (peak >= active, peak climbs after a fresh allocation post-reset), set_memory_limit round-trip restores previous, set_cache_limit / clear_cache no-op on CPU. Tests serialize through a dedicated MEMORY_TEST_LOCK so reset_peak_memory() in one test does not trample a peak observation in another running in parallel.
CLI generate integration (src/commands/generate.rs): load_generation_model captures memory::snapshot() immediately after load_model* returns and appends "(resident: X.XX GB, peak: X.XX GB)" to the existing "Model loaded in N.NNNs." line. A tracing::info! companion event carries the full snapshot for structured-log consumers.
Server worker integration (src/server/model_worker.rs): the same snapshot + structured log on both the batched scheduler path and the legacy --no-batch worker, so HTTP deployments report the same residency figure as the CLI.
Preflight hook (src/execution/runtime.rs): new MLXCEL_MEMORY_LIMIT env var calls memory::set_memory_limit(...) at startup so MLX raises an exception on overflow during evaluation. This is the explicit "preflight hook" the capstone (feat: unified memory estimator with mlxcel inspect and generate/serve preflight #56) will consume. Syntax matches MLXCEL_WIRED_LIMIT ("32GB" / "1024MB" / raw bytes / "0" / "none"). RuntimeSetup gains a memory_limit_bytes: Option<usize> field; print_runtime_setup surfaces the cap at boot.
Help text (src/main.rs): document MLXCEL_MEMORY_LIMIT in the CLI after_help block alongside MLXCEL_WIRED_LIMIT.

Acceptance criteria

FFI smoke test allocates an array and active_memory() is non-zero. cargo test --lib -p mlxcel-core memory::tests::active_memory_increases_after_allocation passes on Linux/CPU; equivalent assertion in ffi_tests::test_runtime_memory_apis_smoke passes the raw-FFI surface.
reset_peak_memory() followed by a known allocation → peak_memory() reflects it. cargo test --lib -p mlxcel-core memory::tests::reset_peak_memory_lowers_or_holds_peak exercises exactly this sequence.
Wired into the load path so a real mlxcel generate logs resident-after-load. Added to both src/commands/generate.rs (CLI) and src/server/model_worker.rs (server). The CLI now prints "Model loaded in 1.234s (resident: 7.42 GB, peak: 7.42 GB)." with a tracing::info! companion. Apple Silicon real-model validation is deferred to the capstone (feat: unified memory estimator with mlxcel inspect and generate/serve preflight #56) per the issue's "build on Linux, validate residency on Apple Silicon later" guidance.

Test plan

cargo check -p mlxcel-core — clean.
cargo check --lib --tests (workspace incl. main crate) — clean.
cargo build --bin mlxcel and ./target/debug/mlxcel --help — boots, new MLXCEL_MEMORY_LIMIT line shown in help.
cargo test --lib -p mlxcel-core memory:: — 6 tests pass on Linux/CPU.
cargo test --lib -p mlxcel-core ffi_tests::test_runtime_memory_apis_smoke — passes.
cargo test --lib execution::runtime — 10 existing tests still pass after the RuntimeSetup field addition.
cargo clippy --lib --tests -- -D warnings (mlxcel-core + main crate) — clean.
cargo fmt --all -- --check — clean.
Apple Silicon (Metal) real-model run — deferred to capstone feat: unified memory estimator with mlxcel inspect and generate/serve preflight #56 per issue guidance.

Cross-platform notes

MLX's allocator dispatch is transparent: every wrapper compiles and runs on Apple Silicon (Metal), Linux/CUDA, and the no-gpu CPU CommonAllocator. The CPU backend returns 0 for get_cache_memory and treats set_cache_limit as a no-op by upstream design; the wrappers reflect that without panicking, matching the precedent set by set_wired_limit.

Closes #55

Bind MLX's `mlx/memory.h` runtime counters and limit setters through the C++ FFI bridge so mlxcel can finally measure actual MLX-allocator residency and cap the working set. Previously none of `get_active_memory` / `get_peak_memory` / `get_cache_memory` / `set_memory_limit` / `set_cache_limit` / `reset_peak_memory` were exposed, leaving the estimator unverifiable and the allocator unbounded. Changes: - `src/lib/mlxcel-core/cpp/mlx_cxx_bridge.{h,cpp}`: add `get_active_memory`, `get_peak_memory`, `get_cache_memory`, `set_memory_limit`, `get_memory_limit`, `set_cache_limit`, `reset_peak_memory` as thin one-line forwarders to `mlx::core::*` so the active allocator (Metal / CUDA / no-gpu CommonAllocator) decides per-backend semantics. Documented cross-backend behaviour in the header. - `src/lib/mlxcel-core/src/lib.rs`: declare the same set in the cxx bridge. They are auto-re-exported via the existing `pub use ffi::*` so callers reach them as `mlxcel_core::get_active_memory(...)` for parity with the existing `set_wired_limit` / `gpu_max_memory_size` precedent. - `src/lib/mlxcel-core/src/memory.rs` (new): typed wrappers — `active_memory() -> u64`, `peak_memory() -> u64`, `cache_memory() -> u64`, `memory_limit() -> u64`, `set_memory_limit(u64) -> u64`, `set_cache_limit(u64) -> u64`, `reset_peak_memory()`, `clear_cache()`, plus a `MemorySnapshot { active, peak, cache, limit }` struct with `snapshot()` / `used_bytes()`. Wrappers convert `usize` ↔ `u64` so callers (estimator, preflight, metrics) get a stable wire size independent of host pointer width. - `src/lib/mlxcel-core/src/ffi_tests.rs`: add `test_runtime_memory_apis_smoke` covering the raw cxx surface — allocate, read every counter, round-trip `set_memory_limit`, call `reset_peak_memory`. - `src/lib/mlxcel-core/src/memory.rs` unit tests: monotonic-relationship checks (`peak >= active`, `peak` grows after a fresh allocation post-reset), `set_memory_limit` round-trip restores previous, `set_cache_limit` / `clear_cache` no-op on the no-gpu CPU backend. All assertions are loose enough to survive parallel allocations from other tests in the same binary; tests serialize through a dedicated `MEMORY_TEST_LOCK` so `reset_peak_memory()` in one test does not trample a peak observation in another. Integration (mandatory per issue acceptance): - `src/commands/generate.rs`: `load_generation_model` now captures `mlxcel_core::memory::snapshot()` immediately after `load_model*` returns and appends "(resident: X.XX GB, peak: X.XX GB)" to the existing "Model loaded in N.NNNs." line. A `tracing::info!` companion event carries the full snapshot fields (`active_bytes`, `peak_bytes`, `cache_bytes`, `limit_bytes`, `load_seconds`) for structured-log consumers. - `src/server/model_worker.rs`: matching snapshot + structured log on the server load path (both the batched scheduler and the legacy `--no-batch` worker) so HTTP deployments see the same residency figure as the CLI. - `src/execution/runtime.rs`: new `MLXCEL_MEMORY_LIMIT` env var. When set, `initialize_runtime` calls `memory::set_memory_limit(...)` before model load so MLX raises an exception on overflow during evaluation instead of thrashing or being OOM-killed. This is the explicit "preflight hook" the future capstone (#56) consumes. Syntax matches `MLXCEL_WIRED_LIMIT` ("32GB" / "1024MB" / raw bytes / "0" / "none"). `RuntimeSetup` gains a `memory_limit_bytes: Option<usize>` field so `print_runtime_setup` can surface the cap at boot. - `src/main.rs`: document `MLXCEL_MEMORY_LIMIT` in the CLI `after_help` block alongside `MLXCEL_WIRED_LIMIT`. Cross-platform contract: MLX's allocator selection is transparent: every wrapper here builds and runs on Apple Silicon (Metal), Linux/CUDA, and the no-gpu CPU CommonAllocator. The CPU allocator returns 0 for `get_cache_memory` and treats `set_cache_limit` as a no-op by upstream design; the wrappers reflect that without panicking, matching the precedent set by `set_wired_limit` (also a no-op on CPU). All new unit tests pass on the Linux/CUDA dev host; Apple Silicon real-model validation is deferred to the integration capstone (#56). Closes #55

inureyes added status:review Under review type:enhancement New features, capabilities, or significant additions priority:medium Medium priority area:core mlxcel-core: MLX FFI, primitives, KV cache, layers status:done Completed and removed status:review Under review labels May 21, 2026

inureyes merged commit 1af5069 into main May 21, 2026
4 checks passed

inureyes mentioned this pull request May 21, 2026

Epic: Pre-load model memory requirement estimation #52

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(ffi): wrap MLX runtime memory APIs (active/peak/limit)#66

feat(ffi): wrap MLX runtime memory APIs (active/peak/limit)#66
inureyes merged 1 commit into
mainfrom
feature/issue-55-mlx-memory-ffi

inureyes commented May 21, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

inureyes commented May 21, 2026

Summary

What changed

Acceptance criteria

Test plan

Cross-platform notes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant