feat(ffi): wrap MLX runtime memory APIs (active/peak/limit)#66
Merged
Conversation
Bind MLX's `mlx/memory.h` runtime counters and limit setters through the C++ FFI bridge so mlxcel can finally measure actual MLX-allocator residency and cap the working set. Previously none of `get_active_memory` / `get_peak_memory` / `get_cache_memory` / `set_memory_limit` / `set_cache_limit` / `reset_peak_memory` were exposed, leaving the estimator unverifiable and the allocator unbounded.
Changes:
- `src/lib/mlxcel-core/cpp/mlx_cxx_bridge.{h,cpp}`: add `get_active_memory`, `get_peak_memory`, `get_cache_memory`, `set_memory_limit`, `get_memory_limit`, `set_cache_limit`, `reset_peak_memory` as thin one-line forwarders to `mlx::core::*` so the active allocator (Metal / CUDA / no-gpu CommonAllocator) decides per-backend semantics. Documented cross-backend behaviour in the header.
- `src/lib/mlxcel-core/src/lib.rs`: declare the same set in the cxx bridge. They are auto-re-exported via the existing `pub use ffi::*` so callers reach them as `mlxcel_core::get_active_memory(...)` for parity with the existing `set_wired_limit` / `gpu_max_memory_size` precedent.
- `src/lib/mlxcel-core/src/memory.rs` (new): typed wrappers — `active_memory() -> u64`, `peak_memory() -> u64`, `cache_memory() -> u64`, `memory_limit() -> u64`, `set_memory_limit(u64) -> u64`, `set_cache_limit(u64) -> u64`, `reset_peak_memory()`, `clear_cache()`, plus a `MemorySnapshot { active, peak, cache, limit }` struct with `snapshot()` / `used_bytes()`. Wrappers convert `usize` ↔ `u64` so callers (estimator, preflight, metrics) get a stable wire size independent of host pointer width.
- `src/lib/mlxcel-core/src/ffi_tests.rs`: add `test_runtime_memory_apis_smoke` covering the raw cxx surface — allocate, read every counter, round-trip `set_memory_limit`, call `reset_peak_memory`.
- `src/lib/mlxcel-core/src/memory.rs` unit tests: monotonic-relationship checks (`peak >= active`, `peak` grows after a fresh allocation post-reset), `set_memory_limit` round-trip restores previous, `set_cache_limit` / `clear_cache` no-op on the no-gpu CPU backend. All assertions are loose enough to survive parallel allocations from other tests in the same binary; tests serialize through a dedicated `MEMORY_TEST_LOCK` so `reset_peak_memory()` in one test does not trample a peak observation in another.
Integration (mandatory per issue acceptance):
- `src/commands/generate.rs`: `load_generation_model` now captures `mlxcel_core::memory::snapshot()` immediately after `load_model*` returns and appends "(resident: X.XX GB, peak: X.XX GB)" to the existing "Model loaded in N.NNNs." line. A `tracing::info!` companion event carries the full snapshot fields (`active_bytes`, `peak_bytes`, `cache_bytes`, `limit_bytes`, `load_seconds`) for structured-log consumers.
- `src/server/model_worker.rs`: matching snapshot + structured log on the server load path (both the batched scheduler and the legacy `--no-batch` worker) so HTTP deployments see the same residency figure as the CLI.
- `src/execution/runtime.rs`: new `MLXCEL_MEMORY_LIMIT` env var. When set, `initialize_runtime` calls `memory::set_memory_limit(...)` before model load so MLX raises an exception on overflow during evaluation instead of thrashing or being OOM-killed. This is the explicit "preflight hook" the future capstone (#56) consumes. Syntax matches `MLXCEL_WIRED_LIMIT` ("32GB" / "1024MB" / raw bytes / "0" / "none"). `RuntimeSetup` gains a `memory_limit_bytes: Option<usize>` field so `print_runtime_setup` can surface the cap at boot.
- `src/main.rs`: document `MLXCEL_MEMORY_LIMIT` in the CLI `after_help` block alongside `MLXCEL_WIRED_LIMIT`.
Cross-platform contract:
MLX's allocator selection is transparent: every wrapper here builds and runs on Apple Silicon (Metal), Linux/CUDA, and the no-gpu CPU CommonAllocator. The CPU allocator returns 0 for `get_cache_memory` and treats `set_cache_limit` as a no-op by upstream design; the wrappers reflect that without panicking, matching the precedent set by `set_wired_limit` (also a no-op on CPU). All new unit tests pass on the Linux/CUDA dev host; Apple Silicon real-model validation is deferred to the integration capstone (#56).
Closes #55
4 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Bind MLX's
mlx/memory.hruntime counters and limit setters through the C++ FFI bridge so mlxcel can finally measure actual MLX-allocator residency and cap the working set. None ofget_active_memory/get_peak_memory/get_cache_memory/set_memory_limit/set_cache_limit/reset_peak_memorywere exposed before; this PR adds the bridge entries, layers a typedmlxcel_core::memorymodule on top, and wires the post-load residency log into both the CLI generate path and the server worker.What changed
src/lib/mlxcel-core/cpp/mlx_cxx_bridge.{h,cpp}): seven new one-line forwarders tomlx::core::*. Documented per-backend semantics inline — Metal / CUDA populate every counter, the no-gpu CPUCommonAllocatorpopulates active / peak / limit but treatsget_cache_memory/set_cache_limitas inert no-ops by MLX upstream design.src/lib/mlxcel-core/src/lib.rs): the same set declared in the#[cxx::bridge]module. Auto-re-exported via the existingpub use ffi::*so they show up asmlxcel_core::get_active_memory(...)for parity with the existingset_wired_limit/gpu_max_memory_sizeprecedent.src/lib/mlxcel-core/src/memory.rs, new):active_memory()/peak_memory()/cache_memory()/memory_limit()returningu64,set_memory_limit(u64) -> u64,set_cache_limit(u64) -> u64,reset_peak_memory(),clear_cache(), plus aMemorySnapshot { active, peak, cache, limit }struct withsnapshot()/used_bytes(). Wrappers normaliseusize↔u64so callers get a stable wire size irrespective of host pointer width.src/lib/mlxcel-core/src/ffi_tests.rs::test_runtime_memory_apis_smoke): allocate an array, read every counter, round-tripset_memory_limit, callreset_peak_memory(). Guards the raw cxx surface.peak >= active,peakclimbs after a fresh allocation post-reset),set_memory_limitround-trip restores previous,set_cache_limit/clear_cacheno-op on CPU. Tests serialize through a dedicatedMEMORY_TEST_LOCKsoreset_peak_memory()in one test does not trample a peak observation in another running in parallel.src/commands/generate.rs):load_generation_modelcapturesmemory::snapshot()immediately afterload_model*returns and appends "(resident: X.XX GB, peak: X.XX GB)" to the existing "Model loaded in N.NNNs." line. Atracing::info!companion event carries the full snapshot for structured-log consumers.src/server/model_worker.rs): the same snapshot + structured log on both the batched scheduler path and the legacy--no-batchworker, so HTTP deployments report the same residency figure as the CLI.src/execution/runtime.rs): newMLXCEL_MEMORY_LIMITenv var callsmemory::set_memory_limit(...)at startup so MLX raises an exception on overflow during evaluation. This is the explicit "preflight hook" the capstone (feat: unified memory estimator with mlxcel inspect and generate/serve preflight #56) will consume. Syntax matchesMLXCEL_WIRED_LIMIT("32GB" / "1024MB" / raw bytes / "0" / "none").RuntimeSetupgains amemory_limit_bytes: Option<usize>field;print_runtime_setupsurfaces the cap at boot.src/main.rs): documentMLXCEL_MEMORY_LIMITin the CLIafter_helpblock alongsideMLXCEL_WIRED_LIMIT.Acceptance criteria
active_memory()is non-zero.cargo test --lib -p mlxcel-core memory::tests::active_memory_increases_after_allocationpasses on Linux/CPU; equivalent assertion inffi_tests::test_runtime_memory_apis_smokepasses the raw-FFI surface.reset_peak_memory()followed by a known allocation →peak_memory()reflects it.cargo test --lib -p mlxcel-core memory::tests::reset_peak_memory_lowers_or_holds_peakexercises exactly this sequence.mlxcel generatelogs resident-after-load. Added to bothsrc/commands/generate.rs(CLI) andsrc/server/model_worker.rs(server). The CLI now prints "Model loaded in 1.234s (resident: 7.42 GB, peak: 7.42 GB)." with atracing::info!companion. Apple Silicon real-model validation is deferred to the capstone (feat: unified memory estimator with mlxcel inspect and generate/serve preflight #56) per the issue's "build on Linux, validate residency on Apple Silicon later" guidance.Test plan
cargo check -p mlxcel-core— clean.cargo check --lib --tests(workspace incl. main crate) — clean.cargo build --bin mlxceland./target/debug/mlxcel --help— boots, newMLXCEL_MEMORY_LIMITline shown in help.cargo test --lib -p mlxcel-core memory::— 6 tests pass on Linux/CPU.cargo test --lib -p mlxcel-core ffi_tests::test_runtime_memory_apis_smoke— passes.cargo test --lib execution::runtime— 10 existing tests still pass after theRuntimeSetupfield addition.cargo clippy --lib --tests -- -D warnings(mlxcel-core + main crate) — clean.cargo fmt --all -- --check— clean.Cross-platform notes
MLX's allocator dispatch is transparent: every wrapper compiles and runs on Apple Silicon (Metal), Linux/CUDA, and the no-gpu CPU
CommonAllocator. The CPU backend returns 0 forget_cache_memoryand treatsset_cache_limitas a no-op by upstream design; the wrappers reflect that without panicking, matching the precedent set byset_wired_limit.Closes #55