Retain bound buffers under untracked hazard mode#3462
Retain bound buffers under untracked hazard mode#3462TheTom wants to merge 1 commit intoml-explore:mainfrom
Conversation
941ddea to
9c2187a
Compare
The Metal allocator uses MTLResourceHazardTrackingModeUntracked and the command queue uses commandBufferWithUnretainedReferences(); both Apple APIs require the application to keep bound buffers alive until the command buffer completes. CommandEncoder::set_buffer / set_input_array did not take that retain, so a buffer could be destroyed mid-flight when the caller's shared_ptr<array::Data> dropped between encode and CB completion. Retain each MTL::Buffer on first sighting in the current command buffer (the existing all_inputs_ set is the dedup oracle), transfer the per-CB retained vector into the addCompletedHandler lambda, and release on completion. Also stop removing the output's data_shared_ptr from the captured set in eval() (the unordered_set already dedupes). Manifests as random Invalid Resource crashes under concurrent custom- kernel workloads at decode B>=16 on M-series. Ablation env knob MLX_METAL_RETAIN_BOUND_BUFFERS=0 reverts behaviour for bisection. Validated on Qwen3.5-35B-A3B turbo4v2 4K (M5 Max): 0/10 -> 10/10 at B=16 and B=17, 0/5 -> 5/5 at B=32. Steady-state decode 86.5 vs 88.6 t/s (~2.4% cost). Memory cost: zero (retain bumps a 32-bit counter inside the existing MTL::Buffer header). Mirrors ml-explore#3462 (issue ml-explore#3461).
9c2187a to
379f002
Compare
The Metal allocator uses MTLResourceHazardTrackingModeUntracked and the command queue uses commandBufferWithUnretainedReferences(); both Apple APIs require the application to keep bound buffers alive until the command buffer completes. CommandEncoder::set_buffer / set_input_array did not take that retain, so a buffer could be destroyed mid-flight when the caller's shared_ptr<array::Data> dropped between encode and CB completion. Retain each MTL::Buffer on first sighting in the current command buffer (the existing all_inputs_ set is the dedup oracle), transfer the per-CB retained vector into the addCompletedHandler lambda, and release on completion. Also stop removing the output's data_shared_ptr from the captured set in eval() (the unordered_set already dedupes). Manifests as random Invalid Resource crashes under concurrent custom- kernel workloads at decode B>=16 on M-series. Validated on Qwen3.5-35B-A3B turbo4v2 4K (M5 Max, mlx-swift): 0/10 -> 10/10 at B=16 and B=17, 0/5 -> 5/5 at B=32. Ablation via MLX_RETAIN_BOUND_BUFFERS=0 reverts behaviour and crashes return at the same rate. Steady-state decode 86.5 vs 88.6 t/s (~2.4% cost). Memory cost: zero (retain bumps a 32-bit counter inside the existing MTL::Buffer header).
379f002 to
57c88c0
Compare
The Metal allocator uses MTLResourceHazardTrackingModeUntracked and the command queue uses commandBufferWithUnretainedReferences(); both Apple APIs require the application to keep bound buffers alive until the command buffer completes. CommandEncoder::set_buffer / set_input_array did not take that retain, so a buffer could be destroyed mid-flight when the caller's shared_ptr<array::Data> dropped between encode and CB completion. Retain each MTL::Buffer on first sighting in the current command buffer (the existing all_inputs_ set is the dedup oracle), transfer the per-CB retained vector into the addCompletedHandler lambda, and release on completion. Also stop removing the output's data_shared_ptr from the captured set in eval() (the unordered_set already dedupes). Manifests as random Invalid Resource crashes under concurrent custom- kernel workloads at decode B>=16 on M-series. Ablation env knob MLX_METAL_RETAIN_BOUND_BUFFERS=0 reverts behaviour for bisection. Validated on Qwen3.5-35B-A3B turbo4v2 4K (M5 Max): 0/10 -> 10/10 at B=16 and B=17, 0/5 -> 5/5 at B=32. Steady-state decode 86.5 vs 88.6 t/s (~2.4% cost). Memory cost: zero (retain bumps a 32-bit counter inside the existing MTL::Buffer header). Mirrors ml-explore#3462 (issue ml-explore#3461).
The Metal allocator uses MTLResourceHazardTrackingModeUntracked and the command queue uses commandBufferWithUnretainedReferences(); both Apple APIs require the application to keep bound buffers alive until the command buffer completes. CommandEncoder::set_buffer / set_input_array did not take that retain, so a buffer could be destroyed mid-flight when the caller's shared_ptr<array::Data> dropped between encode and CB completion. Retain each MTL::Buffer on first sighting in the current command buffer (the existing all_inputs_ set is the dedup oracle), transfer the per-CB retained vector into the addCompletedHandler lambda, and release on completion. Also stop removing the output's data_shared_ptr from the captured set in eval() (the unordered_set already dedupes). Manifests as random Invalid Resource crashes under concurrent custom- kernel workloads at decode B>=16 on M-series. Ablation env knob MLX_METAL_RETAIN_BOUND_BUFFERS=0 reverts behaviour for bisection. Validated on Qwen3.5-35B-A3B turbo4v2 4K (M5 Max): 0/10 -> 10/10 at B=16 and B=17, 0/5 -> 5/5 at B=32. Steady-state decode 86.5 vs 88.6 t/s (~2.4% cost). Memory cost: zero (retain bumps a 32-bit counter inside the existing MTL::Buffer header). Mirrors ml-explore#3462 (issue ml-explore#3461).
|
I have confirmed this fixed the bug in my |
There was a problem hiding this comment.
buffers could be destroyed mid-flight when the caller's shared_ptrarray::Data dropped between encode and CB completion
The buffers are retained between encode and completion by code:
command_buffer->addCompletedHandler(
[s, buffers = std::move(buffers)](MTL::CommandBuffer* cbuf) {
...
}The buffers = std::move(buffers) would ensure that the buffers won't be released until the completion callback is ended.
I highly doubt that the crashes were caused by race conditions which should had been resolved by recent thread safety changes. I have tried to run the tests added by this PR in the main branch and they are passing.
|
Thanks for the comments. Some responses:
The captured set is held through CB completion, but the line right above the // Remove the output if it was donated to by an input
if (auto it = buffers.find(arr.data_shared_ptr()); it != buffers.end()) {
buffers.erase(it);
}So the lambda capture protects inputs and siblings but not the output. The output's lifetime is left to the Swift caller's Issue #3461 has the Metal Validator log naming the destroyed-mid-flight buffer directly. The patch closes this with two changes: drop the donation-erase in
Could you point me at the specific PR you're thinking of for the recent thread safety changes? Happy to retest against that baseline. The companion issue #3078 is in the same area but a different mechanism (encoder aliasing across threads), and this patch isn't intended to address that one. On the tests passing on main: that's right and the PR description says so up front, and this is part of what makes it a harder probelm. The included tests are smoke tests, not deterministic regressions for this specific race. I scaled the The deterministic evidence is the workloaad ablation from #3461: same patched binary, On the tests passing on main: that's right and the PR description says so up front and this is part of what makes it a harder probelm. The included tests are smoke tests, not deterministic regressions for this specific race. I scaled the The deterministic evidence is the workloaad ablation: same patched binary, |
|
If the root cause is the output getting release too early I think ensuring the The extra |
|
The ablation answers this directly. In the same patched binary, with the The two layers also protect different things: |
|
I think there are 2 possibilities:
To verify whether the first possibility is true, can you try retaining the buffers without calling std::vector<MTL::Buffer*> retained_buffers_;with std::set<std::shared_ptr<array::Data>> retained_buffers_; |
|
@zcbenz Workload: B=32 concurrent generation tasks via
Direct answer to your two cases: case 2 (missed wrapper coverage in eval.cpp's existing capture) is the right one on this workload. Case 1 (Apple-API-level retain strictly required) is not what's firing here. Wrapper-set keeps the Stack trace from Run B ( One thing I'd flag before fully retiring the bind-path retain: |
|
Thanks for testing the possibilities. The most likely causes are:
But they do not necessarily happen inside MLX, and unlikely so because it would have been reported long ago. You mentioned using a forked mlx-swift-lm with custom kernels, which in my opinion should be looked into first. We can not just retain everything passed to |
…to, vendor C ABI Three co-located fixes for `78c91aa` not building cleanly: 1. **mlx-swift pin reverted to `0a56f90`** (was `a21d2af`). `a21d2af` advanced the `osaurus-ai/mlx` submodule to `7086ba37`, an INCOMPLETE backport of upstream ml-explore/mlx#3462. The backport added `encoder.take_retained_buffers()` at `mlx/backend/metal/eval.cpp:62` but never carried over the `auto& encoder = metal::get_command_encoder(s);` declaration that line depends on. Result: the package fails to compile at HEAD. `0a56f90` is the last green pin (submodule at `mlx@96aa27a5`, layered on upstream `ce45c525`). Reverting drops the perf-oriented buffer-retain optimization but restores correctness. Re-introduce when an `osaurus-ai/mlx-swift` branch advances the submodule pointer to `e577ca02` (the corrected backport, currently on `backport/3462-retain-bound-buffers`). 2. **swift-crypto range loosened to `"3.0.0"..<"5.0.0"`** (was `from: "4.0.0"`). The hard 4.x lower-bound conflicted with hosts that pin `apple/containerization` (still on swift-crypto 3.x as of 0.32.0). The only crypto APIs touched by MLXDistributedTransport are `SHA256.hash` and `P256.Signing.PrivateKey()` — both stable since swift-crypto 1.x — so the wider range is safe. 3. **Vendored `mlx-c/mlx/c/distributed.cpp` and `distributed_group.cpp`** into `CmlxDistributedShim` as `MlxCDistributed.cpp` and `MlxCDistributedGroup.cpp`. The mlx-swift Package.swift excludes both files from the Cmlx target (only the abstract C++ layer is built; backends + the C ABI wrappers are not). Without them our `_mlx_distributed_*` C symbols are unresolved and `TPRankWorker` fails to link. Files are byte-identical to the upstream — re-vendor when bumping mlx-swift if the C ABI changes. Added `cxxLanguageStandard: .gnucxx20` at the package level so the vendored C++ uses the same standard as the upstream Cmlx target. Added matching `headerSearchPath`s for the mlx-c headers, the mlx C++ root, json/single_include/nlohmann, and fmt/include so the vendored files resolve their includes. Verified end-to-end on M5 Max (apple silicon, macOS 26.3.2): - `swift build -c release` → Build complete in 26.84s, including CmlxDistributedShim (3 files), MLXDistributedTP, TPRankWorker, RunBench. No errors, only pre-existing warnings. - `swift test --filter "LoadConfigurationTests|ShardingPlanTests"` → 28/28 pass. - Real-bundle smoke (Laguna-XS.2-JANGTQ, 9.4 GB): 3-turn coherent multi-turn, "blue" recall correct. - Real-bundle smoke (MiniMax-SLURPY-JANGTQ, BENCH_JPREG=1): - MiniMaxM2Minimal auto-engage confirmed in logs - Thinking probe PASS (off-reasoning=0c, on-reasoning=772c) - TQ disk round-trip PASS - 3/3 turn coherence, no looping Pre-existing flakes NOT introduced by this commit: - `EvalTests/testConcurrentSampling` and `testRandomStateIsolation` crash on `0a56f90` (the very symptoms `a21d2af` tried to fix). Will re-pass once the mlx-swift backport branch points at `e577ca02`. - `LoadConfigurationTests.autoFallsThroughOnBadEnv` flakes under parallel xctest because `withEnvironmentValue` uses process-global setenv. Pre-existing pattern, unrelated.
|
Closing this in favor of a targeted downstream fix. @zcbenz's reading was right: the lifetime race wasn't an Apple-API contract gap that needed bind-path retain. It was nine Symptom is workload-dependent. Metal Patch follows the SDPA pattern ( Fix is in our fork, not upstream. The affected primitives don't exist in Thanks @zcbenz for the patient pushback. "Look at the forked mlx-swift-lm with custom kernels first" was exactly the right pointer. Closing #3461 with the same context. |
Refs #3461.
The Metal backend allocates buffers with
MTLResourceHazardTrackingModeUntrackedand creates command buffers viacommandBufferWithUnretainedReferences(). Both Apple APIs require the application to keep bound buffers alive until the command buffer completes. The bind path (CommandEncoder::set_buffer/set_input_array) wasn't taking that retain — buffers could be destroyed mid-flight when the caller'sshared_ptr<array::Data>dropped between encode and CB completion. Manifests as random[METAL] Command buffer execution failed: Invalid Resourcecrashes under concurrent custom-kernel workloads (e.g.mlx-swiftSwift Tasks dispatchingMLXFast.metal_kernelat decode B>=16).Metal Validator (
METAL_DEVICE_WRAPPER_TYPE=1 MTL_DEBUG_LAYER=1) names it directly:Companion to #3078 — different bug (encoder aliasing across threads), but in the same area.
Proposed changes
CommandEncoderretains each boundMTL::Bufferon first sighting in the current command buffer (using the existingall_inputs_set as the dedup oracle). Per-CB cost: oneretain()/release()pair per unique buffer.eval()transfers the per-CB retained vector into theaddCompletedHandlerlambda and releases each pointer when the CB completes.eval()no longer removes the output'sdata_shared_ptrfrom the captured set — theunordered_setalready dedupes the input-donated-as-output case, and removing it leaked the output buffer's lifetime to the caller's wrapper.env::metal_retain_bound_buffers()accessor inmlx/utils.h(matchesmetal_fast_synch/metal_gpu_archpattern). Env var:MLX_METAL_RETAIN_BOUND_BUFFERS=0reverts to old behaviour for ablation/bisection.Diff: 70 lines across 4 files (
mlx/utils.h,mlx/backend/metal/device.h,mlx/backend/metal/device.cpp,mlx/backend/metal/eval.cpp) + 120-line tests file undertests/.Tests
tests/metal_buffer_lifetime_tests.cppadds twoTEST_CASEs, bothgpu::is_available()-guarded:test concurrent eval smoke— 16 threads × 8 iters ofmatmul/eval. Smoke test for built-in primitives.test custom kernel concurrent buffer lifetime— 32 threads × 64 iters ofmlx::fast::metal_kerneldispatch + drop. Designed to be the deterministic regression test for THIS bug.Both pass on M5 Max with the patch:
Honest caveat on the deterministic test: I expected
test custom kernel concurrent buffer lifetimeto fail in ablation (MLX_METAL_RETAIN_BOUND_BUFFERS=0) and it doesn't. I tried scaling to 64 threads × 256 iters × 4096-element arrays × 3 inputs and it still passes ablated, even with Metal Validator (METAL_DEVICE_WRAPPER_TYPE=1 MTL_DEBUG_LAYER=1) on. The race is real but workload-shaped — it requires the specific memory pressure + binding pattern of compressed-attention KV cache reads with concurrent decode (the original repro). Built-in primitives and small custom kernels both have well-rooted enoughshared_ptrchains that the race rarely surfaces in unit-test conditions.So the included tests are smoke tests in practice. The actual deterministic regression evidence is at the workload level (below). I'd appreciate maintainer guidance on whether a more aggressive
metal_kernel-based test is wanted or if the workload-level evidence +MLX_METAL_RETAIN_BOUND_BUFFERS=0ablation lever is enough.Full upstream
ctest(with patch applied):The 6 failing doctests + 1 aggregate are all in
linalg_tests.cpp(matrix factorisation: QR, eigh, inversion, cholesky, pseudo-inverse, lu). Verified pre-existing on unpatchedbdb6ff88(rebuilt without this patch, same failures, same lines, same assertions) — independent of this patch (Apple Accelerate / macOS specific). Other 240 tests pass clean.CPU-only build (
MLX_BUILD_METAL=OFF -DMLX_BUILD_CPU=ON) also builds clean — both new tests aregpu::is_available()-guarded so they no-op on CPU.Workload-level repro (downstream
mlx-swiftfork — for context)Empirically on a downstream
mlx-swift-lmfork running TurboQuant compressed-attention (a custom Metal kernel) at decode B>=16, Qwen3.5-35B-A3B turbo4v2 4K context, M5 Max:Ablation: with
MLX_METAL_RETAIN_BOUND_BUFFERS=0on the same binary, B=32 returns to 0/3 crash. This is the deterministic ablation that proves the patch is what causes the fix. The downstream fork carries additional patches (queue-depth default, swift-sidestopGradient+asyncEval); those numbers therefore include their costs too.Risks
retain()bumps a 32-bit counter inside the existingMTL::Bufferallocation header. Lifetime is extended from "Swift refcount drop" (could be mid-CB) to "CB completion" (typically a few ms later) — which is what the application has been promising Metal all along per Apple's contract.retain()/release()work identically for heap-allocatedMTL::Bufferobjects perMTLHeapdocs. No special-casing.MTL::Buffer::retain/releaseare atomic NSObject operations. The newretained_buffers_vector is encoder-owned and only accessed from the eval thread (same threading model asall_inputs_).MetalAllocator::freeon Swift refcount drop now recycles a buffer whoseMTL::Buffer*may still have refcount >= 1 from the CB. The cache holds the pointer; final destruction (refcount = 0) only happens when the cache evicts AND the completion handler releases. This is the correct serialisation.Open questions
MLX_METAL_RETAIN_BOUND_BUFFERSenv knob ship at all, or just default-on and remove? Argument for keeping one release cycle: bisection support if anyone hits a regression. Long-term answer per Apple's contract is "always on".Checklist
pre-commit run --all-filesto format my code / installed pre-commit prior to committing changes