Skip to content

Retain bound buffers under untracked hazard mode#3462

Closed
TheTom wants to merge 1 commit intoml-explore:mainfrom
TheTom:upstream-retain-bound-buffers
Closed

Retain bound buffers under untracked hazard mode#3462
TheTom wants to merge 1 commit intoml-explore:mainfrom
TheTom:upstream-retain-bound-buffers

Conversation

@TheTom
Copy link
Copy Markdown

@TheTom TheTom commented Apr 28, 2026

Refs #3461.

The Metal backend allocates buffers with MTLResourceHazardTrackingModeUntracked and creates command buffers via commandBufferWithUnretainedReferences(). Both Apple APIs require the application to keep bound buffers alive until the command buffer completes. The bind path (CommandEncoder::set_buffer / set_input_array) wasn't taking that retain — buffers could be destroyed mid-flight when the caller's shared_ptr<array::Data> dropped between encode and CB completion. Manifests as random [METAL] Command buffer execution failed: Invalid Resource crashes under concurrent custom-kernel workloads (e.g. mlx-swift Swift Tasks dispatching MLXFast.metal_kernel at decode B>=16).

Metal Validator (METAL_DEVICE_WRAPPER_TYPE=1 MTL_DEBUG_LAYER=1) names it directly:

The following Metal object is being destroyed while still required to be
alive by the command buffer 0x...:
<AGXG17XFamilyBuffer: 0x...>
    length = 32768
    hazardTrackingMode = MTLHazardTrackingModeUntracked

Companion to #3078 — different bug (encoder aliasing across threads), but in the same area.

Proposed changes

  • CommandEncoder retains each bound MTL::Buffer on first sighting in the current command buffer (using the existing all_inputs_ set as the dedup oracle). Per-CB cost: one retain()/release() pair per unique buffer.
  • eval() transfers the per-CB retained vector into the addCompletedHandler lambda and releases each pointer when the CB completes.
  • eval() no longer removes the output's data_shared_ptr from the captured set — the unordered_set already dedupes the input-donated-as-output case, and removing it leaked the output buffer's lifetime to the caller's wrapper.
  • New env::metal_retain_bound_buffers() accessor in mlx/utils.h (matches metal_fast_synch / metal_gpu_arch pattern). Env var: MLX_METAL_RETAIN_BOUND_BUFFERS=0 reverts to old behaviour for ablation/bisection.

Diff: 70 lines across 4 files (mlx/utils.h, mlx/backend/metal/device.h, mlx/backend/metal/device.cpp, mlx/backend/metal/eval.cpp) + 120-line tests file under tests/.

Tests

tests/metal_buffer_lifetime_tests.cpp adds two TEST_CASEs, both gpu::is_available()-guarded:

  1. test concurrent eval smoke — 16 threads × 8 iters of matmul/eval. Smoke test for built-in primitives.
  2. test custom kernel concurrent buffer lifetime — 32 threads × 64 iters of mlx::fast::metal_kernel dispatch + drop. Designed to be the deterministic regression test for THIS bug.

Both pass on M5 Max with the patch:

$ ./tests/tests --test-case="*concurrent*"
[doctest] test cases: 2 | 2 passed | 0 failed | 244 skipped

Honest caveat on the deterministic test: I expected test custom kernel concurrent buffer lifetime to fail in ablation (MLX_METAL_RETAIN_BOUND_BUFFERS=0) and it doesn't. I tried scaling to 64 threads × 256 iters × 4096-element arrays × 3 inputs and it still passes ablated, even with Metal Validator (METAL_DEVICE_WRAPPER_TYPE=1 MTL_DEBUG_LAYER=1) on. The race is real but workload-shaped — it requires the specific memory pressure + binding pattern of compressed-attention KV cache reads with concurrent decode (the original repro). Built-in primitives and small custom kernels both have well-rooted enough shared_ptr chains that the race rarely surfaces in unit-test conditions.

So the included tests are smoke tests in practice. The actual deterministic regression evidence is at the workload level (below). I'd appreciate maintainer guidance on whether a more aggressive metal_kernel-based test is wanted or if the workload-level evidence + MLX_METAL_RETAIN_BOUND_BUFFERS=0 ablation lever is enough.

Full upstream ctest (with patch applied):

240 passed, 7 failed of 247

The 6 failing doctests + 1 aggregate are all in linalg_tests.cpp (matrix factorisation: QR, eigh, inversion, cholesky, pseudo-inverse, lu). Verified pre-existing on unpatched bdb6ff88 (rebuilt without this patch, same failures, same lines, same assertions) — independent of this patch (Apple Accelerate / macOS specific). Other 240 tests pass clean.

CPU-only build (MLX_BUILD_METAL=OFF -DMLX_BUILD_CPU=ON) also builds clean — both new tests are gpu::is_available()-guarded so they no-op on CPU.

Workload-level repro (downstream mlx-swift fork — for context)

Empirically on a downstream mlx-swift-lm fork running TurboQuant compressed-attention (a custom Metal kernel) at decode B>=16, Qwen3.5-35B-A3B turbo4v2 4K context, M5 Max:

B (concurrent decode tasks) Pre-patch Post-patch (this commit applied)
16 0/10 10/10
17 0/10 10/10
32 0/5 5/5

Ablation: with MLX_METAL_RETAIN_BOUND_BUFFERS=0 on the same binary, B=32 returns to 0/3 crash. This is the deterministic ablation that proves the patch is what causes the fix. The downstream fork carries additional patches (queue-depth default, swift-side stopGradient + asyncEval); those numbers therefore include their costs too.

Risks

  • Memory: zero overhead. retain() bumps a 32-bit counter inside the existing MTL::Buffer allocation header. Lifetime is extended from "Swift refcount drop" (could be mid-CB) to "CB completion" (typically a few ms later) — which is what the application has been promising Metal all along per Apple's contract.
  • Heap suballocations: retain()/release() work identically for heap-allocated MTL::Buffer objects per MTLHeap docs. No special-casing.
  • Threading: MTL::Buffer::retain / release are atomic NSObject operations. The new retained_buffers_ vector is encoder-owned and only accessed from the eval thread (same threading model as all_inputs_).
  • Buffer recycling: MetalAllocator::free on Swift refcount drop now recycles a buffer whose MTL::Buffer* may still have refcount >= 1 from the CB. The cache holds the pointer; final destruction (refcount = 0) only happens when the cache evicts AND the completion handler releases. This is the correct serialisation.

Open questions

  • Should the MLX_METAL_RETAIN_BOUND_BUFFERS env knob ship at all, or just default-on and remove? Argument for keeping one release cycle: bisection support if anyone hits a regression. Long-term answer per Apple's contract is "always on".
  • Is the smoke + custom-kernel test pair sufficient given the lack of unit-level deterministic ablation, or would maintainers prefer I cut the second test back to one combined smoke test?

Checklist

  • I have read the CONTRIBUTING document
  • I have run pre-commit run --all-files to format my code / installed pre-commit prior to committing changes
  • I have added tests that prove my fix is effective or that my feature works
  • I have updated the necessary documentation (if needed)

@TheTom TheTom force-pushed the upstream-retain-bound-buffers branch 4 times, most recently from 941ddea to 9c2187a Compare April 28, 2026 01:10
Copy link
Copy Markdown
Author

@TheTom TheTom left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ready for review. PTAL

@TheTom TheTom marked this pull request as ready for review April 28, 2026 01:18
TheTom added a commit to TheTom/mlx that referenced this pull request Apr 28, 2026
The Metal allocator uses MTLResourceHazardTrackingModeUntracked and the
command queue uses commandBufferWithUnretainedReferences(); both Apple
APIs require the application to keep bound buffers alive until the
command buffer completes. CommandEncoder::set_buffer / set_input_array
did not take that retain, so a buffer could be destroyed mid-flight when
the caller's shared_ptr<array::Data> dropped between encode and CB
completion.

Retain each MTL::Buffer on first sighting in the current command buffer
(the existing all_inputs_ set is the dedup oracle), transfer the per-CB
retained vector into the addCompletedHandler lambda, and release on
completion. Also stop removing the output's data_shared_ptr from the
captured set in eval() (the unordered_set already dedupes).

Manifests as random Invalid Resource crashes under concurrent custom-
kernel workloads at decode B>=16 on M-series. Ablation env knob
MLX_METAL_RETAIN_BOUND_BUFFERS=0 reverts behaviour for bisection.

Validated on Qwen3.5-35B-A3B turbo4v2 4K (M5 Max): 0/10 -> 10/10 at B=16
and B=17, 0/5 -> 5/5 at B=32. Steady-state decode 86.5 vs 88.6 t/s
(~2.4% cost). Memory cost: zero (retain bumps a 32-bit counter inside
the existing MTL::Buffer header).

Mirrors ml-explore#3462 (issue ml-explore#3461).
@TheTom TheTom force-pushed the upstream-retain-bound-buffers branch from 9c2187a to 379f002 Compare April 28, 2026 01:35
The Metal allocator uses MTLResourceHazardTrackingModeUntracked and the
command queue uses commandBufferWithUnretainedReferences(); both Apple
APIs require the application to keep bound buffers alive until the
command buffer completes. CommandEncoder::set_buffer / set_input_array
did not take that retain, so a buffer could be destroyed mid-flight when
the caller's shared_ptr<array::Data> dropped between encode and CB
completion.

Retain each MTL::Buffer on first sighting in the current command buffer
(the existing all_inputs_ set is the dedup oracle), transfer the per-CB
retained vector into the addCompletedHandler lambda, and release on
completion. Also stop removing the output's data_shared_ptr from the
captured set in eval() (the unordered_set already dedupes).

Manifests as random Invalid Resource crashes under concurrent custom-
kernel workloads at decode B>=16 on M-series. Validated on
Qwen3.5-35B-A3B turbo4v2 4K (M5 Max, mlx-swift): 0/10 -> 10/10 at B=16
and B=17, 0/5 -> 5/5 at B=32. Ablation via MLX_RETAIN_BOUND_BUFFERS=0
reverts behaviour and crashes return at the same rate.

Steady-state decode 86.5 vs 88.6 t/s (~2.4% cost). Memory cost: zero
(retain bumps a 32-bit counter inside the existing MTL::Buffer header).
@TheTom TheTom force-pushed the upstream-retain-bound-buffers branch from 379f002 to 57c88c0 Compare April 28, 2026 01:38
TheTom added a commit to TheTom/mlx that referenced this pull request Apr 28, 2026
The Metal allocator uses MTLResourceHazardTrackingModeUntracked and the
command queue uses commandBufferWithUnretainedReferences(); both Apple
APIs require the application to keep bound buffers alive until the
command buffer completes. CommandEncoder::set_buffer / set_input_array
did not take that retain, so a buffer could be destroyed mid-flight when
the caller's shared_ptr<array::Data> dropped between encode and CB
completion.

Retain each MTL::Buffer on first sighting in the current command buffer
(the existing all_inputs_ set is the dedup oracle), transfer the per-CB
retained vector into the addCompletedHandler lambda, and release on
completion. Also stop removing the output's data_shared_ptr from the
captured set in eval() (the unordered_set already dedupes).

Manifests as random Invalid Resource crashes under concurrent custom-
kernel workloads at decode B>=16 on M-series. Ablation env knob
MLX_METAL_RETAIN_BOUND_BUFFERS=0 reverts behaviour for bisection.

Validated on Qwen3.5-35B-A3B turbo4v2 4K (M5 Max): 0/10 -> 10/10 at B=16
and B=17, 0/5 -> 5/5 at B=32. Steady-state decode 86.5 vs 88.6 t/s
(~2.4% cost). Memory cost: zero (retain bumps a 32-bit counter inside
the existing MTL::Buffer header).

Mirrors ml-explore#3462 (issue ml-explore#3461).
ekryski pushed a commit to ekryski/mlx that referenced this pull request Apr 28, 2026
The Metal allocator uses MTLResourceHazardTrackingModeUntracked and the
command queue uses commandBufferWithUnretainedReferences(); both Apple
APIs require the application to keep bound buffers alive until the
command buffer completes. CommandEncoder::set_buffer / set_input_array
did not take that retain, so a buffer could be destroyed mid-flight when
the caller's shared_ptr<array::Data> dropped between encode and CB
completion.

Retain each MTL::Buffer on first sighting in the current command buffer
(the existing all_inputs_ set is the dedup oracle), transfer the per-CB
retained vector into the addCompletedHandler lambda, and release on
completion. Also stop removing the output's data_shared_ptr from the
captured set in eval() (the unordered_set already dedupes).

Manifests as random Invalid Resource crashes under concurrent custom-
kernel workloads at decode B>=16 on M-series. Ablation env knob
MLX_METAL_RETAIN_BOUND_BUFFERS=0 reverts behaviour for bisection.

Validated on Qwen3.5-35B-A3B turbo4v2 4K (M5 Max): 0/10 -> 10/10 at B=16
and B=17, 0/5 -> 5/5 at B=32. Steady-state decode 86.5 vs 88.6 t/s
(~2.4% cost). Memory cost: zero (retain bumps a 32-bit counter inside
the existing MTL::Buffer header).

Mirrors ml-explore#3462 (issue ml-explore#3461).
@ekryski
Copy link
Copy Markdown

ekryski commented Apr 28, 2026

I have confirmed this fixed the bug in my mlx fork and looked at this PR and the code is identical for the fix. Thanks for putting up!

Copy link
Copy Markdown
Collaborator

@zcbenz zcbenz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

buffers could be destroyed mid-flight when the caller's shared_ptrarray::Data dropped between encode and CB completion

The buffers are retained between encode and completion by code:

    command_buffer->addCompletedHandler(
        [s, buffers = std::move(buffers)](MTL::CommandBuffer* cbuf) {
      ...
    }

The buffers = std::move(buffers) would ensure that the buffers won't be released until the completion callback is ended.

I highly doubt that the crashes were caused by race conditions which should had been resolved by recent thread safety changes. I have tried to run the tests added by this PR in the main branch and they are passing.

@TheTom
Copy link
Copy Markdown
Author

TheTom commented Apr 30, 2026

Thanks for the comments. Some responses:

buffers could be destroyed mid-flight when the caller's shared_ptrarray::Data dropped between encode and CB completion

The buffers are retained between encode and completion by code:

    command_buffer->addCompletedHandler(
        [s, buffers = std::move(buffers)](MTL::CommandBuffer* cbuf) {
      ...
    }

The buffers = std::move(buffers) would ensure that the buffers won't be released until the completion callback is ended.

buffers could be destroyed mid-flight when the caller's shared_ptrarray::Data dropped between encode and CB completion

The buffers are retained between encode and completion by code:

    command_buffer->addCompletedHandler(
        [s, buffers = std::move(buffers)](MTL::CommandBuffer* cbuf) {
      ...
    }

The buffers = std::move(buffers) would ensure that the buffers won't be released until the completion callback is ended.

The captured set is held through CB completion, but the line right above the addCompletedHandler explicitly removes the output's data_shared_ptr from that set:

// Remove the output if it was donated to by an input
if (auto it = buffers.find(arr.data_shared_ptr()); it != buffers.end()) {
    buffers.erase(it);
}

So the lambda capture protects inputs and siblings but not the output. The output's lifetime is left to the Swift caller's MLXArray reference. Under structured concurrency the caller's deinit fires on the Swift thread before the completion handler runs, the Data destructor calls allocator::free, the buffer returns to the pool, and the next concurrent malloc() can reissue it while the original CB is still using it.

Issue #3461 has the Metal Validator log naming the destroyed-mid-flight buffer directly.

The patch closes this with two changes: drop the donation-erase in eval.cpp (the unordered_set already dedupes the input-donated-as-output case), and add an explicit MTL::Buffer::retain at bind time as a second layer for chained lazy-op cases where an input is itself the output of a downstream op whose lifetime ends before this CB completes.

I highly doubt that the crashes were caused by race conditions which should had been resolved by recent thread safety changes. I have tried to run the tests added by this PR in the main branch and they are passing.

Could you point me at the specific PR you're thinking of for the recent thread safety changes? Happy to retest against that baseline. The companion issue #3078 is in the same area but a different mechanism (encoder aliasing across threads), and this patch isn't intended to address that one.

On the tests passing on main: that's right and the PR description says so up front, and this is part of what makes it a harder probelm. The included tests are smoke tests, not deterministic regressions for this specific race. I scaled the metal_kernel test to 64 threads × 256 iters × 4096-element arrays × 3 inputs and it still passes on main with Metal Validator on. The race needs the specific binding pattern of the downstream mlx-swift-lm reproducer (compressed-attention KV reads with concurrent decode at B>=16).

The deterministic evidence is the workloaad ablation from #3461: same patched binary, MLX_METAL_RETAIN_BOUND_BUFFERS=0 env, B=32 reverts from 30/30 success to 0/3 crash on the original Qwen3.5-35B-A3B turbo4 setup. That A/B isolates the retain-bound-buffers path as the cause of the fix.

On the tests passing on main: that's right and the PR description says so up front and this is part of what makes it a harder probelm. The included tests are smoke tests, not deterministic regressions for this specific race. I scaled the metal_kernel test to 64 threads × 256 iters × 4096-element arrays × 3 inputs and it still passes on main with Metal Validator on. The race needs the specific binding pattern of the downstream mlx-swift-lm reproducer (compressed-attention KV reads with concurrent decode at B>=16).

The deterministic evidence is the workloaad ablation: same patched binary, MLX_METAL_RETAIN_BOUND_BUFFERS=0 env, B=32 reverts from 30/30 success to 0/3 crash on the original Qwen3.5-35B-A3B turbo4 setup. That A/B isolates the patch as the cause of the fix.

@zcbenz
Copy link
Copy Markdown
Collaborator

zcbenz commented Apr 30, 2026

If the root cause is the output getting release too early I think ensuring the arr getting retained (i.e. inserted to buffers) in eval would be enough to fix it?

The extra retain feels unnecessary because array::Data already holds the buffers with ref count 1, and the retain would merely increase ref count to 2.

@TheTom
Copy link
Copy Markdown
Author

TheTom commented Apr 30, 2026

The ablation answers this directly. In the same patched binary, with the eval.cpp insert still in place, MLX_METAL_RETAIN_BOUND_BUFFERS=0 disables only the bind-path retain and the original B=32 repro returns to its pre-patch crash rate. So eval.cpp alone is not sufficient for this workload.

The two layers also protect different things: eval.cpp captures graph-level wrappers, while CommandEncoder is the only place that sees every MTL::Buffer actually bound to the CB. Since MLX uses untracked resources + commandBufferWithUnretainedReferences, the lifetime contract is on those bound MTL::Buffers, not on the wrappers.

@zcbenz
Copy link
Copy Markdown
Collaborator

zcbenz commented Apr 30, 2026

I think there are 2 possibilities:

  • We need the extra retain() to actually retain the buffers (unlikely to be true IMO).
  • Some buffers should be retained were missed and released early.

To verify whether the first possibility is true, can you try retaining the buffers without calling retain()? i.e. replacing

std::vector<MTL::Buffer*> retained_buffers_;

with

std::set<std::shared_ptr<array::Data>> retained_buffers_;

@TheTom
Copy link
Copy Markdown
Author

TheTom commented Apr 30, 2026

@zcbenz
Ran your wrapper-set experiment properly. Initial attempt had a copy-paste bug where the wrapper insert was env-gated on metal_retain_bound_buffers(), so at env=0 the wrapper-set was disabled too and Run C was actually testing "no retention at all". Caught it on re-read. With the gate removed, results below.

Workload: B=32 concurrent generation tasks via withThrowingTaskGroup on Qwen3.5-35B-A3B-4bit + turbo4v2 KV scheme, M5 Max 128GB, default wired cap. Test source + full logs at .

Run env Variant Result
A 1 original retain 25/25 ✅
B 0 original retain crash @ iter 8 (SIGABRT)
B2 0 original retain crash @ iter 9 (SIGSEGV)
C 0 wrapper-set (ungated) 25/25 ✅
C2 0 wrapper-set (ungated) 25/25 ✅
C3 0 wrapper-set (ungated) 25/25 ✅

Direct answer to your two cases: case 2 (missed wrapper coverage in eval.cpp's existing capture) is the right one on this workload. Case 1 (Apple-API-level retain strictly required) is not what's firing here. Wrapper-set keeps the array::Data shared_ptr alive across CB completion and the race goes away.

Stack trace from Run B (MLXArray deallocated with non-zero retain count N then SIGABRT) points into mlx::core::metal::MetalAllocator::malloc + mlx::core::BufferCache<MTL::Buffer>::reuse_from_cache + mlx::core::array::~array + MLXArray.__deallocating_deinit. Pool reuse vs Swift wrapper deinit timing, exactly the case-2 picture.

One thing I'd flag before fully retiring the bind-path retain: set_buffer(MTL::Buffer*, ...) raw-pointer path has no wrapper to insert into the set, so wrapper-set leaves it uncovered. On this workload that path isn't hit hard enough to expose a problem since turbo4v2 mostly exercises set_input_array. But for a workload that does exercise raw-pointer binds (custom kernels, anyone passing MTL::Buffer* directly), wrapper-set wouldn't catch it and bind-path retain would. So my current PR is a strict superset of what wrapper-set covers. Happy to update it to a hybrid (wrapper-set in eval.cpp for the wrapper-bearing path, targeted retain only on set_buffer(MTL::Buffer*)) if you'd prefer the cleaner mechanism for the common case while keeping the raw-pointer path covered. Whichever lands, the case-2 framing is right.

@zcbenz
Copy link
Copy Markdown
Collaborator

zcbenz commented May 4, 2026

Thanks for testing the possibilities. The most likely causes are:

  • In the eval_gpu we allocated some arrays and forgot to call add_temporary_array for them.
  • When creating primitives we forgot to set some inputs.

But they do not necessarily happen inside MLX, and unlikely so because it would have been reported long ago. You mentioned using a forked mlx-swift-lm with custom kernels, which in my opinion should be looked into first.

We can not just retain everything passed to set_input_array because it would simply hide the bug, and it would retain temporary buffers longer than they should which would introduce performance problems.

jjang-ai added a commit to osaurus-ai/vmlx-swift-lm that referenced this pull request May 5, 2026
…to, vendor C ABI

Three co-located fixes for `78c91aa` not building cleanly:

1. **mlx-swift pin reverted to `0a56f90`** (was `a21d2af`).
   `a21d2af` advanced the `osaurus-ai/mlx` submodule to `7086ba37`,
   an INCOMPLETE backport of upstream ml-explore/mlx#3462. The
   backport added `encoder.take_retained_buffers()` at
   `mlx/backend/metal/eval.cpp:62` but never carried over the
   `auto& encoder = metal::get_command_encoder(s);` declaration that
   line depends on. Result: the package fails to compile at HEAD.

   `0a56f90` is the last green pin (submodule at `mlx@96aa27a5`,
   layered on upstream `ce45c525`). Reverting drops the perf-oriented
   buffer-retain optimization but restores correctness. Re-introduce
   when an `osaurus-ai/mlx-swift` branch advances the submodule
   pointer to `e577ca02` (the corrected backport, currently on
   `backport/3462-retain-bound-buffers`).

2. **swift-crypto range loosened to `"3.0.0"..<"5.0.0"`** (was
   `from: "4.0.0"`). The hard 4.x lower-bound conflicted with hosts
   that pin `apple/containerization` (still on swift-crypto 3.x as of
   0.32.0). The only crypto APIs touched by MLXDistributedTransport
   are `SHA256.hash` and `P256.Signing.PrivateKey()` — both stable
   since swift-crypto 1.x — so the wider range is safe.

3. **Vendored `mlx-c/mlx/c/distributed.cpp` and
   `distributed_group.cpp`** into `CmlxDistributedShim` as
   `MlxCDistributed.cpp` and `MlxCDistributedGroup.cpp`. The
   mlx-swift Package.swift excludes both files from the Cmlx target
   (only the abstract C++ layer is built; backends + the C ABI
   wrappers are not). Without them our `_mlx_distributed_*` C symbols
   are unresolved and `TPRankWorker` fails to link. Files are
   byte-identical to the upstream — re-vendor when bumping
   mlx-swift if the C ABI changes.

   Added `cxxLanguageStandard: .gnucxx20` at the package level so
   the vendored C++ uses the same standard as the upstream Cmlx
   target. Added matching `headerSearchPath`s for the mlx-c headers,
   the mlx C++ root, json/single_include/nlohmann, and fmt/include
   so the vendored files resolve their includes.

Verified end-to-end on M5 Max (apple silicon, macOS 26.3.2):

- `swift build -c release` → Build complete in 26.84s, including
  CmlxDistributedShim (3 files), MLXDistributedTP, TPRankWorker,
  RunBench. No errors, only pre-existing warnings.
- `swift test --filter "LoadConfigurationTests|ShardingPlanTests"`
  → 28/28 pass.
- Real-bundle smoke (Laguna-XS.2-JANGTQ, 9.4 GB): 3-turn coherent
  multi-turn, "blue" recall correct.
- Real-bundle smoke (MiniMax-SLURPY-JANGTQ, BENCH_JPREG=1):
  - MiniMaxM2Minimal auto-engage confirmed in logs
  - Thinking probe PASS (off-reasoning=0c, on-reasoning=772c)
  - TQ disk round-trip PASS
  - 3/3 turn coherence, no looping

Pre-existing flakes NOT introduced by this commit:
  - `EvalTests/testConcurrentSampling` and `testRandomStateIsolation`
    crash on `0a56f90` (the very symptoms `a21d2af` tried to fix).
    Will re-pass once the mlx-swift backport branch points at
    `e577ca02`.
  - `LoadConfigurationTests.autoFallsThroughOnBadEnv` flakes under
    parallel xctest because `withEnvironmentValue` uses
    process-global setenv. Pre-existing pattern, unrelated.
@TheTom
Copy link
Copy Markdown
Author

TheTom commented May 5, 2026

Closing this in favor of a targeted downstream fix.

@zcbenz's reading was right: the lifetime race wasn't an Apple-API contract gap that needed bind-path retain. It was nine eval_gpu implementations in our fork that allocate intermediate contiguous_copy_gpu arrays, bind them via set_input_array, but never call compute_encoder.add_temporaries(...) on the copies. eval.cpp's existing completion handler captures arr.inputs() shared_ptrs but not these in-eval-allocated temporaries, so the local auto x = ensure_contiguous(...) variable drops at end of eval_gpu, the array::Data ref count hits zero, and BufferCache::reuse_from_cache is free to hand the buffer to a concurrent allocation while the original CB is still pending.

Symptom is workload-dependent. Metal Invalid Resource when the validator catches it, MLXArray deallocated with non-zero retain count N when the Swift wrapper's deinit fires after C++ has freed the underlying array, or downstream shape corruption like (1,16,0,64) from a half-freed slice. Root cause is the same missing registration in the fork primitives.

Patch follows the SDPA pattern (scaled_dot_product_attention.cpp:807): collect copies into a std::vector<array>, call add_temporaries(std::move(copies)) at the end of eval_gpu. Validated on the same Qwen3.5-35B-A3B turbo4v2 workload that originally produced the race: 15x unpatched runs at B=64 = 4/15 crashes (27%), 15x patched = 0/15 (p ~= 0.01), per-iter timing identical so no perf cost.

Fix is in our fork, not upstream. The affected primitives don't exist in ml-explore/mlx. Lands at ekryski#25.

Thanks @zcbenz for the patient pushback. "Look at the forked mlx-swift-lm with custom kernels first" was exactly the right pointer. Closing #3461 with the same context.

@TheTom TheTom closed this May 5, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants