gds: query reflects cuFile completion#117
Merged
Merged
Conversation
Codecov Report❌ Patch coverage is
❌ Your patch check has failed because the patch coverage (26.66%) is below the target coverage (70.00%). You can increase the patch coverage or adjust the target coverage. Additional details and impacted files@@ Coverage Diff @@
## main #117 +/- ##
==========================================
- Coverage 55.67% 55.61% -0.07%
==========================================
Files 49 49
Lines 6890 6903 +13
Branches 1232 1233 +1
==========================================
+ Hits 3836 3839 +3
- Misses 2576 2584 +8
- Partials 478 480 +2
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
github-actions Bot
added a commit
that referenced
this pull request
May 21, 2026
closes #116 ## Approach `store_fs_gds.c`'s `gds_event_query` previously returned 1 unconditionally for any non-sentinel `seq` — completely ignoring whether the `cuFileReadAsync` submitted earlier had actually retired on `stream_h2d`. Callers (the wave-pool scheduler) treat a 1 return as "destination bytes are safe to consume" and transition the slot to `SLOT_READY`. On the GDS path this can hand a wave a `dev_buf` that the device has not yet written to, producing illegal memory accesses downstream in decode kernels gated on cross-stream events that fire ahead of the read draining. This PR makes the query honor what the caller would reasonably expect: returns 1 only after the cuFile read has actually completed on the stream. The mechanism reuses infrastructure already in `gds_submit_dev`: `cuLaunchHostFunc(stream, fs_gds_free_params_cb, ctx)` is enqueued after every `cuFileReadAsync`, so the callback runs in stream order *after* the reads drain. A new small `fs_gds_done { flag, claimed, rc }` struct is allocated per submit; the callback sets `flag=1` and drops one ref, `gds_event_query` checks the flag (acquire) and CAS-claims the owner-side ref the first time it observes the flag set. Repeated queries are safe; an unqueried event is reclaimed by the callback alone (no leak). `store_event` gains an opaque `void* impl` — backend-private, NULL for non-GDS stores. `gds_event_wait`, previously a no-op, now actually `cuStreamSynchronize`s and reclaims. ## Key files - `src/store/store_fs_gds.c:222-260` — the `fs_gds_done` refcount protocol. - `src/store/store_fs_gds.c:367-407` — the new `gds_event_query` / `gds_event_wait`. - `tests/test_store_fs_gds.c::test_event_query_reflects_completion` — the contract test. Uses `cuLaunchHostFunc` to park `stream_h2d` behind a host-side barrier, submits a read so cuFile is queued but not retired, asserts the query reports not-ready, then unblocks and asserts it reports ready. Deterministic, not race-dependent. Runs under cuFile compat mode (`CUFILE_FORCE_COMPAT_MODE=true` set in `main` before any cuFile init) so no nvidia-fs is required. ## Test plan - [x] New test fails before the fix (verified during development — query returns 1 while the read is provably queued behind the barrier). - [x] New test passes after the fix. - [x] Existing `test_submit_fail_releases_pins` still passes when it doesn't trip the separate bug below. - [x] Full test suite (25 tests, GDS build) passes. ## Related, not addressed here `test_submit_fail_releases_pins` SEGVs at ~4% on this hardware (filed as #118). Reproduces on this branch's parent commit, so it is not introduced by this PR, but it is a real damacy bug to chase, not upstream noise. 5f9226d
nclack
added a commit
that referenced
this pull request
May 22, 2026
closes #116 ## Approach `store_fs_gds.c`'s `gds_event_query` previously returned 1 unconditionally for any non-sentinel `seq` — completely ignoring whether the `cuFileReadAsync` submitted earlier had actually retired on `stream_h2d`. Callers (the wave-pool scheduler) treat a 1 return as "destination bytes are safe to consume" and transition the slot to `SLOT_READY`. On the GDS path this can hand a wave a `dev_buf` that the device has not yet written to, producing illegal memory accesses downstream in decode kernels gated on cross-stream events that fire ahead of the read draining. This PR makes the query honor what the caller would reasonably expect: returns 1 only after the cuFile read has actually completed on the stream. The mechanism reuses infrastructure already in `gds_submit_dev`: `cuLaunchHostFunc(stream, fs_gds_free_params_cb, ctx)` is enqueued after every `cuFileReadAsync`, so the callback runs in stream order *after* the reads drain. A new small `fs_gds_done { flag, claimed, rc }` struct is allocated per submit; the callback sets `flag=1` and drops one ref, `gds_event_query` checks the flag (acquire) and CAS-claims the owner-side ref the first time it observes the flag set. Repeated queries are safe; an unqueried event is reclaimed by the callback alone (no leak). `store_event` gains an opaque `void* impl` — backend-private, NULL for non-GDS stores. `gds_event_wait`, previously a no-op, now actually `cuStreamSynchronize`s and reclaims. ## Key files - `src/store/store_fs_gds.c:222-260` — the `fs_gds_done` refcount protocol. - `src/store/store_fs_gds.c:367-407` — the new `gds_event_query` / `gds_event_wait`. - `tests/test_store_fs_gds.c::test_event_query_reflects_completion` — the contract test. Uses `cuLaunchHostFunc` to park `stream_h2d` behind a host-side barrier, submits a read so cuFile is queued but not retired, asserts the query reports not-ready, then unblocks and asserts it reports ready. Deterministic, not race-dependent. Runs under cuFile compat mode (`CUFILE_FORCE_COMPAT_MODE=true` set in `main` before any cuFile init) so no nvidia-fs is required. ## Test plan - [x] New test fails before the fix (verified during development — query returns 1 while the read is provably queued behind the barrier). - [x] New test passes after the fix. - [x] Existing `test_submit_fail_releases_pins` still passes when it doesn't trip the separate bug below. - [x] Full test suite (25 tests, GDS build) passes. ## Related, not addressed here `test_submit_fail_releases_pins` SEGVs at ~4% on this hardware (filed as #118). Reproduces on this branch's parent commit, so it is not introduced by this PR, but it is a real damacy bug to chase, not upstream noise.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
closes #116
Approach
store_fs_gds.c'sgds_event_querypreviously returned 1 unconditionallyfor any non-sentinel
seq— completely ignoring whether thecuFileReadAsyncsubmitted earlier had actually retired onstream_h2d. Callers (the wave-pool scheduler) treat a 1 return as"destination bytes are safe to consume" and transition the slot to
SLOT_READY. On the GDS path this can hand a wave adev_bufthatthe device has not yet written to, producing illegal memory accesses
downstream in decode kernels gated on cross-stream events that fire
ahead of the read draining.
This PR makes the query honor what the caller would reasonably expect:
returns 1 only after the cuFile read has actually completed on the
stream.
The mechanism reuses infrastructure already in
gds_submit_dev:cuLaunchHostFunc(stream, fs_gds_free_params_cb, ctx)is enqueuedafter every
cuFileReadAsync, so the callback runs in stream orderafter the reads drain. A new small
fs_gds_done { flag, claimed, rc }struct is allocated per submit; the callback setsflag=1anddrops one ref,
gds_event_querychecks the flag (acquire) andCAS-claims the owner-side ref the first time it observes the flag
set. Repeated queries are safe; an unqueried event is reclaimed by
the callback alone (no leak).
store_eventgains an opaquevoid* impl— backend-private,NULL for non-GDS stores.
gds_event_wait, previously a no-op, now actuallycuStreamSynchronizes and reclaims.Key files
src/store/store_fs_gds.c:222-260— thefs_gds_donerefcountprotocol.
src/store/store_fs_gds.c:367-407— the newgds_event_query/gds_event_wait.tests/test_store_fs_gds.c::test_event_query_reflects_completion—the contract test. Uses
cuLaunchHostFuncto parkstream_h2dbehind a host-side barrier, submits a read so cuFile is queued but
not retired, asserts the query reports not-ready, then unblocks and
asserts it reports ready. Deterministic, not race-dependent. Runs
under cuFile compat mode (
CUFILE_FORCE_COMPAT_MODE=trueset inmainbefore any cuFile init) so no nvidia-fs is required.Test plan
query returns 1 while the read is provably queued behind the
barrier).
test_submit_fail_releases_pinsstill passes when itdoesn't trip the separate bug below.
Related, not addressed here
test_submit_fail_releases_pinsSEGVs at ~4% on this hardware(filed as #118). Reproduces on this branch's parent commit, so it is
not introduced by this PR, but it is a real damacy bug to chase, not
upstream noise.