Skip to content

gds: query reflects cuFile completion#117

Merged
nclack merged 15 commits into
mainfrom
gds-event-query-fix
May 21, 2026
Merged

gds: query reflects cuFile completion#117
nclack merged 15 commits into
mainfrom
gds-event-query-fix

Conversation

@nclack
Copy link
Copy Markdown
Owner

@nclack nclack commented May 21, 2026

closes #116

Approach

store_fs_gds.c's gds_event_query previously returned 1 unconditionally
for any non-sentinel seq — completely ignoring whether the
cuFileReadAsync submitted earlier had actually retired on
stream_h2d. Callers (the wave-pool scheduler) treat a 1 return as
"destination bytes are safe to consume" and transition the slot to
SLOT_READY. On the GDS path this can hand a wave a dev_buf that
the device has not yet written to, producing illegal memory accesses
downstream in decode kernels gated on cross-stream events that fire
ahead of the read draining.

This PR makes the query honor what the caller would reasonably expect:
returns 1 only after the cuFile read has actually completed on the
stream.

The mechanism reuses infrastructure already in gds_submit_dev:
cuLaunchHostFunc(stream, fs_gds_free_params_cb, ctx) is enqueued
after every cuFileReadAsync, so the callback runs in stream order
after the reads drain. A new small fs_gds_done { flag, claimed, rc } struct is allocated per submit; the callback sets flag=1 and
drops one ref, gds_event_query checks the flag (acquire) and
CAS-claims the owner-side ref the first time it observes the flag
set. Repeated queries are safe; an unqueried event is reclaimed by
the callback alone (no leak).

store_event gains an opaque void* impl — backend-private,
NULL for non-GDS stores.

gds_event_wait, previously a no-op, now actually
cuStreamSynchronizes and reclaims.

Key files

  • src/store/store_fs_gds.c:222-260 — the fs_gds_done refcount
    protocol.
  • src/store/store_fs_gds.c:367-407 — the new gds_event_query /
    gds_event_wait.
  • tests/test_store_fs_gds.c::test_event_query_reflects_completion
    the contract test. Uses cuLaunchHostFunc to park stream_h2d
    behind a host-side barrier, submits a read so cuFile is queued but
    not retired, asserts the query reports not-ready, then unblocks and
    asserts it reports ready. Deterministic, not race-dependent. Runs
    under cuFile compat mode (CUFILE_FORCE_COMPAT_MODE=true set in
    main before any cuFile init) so no nvidia-fs is required.

Test plan

  • New test fails before the fix (verified during development —
    query returns 1 while the read is provably queued behind the
    barrier).
  • New test passes after the fix.
  • Existing test_submit_fail_releases_pins still passes when it
    doesn't trip the separate bug below.
  • Full test suite (25 tests, GDS build) passes.

Related, not addressed here

test_submit_fail_releases_pins SEGVs at ~4% on this hardware
(filed as #118). Reproduces on this branch's parent commit, so it is
not introduced by this PR, but it is a real damacy bug to chase, not
upstream noise.

@codecov
Copy link
Copy Markdown

codecov Bot commented May 21, 2026

Codecov Report

❌ Patch coverage is 26.66667% with 11 lines in your changes missing coverage. Please review.
✅ Project coverage is 55.61%. Comparing base (67de519) to head (4968a44).

Files with missing lines Patch % Lines
src/store/store.c 0.00% 7 Missing ⚠️
src/platform/platform.posix.c 0.00% 2 Missing ⚠️
src/store/store_fs.c 0.00% 1 Missing ⚠️
src/wave/wave_pool.c 80.00% 1 Missing ⚠️

❌ Your patch check has failed because the patch coverage (26.66%) is below the target coverage (70.00%). You can increase the patch coverage or adjust the target coverage.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #117      +/-   ##
==========================================
- Coverage   55.67%   55.61%   -0.07%     
==========================================
  Files          49       49              
  Lines        6890     6903      +13     
  Branches     1232     1233       +1     
==========================================
+ Hits         3836     3839       +3     
- Misses       2576     2584       +8     
- Partials      478      480       +2     
Flag Coverage Δ
unittests 55.61% <26.66%> (-0.07%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
src/store/store_fs.c 74.74% <0.00%> (+0.64%) ⬆️
src/wave/wave_pool.c 78.36% <80.00%> (-0.29%) ⬇️
src/platform/platform.posix.c 58.47% <0.00%> (-1.01%) ⬇️
src/store/store.c 43.28% <0.00%> (-5.05%) ⬇️

... and 8 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@nclack nclack merged commit 5f9226d into main May 21, 2026
5 checks passed
@nclack nclack deleted the gds-event-query-fix branch May 21, 2026 18:28
github-actions Bot added a commit that referenced this pull request May 21, 2026
closes #116

## Approach

`store_fs_gds.c`'s `gds_event_query` previously returned 1
unconditionally
for any non-sentinel `seq` — completely ignoring whether the
`cuFileReadAsync` submitted earlier had actually retired on
`stream_h2d`. Callers (the wave-pool scheduler) treat a 1 return as
"destination bytes are safe to consume" and transition the slot to
`SLOT_READY`. On the GDS path this can hand a wave a `dev_buf` that
the device has not yet written to, producing illegal memory accesses
downstream in decode kernels gated on cross-stream events that fire
ahead of the read draining.

This PR makes the query honor what the caller would reasonably expect:
returns 1 only after the cuFile read has actually completed on the
stream.

The mechanism reuses infrastructure already in `gds_submit_dev`:
`cuLaunchHostFunc(stream, fs_gds_free_params_cb, ctx)` is enqueued
after every `cuFileReadAsync`, so the callback runs in stream order
*after* the reads drain. A new small `fs_gds_done { flag, claimed,
rc }` struct is allocated per submit; the callback sets `flag=1` and
drops one ref, `gds_event_query` checks the flag (acquire) and
CAS-claims the owner-side ref the first time it observes the flag
set. Repeated queries are safe; an unqueried event is reclaimed by
the callback alone (no leak).

`store_event` gains an opaque `void* impl` — backend-private,
NULL for non-GDS stores.

`gds_event_wait`, previously a no-op, now actually
`cuStreamSynchronize`s and reclaims.

## Key files

- `src/store/store_fs_gds.c:222-260` — the `fs_gds_done` refcount
  protocol.
- `src/store/store_fs_gds.c:367-407` — the new `gds_event_query` /
  `gds_event_wait`.
- `tests/test_store_fs_gds.c::test_event_query_reflects_completion` —
  the contract test. Uses `cuLaunchHostFunc` to park `stream_h2d`
  behind a host-side barrier, submits a read so cuFile is queued but
  not retired, asserts the query reports not-ready, then unblocks and
  asserts it reports ready. Deterministic, not race-dependent. Runs
  under cuFile compat mode (`CUFILE_FORCE_COMPAT_MODE=true` set in
  `main` before any cuFile init) so no nvidia-fs is required.

## Test plan

- [x] New test fails before the fix (verified during development —
      query returns 1 while the read is provably queued behind the
      barrier).
- [x] New test passes after the fix.
- [x] Existing `test_submit_fail_releases_pins` still passes when it
      doesn't trip the separate bug below.
- [x] Full test suite (25 tests, GDS build) passes.

## Related, not addressed here

`test_submit_fail_releases_pins` SEGVs at ~4% on this hardware
(filed as #118). Reproduces on this branch's parent commit, so it is
not introduced by this PR, but it is a real damacy bug to chase, not
upstream noise. 5f9226d
nclack added a commit that referenced this pull request May 22, 2026
closes #116 

## Approach

`store_fs_gds.c`'s `gds_event_query` previously returned 1
unconditionally
for any non-sentinel `seq` — completely ignoring whether the
`cuFileReadAsync` submitted earlier had actually retired on
`stream_h2d`. Callers (the wave-pool scheduler) treat a 1 return as
"destination bytes are safe to consume" and transition the slot to
`SLOT_READY`. On the GDS path this can hand a wave a `dev_buf` that
the device has not yet written to, producing illegal memory accesses
downstream in decode kernels gated on cross-stream events that fire
ahead of the read draining.

This PR makes the query honor what the caller would reasonably expect:
returns 1 only after the cuFile read has actually completed on the
stream.

The mechanism reuses infrastructure already in `gds_submit_dev`:
`cuLaunchHostFunc(stream, fs_gds_free_params_cb, ctx)` is enqueued
after every `cuFileReadAsync`, so the callback runs in stream order
*after* the reads drain. A new small `fs_gds_done { flag, claimed,
rc }` struct is allocated per submit; the callback sets `flag=1` and
drops one ref, `gds_event_query` checks the flag (acquire) and
CAS-claims the owner-side ref the first time it observes the flag
set. Repeated queries are safe; an unqueried event is reclaimed by
the callback alone (no leak).

`store_event` gains an opaque `void* impl` — backend-private,
NULL for non-GDS stores.

`gds_event_wait`, previously a no-op, now actually
`cuStreamSynchronize`s and reclaims.

## Key files

- `src/store/store_fs_gds.c:222-260` — the `fs_gds_done` refcount
  protocol.
- `src/store/store_fs_gds.c:367-407` — the new `gds_event_query` /
  `gds_event_wait`.
- `tests/test_store_fs_gds.c::test_event_query_reflects_completion` —
  the contract test. Uses `cuLaunchHostFunc` to park `stream_h2d`
  behind a host-side barrier, submits a read so cuFile is queued but
  not retired, asserts the query reports not-ready, then unblocks and
  asserts it reports ready. Deterministic, not race-dependent. Runs
  under cuFile compat mode (`CUFILE_FORCE_COMPAT_MODE=true` set in
  `main` before any cuFile init) so no nvidia-fs is required.

## Test plan

- [x] New test fails before the fix (verified during development —
      query returns 1 while the read is provably queued behind the
      barrier).
- [x] New test passes after the fix.
- [x] Existing `test_submit_fail_releases_pins` still passes when it
      doesn't trip the separate bug below.
- [x] Full test suite (25 tests, GDS build) passes.

## Related, not addressed here

`test_submit_fail_releases_pins` SEGVs at ~4% on this hardware
(filed as #118). Reproduces on this branch's parent commit, so it is
not introduced by this PR, but it is a real damacy bug to chase, not
upstream noise.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

GDS path: CUDA illegal memory access on multi-GPU (reproducible)

1 participant