Metadata prefetch layer#120
Conversation
Review notesRead through the cache primitive, the prefetcher, and the three fetchers against Blockers1. Handle stability is broken by eviction — The handle stores The straightforward fix is to key the handle on the slot itself, not its index in a compact array — e.g. allocate from a fixed slot table (capacity-sized) and let This bug isn't currently caught because no unit test triggers eviction with a still-live handle to a non-evicted sibling slot. Worth adding before further consumers depend on handle stability. 2. Gate aliasing across batch reuse —
Significant3. Worker polls instead of using the blocking pop the design added —
4.
5. Per-prefetcher errors don't set the batch gate's error bit —
6.
Notes7. 8. Slot/batch tables are linear scans — 9. Test coverage gaps for the contracts that bite later — eviction with live sibling handle (#1), gate reuse race (#2), zero-shard sample (#4), saturated cache returning 10. Bottom line#1 and #2 are correctness issues that should land in this PR or as immediate follow-ups before the planner/scheduler integration goes in on top. The rest can sit until the integration PR exercises them. |
closes #116 ## Approach `store_fs_gds.c`'s `gds_event_query` previously returned 1 unconditionally for any non-sentinel `seq` — completely ignoring whether the `cuFileReadAsync` submitted earlier had actually retired on `stream_h2d`. Callers (the wave-pool scheduler) treat a 1 return as "destination bytes are safe to consume" and transition the slot to `SLOT_READY`. On the GDS path this can hand a wave a `dev_buf` that the device has not yet written to, producing illegal memory accesses downstream in decode kernels gated on cross-stream events that fire ahead of the read draining. This PR makes the query honor what the caller would reasonably expect: returns 1 only after the cuFile read has actually completed on the stream. The mechanism reuses infrastructure already in `gds_submit_dev`: `cuLaunchHostFunc(stream, fs_gds_free_params_cb, ctx)` is enqueued after every `cuFileReadAsync`, so the callback runs in stream order *after* the reads drain. A new small `fs_gds_done { flag, claimed, rc }` struct is allocated per submit; the callback sets `flag=1` and drops one ref, `gds_event_query` checks the flag (acquire) and CAS-claims the owner-side ref the first time it observes the flag set. Repeated queries are safe; an unqueried event is reclaimed by the callback alone (no leak). `store_event` gains an opaque `void* impl` — backend-private, NULL for non-GDS stores. `gds_event_wait`, previously a no-op, now actually `cuStreamSynchronize`s and reclaims. ## Key files - `src/store/store_fs_gds.c:222-260` — the `fs_gds_done` refcount protocol. - `src/store/store_fs_gds.c:367-407` — the new `gds_event_query` / `gds_event_wait`. - `tests/test_store_fs_gds.c::test_event_query_reflects_completion` — the contract test. Uses `cuLaunchHostFunc` to park `stream_h2d` behind a host-side barrier, submits a read so cuFile is queued but not retired, asserts the query reports not-ready, then unblocks and asserts it reports ready. Deterministic, not race-dependent. Runs under cuFile compat mode (`CUFILE_FORCE_COMPAT_MODE=true` set in `main` before any cuFile init) so no nvidia-fs is required. ## Test plan - [x] New test fails before the fix (verified during development — query returns 1 while the read is provably queued behind the barrier). - [x] New test passes after the fix. - [x] Existing `test_submit_fail_releases_pins` still passes when it doesn't trip the separate bug below. - [x] Full test suite (25 tests, GDS build) passes. ## Related, not addressed here `test_submit_fail_releases_pins` SEGVs at ~4% on this hardware (filed as #118). Reproduces on this branch's parent commit, so it is not introduced by this PR, but it is a real damacy bug to chase, not upstream noise.
Fixes #118. ## Approach In compat mode, libcufile lazily allocates per-stream state on the first `cuFileReadAsync`, and that lazy init races against itself. The passing test in the same file happens to enqueue a `cuLaunchHostFunc` barrier before the first read, which serializes the stream enough to mask the race. The failing test goes straight into `cuFileReadAsync` on an empty stream and SEGVs ~4% of the time deep inside libcufile. cuFile already exposes the hook for this: `cuFileStreamRegister` "allocates resources needed to support cuFile operations asynchronously for the cuda stream" — i.e. exactly the lazy state that was racing. Calling it eagerly when damacy adopts a stream removes the race window. The matching `cuFileStreamDeregister` is required before `cuStreamDestroy` per the cuFile contract. ## Change In `src/store/store_fs_gds.c`: - dlsym-bind `cuFileStreamRegister` / `cuFileStreamDeregister` as optional symbols (graceful no-op on older libcufile that doesn't ship them). - `store_fs_gds_set_stream`: deregister any previously-set stream, then register the new one. First `cuFileReadAsync` now finds per-stream state already allocated. - `gds_destroy`: deregister after the existing `cuStreamSynchronize`, before the caller's `cuStreamDestroy`. ## Verification - `cmake --build build` clean. - 100× loop of `CUFILE_FORCE_COMPAT_MODE=true ./build/tests/test_store_fs_gds`: 0 failures (was ~4/100). - Full `ctest`: 26/26 pass.
## Approach Close #101 by removing the dead static fallback and replacing the ad-hoc structural constants it derived from with explicit `damacy_tuning` knobs. `chunk_substreams_upper_bound` (formerly `chunk_zsubs_upper_bound`) in `src/wave/wave_pool.c` sizes the per-wave fanout SOA and the shared nvcomp zstd decoder scratch. Its `!sp->layout_probed` fallback returned a hardcoded `DAMACY_BLOSC_MAX_BLOCKS_PER_CHUNK = 32` — the adversarial worst case. But `wave_chunks_eligible` (per-chunk gate, runs before `prepare_decode_caps` in `kick_h2d`) rejects any wave containing an unprobed BLOSC_ZSTD chunk with `DAMACY_INVAL`, so the fallback is structurally unreachable. The "perf" framing of the original issue was moot. This PR: - **Turns the implicit gate-vs-sizer contract into an explicit check.** `chunk_substreams_upper_bound` now returns `enum damacy_status`; on unprobed BLOSC it returns `DAMACY_INVAL` with a `log_error("gate-vs-sizer contract violated")` at the caller. A future gate regression now fails loudly instead of silently undersizing the fanout SOA. - **Replaces the two compile-time constants** (`DAMACY_MAX_CHUNKS_PER_WAVE`, `DAMACY_BLOSC_MAX_BLOCKS_PER_CHUNK`) with `damacy_tuning.max_chunks_per_wave` and `damacy_tuning.max_substreams_per_chunk`. The parser, planner, coalesce, wave_pool, fanout, wave_budget, and meta_cache all thread the effective values through their existing param chains. New `DAMACY_DEFAULT_*` siblings preserve current behavior; `0` in either field resolves to the default. `WAVE_ZSUBS_STRUCTURAL_MAX` becomes a runtime field `wave_pool.max_substreams_per_wave` derived once at init. - **Drops the dead substream rename target.** `zsubs` was a contraction that read as zstd-specific; renames to `substreams` everywhere (the noun that matches both BLOSC1 spec language and the nvcomp batched-decode input it actually counts). - **Strips machinery wired only to the unreachable branch:** the `_Atomic(uint16_t) observed_max_nblocks_per_chunk` slot, its `atomic_u16_observe_max` CAS-loop helper (`src/util/atomic_max.h`), the meta-cache observer setter, the bump sites in `zarr_meta_cache_layout_set` / `_probe_layout`, and the wiring in `damacy_create`. `zarr/zarr_meta_cache.h` returns to `extern "C"` shape (matches main) — the C-only `static_assert` is no longer needed. ## API Two new optional fields on `damacy_tuning` (Python `Config`): - `max_chunks_per_wave: int = 0` — `0` → 512 (current behavior). Clamped to `0xFFFFu` (the 16-bit chunk_idx packing in `d_block_chunk_map`). - `max_substreams_per_chunk: int = 0` — `0` → 32 (current behavior). Parser rejects blosc1 layouts above this with `DAMACY_DECODE`. ## Key file `src/wave/wave_pool.c:355` — `chunk_substreams_upper_bound` (the contract check) and `prepare_decode_caps` (caller). Closes #101.
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #120 +/- ##
==========================================
+ Coverage 55.67% 58.21% +2.54%
==========================================
Files 49 56 +7
Lines 6890 7948 +1058
Branches 1232 1396 +164
==========================================
+ Hits 3836 4627 +791
- Misses 2576 2746 +170
- Partials 478 575 +97
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
Review responsesThanks for the careful pass. Walking through each item with the fix or rationale. Blockers — fixed#1 Handle stability ( #2 Gate aliasing across batch reuse ( Significant — addressed#3 Worker polling vs blocking pop ( The deeper "burns the cache mutex on every tick" complaint is structural — #4 Zero-shard sample (already addressed in #5 Prefetcher-origin errors not setting gate.error (already in #6 shard_index_fetch peek pin Notes — deferred#7 advance_watermark walks active under the lock — perf tuning, will profile under the integration PR. |
Glossary
max_batch_id < watermark. Replaces refcount pinning._Atomic uint64_tper batch packing[error:1 | pending:63]. Lets a consumer check group readiness with one load.Summary
Host-side metadata reads (zarr.json, shard indices, blosc-layout probes) gate every batch plan; at 50–200 ms each on networked stores they starve the GPU. This PR adds an eager prefetch layer that hides that latency.
Approach (full rationale and integration diagram in
dev/metadata_prefetch.md):prefetch_cacheinstances, one per metadata phase. Each is a pinning LRU built onutil/lru.hwith the eviction predicate swapped from refcount tomax_batch_id < watermark. State per entry is{pending | ready | error}—errorfolds in upstream cancellation so consumers check one state.prefetcherthread blocking-pops samples from lookahead and advances each through array-meta → shard-index → chunk-layout. Stage N's keys derive from stage N−1's resolved value; no shared enumeration with the planner.prefetch_cache_try_get. A per-batchprefetch_gatelets the planner check "all handles ready, no errors" with a single atomic load.zarr/sample_shard_iteratorfactors the sample-AABB → unique-shard-coord enumeration out ofplanner.cso prefetcher and planner share one source of truth for "which shards does this sample touch."This PR lands the cache primitive, the three fetcher instances, the orchestrator, and the
sample_shard_iteratorextraction. Planner/scheduler wiring (DAMACY_PENDINGretry loop, watermark broadcast on plan success) is intentionally deferred — seedev/metadata_prefetch.md§Plan — so this PR can be reviewed independently against unit + cache-level integration tests.Reviewer's path through the diff: start with
dev/metadata_prefetch.md, thensrc/prefetch/prefetch_cache.h(the primitive surface), thensrc/prefetch/prefetcher.h(the orchestrator surface), then the three fetchers (array_meta.c,shard_index.c,chunk_layout.c) for what each phase'sfetch_fnactually does.Test plan
chunk_layoutfetcher does a 16-byte probe read; check that the read pattern is correct on a real shard (compressed blosc1 header may needbstartsoffset, not chunk start).prefetcher_drain) actually flushes all in-flight samples, not just the lookahead queue, beforeprefetcher_destroy.