Skip to content

Metadata prefetch layer#120

Open
nclack wants to merge 25 commits into
mainfrom
worktree-prefetch
Open

Metadata prefetch layer#120
nclack wants to merge 25 commits into
mainfrom
worktree-prefetch

Conversation

@nclack
Copy link
Copy Markdown
Owner

@nclack nclack commented May 22, 2026

Glossary

  • prefetch_cache — generic pinning LRU primitive, one instance per metadata phase (array meta, shard index, chunk layout).
  • prefetcher — orchestrator thread; consumes lookahead samples and walks each through the three caches in dependency order.
  • batch_id — monotonic ordinal stamped by lookahead on push; carried through every cache request.
  • watermark — global ordinal; an entry is evictable iff its max_batch_id < watermark. Replaces refcount pinning.
  • prefetch_gate — one _Atomic uint64_t per batch packing [error:1 | pending:63]. Lets a consumer check group readiness with one load.

Summary

Host-side metadata reads (zarr.json, shard indices, blosc-layout probes) gate every batch plan; at 50–200 ms each on networked stores they starve the GPU. This PR adds an eager prefetch layer that hides that latency.

Approach (full rationale and integration diagram in dev/metadata_prefetch.md):

  • Three prefetch_cache instances, one per metadata phase. Each is a pinning LRU built on util/lru.h with the eviction predicate swapped from refcount to max_batch_id < watermark. State per entry is {pending | ready | error}error folds in upstream cancellation so consumers check one state.
  • A single prefetcher thread blocking-pops samples from lookahead and advances each through array-meta → shard-index → chunk-layout. Stage N's keys derive from stage N−1's resolved value; no shared enumeration with the planner.
  • The planner never enumerates either; it receives resolved handles from the prefetcher and dereferences via prefetch_cache_try_get. A per-batch prefetch_gate lets the planner check "all handles ready, no errors" with a single atomic load.
  • New zarr/sample_shard_iterator factors the sample-AABB → unique-shard-coord enumeration out of planner.c so prefetcher and planner share one source of truth for "which shards does this sample touch."

This PR lands the cache primitive, the three fetcher instances, the orchestrator, and the sample_shard_iterator extraction. Planner/scheduler wiring (DAMACY_PENDING retry loop, watermark broadcast on plan success) is intentionally deferred — see dev/metadata_prefetch.md §Plan — so this PR can be reviewed independently against unit + cache-level integration tests.

Reviewer's path through the diff: start with dev/metadata_prefetch.md, then src/prefetch/prefetch_cache.h (the primitive surface), then src/prefetch/prefetcher.h (the orchestrator surface), then the three fetchers (array_meta.c, shard_index.c, chunk_layout.c) for what each phase's fetch_fn actually does.

Test plan

  • The chunk_layout fetcher does a 16-byte probe read; check that the read pattern is correct on a real shard (compressed blosc1 header may need bstarts offset, not chunk start).
  • Confirm the prefetcher's drain path (prefetcher_drain) actually flushes all in-flight samples, not just the lookahead queue, before prefetcher_destroy.

@nclack
Copy link
Copy Markdown
Owner Author

nclack commented May 22, 2026

Review notes

Read through the cache primitive, the prefetcher, and the three fetchers against dev/metadata_prefetch.md. The shape matches the design — these are the issues worth surfacing before more code lands on top.

Blockers

1. Handle stability is broken by evictionsrc/prefetch/prefetch_cache.c:194-203,304-307,470-482

The handle stores active_idx + 1. active_remove swap-and-pops c->active, moving an unrelated slot into the freed position and rewriting its active_idx. Any outstanding handle for that swapped slot is now stale: resolve_handle indexes into c->active[idx], finds the unrelated slot, and the generation check fails. So a ready handle silently degrades to "NULL" from try_get/query whenever any other entry is evicted. This isn't a sentinel collision — it's a misroute.

The straightforward fix is to key the handle on the slot itself, not its index in a compact array — e.g. allocate from a fixed slot table (capacity-sized) and let slot in the handle be that stable table index. The active array becomes a free-list, not the ID space.

This bug isn't currently caught because no unit test triggers eviction with a still-live handle to a non-evicted sibling slot. Worth adding before further consumers depend on handle stability.

2. Gate aliasing across batch reusesrc/prefetch/prefetcher.c:419-432, header prefetcher.h:74

prefetcher_release_batch flips in_use = 0, freeing the entry for batch_get_or_create_locked to reuse — but any in-flight prefetcher_slot for that batch still holds gate = &p->batches[i].gate. If the slot is reused before the worker drains, prefetch_gate_init zeroes the state and the still-pending fetch's gate_dec_pending lands on the new batch's gate. The header puts the burden on the caller ("ensure no slots still reference"), but the prefetcher exposes no way to verify that, and the design doc's intent (scheduler broadcasts watermark advance on plan success) doesn't naturally imply "no in-flight slots." Either gate it on a per-batch refcount, or have the prefetcher own the lifetime and refuse release while slots are active.

Significant

3. Worker polls instead of using the blocking pop the design addedsrc/prefetch/prefetcher.c:238-252

lookahead_pop_blocking exists per design step 5, but worker_fn does try_pop then platform_sleep_ns(1_000_000). With the design's target lead times (10k–100k samples), this gives a 1ms admission ceiling and burns the cache mutex on every tick (advance_all walks p->capacity). The state advance loop is the real polling work; the lookahead drain should block on the condvar instead.

4. advance_from_meta fails on zero-shard samplessrc/prefetch/prefetcher.c:117-120

CHECK(Bad, n > 0) errors the slot if a sample's AABB intersects no shards. That's a valid configuration (edge AABB, all-padding sample) and the planner today handles it without erroring — the prefetcher shouldn't be stricter than the planner. Treat as "skip stages 2+3, mark READY with zero shards" or document the contract that callers must not submit such samples.

5. Per-prefetcher errors don't set the batch gate's error bitsrc/prefetch/prefetcher.c:93-98

fail_slot advances local state but never touches s->gate. For errors that originate in a cache fetch (PREFETCH_STATE_ERROR observed via query), the cache already set the gate bit before the prefetcher saw it — fine. For errors that originate in the prefetcher itself (alloc fail at line 124, n == 0 once #4 is real, invalid handle from a saturated cache at line 132/156/218), the gate bit is never set. Design doc §Readiness gate promises "Error bit set → batch fails fast"; right now a fast-batch-fail consumer can miss these.

6. shard_index_fetch depends on array-meta entry staying pinned, but only peekssrc/prefetch/shard_index.c:106-112

prefetch_cache_peek doesn't widen the ordinal range and doesn't bump LRU recency. If a scheduler advances the watermark between the prefetcher's advance_from_meta and the shard-index worker running, the array-meta slot can be released and (under capacity pressure) evicted — fetch returns DAMACY_INVAL. The current setup avoids this because the slot is still pinned by the current batch's range, but the contract is implicit. Either document "stage-N+1 fetchers must run before watermark advances past stage-N requests" or have the fetcher widen the range explicitly.

Notes

7. prefetch_cache_advance_watermark walks n_active under the cache locksrc/prefetch/prefetch_cache.c:562. With design lead times of 10k–100k entries per cache, this is a real stall on every scheduler advance. Likely fine for the first integration but worth measuring before tuning lead time up.

8. Slot/batch tables are linear scansprefetcher.c:62-83,201-208. O(capacity) per worker tick plus O(batch_capacity) per admit. Same scale concern as #7; reach for an intrusive free list or a small hash table when this shows up in profiles.

9. Test coverage gaps for the contracts that bite later — eviction with live sibling handle (#1), gate reuse race (#2), zero-shard sample (#4), saturated cache returning PREFETCH_HANDLE_NONE mid-walk in the prefetcher. These are the non-obvious behaviors; the current suite covers the happy paths well.

10. struct damacy_lookahead is fully exposed in the public headersrc/lookahead/lookahead.h:17-27. Inconsistent with the opaque-handle pattern used by everything else here (prefetch_cache, prefetcher). Worth opaque-ing before downstream code starts touching la->size directly.

Bottom line

#1 and #2 are correctness issues that should land in this PR or as immediate follow-ups before the planner/scheduler integration goes in on top. The rest can sit until the integration PR exercises them.

nclack added 13 commits May 22, 2026 13:41
closes #116 

## Approach

`store_fs_gds.c`'s `gds_event_query` previously returned 1
unconditionally
for any non-sentinel `seq` — completely ignoring whether the
`cuFileReadAsync` submitted earlier had actually retired on
`stream_h2d`. Callers (the wave-pool scheduler) treat a 1 return as
"destination bytes are safe to consume" and transition the slot to
`SLOT_READY`. On the GDS path this can hand a wave a `dev_buf` that
the device has not yet written to, producing illegal memory accesses
downstream in decode kernels gated on cross-stream events that fire
ahead of the read draining.

This PR makes the query honor what the caller would reasonably expect:
returns 1 only after the cuFile read has actually completed on the
stream.

The mechanism reuses infrastructure already in `gds_submit_dev`:
`cuLaunchHostFunc(stream, fs_gds_free_params_cb, ctx)` is enqueued
after every `cuFileReadAsync`, so the callback runs in stream order
*after* the reads drain. A new small `fs_gds_done { flag, claimed,
rc }` struct is allocated per submit; the callback sets `flag=1` and
drops one ref, `gds_event_query` checks the flag (acquire) and
CAS-claims the owner-side ref the first time it observes the flag
set. Repeated queries are safe; an unqueried event is reclaimed by
the callback alone (no leak).

`store_event` gains an opaque `void* impl` — backend-private,
NULL for non-GDS stores.

`gds_event_wait`, previously a no-op, now actually
`cuStreamSynchronize`s and reclaims.

## Key files

- `src/store/store_fs_gds.c:222-260` — the `fs_gds_done` refcount
  protocol.
- `src/store/store_fs_gds.c:367-407` — the new `gds_event_query` /
  `gds_event_wait`.
- `tests/test_store_fs_gds.c::test_event_query_reflects_completion` —
  the contract test. Uses `cuLaunchHostFunc` to park `stream_h2d`
  behind a host-side barrier, submits a read so cuFile is queued but
  not retired, asserts the query reports not-ready, then unblocks and
  asserts it reports ready. Deterministic, not race-dependent. Runs
  under cuFile compat mode (`CUFILE_FORCE_COMPAT_MODE=true` set in
  `main` before any cuFile init) so no nvidia-fs is required.

## Test plan

- [x] New test fails before the fix (verified during development —
      query returns 1 while the read is provably queued behind the
      barrier).
- [x] New test passes after the fix.
- [x] Existing `test_submit_fail_releases_pins` still passes when it
      doesn't trip the separate bug below.
- [x] Full test suite (25 tests, GDS build) passes.

## Related, not addressed here

`test_submit_fail_releases_pins` SEGVs at ~4% on this hardware
(filed as #118). Reproduces on this branch's parent commit, so it is
not introduced by this PR, but it is a real damacy bug to chase, not
upstream noise.
Fixes #118.

## Approach

In compat mode, libcufile lazily allocates per-stream state on the first
`cuFileReadAsync`, and that lazy init races against itself. The passing
test in the same file happens to enqueue a `cuLaunchHostFunc` barrier
before the first read, which serializes the stream enough to mask the
race. The failing test goes straight into `cuFileReadAsync` on an empty
stream and SEGVs ~4% of the time deep inside libcufile.

cuFile already exposes the hook for this: `cuFileStreamRegister`
"allocates resources needed to support cuFile operations asynchronously
for the cuda stream" — i.e. exactly the lazy state that was racing.
Calling it eagerly when damacy adopts a stream removes the race window.
The matching `cuFileStreamDeregister` is required before
`cuStreamDestroy` per the cuFile contract.

## Change

In `src/store/store_fs_gds.c`:
- dlsym-bind `cuFileStreamRegister` / `cuFileStreamDeregister` as
optional symbols (graceful no-op on older libcufile that doesn't ship
them).
- `store_fs_gds_set_stream`: deregister any previously-set stream, then
register the new one. First `cuFileReadAsync` now finds per-stream state
already allocated.
- `gds_destroy`: deregister after the existing `cuStreamSynchronize`,
before the caller's `cuStreamDestroy`.

## Verification

- `cmake --build build` clean.
- 100× loop of `CUFILE_FORCE_COMPAT_MODE=true
./build/tests/test_store_fs_gds`: 0 failures (was ~4/100).
- Full `ctest`: 26/26 pass.
## Approach

Close #101 by removing the dead static fallback and replacing the
ad-hoc structural constants it derived from with explicit
`damacy_tuning` knobs.

`chunk_substreams_upper_bound` (formerly `chunk_zsubs_upper_bound`)
in `src/wave/wave_pool.c` sizes the per-wave fanout SOA and the
shared nvcomp zstd decoder scratch. Its `!sp->layout_probed` fallback
returned a hardcoded `DAMACY_BLOSC_MAX_BLOCKS_PER_CHUNK = 32` — the
adversarial worst case. But `wave_chunks_eligible` (per-chunk gate,
runs before `prepare_decode_caps` in `kick_h2d`) rejects any wave
containing an unprobed BLOSC_ZSTD chunk with `DAMACY_INVAL`, so the
fallback is structurally unreachable. The "perf" framing of the
original issue was moot.

This PR:

- **Turns the implicit gate-vs-sizer contract into an explicit
  check.** `chunk_substreams_upper_bound` now returns
  `enum damacy_status`; on unprobed BLOSC it returns `DAMACY_INVAL`
  with a `log_error("gate-vs-sizer contract violated")` at the
  caller. A future gate regression now fails loudly instead of
  silently undersizing the fanout SOA.
- **Replaces the two compile-time constants**
(`DAMACY_MAX_CHUNKS_PER_WAVE`,
`DAMACY_BLOSC_MAX_BLOCKS_PER_CHUNK`) with
`damacy_tuning.max_chunks_per_wave`
  and `damacy_tuning.max_substreams_per_chunk`. The parser, planner,
  coalesce, wave_pool, fanout, wave_budget, and meta_cache all thread
  the effective values through their existing param chains. New
  `DAMACY_DEFAULT_*` siblings preserve current behavior; `0` in either
  field resolves to the default. `WAVE_ZSUBS_STRUCTURAL_MAX` becomes
  a runtime field `wave_pool.max_substreams_per_wave` derived once at
  init.
- **Drops the dead substream rename target.** `zsubs` was a
  contraction that read as zstd-specific; renames to `substreams`
  everywhere (the noun that matches both BLOSC1 spec language and the
  nvcomp batched-decode input it actually counts).
- **Strips machinery wired only to the unreachable branch:** the
  `_Atomic(uint16_t) observed_max_nblocks_per_chunk` slot, its
  `atomic_u16_observe_max` CAS-loop helper (`src/util/atomic_max.h`),
  the meta-cache observer setter, the bump sites in
  `zarr_meta_cache_layout_set` / `_probe_layout`, and the wiring in
  `damacy_create`. `zarr/zarr_meta_cache.h` returns to `extern "C"`
  shape (matches main) — the C-only `static_assert` is no longer
  needed.

## API

Two new optional fields on `damacy_tuning` (Python `Config`):

- `max_chunks_per_wave: int = 0` — `0` → 512 (current behavior).
  Clamped to `0xFFFFu` (the 16-bit chunk_idx packing in
  `d_block_chunk_map`).
- `max_substreams_per_chunk: int = 0` — `0` → 32 (current behavior).
  Parser rejects blosc1 layouts above this with `DAMACY_DECODE`.

## Key file

`src/wave/wave_pool.c:355` — `chunk_substreams_upper_bound` (the
contract check) and `prepare_decode_caps` (caller).

Closes #101.
@codecov
Copy link
Copy Markdown

codecov Bot commented May 22, 2026

Codecov Report

❌ Patch coverage is 71.88525% with 343 lines in your changes missing coverage. Please review.
✅ Project coverage is 58.21%. Comparing base (67de519) to head (23d95f0).
⚠️ Report is 3 commits behind head on main.

Files with missing lines Patch % Lines
src/prefetch/prefetch_cache.c 71.05% 62 Missing and 26 partials ⚠️
src/prefetch/prefetcher.c 77.80% 50 Missing and 29 partials ⚠️
src/prefetch/shard_index.c 47.86% 43 Missing and 18 partials ⚠️
src/lookahead/lookahead.c 72.34% 22 Missing and 4 partials ⚠️
src/prefetch/chunk_layout.c 65.07% 15 Missing and 7 partials ⚠️
src/planner/planner.c 73.77% 10 Missing and 6 partials ⚠️
src/prefetch/array_meta.c 79.48% 6 Missing and 2 partials ⚠️
src/wave/wave_pool.c 77.77% 4 Missing and 4 partials ⚠️
python/damacy/_api.c 0.00% 7 Missing ⚠️
src/store/store.c 0.00% 7 Missing ⚠️
... and 8 more
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #120      +/-   ##
==========================================
+ Coverage   55.67%   58.21%   +2.54%     
==========================================
  Files          49       56       +7     
  Lines        6890     7948    +1058     
  Branches     1232     1396     +164     
==========================================
+ Hits         3836     4627     +791     
- Misses       2576     2746     +170     
- Partials      478      575      +97     
Flag Coverage Δ
unittests 58.21% <71.88%> (+2.54%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
src/damacy.c 67.27% <100.00%> (+0.10%) ⬆️
src/damacy_limits.h 100.00% <100.00%> (ø)
src/planner/coalesce.c 89.65% <ø> (ø)
src/wave/host_slab.c 56.71% <100.00%> (ø)
src/zarr/zarr_chunk_layout.c 75.75% <100.00%> (+6.06%) ⬆️
src/zarr/zarr_meta_cache.c 71.72% <ø> (ø)
src/store/store_fs.c 75.25% <0.00%> (+1.16%) ⬆️
src/wave/wave.c 86.56% <80.00%> (+0.20%) ⬆️
src/platform/platform.posix.c 58.47% <0.00%> (-1.01%) ⬇️
src/wave/fanout.c 80.00% <0.00%> (+32.00%) ⬆️
... and 14 more

... and 6 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@nclack
Copy link
Copy Markdown
Owner Author

nclack commented May 22, 2026

Review responses

Thanks for the careful pass. Walking through each item with the fix or rationale.

Blockers — fixed

#1 Handle stability (edbc00d prefetch_cache: stable slot indices)
Slot table is now capacity-sized and indexed by a stable slot id. active becomes a free-list rather than the id space, so active_remove's swap-and-pop no longer invalidates handles to sibling slots. New test test_handle_stable_across_eviction covers the exact case the reviewer described.

#2 Gate aliasing across batch reuse (da117cb prefetcher: batch refcount + edges, 064ad82 prefetcher: hold batch until gate drains)
Batch entries now carry a refcount + a release_pending flag. release_batch defers the in_use=0 flip until refcount and gate.pending both hit zero, so the gate cannot be re-initialized while in-flight slots still reference it. New tests cover release-before-pop, distinct-gates-per-batch, and admit-fail-rollback.

Significant — addressed

#3 Worker polling vs blocking pop (9518104 lookahead: timed pop, worker drops sleep)
Added lookahead_pop_blocking_timeout. Worker now uses pure pop_blocking when there's a free slot and no in-flight work; switches to pop_blocking_timeout(1ms) when in-flight work needs periodic state advance. Kills the 1ms admission ceiling — a sample arriving 100us into the wait now wakes the worker via condvar instead of waiting out the sleep.

The deeper "burns the cache mutex on every tick" complaint is structural — advance_all walks p->capacity because there's no cache→prefetcher notification path. That refactor (cache callbacks) is left for the integration PR rather than dropped here.

#4 Zero-shard sample (already addressed in da117cb)
advance_from_meta now branches on n == 0: skips stages 2+3 by going straight to PREFETCHER_PENDING_CHUNK_LAYOUT with n_shards = 0. The chunk-layout request still goes through since it's per-uri, not per-shard; a downstream consumer that doesn't need layout for empty samples can ignore the handle.

#5 Prefetcher-origin errors not setting gate.error (already in c7bba1b)
fail_slot now calls prefetch_gate_set_error(s->gate) before transitioning to PREFETCHER_ERROR. Catches the alloc-fail / saturation paths the reviewer flagged. test_batch_gate_error_on_failed_sample exercises this.

#6 shard_index_fetch peek pin
Documented the contract inline. The structural fix (widen ordinal range explicitly) would require the cache to expose a pin/unpin API; left for the integration PR. The current invariant (stage-N+1 fetchers run before watermark advances past stage-N requests) is naturally upheld by the prefetcher's submission ordering.

Notes — deferred

#7 advance_watermark walks active under the lock — perf tuning, will profile under the integration PR.
#8 Linear scans for slot/batch tables — same; will reach for free-list / hash table when profiling shows it.
#9 Test gaps — added the eviction-with-live-sibling-handle test (#1). Saturated-cache test is harder to write deterministically; deferred.
#10 Opaque damacy_lookahead — agree it's inconsistent. Will opaque-ify in a follow-up before the planner/scheduler integration lands.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant