Metadata prefetch layer by nclack · Pull Request #120 · nclack/damacy

nclack · 2026-05-22T18:49:33Z

Glossary

prefetch_cache — generic pinning LRU primitive, one instance per metadata phase (array meta, shard index, chunk layout).
prefetcher — orchestrator thread; consumes lookahead samples and walks each through the three caches in dependency order.
batch_id — monotonic ordinal stamped by lookahead on push; carried through every cache request.
watermark — global ordinal; an entry is evictable iff its max_batch_id < watermark. Replaces refcount pinning.
prefetch_gate — one _Atomic uint64_t per batch packing [error:1 | pending:63]. Lets a consumer check group readiness with one load.

Summary

Host-side metadata reads (zarr.json, shard indices, blosc-layout probes) gate every batch plan; at 50–200 ms each on networked stores they starve the GPU. This PR adds an eager prefetch layer that hides that latency.

Approach (full rationale and integration diagram in dev/metadata_prefetch.md):

Three prefetch_cache instances, one per metadata phase. Each is a pinning LRU built on util/lru.h with the eviction predicate swapped from refcount to max_batch_id < watermark. State per entry is {pending | ready | error} — error folds in upstream cancellation so consumers check one state.
A single prefetcher thread blocking-pops samples from lookahead and advances each through array-meta → shard-index → chunk-layout. Stage N's keys derive from stage N−1's resolved value; no shared enumeration with the planner.
The planner never enumerates either; it receives resolved handles from the prefetcher and dereferences via prefetch_cache_try_get. A per-batch prefetch_gate lets the planner check "all handles ready, no errors" with a single atomic load.
New zarr/sample_shard_iterator factors the sample-AABB → unique-shard-coord enumeration out of planner.c so prefetcher and planner share one source of truth for "which shards does this sample touch."

This PR lands the cache primitive, the three fetcher instances, the orchestrator, and the sample_shard_iterator extraction. Planner/scheduler wiring (DAMACY_PENDING retry loop, watermark broadcast on plan success) is intentionally deferred — see dev/metadata_prefetch.md §Plan — so this PR can be reviewed independently against unit + cache-level integration tests.

Reviewer's path through the diff: start with dev/metadata_prefetch.md, then src/prefetch/prefetch_cache.h (the primitive surface), then src/prefetch/prefetcher.h (the orchestrator surface), then the three fetchers (array_meta.c, shard_index.c, chunk_layout.c) for what each phase's fetch_fn actually does.

Test plan

The chunk_layout fetcher does a 16-byte probe read; check that the read pattern is correct on a real shard (compressed blosc1 header may need bstarts offset, not chunk start).
Confirm the prefetcher's drain path (prefetcher_drain) actually flushes all in-flight samples, not just the lookahead queue, before prefetcher_destroy.

nclack · 2026-05-22T18:51:33Z

Review notes

Read through the cache primitive, the prefetcher, and the three fetchers against dev/metadata_prefetch.md. The shape matches the design — these are the issues worth surfacing before more code lands on top.

Blockers

1. Handle stability is broken by eviction — src/prefetch/prefetch_cache.c:194-203,304-307,470-482

The handle stores active_idx + 1. active_remove swap-and-pops c->active, moving an unrelated slot into the freed position and rewriting its active_idx. Any outstanding handle for that swapped slot is now stale: resolve_handle indexes into c->active[idx], finds the unrelated slot, and the generation check fails. So a ready handle silently degrades to "NULL" from try_get/query whenever any other entry is evicted. This isn't a sentinel collision — it's a misroute.

The straightforward fix is to key the handle on the slot itself, not its index in a compact array — e.g. allocate from a fixed slot table (capacity-sized) and let slot in the handle be that stable table index. The active array becomes a free-list, not the ID space.

This bug isn't currently caught because no unit test triggers eviction with a still-live handle to a non-evicted sibling slot. Worth adding before further consumers depend on handle stability.

2. Gate aliasing across batch reuse — src/prefetch/prefetcher.c:419-432, header prefetcher.h:74

prefetcher_release_batch flips in_use = 0, freeing the entry for batch_get_or_create_locked to reuse — but any in-flight prefetcher_slot for that batch still holds gate = &p->batches[i].gate. If the slot is reused before the worker drains, prefetch_gate_init zeroes the state and the still-pending fetch's gate_dec_pending lands on the new batch's gate. The header puts the burden on the caller ("ensure no slots still reference"), but the prefetcher exposes no way to verify that, and the design doc's intent (scheduler broadcasts watermark advance on plan success) doesn't naturally imply "no in-flight slots." Either gate it on a per-batch refcount, or have the prefetcher own the lifetime and refuse release while slots are active.

Significant

3. Worker polls instead of using the blocking pop the design added — src/prefetch/prefetcher.c:238-252

lookahead_pop_blocking exists per design step 5, but worker_fn does try_pop then platform_sleep_ns(1_000_000). With the design's target lead times (10k–100k samples), this gives a 1ms admission ceiling and burns the cache mutex on every tick (advance_all walks p->capacity). The state advance loop is the real polling work; the lookahead drain should block on the condvar instead.

4. advance_from_meta fails on zero-shard samples — src/prefetch/prefetcher.c:117-120

CHECK(Bad, n > 0) errors the slot if a sample's AABB intersects no shards. That's a valid configuration (edge AABB, all-padding sample) and the planner today handles it without erroring — the prefetcher shouldn't be stricter than the planner. Treat as "skip stages 2+3, mark READY with zero shards" or document the contract that callers must not submit such samples.

5. Per-prefetcher errors don't set the batch gate's error bit — src/prefetch/prefetcher.c:93-98

fail_slot advances local state but never touches s->gate. For errors that originate in a cache fetch (PREFETCH_STATE_ERROR observed via query), the cache already set the gate bit before the prefetcher saw it — fine. For errors that originate in the prefetcher itself (alloc fail at line 124, n == 0 once #4 is real, invalid handle from a saturated cache at line 132/156/218), the gate bit is never set. Design doc §Readiness gate promises "Error bit set → batch fails fast"; right now a fast-batch-fail consumer can miss these.

6. shard_index_fetch depends on array-meta entry staying pinned, but only peeks — src/prefetch/shard_index.c:106-112

prefetch_cache_peek doesn't widen the ordinal range and doesn't bump LRU recency. If a scheduler advances the watermark between the prefetcher's advance_from_meta and the shard-index worker running, the array-meta slot can be released and (under capacity pressure) evicted — fetch returns DAMACY_INVAL. The current setup avoids this because the slot is still pinned by the current batch's range, but the contract is implicit. Either document "stage-N+1 fetchers must run before watermark advances past stage-N requests" or have the fetcher widen the range explicitly.

Notes

7. prefetch_cache_advance_watermark walks n_active under the cache lock — src/prefetch/prefetch_cache.c:562. With design lead times of 10k–100k entries per cache, this is a real stall on every scheduler advance. Likely fine for the first integration but worth measuring before tuning lead time up.

8. Slot/batch tables are linear scans — prefetcher.c:62-83,201-208. O(capacity) per worker tick plus O(batch_capacity) per admit. Same scale concern as #7; reach for an intrusive free list or a small hash table when this shows up in profiles.

9. Test coverage gaps for the contracts that bite later — eviction with live sibling handle (#1), gate reuse race (#2), zero-shard sample (#4), saturated cache returning PREFETCH_HANDLE_NONE mid-walk in the prefetcher. These are the non-obvious behaviors; the current suite covers the happy paths well.

10. struct damacy_lookahead is fully exposed in the public header — src/lookahead/lookahead.h:17-27. Inconsistent with the opaque-handle pattern used by everything else here (prefetch_cache, prefetcher). Worth opaque-ing before downstream code starts touching la->size directly.

Bottom line

#1 and #2 are correctness issues that should land in this PR or as immediate follow-ups before the planner/scheduler integration goes in on top. The rest can sit until the integration PR exercises them.

closes #116 ## Approach `store_fs_gds.c`'s `gds_event_query` previously returned 1 unconditionally for any non-sentinel `seq` — completely ignoring whether the `cuFileReadAsync` submitted earlier had actually retired on `stream_h2d`. Callers (the wave-pool scheduler) treat a 1 return as "destination bytes are safe to consume" and transition the slot to `SLOT_READY`. On the GDS path this can hand a wave a `dev_buf` that the device has not yet written to, producing illegal memory accesses downstream in decode kernels gated on cross-stream events that fire ahead of the read draining. This PR makes the query honor what the caller would reasonably expect: returns 1 only after the cuFile read has actually completed on the stream. The mechanism reuses infrastructure already in `gds_submit_dev`: `cuLaunchHostFunc(stream, fs_gds_free_params_cb, ctx)` is enqueued after every `cuFileReadAsync`, so the callback runs in stream order *after* the reads drain. A new small `fs_gds_done { flag, claimed, rc }` struct is allocated per submit; the callback sets `flag=1` and drops one ref, `gds_event_query` checks the flag (acquire) and CAS-claims the owner-side ref the first time it observes the flag set. Repeated queries are safe; an unqueried event is reclaimed by the callback alone (no leak). `store_event` gains an opaque `void* impl` — backend-private, NULL for non-GDS stores. `gds_event_wait`, previously a no-op, now actually `cuStreamSynchronize`s and reclaims. ## Key files - `src/store/store_fs_gds.c:222-260` — the `fs_gds_done` refcount protocol. - `src/store/store_fs_gds.c:367-407` — the new `gds_event_query` / `gds_event_wait`. - `tests/test_store_fs_gds.c::test_event_query_reflects_completion` — the contract test. Uses `cuLaunchHostFunc` to park `stream_h2d` behind a host-side barrier, submits a read so cuFile is queued but not retired, asserts the query reports not-ready, then unblocks and asserts it reports ready. Deterministic, not race-dependent. Runs under cuFile compat mode (`CUFILE_FORCE_COMPAT_MODE=true` set in `main` before any cuFile init) so no nvidia-fs is required. ## Test plan - [x] New test fails before the fix (verified during development — query returns 1 while the read is provably queued behind the barrier). - [x] New test passes after the fix. - [x] Existing `test_submit_fail_releases_pins` still passes when it doesn't trip the separate bug below. - [x] Full test suite (25 tests, GDS build) passes. ## Related, not addressed here `test_submit_fail_releases_pins` SEGVs at ~4% on this hardware (filed as #118). Reproduces on this branch's parent commit, so it is not introduced by this PR, but it is a real damacy bug to chase, not upstream noise.

Fixes #118. ## Approach In compat mode, libcufile lazily allocates per-stream state on the first `cuFileReadAsync`, and that lazy init races against itself. The passing test in the same file happens to enqueue a `cuLaunchHostFunc` barrier before the first read, which serializes the stream enough to mask the race. The failing test goes straight into `cuFileReadAsync` on an empty stream and SEGVs ~4% of the time deep inside libcufile. cuFile already exposes the hook for this: `cuFileStreamRegister` "allocates resources needed to support cuFile operations asynchronously for the cuda stream" — i.e. exactly the lazy state that was racing. Calling it eagerly when damacy adopts a stream removes the race window. The matching `cuFileStreamDeregister` is required before `cuStreamDestroy` per the cuFile contract. ## Change In `src/store/store_fs_gds.c`: - dlsym-bind `cuFileStreamRegister` / `cuFileStreamDeregister` as optional symbols (graceful no-op on older libcufile that doesn't ship them). - `store_fs_gds_set_stream`: deregister any previously-set stream, then register the new one. First `cuFileReadAsync` now finds per-stream state already allocated. - `gds_destroy`: deregister after the existing `cuStreamSynchronize`, before the caller's `cuStreamDestroy`. ## Verification - `cmake --build build` clean. - 100× loop of `CUFILE_FORCE_COMPAT_MODE=true ./build/tests/test_store_fs_gds`: 0 failures (was ~4/100). - Full `ctest`: 26/26 pass.

## Approach Close #101 by removing the dead static fallback and replacing the ad-hoc structural constants it derived from with explicit `damacy_tuning` knobs. `chunk_substreams_upper_bound` (formerly `chunk_zsubs_upper_bound`) in `src/wave/wave_pool.c` sizes the per-wave fanout SOA and the shared nvcomp zstd decoder scratch. Its `!sp->layout_probed` fallback returned a hardcoded `DAMACY_BLOSC_MAX_BLOCKS_PER_CHUNK = 32` — the adversarial worst case. But `wave_chunks_eligible` (per-chunk gate, runs before `prepare_decode_caps` in `kick_h2d`) rejects any wave containing an unprobed BLOSC_ZSTD chunk with `DAMACY_INVAL`, so the fallback is structurally unreachable. The "perf" framing of the original issue was moot. This PR: - **Turns the implicit gate-vs-sizer contract into an explicit check.** `chunk_substreams_upper_bound` now returns `enum damacy_status`; on unprobed BLOSC it returns `DAMACY_INVAL` with a `log_error("gate-vs-sizer contract violated")` at the caller. A future gate regression now fails loudly instead of silently undersizing the fanout SOA. - **Replaces the two compile-time constants** (`DAMACY_MAX_CHUNKS_PER_WAVE`, `DAMACY_BLOSC_MAX_BLOCKS_PER_CHUNK`) with `damacy_tuning.max_chunks_per_wave` and `damacy_tuning.max_substreams_per_chunk`. The parser, planner, coalesce, wave_pool, fanout, wave_budget, and meta_cache all thread the effective values through their existing param chains. New `DAMACY_DEFAULT_*` siblings preserve current behavior; `0` in either field resolves to the default. `WAVE_ZSUBS_STRUCTURAL_MAX` becomes a runtime field `wave_pool.max_substreams_per_wave` derived once at init. - **Drops the dead substream rename target.** `zsubs` was a contraction that read as zstd-specific; renames to `substreams` everywhere (the noun that matches both BLOSC1 spec language and the nvcomp batched-decode input it actually counts). - **Strips machinery wired only to the unreachable branch:** the `_Atomic(uint16_t) observed_max_nblocks_per_chunk` slot, its `atomic_u16_observe_max` CAS-loop helper (`src/util/atomic_max.h`), the meta-cache observer setter, the bump sites in `zarr_meta_cache_layout_set` / `_probe_layout`, and the wiring in `damacy_create`. `zarr/zarr_meta_cache.h` returns to `extern "C"` shape (matches main) — the C-only `static_assert` is no longer needed. ## API Two new optional fields on `damacy_tuning` (Python `Config`): - `max_chunks_per_wave: int = 0` — `0` → 512 (current behavior). Clamped to `0xFFFFu` (the 16-bit chunk_idx packing in `d_block_chunk_map`). - `max_substreams_per_chunk: int = 0` — `0` → 32 (current behavior). Parser rejects blosc1 layouts above this with `DAMACY_DECODE`. ## Key file `src/wave/wave_pool.c:355` — `chunk_substreams_upper_bound` (the contract check) and `prepare_decode_caps` (caller). Closes #101.

codecov · 2026-05-22T21:59:09Z

Codecov Report

❌ Patch coverage is 71.88525% with 343 lines in your changes missing coverage. Please review.
✅ Project coverage is 58.21%. Comparing base (67de519) to head (23d95f0).
⚠️ Report is 3 commits behind head on main.

Files with missing lines	Patch %	Lines
src/prefetch/prefetch_cache.c	71.05%	62 Missing and 26 partials ⚠️
src/prefetch/prefetcher.c	77.80%	50 Missing and 29 partials ⚠️
src/prefetch/shard_index.c	47.86%	43 Missing and 18 partials ⚠️
src/lookahead/lookahead.c	72.34%	22 Missing and 4 partials ⚠️
src/prefetch/chunk_layout.c	65.07%	15 Missing and 7 partials ⚠️
src/planner/planner.c	73.77%	10 Missing and 6 partials ⚠️
src/prefetch/array_meta.c	79.48%	6 Missing and 2 partials ⚠️
src/wave/wave_pool.c	77.77%	4 Missing and 4 partials ⚠️
python/damacy/_api.c	0.00%	7 Missing ⚠️
src/store/store.c	0.00%	7 Missing ⚠️
... and 8 more

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #120      +/-   ##
==========================================
+ Coverage   55.67%   58.21%   +2.54%     
==========================================
  Files          49       56       +7     
  Lines        6890     7948    +1058     
  Branches     1232     1396     +164     
==========================================
+ Hits         3836     4627     +791     
- Misses       2576     2746     +170     
- Partials      478      575      +97

Flag	Coverage Δ
unittests	`58.21% <71.88%> (+2.54%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
src/damacy.c	`67.27% <100.00%> (+0.10%)`	⬆️
src/damacy_limits.h	`100.00% <100.00%> (ø)`
src/planner/coalesce.c	`89.65% <ø> (ø)`
src/wave/host_slab.c	`56.71% <100.00%> (ø)`
src/zarr/zarr_chunk_layout.c	`75.75% <100.00%> (+6.06%)`	⬆️
src/zarr/zarr_meta_cache.c	`71.72% <ø> (ø)`
src/store/store_fs.c	`75.25% <0.00%> (+1.16%)`	⬆️
src/wave/wave.c	`86.56% <80.00%> (+0.20%)`	⬆️
src/platform/platform.posix.c	`58.47% <0.00%> (-1.01%)`	⬇️
src/wave/fanout.c	`80.00% <0.00%> (+32.00%)`	⬆️
... and 14 more

... and 6 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

nclack · 2026-05-22T21:59:44Z

Review responses

Thanks for the careful pass. Walking through each item with the fix or rationale.

Blockers — fixed

#1 Handle stability (edbc00d prefetch_cache: stable slot indices)
Slot table is now capacity-sized and indexed by a stable slot id. active becomes a free-list rather than the id space, so active_remove's swap-and-pop no longer invalidates handles to sibling slots. New test test_handle_stable_across_eviction covers the exact case the reviewer described.

#2 Gate aliasing across batch reuse (da117cb prefetcher: batch refcount + edges, 064ad82 prefetcher: hold batch until gate drains)
Batch entries now carry a refcount + a release_pending flag. release_batch defers the in_use=0 flip until refcount and gate.pending both hit zero, so the gate cannot be re-initialized while in-flight slots still reference it. New tests cover release-before-pop, distinct-gates-per-batch, and admit-fail-rollback.

Significant — addressed

#3 Worker polling vs blocking pop (9518104 lookahead: timed pop, worker drops sleep)
Added lookahead_pop_blocking_timeout. Worker now uses pure pop_blocking when there's a free slot and no in-flight work; switches to pop_blocking_timeout(1ms) when in-flight work needs periodic state advance. Kills the 1ms admission ceiling — a sample arriving 100us into the wait now wakes the worker via condvar instead of waiting out the sleep.

The deeper "burns the cache mutex on every tick" complaint is structural — advance_all walks p->capacity because there's no cache→prefetcher notification path. That refactor (cache callbacks) is left for the integration PR rather than dropped here.

#4 Zero-shard sample (already addressed in da117cb)
advance_from_meta now branches on n == 0: skips stages 2+3 by going straight to PREFETCHER_PENDING_CHUNK_LAYOUT with n_shards = 0. The chunk-layout request still goes through since it's per-uri, not per-shard; a downstream consumer that doesn't need layout for empty samples can ignore the handle.

#5 Prefetcher-origin errors not setting gate.error (already in c7bba1b)
fail_slot now calls prefetch_gate_set_error(s->gate) before transitioning to PREFETCHER_ERROR. Catches the alloc-fail / saturation paths the reviewer flagged. test_batch_gate_error_on_failed_sample exercises this.

#6 shard_index_fetch peek pin
Documented the contract inline. The structural fix (widen ordinal range explicitly) would require the cache to expose a pin/unpin API; left for the integration PR. The current invariant (stage-N+1 fetchers run before watermark advances past stage-N requests) is naturally upheld by the prefetcher's submission ordering.

Notes — deferred

#7 advance_watermark walks active under the lock — perf tuning, will profile under the integration PR.
#8 Linear scans for slot/batch tables — same; will reach for free-list / hash table when profiling shows it.
#9 Test gaps — added the eviction-with-live-sibling-handle test (#1). Saturated-cache test is harder to write deterministically; deferred.
#10 Opaque damacy_lookahead — agree it's inconsistent. Will opaque-ify in a follow-up before the planner/scheduler integration lands.

nclack added 9 commits May 22, 2026 11:32

add prefetch_cache primitive

249cf0f

add zarr/sample_shard_iterator

2e366a0

planner: use shard iterator, beg/end names

2327a07

lookahead: batch_id + blocking pop

d095fba

add prefetch/array_meta fetcher

b525197

add prefetch/shard_index fetcher

4cd27b7

add prefetch/chunk_layout fetcher

24e855f

add prefetch/prefetcher orchestrator

8abdc2a

add dev/metadata_prefetch.md

fcf5cb4

nclack added 13 commits May 22, 2026 13:41

prefetch_cache: stable slot indices

edbc00d

prefetcher: batch refcount + edges

da117cb

prefetcher: rollback new batch on admit fail

3af0eb6

prefetch_cache: doc executor post contract

ac70985

prefetcher: hold batch until gate drains

064ad82

prefetcher: defer n_shards until requests done

3f646ae

prefetcher: doc destroy precondition

7bba37d

prefetcher: lookup helper, fused slot scan

c7bba1b

lookahead: timed pop, worker drops sleep

9518104

chunk_layout: thread max_substreams param

0b773ab

nclack added 3 commits May 22, 2026 15:43

prefetcher: batch-aware pop_ready

dea0137

prefetcher: err_code on ready and slot

7faaa83

prefetcher: batch readiness queries

23d95f0

nclack mentioned this pull request May 23, 2026

damacy rewrite PR-1: file split + prefetcher producer #121

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Metadata prefetch layer#120

Metadata prefetch layer#120
nclack wants to merge 25 commits into
mainfrom
worktree-prefetch

nclack commented May 22, 2026

Uh oh!

nclack commented May 22, 2026

Uh oh!

codecov Bot commented May 22, 2026 •

edited

Loading

Uh oh!

nclack commented May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

nclack commented May 22, 2026

Glossary

Summary

Test plan

Uh oh!

nclack commented May 22, 2026

Review notes

Blockers

Significant

Notes

Bottom line

Uh oh!

codecov Bot commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

nclack commented May 22, 2026

Review responses

Blockers — fixed

Significant — addressed

Notes — deferred

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

codecov Bot commented May 22, 2026 •

edited

Loading