Runtime tuning: chunks/wave + substreams/chunk#112
Conversation
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #112 +/- ##
==========================================
+ Coverage 55.64% 56.30% +0.66%
==========================================
Files 49 50 +1
Lines 6903 6953 +50
Branches 1233 1238 +5
==========================================
+ Hits 3841 3915 +74
+ Misses 2585 2550 -35
- Partials 477 488 +11
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
GDS crash with this PR in the mixI built a combined branch merging PRs #105, #110, #111, and #112 onto 614d080 (main), with Multiple ranks (3, 5, 6, and others) crashed with For reference, 614d080 (main without these PRs) works fine with GDS enabled — 100 steps, no crashes. I suspect the Config: 8-GPU H100, 22 datasets, |
3204d3d to
5e01d25
Compare
53f229a to
8f01759
Compare
|
@ritvikvasan — the illegal-memory-access pattern you hit is a latent bug in the GDS path itself, independent of this PR. Tracked as #116, fixed in #117. Root cause: This PR (#112) does not touch the GDS path. The combined-PR build likely perturbed scheduling enough to close the window where the race had been going unobserved on plain #117 makes |
## Approach Close #101 by removing the dead static fallback and replacing the ad-hoc structural constants it derived from with explicit `damacy_tuning` knobs. `chunk_substreams_upper_bound` (formerly `chunk_zsubs_upper_bound`) in `src/wave/wave_pool.c` sizes the per-wave fanout SOA and the shared nvcomp zstd decoder scratch. Its `!sp->layout_probed` fallback returned a hardcoded `DAMACY_BLOSC_MAX_BLOCKS_PER_CHUNK = 32` — the adversarial worst case. But `wave_chunks_eligible` (per-chunk gate, runs before `prepare_decode_caps` in `kick_h2d`) rejects any wave containing an unprobed BLOSC_ZSTD chunk with `DAMACY_INVAL`, so the fallback is structurally unreachable. The "perf" framing of the original issue was moot. This PR: - **Turns the implicit gate-vs-sizer contract into an explicit check.** `chunk_substreams_upper_bound` now returns `enum damacy_status`; on unprobed BLOSC it returns `DAMACY_INVAL` with a `log_error("gate-vs-sizer contract violated")` at the caller. A future gate regression now fails loudly instead of silently undersizing the fanout SOA. - **Replaces the two compile-time constants** (`DAMACY_MAX_CHUNKS_PER_WAVE`, `DAMACY_BLOSC_MAX_BLOCKS_PER_CHUNK`) with `damacy_tuning.max_chunks_per_wave` and `damacy_tuning.max_substreams_per_chunk`. The parser, planner, coalesce, wave_pool, fanout, wave_budget, and meta_cache all thread the effective values through their existing param chains. New `DAMACY_DEFAULT_*` siblings preserve current behavior; `0` in either field resolves to the default. `WAVE_ZSUBS_STRUCTURAL_MAX` becomes a runtime field `wave_pool.max_substreams_per_wave` derived once at init. - **Drops the dead substream rename target.** `zsubs` was a contraction that read as zstd-specific; renames to `substreams` everywhere (the noun that matches both BLOSC1 spec language and the nvcomp batched-decode input it actually counts). - **Strips machinery wired only to the unreachable branch:** the `_Atomic(uint16_t) observed_max_nblocks_per_chunk` slot, its `atomic_u16_observe_max` CAS-loop helper (`src/util/atomic_max.h`), the meta-cache observer setter, the bump sites in `zarr_meta_cache_layout_set` / `_probe_layout`, and the wiring in `damacy_create`. `zarr/zarr_meta_cache.h` returns to `extern "C"` shape (matches main) — the C-only `static_assert` is no longer needed. ## API Two new optional fields on `damacy_tuning` (Python `Config`): - `max_chunks_per_wave: int = 0` — `0` → 512 (current behavior). Clamped to `0xFFFFu` (the 16-bit chunk_idx packing in `d_block_chunk_map`). - `max_substreams_per_chunk: int = 0` — `0` → 32 (current behavior). Parser rejects blosc1 layouts above this with `DAMACY_DECODE`. ## Key file `src/wave/wave_pool.c:355` — `chunk_substreams_upper_bound` (the contract check) and `prepare_decode_caps` (caller). Closes #101. 5672f7e
## Approach Close #101 by removing the dead static fallback and replacing the ad-hoc structural constants it derived from with explicit `damacy_tuning` knobs. `chunk_substreams_upper_bound` (formerly `chunk_zsubs_upper_bound`) in `src/wave/wave_pool.c` sizes the per-wave fanout SOA and the shared nvcomp zstd decoder scratch. Its `!sp->layout_probed` fallback returned a hardcoded `DAMACY_BLOSC_MAX_BLOCKS_PER_CHUNK = 32` — the adversarial worst case. But `wave_chunks_eligible` (per-chunk gate, runs before `prepare_decode_caps` in `kick_h2d`) rejects any wave containing an unprobed BLOSC_ZSTD chunk with `DAMACY_INVAL`, so the fallback is structurally unreachable. The "perf" framing of the original issue was moot. This PR: - **Turns the implicit gate-vs-sizer contract into an explicit check.** `chunk_substreams_upper_bound` now returns `enum damacy_status`; on unprobed BLOSC it returns `DAMACY_INVAL` with a `log_error("gate-vs-sizer contract violated")` at the caller. A future gate regression now fails loudly instead of silently undersizing the fanout SOA. - **Replaces the two compile-time constants** (`DAMACY_MAX_CHUNKS_PER_WAVE`, `DAMACY_BLOSC_MAX_BLOCKS_PER_CHUNK`) with `damacy_tuning.max_chunks_per_wave` and `damacy_tuning.max_substreams_per_chunk`. The parser, planner, coalesce, wave_pool, fanout, wave_budget, and meta_cache all thread the effective values through their existing param chains. New `DAMACY_DEFAULT_*` siblings preserve current behavior; `0` in either field resolves to the default. `WAVE_ZSUBS_STRUCTURAL_MAX` becomes a runtime field `wave_pool.max_substreams_per_wave` derived once at init. - **Drops the dead substream rename target.** `zsubs` was a contraction that read as zstd-specific; renames to `substreams` everywhere (the noun that matches both BLOSC1 spec language and the nvcomp batched-decode input it actually counts). - **Strips machinery wired only to the unreachable branch:** the `_Atomic(uint16_t) observed_max_nblocks_per_chunk` slot, its `atomic_u16_observe_max` CAS-loop helper (`src/util/atomic_max.h`), the meta-cache observer setter, the bump sites in `zarr_meta_cache_layout_set` / `_probe_layout`, and the wiring in `damacy_create`. `zarr/zarr_meta_cache.h` returns to `extern "C"` shape (matches main) — the C-only `static_assert` is no longer needed. ## API Two new optional fields on `damacy_tuning` (Python `Config`): - `max_chunks_per_wave: int = 0` — `0` → 512 (current behavior). Clamped to `0xFFFFu` (the 16-bit chunk_idx packing in `d_block_chunk_map`). - `max_substreams_per_chunk: int = 0` — `0` → 32 (current behavior). Parser rejects blosc1 layouts above this with `DAMACY_DECODE`. ## Key file `src/wave/wave_pool.c:355` — `chunk_substreams_upper_bound` (the contract check) and `prepare_decode_caps` (caller). Closes #101.
Approach
Close #101 by removing the dead static fallback and replacing the
ad-hoc structural constants it derived from with explicit
damacy_tuningknobs.chunk_substreams_upper_bound(formerlychunk_zsubs_upper_bound)in
src/wave/wave_pool.csizes the per-wave fanout SOA and theshared nvcomp zstd decoder scratch. Its
!sp->layout_probedfallbackreturned a hardcoded
DAMACY_BLOSC_MAX_BLOCKS_PER_CHUNK = 32— theadversarial worst case. But
wave_chunks_eligible(per-chunk gate,runs before
prepare_decode_capsinkick_h2d) rejects any wavecontaining an unprobed BLOSC_ZSTD chunk with
DAMACY_INVAL, so thefallback is structurally unreachable. The "perf" framing of the
original issue was moot.
This PR:
check.
chunk_substreams_upper_boundnow returnsenum damacy_status; on unprobed BLOSC it returnsDAMACY_INVALwith a
log_error("gate-vs-sizer contract violated")at thecaller. A future gate regression now fails loudly instead of
silently undersizing the fanout SOA.
DAMACY_MAX_CHUNKS_PER_WAVE,DAMACY_BLOSC_MAX_BLOCKS_PER_CHUNK) withdamacy_tuning.max_chunks_per_waveand
damacy_tuning.max_substreams_per_chunk. The parser, planner,coalesce, wave_pool, fanout, wave_budget, and meta_cache all thread
the effective values through their existing param chains. New
DAMACY_DEFAULT_*siblings preserve current behavior;0in eitherfield resolves to the default.
WAVE_ZSUBS_STRUCTURAL_MAXbecomesa runtime field
wave_pool.max_substreams_per_wavederived once atinit.
zsubswas acontraction that read as zstd-specific; renames to
substreamseverywhere (the noun that matches both BLOSC1 spec language and the
nvcomp batched-decode input it actually counts).
_Atomic(uint16_t) observed_max_nblocks_per_chunkslot, itsatomic_u16_observe_maxCAS-loop helper (src/util/atomic_max.h),the meta-cache observer setter, the bump sites in
zarr_meta_cache_layout_set/_probe_layout, and the wiring indamacy_create.zarr/zarr_meta_cache.hreturns toextern "C"shape (matches main) — the C-only
static_assertis no longerneeded.
API
Two new optional fields on
damacy_tuning(PythonConfig):max_chunks_per_wave: int = 0—0→ 512 (current behavior).Clamped to
0xFFFFu(the 16-bit chunk_idx packing ind_block_chunk_map).max_substreams_per_chunk: int = 0—0→ 32 (current behavior).Parser rejects blosc1 layouts above this with
DAMACY_DECODE.Key file
src/wave/wave_pool.c:355—chunk_substreams_upper_bound(thecontract check) and
prepare_decode_caps(caller).Closes #101.