Runtime tuning: chunks/wave + substreams/chunk by nclack · Pull Request #112 · nclack/damacy

nclack · 2026-05-19T15:37:40Z

Approach

Close #101 by removing the dead static fallback and replacing the
ad-hoc structural constants it derived from with explicit
damacy_tuning knobs.

chunk_substreams_upper_bound (formerly chunk_zsubs_upper_bound)
in src/wave/wave_pool.c sizes the per-wave fanout SOA and the
shared nvcomp zstd decoder scratch. Its !sp->layout_probed fallback
returned a hardcoded DAMACY_BLOSC_MAX_BLOCKS_PER_CHUNK = 32 — the
adversarial worst case. But wave_chunks_eligible (per-chunk gate,
runs before prepare_decode_caps in kick_h2d) rejects any wave
containing an unprobed BLOSC_ZSTD chunk with DAMACY_INVAL, so the
fallback is structurally unreachable. The "perf" framing of the
original issue was moot.

This PR:

Turns the implicit gate-vs-sizer contract into an explicit
check. chunk_substreams_upper_bound now returns
enum damacy_status; on unprobed BLOSC it returns DAMACY_INVAL
with a log_error("gate-vs-sizer contract violated") at the
caller. A future gate regression now fails loudly instead of
silently undersizing the fanout SOA.
Replaces the two compile-time constants (DAMACY_MAX_CHUNKS_PER_WAVE,
DAMACY_BLOSC_MAX_BLOCKS_PER_CHUNK) with damacy_tuning.max_chunks_per_wave
and damacy_tuning.max_substreams_per_chunk. The parser, planner,
coalesce, wave_pool, fanout, wave_budget, and meta_cache all thread
the effective values through their existing param chains. New
DAMACY_DEFAULT_* siblings preserve current behavior; 0 in either
field resolves to the default. WAVE_ZSUBS_STRUCTURAL_MAX becomes
a runtime field wave_pool.max_substreams_per_wave derived once at
init.
Drops the dead substream rename target. zsubs was a
contraction that read as zstd-specific; renames to substreams
everywhere (the noun that matches both BLOSC1 spec language and the
nvcomp batched-decode input it actually counts).
Strips machinery wired only to the unreachable branch: the
_Atomic(uint16_t) observed_max_nblocks_per_chunk slot, its
atomic_u16_observe_max CAS-loop helper (src/util/atomic_max.h),
the meta-cache observer setter, the bump sites in
zarr_meta_cache_layout_set / _probe_layout, and the wiring in
damacy_create. zarr/zarr_meta_cache.h returns to extern "C"
shape (matches main) — the C-only static_assert is no longer
needed.

API

Two new optional fields on damacy_tuning (Python Config):

max_chunks_per_wave: int = 0 — 0 → 512 (current behavior).
Clamped to 0xFFFFu (the 16-bit chunk_idx packing in
d_block_chunk_map).
max_substreams_per_chunk: int = 0 — 0 → 32 (current behavior).
Parser rejects blosc1 layouts above this with DAMACY_DECODE.

Key file

src/wave/wave_pool.c:355 — chunk_substreams_upper_bound (the
contract check) and prepare_decode_caps (caller).

Closes #101.

codecov · 2026-05-19T15:43:57Z

Codecov Report

❌ Patch coverage is 77.44361% with 30 lines in your changes missing coverage. Please review.
✅ Project coverage is 56.30%. Comparing base (5f9226d) to head (601600a).

Files with missing lines	Patch %	Lines
python/damacy/_api.c	0.00%	7 Missing ⚠️
src/wave/wave_pool.c	77.41%	3 Missing and 4 partials ⚠️
src/damacy_config.c	57.14%	2 Missing and 4 partials ⚠️
python/damacy/__init__.py	77.77%	4 Missing ⚠️
src/wave/wave_budget.c	91.89%	1 Missing and 2 partials ⚠️
src/wave/fanout.c	0.00%	1 Missing and 1 partial ⚠️
src/wave/wave.c	80.00%	0 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #112      +/-   ##
==========================================
+ Coverage   55.64%   56.30%   +0.66%     
==========================================
  Files          49       50       +1     
  Lines        6903     6953      +50     
  Branches     1233     1238       +5     
==========================================
+ Hits         3841     3915      +74     
+ Misses       2585     2550      -35     
- Partials      477      488      +11

Flag	Coverage Δ
unittests	`56.30% <77.44%> (+0.66%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
src/damacy.c	`67.27% <100.00%> (-0.23%)`	⬇️
src/damacy_limits.h	`100.00% <100.00%> (ø)`
src/planner/coalesce.c	`89.65% <ø> (ø)`
src/planner/planner.c	`75.08% <100.00%> (+0.16%)`	⬆️
src/wave/host_slab.c	`56.71% <100.00%> (-2.99%)`	⬇️
src/zarr/zarr_chunk_layout.c	`75.75% <100.00%> (+6.06%)`	⬆️
src/zarr/zarr_meta_cache.c	`71.72% <ø> (ø)`
src/wave/wave.c	`86.56% <80.00%> (+0.20%)`	⬆️
src/wave/fanout.c	`80.00% <0.00%> (+32.00%)`	⬆️
src/wave/wave_budget.c	`70.77% <91.89%> (+8.93%)`	⬆️
... and 4 more

... and 7 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

ritvikvasan · 2026-05-19T16:21:37Z

GDS crash with this PR in the mix

I built a combined branch merging PRs #105, #110, #111, and #112 onto 614d080 (main), with -DDAMACY_ENABLE_GDS=ON. The non-GDS path works fine (100 steps, no errors), but the GDS path hits CUDA memory corruption:

damacy._native.DamacyError: damacy: pop failed (cuda error)
NCCL WARN Cuda failure 'an illegal memory access was encountered'
RuntimeError: CUDA error: CUBLAS_STATUS_INTERNAL_ERROR when calling cublasCreate(handle)

Multiple ranks (3, 5, 6, and others) crashed with pop failed (cuda error). Rank 0 appears to have completed all 100 steps, but the CUDA errors make the results unreliable.

For reference, 614d080 (main without these PRs) works fine with GDS enabled — 100 steps, no crashes.

I suspect the observed-max nblocks fallback in this PR could underestimate buffer sizes for some chunk layouts with GDS reads, causing out-of-bounds GPU writes. But I haven't isolated which of the 4 PRs is the culprit — this might also be #111 (atomic refcount race) or a combination.

Config: 8-GPU H100, 22 datasets, enable_gds=true, host_buffer_waves=8, n_io_threads=8.

nclack · 2026-05-21T17:02:10Z

@ritvikvasan — the illegal-memory-access pattern you hit is a latent bug in the GDS path itself, independent of this PR. Tracked as #116, fixed in #117.

Root cause: gds_event_query in src/store/store_fs_gds.c returned 1 unconditionally for any non-sentinel seq, ignoring whether cuFileReadAsync had actually retired on stream_h2d. The wave-pool scheduler treats 1 as "destination bytes are safe to consume" and transitions the slot to SLOT_READY. The decode kernel runs on stream_decode, gated only on cross-stream events recorded on stream_h2d; under real-GDS (non-compat) mode cuFileReadAsync does not always retire before the subsequent cuEventRecord, so the decoder can read from dev_buf before cuFile has written to it. Higher-rank devices consistently lose the race — matches your rank 3/5/6 pattern.

This PR (#112) does not touch the GDS path. The combined-PR build likely perturbed scheduling enough to close the window where the race had been going unobserved on plain main.

#117 makes gds_event_query actually reflect cuFile completion (refcounted done struct set by the existing free-params callback, acquire-load + CAS-claim on the query side). Could you retest your 8-GPU config with #117 in the mix?

## Approach Close #101 by removing the dead static fallback and replacing the ad-hoc structural constants it derived from with explicit `damacy_tuning` knobs. `chunk_substreams_upper_bound` (formerly `chunk_zsubs_upper_bound`) in `src/wave/wave_pool.c` sizes the per-wave fanout SOA and the shared nvcomp zstd decoder scratch. Its `!sp->layout_probed` fallback returned a hardcoded `DAMACY_BLOSC_MAX_BLOCKS_PER_CHUNK = 32` — the adversarial worst case. But `wave_chunks_eligible` (per-chunk gate, runs before `prepare_decode_caps` in `kick_h2d`) rejects any wave containing an unprobed BLOSC_ZSTD chunk with `DAMACY_INVAL`, so the fallback is structurally unreachable. The "perf" framing of the original issue was moot. This PR: - **Turns the implicit gate-vs-sizer contract into an explicit check.** `chunk_substreams_upper_bound` now returns `enum damacy_status`; on unprobed BLOSC it returns `DAMACY_INVAL` with a `log_error("gate-vs-sizer contract violated")` at the caller. A future gate regression now fails loudly instead of silently undersizing the fanout SOA. - **Replaces the two compile-time constants** (`DAMACY_MAX_CHUNKS_PER_WAVE`, `DAMACY_BLOSC_MAX_BLOCKS_PER_CHUNK`) with `damacy_tuning.max_chunks_per_wave` and `damacy_tuning.max_substreams_per_chunk`. The parser, planner, coalesce, wave_pool, fanout, wave_budget, and meta_cache all thread the effective values through their existing param chains. New `DAMACY_DEFAULT_*` siblings preserve current behavior; `0` in either field resolves to the default. `WAVE_ZSUBS_STRUCTURAL_MAX` becomes a runtime field `wave_pool.max_substreams_per_wave` derived once at init. - **Drops the dead substream rename target.** `zsubs` was a contraction that read as zstd-specific; renames to `substreams` everywhere (the noun that matches both BLOSC1 spec language and the nvcomp batched-decode input it actually counts). - **Strips machinery wired only to the unreachable branch:** the `_Atomic(uint16_t) observed_max_nblocks_per_chunk` slot, its `atomic_u16_observe_max` CAS-loop helper (`src/util/atomic_max.h`), the meta-cache observer setter, the bump sites in `zarr_meta_cache_layout_set` / `_probe_layout`, and the wiring in `damacy_create`. `zarr/zarr_meta_cache.h` returns to `extern "C"` shape (matches main) — the C-only `static_assert` is no longer needed. ## API Two new optional fields on `damacy_tuning` (Python `Config`): - `max_chunks_per_wave: int = 0` — `0` → 512 (current behavior). Clamped to `0xFFFFu` (the 16-bit chunk_idx packing in `d_block_chunk_map`). - `max_substreams_per_chunk: int = 0` — `0` → 32 (current behavior). Parser rejects blosc1 layouts above this with `DAMACY_DECODE`. ## Key file `src/wave/wave_pool.c:355` — `chunk_substreams_upper_bound` (the contract check) and `prepare_decode_caps` (caller). Closes #101. 5672f7e

## Approach Close #101 by removing the dead static fallback and replacing the ad-hoc structural constants it derived from with explicit `damacy_tuning` knobs. `chunk_substreams_upper_bound` (formerly `chunk_zsubs_upper_bound`) in `src/wave/wave_pool.c` sizes the per-wave fanout SOA and the shared nvcomp zstd decoder scratch. Its `!sp->layout_probed` fallback returned a hardcoded `DAMACY_BLOSC_MAX_BLOCKS_PER_CHUNK = 32` — the adversarial worst case. But `wave_chunks_eligible` (per-chunk gate, runs before `prepare_decode_caps` in `kick_h2d`) rejects any wave containing an unprobed BLOSC_ZSTD chunk with `DAMACY_INVAL`, so the fallback is structurally unreachable. The "perf" framing of the original issue was moot. This PR: - **Turns the implicit gate-vs-sizer contract into an explicit check.** `chunk_substreams_upper_bound` now returns `enum damacy_status`; on unprobed BLOSC it returns `DAMACY_INVAL` with a `log_error("gate-vs-sizer contract violated")` at the caller. A future gate regression now fails loudly instead of silently undersizing the fanout SOA. - **Replaces the two compile-time constants** (`DAMACY_MAX_CHUNKS_PER_WAVE`, `DAMACY_BLOSC_MAX_BLOCKS_PER_CHUNK`) with `damacy_tuning.max_chunks_per_wave` and `damacy_tuning.max_substreams_per_chunk`. The parser, planner, coalesce, wave_pool, fanout, wave_budget, and meta_cache all thread the effective values through their existing param chains. New `DAMACY_DEFAULT_*` siblings preserve current behavior; `0` in either field resolves to the default. `WAVE_ZSUBS_STRUCTURAL_MAX` becomes a runtime field `wave_pool.max_substreams_per_wave` derived once at init. - **Drops the dead substream rename target.** `zsubs` was a contraction that read as zstd-specific; renames to `substreams` everywhere (the noun that matches both BLOSC1 spec language and the nvcomp batched-decode input it actually counts). - **Strips machinery wired only to the unreachable branch:** the `_Atomic(uint16_t) observed_max_nblocks_per_chunk` slot, its `atomic_u16_observe_max` CAS-loop helper (`src/util/atomic_max.h`), the meta-cache observer setter, the bump sites in `zarr_meta_cache_layout_set` / `_probe_layout`, and the wiring in `damacy_create`. `zarr/zarr_meta_cache.h` returns to `extern "C"` shape (matches main) — the C-only `static_assert` is no longer needed. ## API Two new optional fields on `damacy_tuning` (Python `Config`): - `max_chunks_per_wave: int = 0` — `0` → 512 (current behavior). Clamped to `0xFFFFu` (the 16-bit chunk_idx packing in `d_block_chunk_map`). - `max_substreams_per_chunk: int = 0` — `0` → 32 (current behavior). Parser rejects blosc1 layouts above this with `DAMACY_DECODE`. ## Key file `src/wave/wave_pool.c:355` — `chunk_substreams_upper_bound` (the contract check) and `prepare_decode_caps` (caller). Closes #101.

ritvikvasan mentioned this pull request May 19, 2026

Default host_buffer_waves=2 causes GPU stalls on networked storage #91

Open

nclack force-pushed the worktree-issue-101-observed-max-nblocks branch from 3204d3d to 5e01d25 Compare May 19, 2026 17:50

ritvikvasan mentioned this pull request May 20, 2026

GDS path: CUDA illegal memory access on multi-GPU (reproducible) #116

Closed

nclack added 15 commits May 19, 2026 20:45

wave_pool: observed-max nblocks fallback

5074f44

atomic_max: generic u16 observe helper

47b86ce

wave: WAVE_ZSUBS_STRUCTURAL_MAX macro

a88c024

init order + comments + typed observer

04b9720

atomic_max: release/acquire pair

3d9883e

limits: static_asserts + observer asserts

8de0506

meta_cache.h: loud C++ rejection

89c3288

damacy: clarify observer ordering note

e28fba3

atomic_max: clarify CAS ordering note

6a07192

wave_pool: note observer init=1 invariant

05e7ae7

meta_cache.h: actionable C++ message

760d58c

test: observer-slot advance/no-op

7723b45

kick_h2d: eligible before decode caps

6b5b09b

meta_cache.h: static_assert for C++

4987670

test: exercise NULL observer slot

8f01759

nclack force-pushed the worktree-issue-101-observed-max-nblocks branch from 53f229a to 8f01759 Compare May 20, 2026 03:52

nclack added 9 commits May 20, 2026 10:20

Merge main into PR branch

1cd992e

zarr_meta_cache: assert -> CHECK

40f9d22

Strip observer; harden gate contract

504f7c7

Tuning knobs: chunks/wave + substreams/chunk

e7b9200

Rename zsubs -> substreams

f4919f1

Hard-cap max_substreams_per_chunk

f163ca8

Python: enforce 0xFFFF bound on wave caps

ad8e251

planner.h: fix sample_plan layout comment

456753f

wave_pool: drop unused max_substreams_per_chunk

a7db2d6

nclack added 2 commits May 21, 2026 08:23

planner_create: guard non-zero wave caps

dd01bf4

Group wave caps with chunk-size knobs

8b56e44

nclack added 10 commits May 21, 2026 10:30

Match _native.pyi arg order to _api.c

e939804

Widen max_substreams_per_chunk to uint32

e3e6474

Decouple IO queue initial cap comment

65c754c

Test substreams cap rejects excess blocks

5b9b59a

Test gate contract for substreams sizer

c6b8d5e

damacy_limits: clean static_assert formatting

812e58d

Centralize max_substreams_per_wave helper

a16f4bc

wave_pool.h: note test-only exposure

55ab403

Drive substream-cap grow with l1 fixture

328f8c3

Merge remote-tracking branch 'origin/main' into pr112-review-fixes

601600a

nclack changed the title ~~wave_pool: observed-max nblocks fallback~~ Runtime tuning: chunks/wave + substreams/chunk May 21, 2026

nclack merged commit 5672f7e into main May 21, 2026
6 checks passed

nclack deleted the worktree-issue-101-observed-max-nblocks branch May 21, 2026 18:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Runtime tuning: chunks/wave + substreams/chunk#112

Runtime tuning: chunks/wave + substreams/chunk#112
nclack merged 36 commits into
mainfrom
worktree-issue-101-observed-max-nblocks

nclack commented May 19, 2026 •

edited

Loading

Uh oh!

codecov Bot commented May 19, 2026 •

edited

Loading

Uh oh!

ritvikvasan commented May 19, 2026

Uh oh!

nclack commented May 21, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

nclack commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Approach

API

Key file

Uh oh!

codecov Bot commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

ritvikvasan commented May 19, 2026

GDS crash with this PR in the mix

Uh oh!

nclack commented May 21, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

nclack commented May 19, 2026 •

edited

Loading

codecov Bot commented May 19, 2026 •

edited

Loading