Skip to content

Runtime tuning: chunks/wave + substreams/chunk#112

Merged
nclack merged 36 commits into
mainfrom
worktree-issue-101-observed-max-nblocks
May 21, 2026
Merged

Runtime tuning: chunks/wave + substreams/chunk#112
nclack merged 36 commits into
mainfrom
worktree-issue-101-observed-max-nblocks

Conversation

@nclack
Copy link
Copy Markdown
Owner

@nclack nclack commented May 19, 2026

Approach

Close #101 by removing the dead static fallback and replacing the
ad-hoc structural constants it derived from with explicit
damacy_tuning knobs.

chunk_substreams_upper_bound (formerly chunk_zsubs_upper_bound)
in src/wave/wave_pool.c sizes the per-wave fanout SOA and the
shared nvcomp zstd decoder scratch. Its !sp->layout_probed fallback
returned a hardcoded DAMACY_BLOSC_MAX_BLOCKS_PER_CHUNK = 32 — the
adversarial worst case. But wave_chunks_eligible (per-chunk gate,
runs before prepare_decode_caps in kick_h2d) rejects any wave
containing an unprobed BLOSC_ZSTD chunk with DAMACY_INVAL, so the
fallback is structurally unreachable. The "perf" framing of the
original issue was moot.

This PR:

  • Turns the implicit gate-vs-sizer contract into an explicit
    check.
    chunk_substreams_upper_bound now returns
    enum damacy_status; on unprobed BLOSC it returns DAMACY_INVAL
    with a log_error("gate-vs-sizer contract violated") at the
    caller. A future gate regression now fails loudly instead of
    silently undersizing the fanout SOA.
  • Replaces the two compile-time constants (DAMACY_MAX_CHUNKS_PER_WAVE,
    DAMACY_BLOSC_MAX_BLOCKS_PER_CHUNK) with damacy_tuning.max_chunks_per_wave
    and damacy_tuning.max_substreams_per_chunk. The parser, planner,
    coalesce, wave_pool, fanout, wave_budget, and meta_cache all thread
    the effective values through their existing param chains. New
    DAMACY_DEFAULT_* siblings preserve current behavior; 0 in either
    field resolves to the default. WAVE_ZSUBS_STRUCTURAL_MAX becomes
    a runtime field wave_pool.max_substreams_per_wave derived once at
    init.
  • Drops the dead substream rename target. zsubs was a
    contraction that read as zstd-specific; renames to substreams
    everywhere (the noun that matches both BLOSC1 spec language and the
    nvcomp batched-decode input it actually counts).
  • Strips machinery wired only to the unreachable branch: the
    _Atomic(uint16_t) observed_max_nblocks_per_chunk slot, its
    atomic_u16_observe_max CAS-loop helper (src/util/atomic_max.h),
    the meta-cache observer setter, the bump sites in
    zarr_meta_cache_layout_set / _probe_layout, and the wiring in
    damacy_create. zarr/zarr_meta_cache.h returns to extern "C"
    shape (matches main) — the C-only static_assert is no longer
    needed.

API

Two new optional fields on damacy_tuning (Python Config):

  • max_chunks_per_wave: int = 00 → 512 (current behavior).
    Clamped to 0xFFFFu (the 16-bit chunk_idx packing in
    d_block_chunk_map).
  • max_substreams_per_chunk: int = 00 → 32 (current behavior).
    Parser rejects blosc1 layouts above this with DAMACY_DECODE.

Key file

src/wave/wave_pool.c:355chunk_substreams_upper_bound (the
contract check) and prepare_decode_caps (caller).

Closes #101.

@codecov
Copy link
Copy Markdown

codecov Bot commented May 19, 2026

Codecov Report

❌ Patch coverage is 77.44361% with 30 lines in your changes missing coverage. Please review.
✅ Project coverage is 56.30%. Comparing base (5f9226d) to head (601600a).

Files with missing lines Patch % Lines
python/damacy/_api.c 0.00% 7 Missing ⚠️
src/wave/wave_pool.c 77.41% 3 Missing and 4 partials ⚠️
src/damacy_config.c 57.14% 2 Missing and 4 partials ⚠️
python/damacy/__init__.py 77.77% 4 Missing ⚠️
src/wave/wave_budget.c 91.89% 1 Missing and 2 partials ⚠️
src/wave/fanout.c 0.00% 1 Missing and 1 partial ⚠️
src/wave/wave.c 80.00% 0 Missing and 1 partial ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #112      +/-   ##
==========================================
+ Coverage   55.64%   56.30%   +0.66%     
==========================================
  Files          49       50       +1     
  Lines        6903     6953      +50     
  Branches     1233     1238       +5     
==========================================
+ Hits         3841     3915      +74     
+ Misses       2585     2550      -35     
- Partials      477      488      +11     
Flag Coverage Δ
unittests 56.30% <77.44%> (+0.66%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
src/damacy.c 67.27% <100.00%> (-0.23%) ⬇️
src/damacy_limits.h 100.00% <100.00%> (ø)
src/planner/coalesce.c 89.65% <ø> (ø)
src/planner/planner.c 75.08% <100.00%> (+0.16%) ⬆️
src/wave/host_slab.c 56.71% <100.00%> (-2.99%) ⬇️
src/zarr/zarr_chunk_layout.c 75.75% <100.00%> (+6.06%) ⬆️
src/zarr/zarr_meta_cache.c 71.72% <ø> (ø)
src/wave/wave.c 86.56% <80.00%> (+0.20%) ⬆️
src/wave/fanout.c 80.00% <0.00%> (+32.00%) ⬆️
src/wave/wave_budget.c 70.77% <91.89%> (+8.93%) ⬆️
... and 4 more

... and 7 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@ritvikvasan
Copy link
Copy Markdown

GDS crash with this PR in the mix

I built a combined branch merging PRs #105, #110, #111, and #112 onto 614d080 (main), with -DDAMACY_ENABLE_GDS=ON. The non-GDS path works fine (100 steps, no errors), but the GDS path hits CUDA memory corruption:

damacy._native.DamacyError: damacy: pop failed (cuda error)
NCCL WARN Cuda failure 'an illegal memory access was encountered'
RuntimeError: CUDA error: CUBLAS_STATUS_INTERNAL_ERROR when calling cublasCreate(handle)

Multiple ranks (3, 5, 6, and others) crashed with pop failed (cuda error). Rank 0 appears to have completed all 100 steps, but the CUDA errors make the results unreliable.

For reference, 614d080 (main without these PRs) works fine with GDS enabled — 100 steps, no crashes.

I suspect the observed-max nblocks fallback in this PR could underestimate buffer sizes for some chunk layouts with GDS reads, causing out-of-bounds GPU writes. But I haven't isolated which of the 4 PRs is the culprit — this might also be #111 (atomic refcount race) or a combination.

Config: 8-GPU H100, 22 datasets, enable_gds=true, host_buffer_waves=8, n_io_threads=8.

@nclack nclack force-pushed the worktree-issue-101-observed-max-nblocks branch from 53f229a to 8f01759 Compare May 20, 2026 03:52
@nclack
Copy link
Copy Markdown
Owner Author

nclack commented May 21, 2026

@ritvikvasan — the illegal-memory-access pattern you hit is a latent bug in the GDS path itself, independent of this PR. Tracked as #116, fixed in #117.

Root cause: gds_event_query in src/store/store_fs_gds.c returned 1 unconditionally for any non-sentinel seq, ignoring whether cuFileReadAsync had actually retired on stream_h2d. The wave-pool scheduler treats 1 as "destination bytes are safe to consume" and transitions the slot to SLOT_READY. The decode kernel runs on stream_decode, gated only on cross-stream events recorded on stream_h2d; under real-GDS (non-compat) mode cuFileReadAsync does not always retire before the subsequent cuEventRecord, so the decoder can read from dev_buf before cuFile has written to it. Higher-rank devices consistently lose the race — matches your rank 3/5/6 pattern.

This PR (#112) does not touch the GDS path. The combined-PR build likely perturbed scheduling enough to close the window where the race had been going unobserved on plain main.

#117 makes gds_event_query actually reflect cuFile completion (refcounted done struct set by the existing free-params callback, acquire-load + CAS-claim on the query side). Could you retest your 8-GPU config with #117 in the mix?

@nclack nclack changed the title wave_pool: observed-max nblocks fallback Runtime tuning: chunks/wave + substreams/chunk May 21, 2026
@nclack nclack merged commit 5672f7e into main May 21, 2026
6 checks passed
@nclack nclack deleted the worktree-issue-101-observed-max-nblocks branch May 21, 2026 18:39
github-actions Bot added a commit that referenced this pull request May 21, 2026
## Approach

Close #101 by removing the dead static fallback and replacing the
ad-hoc structural constants it derived from with explicit
`damacy_tuning` knobs.

`chunk_substreams_upper_bound` (formerly `chunk_zsubs_upper_bound`)
in `src/wave/wave_pool.c` sizes the per-wave fanout SOA and the
shared nvcomp zstd decoder scratch. Its `!sp->layout_probed` fallback
returned a hardcoded `DAMACY_BLOSC_MAX_BLOCKS_PER_CHUNK = 32` — the
adversarial worst case. But `wave_chunks_eligible` (per-chunk gate,
runs before `prepare_decode_caps` in `kick_h2d`) rejects any wave
containing an unprobed BLOSC_ZSTD chunk with `DAMACY_INVAL`, so the
fallback is structurally unreachable. The "perf" framing of the
original issue was moot.

This PR:

- **Turns the implicit gate-vs-sizer contract into an explicit
  check.** `chunk_substreams_upper_bound` now returns
  `enum damacy_status`; on unprobed BLOSC it returns `DAMACY_INVAL`
  with a `log_error("gate-vs-sizer contract violated")` at the
  caller. A future gate regression now fails loudly instead of
  silently undersizing the fanout SOA.
- **Replaces the two compile-time constants**
(`DAMACY_MAX_CHUNKS_PER_WAVE`,
`DAMACY_BLOSC_MAX_BLOCKS_PER_CHUNK`) with
`damacy_tuning.max_chunks_per_wave`
  and `damacy_tuning.max_substreams_per_chunk`. The parser, planner,
  coalesce, wave_pool, fanout, wave_budget, and meta_cache all thread
  the effective values through their existing param chains. New
  `DAMACY_DEFAULT_*` siblings preserve current behavior; `0` in either
  field resolves to the default. `WAVE_ZSUBS_STRUCTURAL_MAX` becomes
  a runtime field `wave_pool.max_substreams_per_wave` derived once at
  init.
- **Drops the dead substream rename target.** `zsubs` was a
  contraction that read as zstd-specific; renames to `substreams`
  everywhere (the noun that matches both BLOSC1 spec language and the
  nvcomp batched-decode input it actually counts).
- **Strips machinery wired only to the unreachable branch:** the
  `_Atomic(uint16_t) observed_max_nblocks_per_chunk` slot, its
  `atomic_u16_observe_max` CAS-loop helper (`src/util/atomic_max.h`),
  the meta-cache observer setter, the bump sites in
  `zarr_meta_cache_layout_set` / `_probe_layout`, and the wiring in
  `damacy_create`. `zarr/zarr_meta_cache.h` returns to `extern "C"`
  shape (matches main) — the C-only `static_assert` is no longer
  needed.

## API

Two new optional fields on `damacy_tuning` (Python `Config`):

- `max_chunks_per_wave: int = 0` — `0` → 512 (current behavior).
  Clamped to `0xFFFFu` (the 16-bit chunk_idx packing in
  `d_block_chunk_map`).
- `max_substreams_per_chunk: int = 0` — `0` → 32 (current behavior).
  Parser rejects blosc1 layouts above this with `DAMACY_DECODE`.

## Key file

`src/wave/wave_pool.c:355` — `chunk_substreams_upper_bound` (the
contract check) and `prepare_decode_caps` (caller).

Closes #101. 5672f7e
nclack added a commit that referenced this pull request May 22, 2026
## Approach

Close #101 by removing the dead static fallback and replacing the
ad-hoc structural constants it derived from with explicit
`damacy_tuning` knobs.

`chunk_substreams_upper_bound` (formerly `chunk_zsubs_upper_bound`)
in `src/wave/wave_pool.c` sizes the per-wave fanout SOA and the
shared nvcomp zstd decoder scratch. Its `!sp->layout_probed` fallback
returned a hardcoded `DAMACY_BLOSC_MAX_BLOCKS_PER_CHUNK = 32` — the
adversarial worst case. But `wave_chunks_eligible` (per-chunk gate,
runs before `prepare_decode_caps` in `kick_h2d`) rejects any wave
containing an unprobed BLOSC_ZSTD chunk with `DAMACY_INVAL`, so the
fallback is structurally unreachable. The "perf" framing of the
original issue was moot.

This PR:

- **Turns the implicit gate-vs-sizer contract into an explicit
  check.** `chunk_substreams_upper_bound` now returns
  `enum damacy_status`; on unprobed BLOSC it returns `DAMACY_INVAL`
  with a `log_error("gate-vs-sizer contract violated")` at the
  caller. A future gate regression now fails loudly instead of
  silently undersizing the fanout SOA.
- **Replaces the two compile-time constants**
(`DAMACY_MAX_CHUNKS_PER_WAVE`,
`DAMACY_BLOSC_MAX_BLOCKS_PER_CHUNK`) with
`damacy_tuning.max_chunks_per_wave`
  and `damacy_tuning.max_substreams_per_chunk`. The parser, planner,
  coalesce, wave_pool, fanout, wave_budget, and meta_cache all thread
  the effective values through their existing param chains. New
  `DAMACY_DEFAULT_*` siblings preserve current behavior; `0` in either
  field resolves to the default. `WAVE_ZSUBS_STRUCTURAL_MAX` becomes
  a runtime field `wave_pool.max_substreams_per_wave` derived once at
  init.
- **Drops the dead substream rename target.** `zsubs` was a
  contraction that read as zstd-specific; renames to `substreams`
  everywhere (the noun that matches both BLOSC1 spec language and the
  nvcomp batched-decode input it actually counts).
- **Strips machinery wired only to the unreachable branch:** the
  `_Atomic(uint16_t) observed_max_nblocks_per_chunk` slot, its
  `atomic_u16_observe_max` CAS-loop helper (`src/util/atomic_max.h`),
  the meta-cache observer setter, the bump sites in
  `zarr_meta_cache_layout_set` / `_probe_layout`, and the wiring in
  `damacy_create`. `zarr/zarr_meta_cache.h` returns to `extern "C"`
  shape (matches main) — the C-only `static_assert` is no longer
  needed.

## API

Two new optional fields on `damacy_tuning` (Python `Config`):

- `max_chunks_per_wave: int = 0` — `0` → 512 (current behavior).
  Clamped to `0xFFFFu` (the 16-bit chunk_idx packing in
  `d_block_chunk_map`).
- `max_substreams_per_chunk: int = 0` — `0` → 32 (current behavior).
  Parser rejects blosc1 layouts above this with `DAMACY_DECODE`.

## Key file

`src/wave/wave_pool.c:355` — `chunk_substreams_upper_bound` (the
contract check) and `prepare_decode_caps` (caller).

Closes #101.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

wave_pool: replace DAMACY_BLOSC_MAX_BLOCKS_PER_CHUNK fallback with observed-max counter

2 participants