fix: tile subsampling for long audio to avoid ggml 2^31 tensor overflow on GPU by localai-bot · Pull Request #19 · mudler/parakeet.cpp

localai-bot · 2026-06-07T18:35:16Z

Problem

Transcribing audio longer than ~44 min on GPU crashed. There were two distinct CUDA limits in the long-audio path, both invisible to CPU tests (CPU has neither limit):

Subsampling relu — int32 element overflow. The first subsampling conv makes a tensor of (n_mels/2)·(T/2)·conv_channels elements. For tdt-0.6b-v3 (n_mels=128, conv_channels=256) a 51-min clip is 2,521,890,816 > INT_MAX, and ggml's CUDA unary (relu) kernel indexes elements with int → wraps negative → invalid configuration argument in ggml_cuda_op_relu. (Same 2³¹ wall PyTorch hits, canUse32BitIndexMath not working properly in Conv2D layer pytorch/pytorch#80020; NeMo chunks the subsampling conv via subsampling_conv_chunking_factor.)
Banded attention pad — gridDim.y cap. Once past (1), the encoder's banded local-attention over-pads K/V to a contiguous axis Lk = (C+P-1)·ceil(T'/C) ≈ 77k for T'=38,481, and ggml's CUDA pad kernel maps ne1 straight to gridDim.y, which CUDA caps at 65535 → PAD failed / invalid argument.

Fix

(1) Tile the subsampling stage over time (Subsampling::forward_tiled, Encoder::forward_batch_tiled) so no conv tensor exceeds 2³¹, then run the unchanged conformer stack on the full sequence. The subsampler's receptive field is ±7 mel frames, so tiling with an 8-frame halo is bit-exact on interior frames. Done in our code (no ggml change) — covers CPU/Metal too and bounds the ~10 GB activation spike. Model::transcribe_* route long audio to the tiled path above a model-derived threshold (safe_mel_window) via one subsampling_tile_for helper; both batched and single-clip (CLI / transcribe_path) paths are wired. PARAKEET_SUBSAMPLING_TILE=<frames> forces it (testing).

(2) Grid-stride the ggml-cuda pad kernel (third_party/ggml-patches/0004-cuda-pad-grid-stride.patch) so it handles ne1/ne2·ne3 > 65535. A kernel audit confirmed pad is the only op in the long-audio encoder that routes a large dim through the capped gridDim.y/z (softmax/norm use gridDim.x; add/mul auto-fall-back to a flattened x; im2col already grid-strides; cpy/scale/concat are x-only/int64). The fix is perf-neutral (when a dim ≤ 65535 the stride loop runs exactly once → identical launch geometry) and general — it lifts the ceiling to ~23 h (next limit is an unrelated bin_bcast int32 index). This is what PyTorch already does, and it's upstreamable.

Validation

test_subsampling_tiling — forward_tiled vs forward: single-tile bit-exact; multi-tile worst per-frame rel ~1.8e-5.
test_encoder_long — forward_batch_tiled vs forward_batch: injection layout verified (large-tile worstrel 3.4e-3).
test_transcribe_tiled — full pipeline, fused vs forced-tiled, identical non-empty transcripts on batched and single-clip paths.
No GPU regression (A/B of the pad grid-stride, patched vs reverted libggml-cuda.so, GB10): 20-min banded clip +0.01%, 60-s clip +0.53% (both within run-to-run noise), proc_ms transcribe-only, output byte-identical. Re-verified end-to-end on the rebased PR head with a 52-min synthetic clip (exit 0, no CUDA error). See comment for the table.
GPU end-to-end on dgx.casa (GB10, CUDA 13, sm_121), real tdt-0.6b-v3, the 51-min file: PASS. Was a crash on master; now transcribes in ~16 s (≈192× realtime) to a complete 9,196-word transcript, exit 0, no CUDA error.

🤖 Generated with Claude Code

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

… tiling test invariant

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…ttention)

localai-bot · 2026-06-07T21:27:38Z

Post-rebase GPU re-verification + perf regression check

After rebasing onto current master, re-verified on dgx.casa (GB10, CUDA 13, sm_121) at the PR head:

1. End-to-end re-run on the rebased commit (the private 51-min repro file was deleted for privacy, so used a ~52-min synthetic clip = the speech.wav fixture looped, which is longer than the original and triggers both fixes):

parakeet-cli transcribe exit 0, no CUDA error, ~16 s, full 9,266-word transcript. The earlier 51-min run and this 52-min run on the rebased head both pass.

2. Performance regression A/B — the only change on the common (short/normal) execution path is the pad grid-stride. A/B'd it by swapping libggml-cuda.so (patched vs the kernel reverted) on the same parakeet-cli, 3 reps each, GB10, proc_ms = transcribe-only (the bench warms up first):

clip	audio	patched	unpatched	delta	patched RTFx
20 min (banded local-attn — where `pad` runs hardest)	1197 s	5404.5 ms	5404.1 ms	+0.01%	221x
60 s (full attention)	59.5 s	263.5 ms	262.1 ms	+0.53%	226x

Both deltas are within run-to-run noise (rep spreads overlap), and text_identical=True for both clips — the grid-stride pad is perf-neutral and output-identical, as the kernel audit predicted (when a dim <= 65535 the stride loop runs exactly once). The subsampling-tiling change only engages above ~30 min, where the alternative was a crash, so there is no short-audio path to regress there.

Conclusion: no performance or accuracy regression; long audio that crashed on master now transcribes.

mudler and others added 7 commits June 7, 2026 21:10

feat(subsampling): add subsample_len spatial-length helper

db4a19c

feat(subsampling): tiled long-audio path (parity vs forward)

11ebf46

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

refactor(subsampling): cache valid_out_len in forward_tiled; document…

5f37b9f

… tiling test invariant

feat(encoder): forward_batch_tiled from pre-subsampled features

17fc6fb

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

feat(model): tile subsampling for long audio above safe threshold

162742b

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

feat(model): tile single-clip transcribe for long audio (CLI/path C-API)

ed1bb9b

fix(ggml-cuda): grid-stride pad kernel for dims > 65535 (long-audio a…

236c688

…ttention)

mudler force-pushed the worktree-long-audio-tiling branch from 0cc3194 to 236c688 Compare June 7, 2026 21:11

mudler merged commit 96b81bb into master Jun 7, 2026
7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: tile subsampling for long audio to avoid ggml 2^31 tensor overflow on GPU#19

fix: tile subsampling for long audio to avoid ggml 2^31 tensor overflow on GPU#19
mudler merged 7 commits into
masterfrom
worktree-long-audio-tiling

localai-bot commented Jun 7, 2026 •

edited

Loading

Uh oh!

localai-bot commented Jun 7, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

localai-bot commented Jun 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Fix

Validation

Uh oh!

localai-bot commented Jun 7, 2026

Post-rebase GPU re-verification + perf regression check

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

localai-bot commented Jun 7, 2026 •

edited

Loading