Batched encoder: run N clips through one fused ggml graph#1

Merged

mudler merged 41 commits into

worktree-batched-encoder

Jun 1, 2026

Owner

mudler commented May 31, 2026

Adds batched-encoder support: N audio clips run through ONE fused ggml encoder graph (subsampling, conformer stack, output) at batch size B, instead of one graph per clip. The single-clip (B=1) path is kept numerically faithful to before. Decode stays per-item over the batched encoder output (batched decode is intentionally a separate future milestone).

What landed:

MelBatch + Encoder::forward_batch (one fused batched graph).
Batched builders, each with B on the trailing axis and a per-item valid_len vector driving the masks; every scalar build_graph is now a thin B=1 adapter (DRY, no duplicated bodies): Subsampling::build_graph_batched, RelPosAttention::build_graph_batched (incl. the 4D rel-shift), ConformerLayer::build_graph_batched, and a B-aware build_conv_module.
Model::transcribe_pcm_batch (+ extracted decode_enc_out) and C-API parakeet_capi_transcribe_pcm_batch.
bench-batch CLI subcommand + BENCHMARK.md notes.

Correctness

Built test-first against the smallest validated model (parakeet-tdt_ctc-110m, f32 so the tight 1e-2 parity tolerances hold). Per-item batched-vs-standalone equivalence + padding-invariance tests at every level: test_subsampling_batch, test_relpos_attention_batch, test_conformer_batch, test_encoder_batch, test_capi_batch.

Attention and encoder equivalence are bit-exact (max|d| = 0) per item, including ragged batches; the full encoder is within ~2e-6 over 17 layers. Padding invariance is explicitly asserted (a shorter clip does not perturb a longer neighbor).
B=1 parity vs the NeMo baseline is unchanged across subsampling/relpos/conformer/encoder.

Two real bugs were caught by these gates and fixed:

Subsampling padding leakage: the non-causal path only masked the flattened output; in a batch the last valid frame read the bias-contaminated zero-pad region. Fixed with per-stage per-item trailing-pad input masking, matching NeMo's MaskedConvSequential.
Conformer depthwise conv: ggml's 1D im2col requires ne[3]==1, so it aborted for B>1. Fixed by running the depthwise per item and concatenating; B=1 is byte-identical.

Performance note (important)

On a CPU backend, batching does not improve throughput and slightly reduces it (B=1 is fastest): the GEMMs already saturate the thread pool at B=1, and padding every clip to the batch's longest clip just adds wasted work. The throughput win is a GPU-backend effect. Documented in BENCHMARK.md and surfaced by the new bench-batch subcommand. On CPU the value of this PR is the correct, fused batched encoder graph; the latency/throughput payoff lands on GPU.

Scope / limitations

Encoder only; decode is per-item (batched decode is future work).
Causal/streaming subsampling at B>1 is hard-guarded (GGML_ASSERT); only the offline non-causal path is batched.
No length bucketing (mixed-duration batches waste pad compute, same as NeMo's own transcribe).

Test plan

ctest -L model green: 100%, 0 failures (38 tests, with a converted 110m f32 model + NeMo baseline).
All 5 batch tests pass; B=1 parity unchanged vs baseline.

mudler and others added 30 commits

May 31, 2026 08:13


          feat(encoder): add MelBatch + forward_batch (B=1 per-item loop)

9b89a3f

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>


          feat(model): transcribe_pcm_batch + extract decode_enc_out helper

ab1b458

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>


          fix(model): guard invalid sample_rate in transcribe_pcm_batch (parity…

… with single-clip)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>


          feat(capi): parakeet_capi_transcribe_pcm_batch

e786266

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>


          fix(capi): null out[] on entry so error paths leave a clean, uniform …

bc582d4

…contract

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>


          test: batch API B=1 equivalence smoke

4786ca7

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>


          feat(subsampling): batched build_graph_batched; build_graph adapts B=1

8c76c1a

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>


          fix(subsampling): release-surviving guard for batched-causal; dedupe …

fc391a3

…valid_out_len

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>


          test(subsampling): batched-vs-standalone per-item equivalence

eab74e6

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>


          fix(subsampling): per-stage per-item trailing-pad input masking for b…

d355250

…atched conv (matches NeMo MaskedConvSequential)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>


          feat(conformer): batch axis through build_conv_module (B=1 unchanged)

b9f1cb1

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>


          feat(attention): batched build_graph_batched with 4D rel-shift

4efe312

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>


          test(attention): batched rel-shift equivalence + padding invariance

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>


          feat(conformer): build_graph_batched ([D,T,B]); build_graph adapts B=1

7a8a8a3

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>


          test(conformer): batched-vs-standalone per-item equivalence + padding…

b5bede7

… invariance

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>


          fix(conformer): per-item depthwise conv for B>1 (ggml 1D im2col requi…

3a89768

…res ne[3]==1); B=1 byte-identical

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>


          feat(encoder): fused single-graph batched forward_batch

2257e67

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>


          test(encoder): fused batched equivalence + padding invariance

0b1f409

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>


          bench: add bench-batch CLI subcommand for batched-encoder throughput

50d2cfc

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>


          docs(encoder): correct forward_batch comment (fused graph, not intern…

3982fa7

…al loop)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>


          feat(model): transcribe_pcm_batch_with_timestamps + extract decode_en…

7cc5942

…c_out_with_timestamps

Adds batched timestamped transcription (N 16 kHz clips -> N Transcriptions)
built on the fused batched encoder, plus the public
transcribe_pcm_batch_with_timestamps entry point. The per-item decode tail of
transcribe_16k_with_timestamps is extracted verbatim into a file-scope
decode_enc_out_with_timestamps helper (behavior-preserving) and reused by both
the single-clip and batched paths.

Also fixes a pre-existing batched-encoder bug surfaced by the new equivalence
test: forward_batch emitted each enc_outs[b] at the padded width Tp, but the
decoders index enc_out[c*Tout + t] with Tout = valid_Tout[b]. For a padded
(shorter) item that stride mismatch misaligned every row after the first,
corrupting the decode (e.g. a 18-word clip collapsed to 3 garbage words). This
affected the text-only transcribe_pcm_batch path too. forward_batch now
compacts each enc_outs[b] to its own valid_Tout[b] columns so the row stride
matches and no pad-derived frames reach the decoder; test_encoder_batch's
slice is updated to the compacted per-item width.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>


          refactor(model): share build_mel_batch across batch paths; restore CT…

d2b76a5

…C span rationale

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>


          feat(capi): parakeet_capi_transcribe_pcm_batch_json (batched timestam…

9bd7407

…ps, ABI bump)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>


          docs(capi): state sum(n_samples) precondition for batch_json as calle…

78f84f5

…r-must-uphold

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>


          refactor(decode): share argmax/max_prob_conf in decode_common.hpp

b9ca3f6

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>


          feat(prediction): step_batch (batched LSTM, [H,N])

fc19481

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>


          docs(prediction): explain the 4H gate-slice stride in step_batch

e5846ef

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>


          feat(joint): step_logits_batch (batched joint, [V_plus,N])

cc9834d

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>


          feat(decode): transducer_greedy_batch (batched RNNT+TDT greedy, bit-e…

093aacc

…xact parity)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>


          refactor(decode): extract commit_state lambda in transducer_greedy_ba…

782a801

…tch; document max_symbols assumption

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

mudler and others added 11 commits

May 31, 2026 22:49


          feat(model): batched transducer decode in transcribe_pcm_batch[_with_…

df4c857

…timestamps]

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>


          bench: add bench-decode (batched vs serial transducer decode timing)

52cffea

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>


          perf(decode): cache prediction-net g across rounds in transducer_gree…

85feda9

…dy_batch (bit-exact; recovers B=1, fewer LSTM calls)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>


          refactor(model): extract batch_enc_to_row_major helper (dedup batched…

c3161d2

…-decode transpose)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>


          docs(decode): correct header comment to describe the g_valid cache

da38ea1

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>


          Merge pull request #3 from mudler/batched-decode

543af2c

Batched decode


          refactor(subsampling): restore v1 2-D scalar build_graph for B=1

98bd731

Replace the delegating Subsampling::build_graph (which forwarded to
build_graph_batched at B=1) with the verbatim v1 2-D body from 60bd1bb.
The single-clip/forward path now runs the lean 2-D graph again: faster and
bit-exact with v1 (the batch test's item0, the full clip, is max|d|=0).

build_graph_batched is unchanged (forward_batch / B>1 still use it).

test_subsampling_batch: the short, zero-padded item1 now compares the
batched builder against the 2-D scalar path. They match to ~1e-3 on interior
frames; only the single trailing valid frame diverges (its downsampled
receptive field straddles the clip boundary, where per-stage masking and the
conv zero-edge round differently by design). Compare interior frames at a
modest 5e-3 and skip that last boundary frame. item0 stays exact at 1e-3.


          refactor(conformer,attention): restore v1 2-D scalar builders for B=1

0c42127

Replace the delegating scalar builders (which forwarded to the batched
builders at B=1) with the verbatim v1 2-D bodies from 60bd1bb:
  - ConformerLayer::build_graph (full 2-D conformer layer)
  - RelPosAttention::build_graph (2-D/3D rel-pos attention)
  - build_conv_module gains a scalar (int valid_len) overload alongside the
    batched (int B, const std::vector<int>&) one; distinguished by signature.

The scalar callers (build_graph, forward_with_conv localization,
conv_module_forward) now route to the 2-D conv module. The batched builders
(build_graph_batched, the batched build_conv_module) are untouched; forward_batch
/ B>1 still use them.

Net effect: the single-clip / forward path runs the lean 2-D graph again
(faster) and is bit-exact with v1. test_conformer, test_conformer_batch,
test_relpos_attention_batch, test_encoder, test_encoder_batch all pass; the
24-layer 2-D-vs-batched accumulation stays within test_encoder_batch's existing
5e-2 tolerance, so no tolerance change was needed there.


          feat(bench): bench-decode --json + BENCHMARK.md batched-decode section

0b47f3a

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>


          docs(bench): add batched-decode throughput tables (CPU + GPU)

efc061c

Record serial-vs-batched decode speedups for the transducer models at
B=1,4,8,16, captured via bench-decode --json. CPU (this 20-core host,
q5_k) reaches ~3-5x at B=16; GPU (dgx GB10, f16) reaches ~10-12x. CTC
models are excluded (no autoregressive decode). gen_benchmark_md.py
renders the new 'Batched decode throughput' section from
benchmarks/results/decode_batch/{cpu,gpu}/.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>


          docs(readme): add Batching section (CLI, C-API, when to use)

5e9bb6f

Document the opt-in batched-decode path: the bench-decode/bench-batch
CLI commands, the C++ transcribe_16k_batch and C-API
transcribe_pcm_batch[_json] entry points, the decode-batching win (GPU
~10-12x at B=16, CPU ~3-5x; encoder and CTC excluded), bit-exactness vs
single-clip, and a pointer to the BENCHMARK.md tables. Notes that
LocalAI exposes it via batch_max_size (off by default).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

mudler merged commit 8a7c482 into master

4 checks passed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet