Skip to content

Batched encoder: run N clips through one fused ggml graph#1

Merged
mudler merged 41 commits into
masterfrom
worktree-batched-encoder
Jun 1, 2026
Merged

Batched encoder: run N clips through one fused ggml graph#1
mudler merged 41 commits into
masterfrom
worktree-batched-encoder

Conversation

@mudler
Copy link
Copy Markdown
Owner

@mudler mudler commented May 31, 2026

Adds batched-encoder support: N audio clips run through ONE fused ggml encoder graph (subsampling, conformer stack, output) at batch size B, instead of one graph per clip. The single-clip (B=1) path is kept numerically faithful to before. Decode stays per-item over the batched encoder output (batched decode is intentionally a separate future milestone).

What landed:

  • MelBatch + Encoder::forward_batch (one fused batched graph).
  • Batched builders, each with B on the trailing axis and a per-item valid_len vector driving the masks; every scalar build_graph is now a thin B=1 adapter (DRY, no duplicated bodies): Subsampling::build_graph_batched, RelPosAttention::build_graph_batched (incl. the 4D rel-shift), ConformerLayer::build_graph_batched, and a B-aware build_conv_module.
  • Model::transcribe_pcm_batch (+ extracted decode_enc_out) and C-API parakeet_capi_transcribe_pcm_batch.
  • bench-batch CLI subcommand + BENCHMARK.md notes.

Correctness

Built test-first against the smallest validated model (parakeet-tdt_ctc-110m, f32 so the tight 1e-2 parity tolerances hold). Per-item batched-vs-standalone equivalence + padding-invariance tests at every level: test_subsampling_batch, test_relpos_attention_batch, test_conformer_batch, test_encoder_batch, test_capi_batch.

  • Attention and encoder equivalence are bit-exact (max|d| = 0) per item, including ragged batches; the full encoder is within ~2e-6 over 17 layers. Padding invariance is explicitly asserted (a shorter clip does not perturb a longer neighbor).
  • B=1 parity vs the NeMo baseline is unchanged across subsampling/relpos/conformer/encoder.

Two real bugs were caught by these gates and fixed:

  1. Subsampling padding leakage: the non-causal path only masked the flattened output; in a batch the last valid frame read the bias-contaminated zero-pad region. Fixed with per-stage per-item trailing-pad input masking, matching NeMo's MaskedConvSequential.
  2. Conformer depthwise conv: ggml's 1D im2col requires ne[3]==1, so it aborted for B>1. Fixed by running the depthwise per item and concatenating; B=1 is byte-identical.

Performance note (important)

On a CPU backend, batching does not improve throughput and slightly reduces it (B=1 is fastest): the GEMMs already saturate the thread pool at B=1, and padding every clip to the batch's longest clip just adds wasted work. The throughput win is a GPU-backend effect. Documented in BENCHMARK.md and surfaced by the new bench-batch subcommand. On CPU the value of this PR is the correct, fused batched encoder graph; the latency/throughput payoff lands on GPU.

Scope / limitations

  • Encoder only; decode is per-item (batched decode is future work).
  • Causal/streaming subsampling at B>1 is hard-guarded (GGML_ASSERT); only the offline non-causal path is batched.
  • No length bucketing (mixed-duration batches waste pad compute, same as NeMo's own transcribe).

Test plan

  • ctest -L model green: 100%, 0 failures (38 tests, with a converted 110m f32 model + NeMo baseline).
  • All 5 batch tests pass; B=1 parity unchanged vs baseline.

mudler and others added 30 commits May 31, 2026 08:13
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… with single-clip)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…contract

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…valid_out_len

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…atched conv (matches NeMo MaskedConvSequential)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… invariance

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…res ne[3]==1); B=1 byte-identical

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…al loop)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…c_out_with_timestamps

Adds batched timestamped transcription (N 16 kHz clips -> N Transcriptions)
built on the fused batched encoder, plus the public
transcribe_pcm_batch_with_timestamps entry point. The per-item decode tail of
transcribe_16k_with_timestamps is extracted verbatim into a file-scope
decode_enc_out_with_timestamps helper (behavior-preserving) and reused by both
the single-clip and batched paths.

Also fixes a pre-existing batched-encoder bug surfaced by the new equivalence
test: forward_batch emitted each enc_outs[b] at the padded width Tp, but the
decoders index enc_out[c*Tout + t] with Tout = valid_Tout[b]. For a padded
(shorter) item that stride mismatch misaligned every row after the first,
corrupting the decode (e.g. a 18-word clip collapsed to 3 garbage words). This
affected the text-only transcribe_pcm_batch path too. forward_batch now
compacts each enc_outs[b] to its own valid_Tout[b] columns so the row stride
matches and no pad-derived frames reach the decoder; test_encoder_batch's
slice is updated to the compacted per-item width.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…C span rationale

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ps, ABI bump)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…r-must-uphold

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…xact parity)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…tch; document max_symbols assumption

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
mudler and others added 11 commits May 31, 2026 22:49
…timestamps]

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…dy_batch (bit-exact; recovers B=1, fewer LSTM calls)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…-decode transpose)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Replace the delegating Subsampling::build_graph (which forwarded to
build_graph_batched at B=1) with the verbatim v1 2-D body from 60bd1bb.
The single-clip/forward path now runs the lean 2-D graph again: faster and
bit-exact with v1 (the batch test's item0, the full clip, is max|d|=0).

build_graph_batched is unchanged (forward_batch / B>1 still use it).

test_subsampling_batch: the short, zero-padded item1 now compares the
batched builder against the 2-D scalar path. They match to ~1e-3 on interior
frames; only the single trailing valid frame diverges (its downsampled
receptive field straddles the clip boundary, where per-stage masking and the
conv zero-edge round differently by design). Compare interior frames at a
modest 5e-3 and skip that last boundary frame. item0 stays exact at 1e-3.
Replace the delegating scalar builders (which forwarded to the batched
builders at B=1) with the verbatim v1 2-D bodies from 60bd1bb:
  - ConformerLayer::build_graph (full 2-D conformer layer)
  - RelPosAttention::build_graph (2-D/3D rel-pos attention)
  - build_conv_module gains a scalar (int valid_len) overload alongside the
    batched (int B, const std::vector<int>&) one; distinguished by signature.

The scalar callers (build_graph, forward_with_conv localization,
conv_module_forward) now route to the 2-D conv module. The batched builders
(build_graph_batched, the batched build_conv_module) are untouched; forward_batch
/ B>1 still use them.

Net effect: the single-clip / forward path runs the lean 2-D graph again
(faster) and is bit-exact with v1. test_conformer, test_conformer_batch,
test_relpos_attention_batch, test_encoder, test_encoder_batch all pass; the
24-layer 2-D-vs-batched accumulation stays within test_encoder_batch's existing
5e-2 tolerance, so no tolerance change was needed there.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Record serial-vs-batched decode speedups for the transducer models at
B=1,4,8,16, captured via bench-decode --json. CPU (this 20-core host,
q5_k) reaches ~3-5x at B=16; GPU (dgx GB10, f16) reaches ~10-12x. CTC
models are excluded (no autoregressive decode). gen_benchmark_md.py
renders the new 'Batched decode throughput' section from
benchmarks/results/decode_batch/{cpu,gpu}/.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Document the opt-in batched-decode path: the bench-decode/bench-batch
CLI commands, the C++ transcribe_16k_batch and C-API
transcribe_pcm_batch[_json] entry points, the decode-batching win (GPU
~10-12x at B=16, CPU ~3-5x; encoder and CTC excluded), bit-exactness vs
single-clip, and a pointer to the BENCHMARK.md tables. Notes that
LocalAI exposes it via batch_max_size (off by default).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@mudler mudler merged commit 8a7c482 into master Jun 1, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant