Batched encoder: run N clips through one fused ggml graph#1
Merged
Conversation
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… with single-clip) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…contract Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…valid_out_len Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…atched conv (matches NeMo MaskedConvSequential) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… invariance Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…res ne[3]==1); B=1 byte-identical Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…al loop) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…c_out_with_timestamps Adds batched timestamped transcription (N 16 kHz clips -> N Transcriptions) built on the fused batched encoder, plus the public transcribe_pcm_batch_with_timestamps entry point. The per-item decode tail of transcribe_16k_with_timestamps is extracted verbatim into a file-scope decode_enc_out_with_timestamps helper (behavior-preserving) and reused by both the single-clip and batched paths. Also fixes a pre-existing batched-encoder bug surfaced by the new equivalence test: forward_batch emitted each enc_outs[b] at the padded width Tp, but the decoders index enc_out[c*Tout + t] with Tout = valid_Tout[b]. For a padded (shorter) item that stride mismatch misaligned every row after the first, corrupting the decode (e.g. a 18-word clip collapsed to 3 garbage words). This affected the text-only transcribe_pcm_batch path too. forward_batch now compacts each enc_outs[b] to its own valid_Tout[b] columns so the row stride matches and no pad-derived frames reach the decoder; test_encoder_batch's slice is updated to the compacted per-item width. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…C span rationale Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ps, ABI bump) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…r-must-uphold Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…xact parity) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…tch; document max_symbols assumption Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…timestamps] Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…dy_batch (bit-exact; recovers B=1, fewer LSTM calls) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…-decode transpose) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Batched decode
Replace the delegating Subsampling::build_graph (which forwarded to build_graph_batched at B=1) with the verbatim v1 2-D body from 60bd1bb. The single-clip/forward path now runs the lean 2-D graph again: faster and bit-exact with v1 (the batch test's item0, the full clip, is max|d|=0). build_graph_batched is unchanged (forward_batch / B>1 still use it). test_subsampling_batch: the short, zero-padded item1 now compares the batched builder against the 2-D scalar path. They match to ~1e-3 on interior frames; only the single trailing valid frame diverges (its downsampled receptive field straddles the clip boundary, where per-stage masking and the conv zero-edge round differently by design). Compare interior frames at a modest 5e-3 and skip that last boundary frame. item0 stays exact at 1e-3.
Replace the delegating scalar builders (which forwarded to the batched builders at B=1) with the verbatim v1 2-D bodies from 60bd1bb: - ConformerLayer::build_graph (full 2-D conformer layer) - RelPosAttention::build_graph (2-D/3D rel-pos attention) - build_conv_module gains a scalar (int valid_len) overload alongside the batched (int B, const std::vector<int>&) one; distinguished by signature. The scalar callers (build_graph, forward_with_conv localization, conv_module_forward) now route to the 2-D conv module. The batched builders (build_graph_batched, the batched build_conv_module) are untouched; forward_batch / B>1 still use them. Net effect: the single-clip / forward path runs the lean 2-D graph again (faster) and is bit-exact with v1. test_conformer, test_conformer_batch, test_relpos_attention_batch, test_encoder, test_encoder_batch all pass; the 24-layer 2-D-vs-batched accumulation stays within test_encoder_batch's existing 5e-2 tolerance, so no tolerance change was needed there.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Record serial-vs-batched decode speedups for the transducer models at
B=1,4,8,16, captured via bench-decode --json. CPU (this 20-core host,
q5_k) reaches ~3-5x at B=16; GPU (dgx GB10, f16) reaches ~10-12x. CTC
models are excluded (no autoregressive decode). gen_benchmark_md.py
renders the new 'Batched decode throughput' section from
benchmarks/results/decode_batch/{cpu,gpu}/.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Document the opt-in batched-decode path: the bench-decode/bench-batch CLI commands, the C++ transcribe_16k_batch and C-API transcribe_pcm_batch[_json] entry points, the decode-batching win (GPU ~10-12x at B=16, CPU ~3-5x; encoder and CTC excluded), bit-exactness vs single-clip, and a pointer to the BENCHMARK.md tables. Notes that LocalAI exposes it via batch_max_size (off by default). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Adds batched-encoder support: N audio clips run through ONE fused ggml encoder graph (subsampling, conformer stack, output) at batch size B, instead of one graph per clip. The single-clip (B=1) path is kept numerically faithful to before. Decode stays per-item over the batched encoder output (batched decode is intentionally a separate future milestone).
What landed:
MelBatch+Encoder::forward_batch(one fused batched graph).valid_lenvector driving the masks; every scalarbuild_graphis now a thin B=1 adapter (DRY, no duplicated bodies):Subsampling::build_graph_batched,RelPosAttention::build_graph_batched(incl. the 4D rel-shift),ConformerLayer::build_graph_batched, and a B-awarebuild_conv_module.Model::transcribe_pcm_batch(+ extracteddecode_enc_out) and C-APIparakeet_capi_transcribe_pcm_batch.bench-batchCLI subcommand + BENCHMARK.md notes.Correctness
Built test-first against the smallest validated model (
parakeet-tdt_ctc-110m, f32 so the tight 1e-2 parity tolerances hold). Per-item batched-vs-standalone equivalence + padding-invariance tests at every level:test_subsampling_batch,test_relpos_attention_batch,test_conformer_batch,test_encoder_batch,test_capi_batch.Two real bugs were caught by these gates and fixed:
Performance note (important)
On a CPU backend, batching does not improve throughput and slightly reduces it (B=1 is fastest): the GEMMs already saturate the thread pool at B=1, and padding every clip to the batch's longest clip just adds wasted work. The throughput win is a GPU-backend effect. Documented in BENCHMARK.md and surfaced by the new
bench-batchsubcommand. On CPU the value of this PR is the correct, fused batched encoder graph; the latency/throughput payoff lands on GPU.Scope / limitations
Test plan