Skip to content

feat(api/transcription): include segments + duration + language on stream done event#9709

Merged
mudler merged 1 commit into
masterfrom
feat/stream-transcription-include-segments
May 7, 2026
Merged

feat(api/transcription): include segments + duration + language on stream done event#9709
mudler merged 1 commit into
masterfrom
feat/stream-transcription-include-segments

Conversation

@localai-bot
Copy link
Copy Markdown
Collaborator

Summary

Extends the transcript.text.done SSE payload emitted by streamTranscription to additively carry language, duration, and a segments array so streaming clients can build the same TranscriptionResultSeconds shape they get from the non-streaming JSON path.

Why

Streaming clients (e.g. notetaker / notary) want streaming semantics for one big practical reason: the JSON path's ResponseHeaderTimeout trips when whisper requests queue behind each other on a SingleThread backend. streamTranscription fixes that by flushing 200 + headers immediately, but until now the done event only carried text — clients that needed per-utterance timings or audio duration had to fall back to the JSON path and accept the queue-induced timeouts again.

This change closes the loop: the streaming endpoint becomes a strict superset of the JSON endpoint's body, and clients can stay on SSE end-to-end.

Compatibility

  • OpenAI streaming spec: only text is mandated on transcript.text.done; spec-compliant clients ignore unknown fields.
  • Empty values are still omitted (Language == "", Duration == 0, len(Segments) == 0), so a transcription that came from a backend without these signals emits the same shape as before.
  • Wire format: start/end are float seconds (matching TranscriptionSegmentSeconds.Seconds() from the JSON path) so clients can share decoding logic.

Test plan

  • Manual: hit POST /v1/audio/transcriptions with stream=true against a model that emits segment timings (whisper-cpp). Verify the final data: {...} line before [DONE] includes segments and duration and matches the JSON-mode response for the same audio.
  • Manual: same request against a backend that doesn't produce segments. Verify the done event still includes text and the segments field is absent (omitempty preserved).
  • Existing non-streaming JSON path is untouched — no regression risk for spec-compliant clients.

Companion change

The notetaker repo's notary client switches its Transcribe to streaming and consumes these fields in the same shape as its existing JSON-path parser (internal/shared/localai/client.go::parseTranscriptionStream). Without this server change, the streaming client falls through to text-only results — so this PR completes the round trip but isn't a breaking dependency.

🤖 Generated with Claude Code

…ream done event

streamTranscription previously emitted a done event with just `text`,
matching the OpenAI streaming spec exactly. Streaming clients that need
per-utterance timings or audio duration had to fall back to the
non-streaming JSON path — and that path is exactly the one that trips
on ResponseHeaderTimeout when whisper requests queue behind each other
on a SingleThread backend.

Extend the done event to additively carry `language`, `duration`, and
a `segments` array (id, start, end, text — start/end as float seconds,
matching TranscriptionSegmentSeconds). Empty / zero values are still
omitted; spec-compliant clients ignore the new fields.

This unblocks notary's streaming Transcribe (companion change in the
notary repo) so it produces the same TranscriptionResult shape as the
JSON path while sidestepping the queue-induced header timeouts.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-7 [Claude Code]
@mudler mudler merged commit 595b6fd into master May 7, 2026
51 checks passed
@mudler mudler deleted the feat/stream-transcription-include-segments branch May 7, 2026 15:28
@localai-bot localai-bot added the enhancement New feature or request label May 9, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants