feat(inference/tts): port aligned transcript / output_timestamps from Python (#5534)#1311
Merged
toubatbrian merged 3 commits intomainfrom Apr 28, 2026
Merged
feat(inference/tts): port aligned transcript / output_timestamps from Python (#5534)#1311toubatbrian merged 3 commits intomainfrom
toubatbrian merged 3 commits intomainfrom
Conversation
Ports livekit/agents#5534 to agents-js. The Inference gateway gained word/character-level timestamp support for Cartesia, ElevenLabs, and Inworld. This wires the client side: - Add `hasAlignedTranscript(model, modelOptions)` mirroring the Python `_has_aligned_transcript` helper. It inspects the provider-specific opt-in flags inside `modelOptions` (`cartesia.add_timestamps`, `elevenlabs.sync_alignment`, `inworld.timestamp_type`). - Pass the derived capability into the base TTS constructor and track it in a mutable subclass field so `capabilities.alignedTranscript` stays in sync after `updateOptions` reconfigures the model or the provider-specific options. - Extend the inference TTS `updateOptions` signature to accept a partial `modelOptions` patch (merged shallowly, matching Python's `self._opts.extra_kwargs.update(...)` semantics) so the alignment flag can be toggled at runtime. - Add an `output_timestamps` zod schema to the server event discriminated union, and handle it in the recv task by buffering the decoded words (or characters) as `TimedString` entries that are attached to the next synthesized audio frame via `SynthesizedAudio.timedTranscripts` (the JS analogue of Python's `output_emitter.push_timed_transcript`). Tests cover the helper's provider matrix, capability wiring at construction time, and recomputation on `updateOptions`. Automated port by the Claude Code routine (experimental) triggered by the merged upstream PR.
🦋 Changeset detectedLatest commit: 9bad2b1 The changes in this PR will be included in the next version bump. This PR includes changesets to release 26 packages
Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
|
|
…ges session payload Addresses a review comment on #1311 (codex review). The LiveKit Inference gateway only consumes `model` / `voice` / `language` / `extra` (`modelOptions`) when the WebSocket is first opened via `session.create`. Without invalidating the pool, calling `updateOptions({ modelOptions: { add_timestamps: true } })` after a prior stream has warmed the pool would report `capabilities.alignedTranscript = true` while the reused socket continued to run with the old session and never emit `output_timestamps`. In flows that use TTS-aligned transcripts this would silently switch transcription to an empty timed-text stream. Call `pool.invalidate()` whenever a session-affecting option changes, so the next `stream()` opens a fresh socket with the up-to-date payload. Added a unit test that spies on `pool.invalidate` to pin the behaviour.
This was referenced Apr 28, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Automated port of livekit/agents#5534 ("feat(tts): add support for timestamps in Inference") into
agents-js.The upstream Python PR taught the Inference TTS client to surface the LiveKit Inference gateway's new
output_timestampsWebSocket event as word- / character-levelTimedStrings, and to advertise thealigned_transcriptcapability automatically whenever a provider-specific opt-in flag is present inextra_kwargs. This PR brings the same behaviour to the Node.js client.Triggered by the automated Claude Code routine (experimental).
Scope
This change is a plugin-layer improvement — all edits are confined to
agents/src/inference/and its tests. No public API surface changes outside of the inference TTS namespace.What's ported
1.
hasAlignedTranscript(model, modelOptions)helperDirect analogue of the Python
_has_aligned_transcript. Inspects the provider prefix and returnstrueif the provider-specific alignment opt-in is present inmodelOptions:modelOptionsflag that enables alignmentcartesia/*add_timestamps: trueelevenlabs/*sync_alignment: trueinworld/*timestamp_type: 'WORD' | 'CHARACTER'Any other provider (e.g.
deepgram,rime) returnsfalse, matching Python.2. Dynamic
TTSCapabilities.alignedTranscriptmodel+modelOptionsand forwarded tosuper(..., { streaming: true, alignedTranscript }).updateOptionswhen eithermodelormodelOptionschanges, mirroring Python'sself._capabilities.aligned_transcript = _has_aligned_transcript(...).get capabilities()on the inference TTS subclass.3. Runtime
modelOptionsupdatesTTS.updateOptions/SynthesizeStream.updateOptionsnow acceptmodelOptionsin addition tomodel/voice/language. The new options are merged shallowly into the existingmodelOptions, matching Python'sself._opts.extra_kwargs.update(extra_kwargs).4.
output_timestampsWebSocket eventttsOutputTimestampsEventSchematoagents/src/inference/api_protos.tsas a new branch of thettsServerEventSchemadiscriminated union. The schema supports bothwords: [{word, start, end}]andchars: [{char, start, end}]payloads.SynthesizeStream.runnow buffers decoded timings asTimedString[](pendingTimedTranscripts) and attaches them to the next audio frame viaSynthesizedAudio.timedTranscripts. Matches the Cartesia plugin's pattern for word-level timing and the downstreamTranscriptionSynchronizerconsumer.Implementation notes where JS differs from Python
Code-level parity is near-1:1, modulo the following language-level adjustments:
self._capabilities.aligned_transcriptin place on the dataclass held on the baseTTS. The JS base class stores capabilities in a hard-private#capabilitiesfield accessed through a getter, so mutation from a subclass isn't possible. Instead, the inference TTS subclass tracks its own#alignedTranscriptboolean and overridesget capabilities()to return the live value. Externally observable behaviour is identical —tts.capabilities.alignedTranscriptreflects the latestmodelOptions.push_timed_transcript→SynthesizedAudio.timedTranscripts. Python exposesoutput_emitter.push_timed_transcript(ts), which streams timed strings alongside audio. The JS agents framework has no such abstraction; the established pattern (seeplugins/cartesia/src/tts.ts) is to bufferTimedString[]and attach them to the nextSynthesizedAudiopacket via the optionaltimedTranscriptsfield. This is whatTranscriptionSynchronizeralready consumes.update_optionssignature. The Pythonupdate_optionshad always acceptedextra_kwargs; JSupdateOptionspreviously did not. To preserve Python semantics (recomputing alignment when options change at runtime), the JSupdateOptionssignature now also accepts a partialmodelOptionspatch — shallow-merged into the stored options. This is a strict additive change to the type; existing callers compile unchanged.TimedString. Python passes the rawword_info["word"]/char_info["char"]with no trailing whitespace. The JS port does the same — unlike the Cartesia plugin port, no+ ' 'is appended. If padding is desired in the future we can align both ports in a follow-up.modelOptionsis carried as the genericTTSOptions<TModel>type on the way in, buthasAlignedTranscriptintentionally acceptsRecord<string, unknown>to avoid having to widen every provider interface just to read three opt-in fields.Files changed
agents/src/inference/api_protos.ts— newttsWordTimestampSchema/ttsCharTimestampSchema/ttsOutputTimestampsEventSchemaand their inferred types, added to the server-event union.agents/src/inference/tts.ts— newhasAlignedTranscripthelper, dynamic capability wiring in constructor +updateOptions,output_timestampshandling inSynthesizeStream.run,TimedStringbuffer attached to the next frame.agents/src/inference/tts.test.ts— newdescribeblocks forhasAlignedTranscript(provider matrix + edge cases) andTTS alignedTranscript capability(constructor-time computation andupdateOptionsrecomputation)..changeset/inference-tts-aligned-timestamps.md— minor changeset entry.Cross-reference comments
Every ported section carries an inline
// Ref: python livekit-agents/livekit/agents/inference/tts.py - <lines>comment pointing back at the corresponding lines in the upstream diff, per the repo's porting convention.Test plan
pnpm build:agentssucceeds (tsup +tsc --declaration).pnpm exec vitest run src/inference/tts.test.ts src/inference/api_protos.test.ts— 51 tests green.pnpm exec eslint src/inference/tts.ts src/inference/tts.test.ts src/inference/api_protos.ts— no errors introduced by this PR (4 pre-existing tsdoc warnings onRange >0.5, <=1.5.JSDoc inInworldOptions/RimeOptions, unchanged).pnpm format:checkclean.modelOptions: { sync_alignment: true }— confirmoutput_timestampsframes flow through to theTranscriptionSynchronizer.add_timestamps: trueand Inworldtimestamp_type: 'WORD'.This is an automated port from livekit/agents#5534 by the Claude Code automation routine (experimental).
cc @toubatbrian @livekit/agent-devs
Generated by Claude Code