Skip to content

feat(inference/tts): port aligned transcript / output_timestamps from Python (#5534)#1311

Merged
toubatbrian merged 3 commits intomainfrom
claude/jolly-lovelace-FHw8W
Apr 28, 2026
Merged

feat(inference/tts): port aligned transcript / output_timestamps from Python (#5534)#1311
toubatbrian merged 3 commits intomainfrom
claude/jolly-lovelace-FHw8W

Conversation

@toubatbrian
Copy link
Copy Markdown
Contributor

Summary

Automated port of livekit/agents#5534 ("feat(tts): add support for timestamps in Inference") into agents-js.

The upstream Python PR taught the Inference TTS client to surface the LiveKit Inference gateway's new output_timestamps WebSocket event as word- / character-level TimedStrings, and to advertise the aligned_transcript capability automatically whenever a provider-specific opt-in flag is present in extra_kwargs. This PR brings the same behaviour to the Node.js client.

Triggered by the automated Claude Code routine (experimental).

Scope

This change is a plugin-layer improvement — all edits are confined to agents/src/inference/ and its tests. No public API surface changes outside of the inference TTS namespace.

What's ported

1. hasAlignedTranscript(model, modelOptions) helper

Direct analogue of the Python _has_aligned_transcript. Inspects the provider prefix and returns true if the provider-specific alignment opt-in is present in modelOptions:

Provider modelOptions flag that enables alignment
cartesia/* add_timestamps: true
elevenlabs/* sync_alignment: true
inworld/* timestamp_type: 'WORD' | 'CHARACTER'

Any other provider (e.g. deepgram, rime) returns false, matching Python.

2. Dynamic TTSCapabilities.alignedTranscript

  • Computed at construction from the incoming model + modelOptions and forwarded to super(..., { streaming: true, alignedTranscript }).
  • Re-computed inside updateOptions when either model or modelOptions changes, mirroring Python's self._capabilities.aligned_transcript = _has_aligned_transcript(...).
  • Exposed via an overridden get capabilities() on the inference TTS subclass.

3. Runtime modelOptions updates

TTS.updateOptions / SynthesizeStream.updateOptions now accept modelOptions in addition to model / voice / language. The new options are merged shallowly into the existing modelOptions, matching Python's self._opts.extra_kwargs.update(extra_kwargs).

4. output_timestamps WebSocket event

  • Added ttsOutputTimestampsEventSchema to agents/src/inference/api_protos.ts as a new branch of the ttsServerEventSchema discriminated union. The schema supports both words: [{word, start, end}] and chars: [{char, start, end}] payloads.
  • SynthesizeStream.run now buffers decoded timings as TimedString[] (pendingTimedTranscripts) and attaches them to the next audio frame via SynthesizedAudio.timedTranscripts. Matches the Cartesia plugin's pattern for word-level timing and the downstream TranscriptionSynchronizer consumer.

Implementation notes where JS differs from Python

Code-level parity is near-1:1, modulo the following language-level adjustments:

  1. Capability mutation. Python mutates self._capabilities.aligned_transcript in place on the dataclass held on the base TTS. The JS base class stores capabilities in a hard-private #capabilities field accessed through a getter, so mutation from a subclass isn't possible. Instead, the inference TTS subclass tracks its own #alignedTranscript boolean and overrides get capabilities() to return the live value. Externally observable behaviour is identical — tts.capabilities.alignedTranscript reflects the latest modelOptions.
  2. push_timed_transcriptSynthesizedAudio.timedTranscripts. Python exposes output_emitter.push_timed_transcript(ts), which streams timed strings alongside audio. The JS agents framework has no such abstraction; the established pattern (see plugins/cartesia/src/tts.ts) is to buffer TimedString[] and attach them to the next SynthesizedAudio packet via the optional timedTranscripts field. This is what TranscriptionSynchronizer already consumes.
  3. update_options signature. The Python update_options had always accepted extra_kwargs; JS updateOptions previously did not. To preserve Python semantics (recomputing alignment when options change at runtime), the JS updateOptions signature now also accepts a partial modelOptions patch — shallow-merged into the stored options. This is a strict additive change to the type; existing callers compile unchanged.
  4. Text of the TimedString. Python passes the raw word_info["word"] / char_info["char"] with no trailing whitespace. The JS port does the same — unlike the Cartesia plugin port, no + ' ' is appended. If padding is desired in the future we can align both ports in a follow-up.
  5. Type guardrails. modelOptions is carried as the generic TTSOptions<TModel> type on the way in, but hasAlignedTranscript intentionally accepts Record<string, unknown> to avoid having to widen every provider interface just to read three opt-in fields.

Files changed

  • agents/src/inference/api_protos.ts — new ttsWordTimestampSchema / ttsCharTimestampSchema / ttsOutputTimestampsEventSchema and their inferred types, added to the server-event union.
  • agents/src/inference/tts.ts — new hasAlignedTranscript helper, dynamic capability wiring in constructor + updateOptions, output_timestamps handling in SynthesizeStream.run, TimedString buffer attached to the next frame.
  • agents/src/inference/tts.test.ts — new describe blocks for hasAlignedTranscript (provider matrix + edge cases) and TTS alignedTranscript capability (constructor-time computation and updateOptions recomputation).
  • .changeset/inference-tts-aligned-timestamps.md — minor changeset entry.

Cross-reference comments

Every ported section carries an inline // Ref: python livekit-agents/livekit/agents/inference/tts.py - <lines> comment pointing back at the corresponding lines in the upstream diff, per the repo's porting convention.

Test plan

  • pnpm build:agents succeeds (tsup + tsc --declaration).
  • pnpm exec vitest run src/inference/tts.test.ts src/inference/api_protos.test.ts — 51 tests green.
  • pnpm exec eslint src/inference/tts.ts src/inference/tts.test.ts src/inference/api_protos.ts — no errors introduced by this PR (4 pre-existing tsdoc warnings on Range >0.5, <=1.5. JSDoc in InworldOptions / RimeOptions, unchanged).
  • pnpm format:check clean.
  • Manual smoke test against the LiveKit Inference gateway with an ElevenLabs model and modelOptions: { sync_alignment: true } — confirm output_timestamps frames flow through to the TranscriptionSynchronizer.
  • Manual smoke test with Cartesia add_timestamps: true and Inworld timestamp_type: 'WORD'.

This is an automated port from livekit/agents#5534 by the Claude Code automation routine (experimental).

cc @toubatbrian @livekit/agent-devs


Generated by Claude Code

Ports livekit/agents#5534 to agents-js. The Inference gateway gained
word/character-level timestamp support for Cartesia, ElevenLabs, and
Inworld. This wires the client side:

- Add `hasAlignedTranscript(model, modelOptions)` mirroring the Python
  `_has_aligned_transcript` helper. It inspects the provider-specific
  opt-in flags inside `modelOptions` (`cartesia.add_timestamps`,
  `elevenlabs.sync_alignment`, `inworld.timestamp_type`).
- Pass the derived capability into the base TTS constructor and track
  it in a mutable subclass field so `capabilities.alignedTranscript`
  stays in sync after `updateOptions` reconfigures the model or the
  provider-specific options.
- Extend the inference TTS `updateOptions` signature to accept a
  partial `modelOptions` patch (merged shallowly, matching Python's
  `self._opts.extra_kwargs.update(...)` semantics) so the alignment
  flag can be toggled at runtime.
- Add an `output_timestamps` zod schema to the server event discriminated
  union, and handle it in the recv task by buffering the decoded words
  (or characters) as `TimedString` entries that are attached to the
  next synthesized audio frame via `SynthesizedAudio.timedTranscripts`
  (the JS analogue of Python's `output_emitter.push_timed_transcript`).

Tests cover the helper's provider matrix, capability wiring at
construction time, and recomputation on `updateOptions`.

Automated port by the Claude Code routine (experimental) triggered by
the merged upstream PR.
@changeset-bot
Copy link
Copy Markdown

changeset-bot Bot commented Apr 24, 2026

🦋 Changeset detected

Latest commit: 9bad2b1

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 26 packages
Name Type
@livekit/agents Major
@livekit/agents-plugin-anam Major
@livekit/agents-plugin-assemblyai Major
@livekit/agents-plugin-baseten Major
@livekit/agents-plugin-bey Major
@livekit/agents-plugin-cartesia Major
@livekit/agents-plugin-cerebras Major
@livekit/agents-plugin-deepgram Major
@livekit/agents-plugin-elevenlabs Major
@livekit/agents-plugin-google Major
@livekit/agents-plugin-hedra Major
@livekit/agents-plugin-inworld Major
@livekit/agents-plugin-lemonslice Major
@livekit/agents-plugin-livekit Major
@livekit/agents-plugin-mistral Major
@livekit/agents-plugin-neuphonic Major
@livekit/agents-plugin-openai Major
@livekit/agents-plugin-phonic Major
@livekit/agents-plugin-resemble Major
@livekit/agents-plugin-rime Major
@livekit/agents-plugin-runway Major
@livekit/agents-plugin-sarvam Major
@livekit/agents-plugin-silero Major
@livekit/agents-plugins-test Major
@livekit/agents-plugin-trugen Major
@livekit/agents-plugin-xai Major

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

@CLAassistant
Copy link
Copy Markdown

CLAassistant commented Apr 24, 2026

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
1 out of 2 committers have signed the CLA.

✅ toubatbrian
❌ claude
You have signed the CLA already but the status is still pending? Let us recheck it.

Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Devin Review: No Issues Found

Devin Review analyzed this PR and found no potential bugs to report.

View in Devin Review to see 5 additional findings.

Open in Devin Review

chatgpt-codex-connector[bot]

This comment was marked as resolved.

claude and others added 2 commits April 24, 2026 11:54
…ges session payload

Addresses a review comment on #1311 (codex review).

The LiveKit Inference gateway only consumes `model` / `voice` / `language`
/ `extra` (`modelOptions`) when the WebSocket is first opened via
`session.create`. Without invalidating the pool, calling
`updateOptions({ modelOptions: { add_timestamps: true } })` after a prior
stream has warmed the pool would report `capabilities.alignedTranscript =
true` while the reused socket continued to run with the old session and
never emit `output_timestamps`. In flows that use TTS-aligned transcripts
this would silently switch transcription to an empty timed-text stream.

Call `pool.invalidate()` whenever a session-affecting option changes, so
the next `stream()` opens a fresh socket with the up-to-date payload.
Added a unit test that spies on `pool.invalidate` to pin the behaviour.
Copy link
Copy Markdown
Contributor

@lukasIO lukasIO left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm!

@toubatbrian toubatbrian merged commit 1287430 into main Apr 28, 2026
8 of 9 checks passed
@toubatbrian toubatbrian deleted the claude/jolly-lovelace-FHw8W branch April 28, 2026 14:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants