feat(inference/tts): port aligned transcript / output_timestamps from Python (#5534) by toubatbrian · Pull Request #1311 · livekit/agents-js

toubatbrian · 2026-04-24T02:50:52Z

Summary

Automated port of livekit/agents#5534 ("feat(tts): add support for timestamps in Inference") into agents-js.

The upstream Python PR taught the Inference TTS client to surface the LiveKit Inference gateway's new output_timestamps WebSocket event as word- / character-level TimedStrings, and to advertise the aligned_transcript capability automatically whenever a provider-specific opt-in flag is present in extra_kwargs. This PR brings the same behaviour to the Node.js client.

Triggered by the automated Claude Code routine (experimental).

Scope

This change is a plugin-layer improvement — all edits are confined to agents/src/inference/ and its tests. No public API surface changes outside of the inference TTS namespace.

What's ported

1. `hasAlignedTranscript(model, modelOptions)` helper

Direct analogue of the Python _has_aligned_transcript. Inspects the provider prefix and returns true if the provider-specific alignment opt-in is present in modelOptions:

Provider	`modelOptions` flag that enables alignment
`cartesia/*`	`add_timestamps: true`
`elevenlabs/*`	`sync_alignment: true`
`inworld/*`	`timestamp_type: 'WORD' \| 'CHARACTER'`

Any other provider (e.g. deepgram, rime) returns false, matching Python.

2. Dynamic `TTSCapabilities.alignedTranscript`

Computed at construction from the incoming model + modelOptions and forwarded to super(..., { streaming: true, alignedTranscript }).
Re-computed inside updateOptions when either model or modelOptions changes, mirroring Python's self._capabilities.aligned_transcript = _has_aligned_transcript(...).
Exposed via an overridden get capabilities() on the inference TTS subclass.

3. Runtime `modelOptions` updates

TTS.updateOptions / SynthesizeStream.updateOptions now accept modelOptions in addition to model / voice / language. The new options are merged shallowly into the existing modelOptions, matching Python's self._opts.extra_kwargs.update(extra_kwargs).

4. `output_timestamps` WebSocket event

Added ttsOutputTimestampsEventSchema to agents/src/inference/api_protos.ts as a new branch of the ttsServerEventSchema discriminated union. The schema supports both words: [{word, start, end}] and chars: [{char, start, end}] payloads.
SynthesizeStream.run now buffers decoded timings as TimedString[] (pendingTimedTranscripts) and attaches them to the next audio frame via SynthesizedAudio.timedTranscripts. Matches the Cartesia plugin's pattern for word-level timing and the downstream TranscriptionSynchronizer consumer.

Implementation notes where JS differs from Python

Code-level parity is near-1:1, modulo the following language-level adjustments:

Capability mutation. Python mutates self._capabilities.aligned_transcript in place on the dataclass held on the base TTS. The JS base class stores capabilities in a hard-private #capabilities field accessed through a getter, so mutation from a subclass isn't possible. Instead, the inference TTS subclass tracks its own #alignedTranscript boolean and overrides get capabilities() to return the live value. Externally observable behaviour is identical — tts.capabilities.alignedTranscript reflects the latest modelOptions.
push_timed_transcript → SynthesizedAudio.timedTranscripts. Python exposes output_emitter.push_timed_transcript(ts), which streams timed strings alongside audio. The JS agents framework has no such abstraction; the established pattern (see plugins/cartesia/src/tts.ts) is to buffer TimedString[] and attach them to the next SynthesizedAudio packet via the optional timedTranscripts field. This is what TranscriptionSynchronizer already consumes.
update_options signature. The Python update_options had always accepted extra_kwargs; JS updateOptions previously did not. To preserve Python semantics (recomputing alignment when options change at runtime), the JS updateOptions signature now also accepts a partial modelOptions patch — shallow-merged into the stored options. This is a strict additive change to the type; existing callers compile unchanged.
Text of the TimedString. Python passes the raw word_info["word"] / char_info["char"] with no trailing whitespace. The JS port does the same — unlike the Cartesia plugin port, no + ' ' is appended. If padding is desired in the future we can align both ports in a follow-up.
Type guardrails. modelOptions is carried as the generic TTSOptions<TModel> type on the way in, but hasAlignedTranscript intentionally accepts Record<string, unknown> to avoid having to widen every provider interface just to read three opt-in fields.

Files changed

agents/src/inference/api_protos.ts — new ttsWordTimestampSchema / ttsCharTimestampSchema / ttsOutputTimestampsEventSchema and their inferred types, added to the server-event union.
agents/src/inference/tts.ts — new hasAlignedTranscript helper, dynamic capability wiring in constructor + updateOptions, output_timestamps handling in SynthesizeStream.run, TimedString buffer attached to the next frame.
agents/src/inference/tts.test.ts — new describe blocks for hasAlignedTranscript (provider matrix + edge cases) and TTS alignedTranscript capability (constructor-time computation and updateOptions recomputation).
.changeset/inference-tts-aligned-timestamps.md — minor changeset entry.

Cross-reference comments

Every ported section carries an inline // Ref: python livekit-agents/livekit/agents/inference/tts.py - <lines> comment pointing back at the corresponding lines in the upstream diff, per the repo's porting convention.

Test plan

pnpm build:agents succeeds (tsup + tsc --declaration).
pnpm exec vitest run src/inference/tts.test.ts src/inference/api_protos.test.ts — 51 tests green.
pnpm exec eslint src/inference/tts.ts src/inference/tts.test.ts src/inference/api_protos.ts — no errors introduced by this PR (4 pre-existing tsdoc warnings on Range >0.5, <=1.5. JSDoc in InworldOptions / RimeOptions, unchanged).
pnpm format:check clean.
Manual smoke test against the LiveKit Inference gateway with an ElevenLabs model and modelOptions: { sync_alignment: true } — confirm output_timestamps frames flow through to the TranscriptionSynchronizer.
Manual smoke test with Cartesia add_timestamps: true and Inworld timestamp_type: 'WORD'.

This is an automated port from livekit/agents#5534 by the Claude Code automation routine (experimental).

cc @toubatbrian @livekit/agent-devs

Generated by Claude Code

Ports livekit/agents#5534 to agents-js. The Inference gateway gained word/character-level timestamp support for Cartesia, ElevenLabs, and Inworld. This wires the client side: - Add `hasAlignedTranscript(model, modelOptions)` mirroring the Python `_has_aligned_transcript` helper. It inspects the provider-specific opt-in flags inside `modelOptions` (`cartesia.add_timestamps`, `elevenlabs.sync_alignment`, `inworld.timestamp_type`). - Pass the derived capability into the base TTS constructor and track it in a mutable subclass field so `capabilities.alignedTranscript` stays in sync after `updateOptions` reconfigures the model or the provider-specific options. - Extend the inference TTS `updateOptions` signature to accept a partial `modelOptions` patch (merged shallowly, matching Python's `self._opts.extra_kwargs.update(...)` semantics) so the alignment flag can be toggled at runtime. - Add an `output_timestamps` zod schema to the server event discriminated union, and handle it in the recv task by buffering the decoded words (or characters) as `TimedString` entries that are attached to the next synthesized audio frame via `SynthesizedAudio.timedTranscripts` (the JS analogue of Python's `output_emitter.push_timed_transcript`). Tests cover the helper's provider matrix, capability wiring at construction time, and recomputation on `updateOptions`. Automated port by the Claude Code routine (experimental) triggered by the merged upstream PR.

changeset-bot · 2026-04-24T02:50:56Z

🦋 Changeset detected

Latest commit: 9bad2b1

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 26 packages

Name	Type
@livekit/agents	Major
@livekit/agents-plugin-anam	Major
@livekit/agents-plugin-assemblyai	Major
@livekit/agents-plugin-baseten	Major
@livekit/agents-plugin-bey	Major
@livekit/agents-plugin-cartesia	Major
@livekit/agents-plugin-cerebras	Major
@livekit/agents-plugin-deepgram	Major
@livekit/agents-plugin-elevenlabs	Major
@livekit/agents-plugin-google	Major
@livekit/agents-plugin-hedra	Major
@livekit/agents-plugin-inworld	Major
@livekit/agents-plugin-lemonslice	Major
@livekit/agents-plugin-livekit	Major
@livekit/agents-plugin-mistral	Major
@livekit/agents-plugin-neuphonic	Major
@livekit/agents-plugin-openai	Major
@livekit/agents-plugin-phonic	Major
@livekit/agents-plugin-resemble	Major
@livekit/agents-plugin-rime	Major
@livekit/agents-plugin-runway	Major
@livekit/agents-plugin-sarvam	Major
@livekit/agents-plugin-silero	Major
@livekit/agents-plugins-test	Major
@livekit/agents-plugin-trugen	Major
@livekit/agents-plugin-xai	Major

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

CLAassistant · 2026-04-24T02:50:59Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
1 out of 2 committers have signed the CLA.

✅ toubatbrian
❌ claude
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

devin-ai-integration

✅ Devin Review: No Issues Found

Devin Review analyzed this PR and found no potential bugs to report.

View in Devin Review to see 5 additional findings.

…ges session payload Addresses a review comment on #1311 (codex review). The LiveKit Inference gateway only consumes `model` / `voice` / `language` / `extra` (`modelOptions`) when the WebSocket is first opened via `session.create`. Without invalidating the pool, calling `updateOptions({ modelOptions: { add_timestamps: true } })` after a prior stream has warmed the pool would report `capabilities.alignedTranscript = true` while the reused socket continued to run with the old session and never emit `output_timestamps`. In flows that use TTS-aligned transcripts this would silently switch transcription to an empty timed-text stream. Call `pool.invalidate()` whenever a session-affecting option changes, so the next `stream()` opens a fresh socket with the up-to-date payload. Added a unit test that spies on `pool.invalidate` to pin the behaviour.

lukasIO

lgtm!

devin-ai-integration Bot reviewed Apr 24, 2026

View reviewed changes

This comment was marked as resolved.

Sign in to view

claude and others added 2 commits April 24, 2026 11:54

verified changes

9bad2b1

toubatbrian added the verified-port label Apr 28, 2026

lukasIO approved these changes Apr 28, 2026

View reviewed changes

toubatbrian merged commit 1287430 into main Apr 28, 2026
8 of 9 checks passed

toubatbrian deleted the claude/jolly-lovelace-FHw8W branch April 28, 2026 14:41

This was referenced Apr 28, 2026

Version Packages #1322

Open

Version Packages enriqueespaillat-gyde/agents-js#2

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(inference/tts): port aligned transcript / output_timestamps from Python (#5534)#1311

feat(inference/tts): port aligned transcript / output_timestamps from Python (#5534)#1311
toubatbrian merged 3 commits intomainfrom
claude/jolly-lovelace-FHw8W

toubatbrian commented Apr 24, 2026

Uh oh!

changeset-bot Bot commented Apr 24, 2026 •

edited

Loading

Uh oh!

CLAassistant commented Apr 24, 2026 •

edited

Loading

Uh oh!

devin-ai-integration Bot left a comment

Uh oh!

This comment was marked as resolved.

Uh oh!

lukasIO left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

toubatbrian commented Apr 24, 2026

Summary

Scope

What's ported

1. hasAlignedTranscript(model, modelOptions) helper

2. Dynamic TTSCapabilities.alignedTranscript

3. Runtime modelOptions updates

4. output_timestamps WebSocket event

Implementation notes where JS differs from Python

Files changed

Cross-reference comments

Test plan

Uh oh!

changeset-bot Bot commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🦋 Changeset detected

Uh oh!

CLAassistant commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

✅ Devin Review: No Issues Found

Uh oh!

This comment was marked as resolved.

Uh oh!

lukasIO left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

1. `hasAlignedTranscript(model, modelOptions)` helper

2. Dynamic `TTSCapabilities.alignedTranscript`

3. Runtime `modelOptions` updates

4. `output_timestamps` WebSocket event

changeset-bot Bot commented Apr 24, 2026 •

edited

Loading

CLAassistant commented Apr 24, 2026 •

edited

Loading