Skip to content

Feature request: per-run target-language segments on stt.SpeechData #5685

@MSameerAbbas

Description

@MSameerAbbas

Feature Type

Would make my life easier

Feature Description

stt.SpeechData exposes per-segment data on the source side (source_languages, source_texts) but on the target side only a single concatenated text and a single language. Under Soniox two-way translation with code-switched input, the translated text legitimately contains spans in both target languages, but downstream consumers can't tell which spans are which because:

  1. SpeechData.language is set to the first translated token's language only — first-wins in livekit-plugins-soniox's _TokenAccumulator.update.
  2. The Soniox plugin already builds the per-run translated list (final._lang_segments, the same coalescing pass it uses to build the source side) but discards it in send_endpoint_transcript — only final_original._lang_segments is forwarded as source_languages / source_texts.

The proposal is purely additive and symmetric: add target_languages / target_texts to SpeechData (per consecutive-language run, mirroring the existing source-side semantics), and have the Soniox plugin populate them from final._lang_segments.

Why per-run, not per-token

The source side is already per-run — _TokenAccumulator.update coalesces consecutive same-language tokens into a single (lang, text) entry. Asking for the target side to be per-run keeps the API symmetric, requires the smallest possible plugin diff (the per-run list is already computed), and matches what consumers actually want for highlighting / display ([en, es, en] instead of [en, en, en, en, es, es, es, en, en]). Anyone who needs word-level granularity can already use the existing SpeechData.words: list[TimedString] field.

Reproduction

from livekit.agents import AgentSession
from livekit.plugins import soniox

session = AgentSession(
    stt=soniox.STT(
        params=soniox.STTOptions(
            model="stt-rt-v4",
            language_hints=["en", "es"],
            language_hints_strict=True,
            translation=soniox.TranslationConfig(
                type="two_way", language_a="en", language_b="es",
            ),
        )
    ),
)

Speak: "No hablo español but I speak English"

Resulting SpeechData on the FINAL_TRANSCRIPT event:

SpeechData(
    text             = "I don't speak Spanish, pero hablo inglés.",
    language         = "en",                                  # misleading: second clause is Spanish
    source_languages = ["es", "en"],                          # per-run, correct
    source_texts     = ["No hablo español, ", " but I speak English."],
    # target_languages / target_texts: not surfaced today
)

Expected after this feature:

SpeechData(
    text             = "I don't speak Spanish, pero hablo inglés.",
    language         = "en",                                  # unchanged for back-compat (first run)
    source_languages = ["es", "en"],
    source_texts     = ["No hablo español, ", " but I speak English."],
    target_languages = ["en", "es"],                          # NEW
    target_texts     = ["I don't speak Spanish, ", "pero hablo inglés."],   # NEW
)

Proposed solution

In livekit-agents, extend stt.SpeechData:

@dataclass
class SpeechData:
    ...
    target_languages: list[LanguageCode] | None = None
    """Per consecutive-language-run target languages, parallel to `target_texts`.
    Mirrors the existing `source_languages` / `source_texts` semantics on the source side.
    Populated by STT services that support translation; None when translation is off."""
    target_texts: list[str] | None = None
    """The translated transcription segments in the target language(s).
    Each entry corresponds to the same-indexed entry in `target_languages`."""

In livekit-plugins-soniox, mirror the existing source-side block in send_endpoint_transcript:

src_segs = final_original._lang_segments
source_languages = [LanguageCode(lang) for lang, _ in src_segs] or None
source_texts     = [t for _, t in src_segs] or None

# NEW — target side, symmetric to source
tgt_segs = final._lang_segments
target_languages = [LanguageCode(lang) for lang, _ in tgt_segs] or None
target_texts     = [t for _, t in tgt_segs] or None

self._event_ch.send_nowait(
    stt.SpeechEvent(
        type=SpeechEventType.FINAL_TRANSCRIPT,
        alternatives=[
            final.to_speech_data(
                self.start_time_offset,
                source_languages=source_languages,
                source_texts=source_texts,
                target_languages=target_languages,
                target_texts=target_texts,
            )
        ],
    )
)

final._lang_segments is already computed by the existing _TokenAccumulator.update coalescing logic — no new accumulation work, no new state, no protocol-level changes. Consumers that don't read the new fields are unaffected (back-compat preserved).

The same change naturally applies to the INTERIM_TRANSCRIPT path that already merges source-side segments via _merge_lang_segments.

Workarounds / Alternatives

No response

Additional Context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions