Feature request: per-run target-language segments on `stt.SpeechData`

### Feature Type

Would make my life easier

### Feature Description

`stt.SpeechData` exposes per-segment data on the source side (`source_languages`, `source_texts`) but on the target side only a single concatenated `text` and a single `language`. Under Soniox two-way translation with code-switched input, the translated text legitimately contains spans in *both* target languages, but downstream consumers can't tell which spans are which because:

1. `SpeechData.language` is set to the **first** translated token's language only — first-wins in `livekit-plugins-soniox`'s `_TokenAccumulator.update`.
2. The Soniox plugin already builds the per-run translated list (`final._lang_segments`, the same coalescing pass it uses to build the source side) but **discards it** in `send_endpoint_transcript` — only `final_original._lang_segments` is forwarded as `source_languages` / `source_texts`.

The proposal is purely additive and symmetric: add `target_languages` / `target_texts` to `SpeechData` (per consecutive-language run, mirroring the existing source-side semantics), and have the Soniox plugin populate them from `final._lang_segments`.

### Why per-run, not per-token

The source side is already per-run — `_TokenAccumulator.update` coalesces consecutive same-language tokens into a single `(lang, text)` entry. Asking for the target side to be per-run keeps the API symmetric, requires the smallest possible plugin diff (the per-run list is already computed), and matches what consumers actually want for highlighting / display (`[en, es, en]` instead of `[en, en, en, en, es, es, es, en, en]`). Anyone who needs word-level granularity can already use the existing `SpeechData.words: list[TimedString]` field.

### Reproduction

```python
from livekit.agents import AgentSession
from livekit.plugins import soniox

session = AgentSession(
    stt=soniox.STT(
        params=soniox.STTOptions(
            model="stt-rt-v4",
            language_hints=["en", "es"],
            language_hints_strict=True,
            translation=soniox.TranslationConfig(
                type="two_way", language_a="en", language_b="es",
            ),
        )
    ),
)
```

Speak: `"No hablo español but I speak English"`

Resulting `SpeechData` on the `FINAL_TRANSCRIPT` event:

```python
SpeechData(
    text             = "I don't speak Spanish, pero hablo inglés.",
    language         = "en",                                  # misleading: second clause is Spanish
    source_languages = ["es", "en"],                          # per-run, correct
    source_texts     = ["No hablo español, ", " but I speak English."],
    # target_languages / target_texts: not surfaced today
)
```

Expected after this feature:

```python
SpeechData(
    text             = "I don't speak Spanish, pero hablo inglés.",
    language         = "en",                                  # unchanged for back-compat (first run)
    source_languages = ["es", "en"],
    source_texts     = ["No hablo español, ", " but I speak English."],
    target_languages = ["en", "es"],                          # NEW
    target_texts     = ["I don't speak Spanish, ", "pero hablo inglés."],   # NEW
)
```

### Proposed solution

In `livekit-agents`, extend `stt.SpeechData`:

```python
@dataclass
class SpeechData:
    ...
    target_languages: list[LanguageCode] | None = None
    """Per consecutive-language-run target languages, parallel to `target_texts`.
    Mirrors the existing `source_languages` / `source_texts` semantics on the source side.
    Populated by STT services that support translation; None when translation is off."""
    target_texts: list[str] | None = None
    """The translated transcription segments in the target language(s).
    Each entry corresponds to the same-indexed entry in `target_languages`."""
```

In `livekit-plugins-soniox`, mirror the existing source-side block in `send_endpoint_transcript`:

```python
src_segs = final_original._lang_segments
source_languages = [LanguageCode(lang) for lang, _ in src_segs] or None
source_texts     = [t for _, t in src_segs] or None

# NEW — target side, symmetric to source
tgt_segs = final._lang_segments
target_languages = [LanguageCode(lang) for lang, _ in tgt_segs] or None
target_texts     = [t for _, t in tgt_segs] or None

self._event_ch.send_nowait(
    stt.SpeechEvent(
        type=SpeechEventType.FINAL_TRANSCRIPT,
        alternatives=[
            final.to_speech_data(
                self.start_time_offset,
                source_languages=source_languages,
                source_texts=source_texts,
                target_languages=target_languages,
                target_texts=target_texts,
            )
        ],
    )
)
```

`final._lang_segments` is already computed by the existing `_TokenAccumulator.update` coalescing logic — no new accumulation work, no new state, no protocol-level changes. Consumers that don't read the new fields are unaffected (back-compat preserved).

The same change naturally applies to the `INTERIM_TRANSCRIPT` path that already merges source-side segments via `_merge_lang_segments`.


### Workarounds / Alternatives

_No response_

### Additional Context

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature request: per-run target-language segments on `stt.SpeechData` #5685

Feature Type

Feature Description

Why per-run, not per-token

Reproduction

Proposed solution

Workarounds / Alternatives

Additional Context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Feature request: per-run target-language segments on stt.SpeechData #5685

Description

Feature Type

Feature Description

Why per-run, not per-token

Reproduction

Proposed solution

Workarounds / Alternatives

Additional Context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Feature request: per-run target-language segments on `stt.SpeechData` #5685