feat(stt): back-date START_OF_SPEECH onset via server-provided timestamp by gsharp-aai · Pull Request #5479 · livekit/agents

gsharp-aai · 2026-04-18T00:17:55Z

Summary

When turn_detection="stt" is used alongside a local VAD plugin (e.g. Silero), the framework records _speech_start_time from whichever event handler — VAD or STT — sets it first.

When local VAD fires for the audio, this works fine — the VAD handler back-dates _speech_start_time via time.time() - speech_duration - inference_duration and the existing None-guard prevents the later STT START_OF_SPEECH event from overwriting it.

But when the local VAD does not fire for that audio (different model version, different acoustic threshold, different preprocessing — common at quiet/borderline volumes), _speech_start_time stays None until the STT START_OF_SPEECH arrives — at which point the framework falls back to time.time() at message arrival. If the STT's speech-onset signal and its first transcript arrive close together, the framework's _speech_start_time and _last_speaking_time end up pinned near the same wall-clock instant. Result: MetricsReport.started_speaking_at ≈ stopped_speaking_at, i.e. speech_duration ≈ 0s for any turn the local VAD missed.

This is unphysical (real audio was transcribed), breaks downstream analytics keyed on speech duration, and creates state inconsistency where a user turn commits without ever entering a meaningful "speaking" window.

What this PR changes

Framework

Adds an optional speech_start_time: float | None = None field on SpeechEvent. Plugins that receive a separate speech-onset signal with timing can populate it; when left None the framework's STT SOS handler falls back to time.time() at message arrival (its current behavior).
Updates the STT SOS handler in audio_recognition.py to set _speech_start_time from ev.speech_start_time only when it's still None (i.e. local VAD hasn't fired first). When VAD has already set it, the VAD-back-dated value is preserved.
Extends the RecognitionHooks.on_start_of_speech protocol with a required speech_start_time: float parameter and threads the authoritative onset through it from both SOS handlers. Each handler computes its own authoritative onset locally (VAD back-dates from the VAD event; STT reads _speech_start_time) and passes a concrete value in — no ambiguity about which input wins at call time. All downstream state — DynamicEndpointing._utterance_started_at, _user_speaking_span.start_time, UserStateChangedEvent.created_at — reads from that single value.
Simplifies AgentActivity.on_start_of_speech to drop its internal VAD-event back-dating and time.time() fallback, since the caller now always provides the authoritative onset.

AssemblyAI plugin

Parses the SpeechStarted.timestamp field (stream-relative ms) that the plugin currently discards, converts it to wall-clock via a _stream_wall_start anchor recorded when the first audio frame is sent, and populates SpeechEvent.speech_start_time on the emitted START_OF_SPEECH event.

Why this is a safe fallback

Strictly additive. Every turn where local VAD fires is unaffected — _speech_start_time is already set by the VAD handler (its back-date is more accurate, computed locally with no network delay), the None-guard preserves it, and the same value flows through the hook to every downstream consumer. The STT-provided timestamp is only consulted when _speech_start_time is still None at STT START_OF_SPEECH arrival, i.e. exactly the case where local VAD missed the audio the STT caught.

Provider-side fallback (if you'd prefer not to add the field)

If a SpeechEvent schema change isn't desirable, the same outcome can be achieved without touching the framework: a plugin can pass through the back-dated time on the existing SpeechData.start_time field by attaching a synthetic SpeechData to the START_OF_SPEECH event's alternatives list. Functionally equivalent but semantically off (SpeechData is meant for transcription hypotheses, not event metadata), so the explicit field is preferred. Happy to switch if maintainers prefer to defer the schema change.

Files changed

livekit-agents/livekit/agents/stt/stt.py — new optional field on SpeechEvent
livekit-agents/livekit/agents/voice/audio_recognition.py — tighten RecognitionHooks.on_start_of_speech to require speech_start_time: float; set _speech_start_time from ev.speech_start_time under the None-guard; pass self._speech_start_time to the hook at the STT SOS call site; pass the locally-computed back-date at the VAD SOS call site
livekit-agents/livekit/agents/voice/agent_activity.py — accept the required speech_start_time kwarg on AgentActivity.on_start_of_speech; internal fallback logic removed
livekit-plugins/livekit-plugins-assemblyai/livekit/plugins/assemblyai/stt.py — anchor _stream_wall_start on first frame; parse SpeechStarted.timestamp and populate SpeechEvent.speech_start_time

Test plan

make format lint type-check pass
Unit/integration coverage for the new field — happy to add on request

Adds an optional SpeechEvent.speech_start_time field for STT plugins that receive a separate speech-onset signal with timing data, and uses it in audio_recognition.py to back-date _speech_start_time on STT START_OF_SPEECH events when local VAD has not fired. Without this, when local VAD does not detect audio that the STT does (e.g. quiet utterances near the activation threshold), _speech_start_time gets pinned to message arrival wall-clock. Because providers like AssemblyAI gate SpeechStarted behind the first partial transcript (so SpeechStarted and the first transcript arrive in the same network burst), this collapses _speech_start_time and _last_speaking_time onto the same timestamp, producing MetricsReport.speech_duration = 0.0s exactly. The framework's existing None-guard makes this strictly additive: VAD wins when it fires (its back-date is more accurate, computed locally on the audio path with no network delay). The STT timestamp is consulted only when _speech_start_time remains None at STT SOS arrival. Populates the new field from the AssemblyAI plugin by parsing SpeechStarted.timestamp (stream-relative ms), anchored to wall-clock via a new _stream_wall_start recorded when the first audio frame is sent.

Contributing guide says contributors don't need to touch CHANGELOG or package manifests \u2014 maintainers handle versioning. Shortening the docstring to match local conventions on existing fields.

Previous implementation computed a local stt_speech_start_time and unconditionally passed it to the on_start_of_speech hook, even when local VAD had already fired and set _speech_start_time. Downstream consumers of the hook (e.g. DynamicEndpointing._utterance_started_at) unconditionally overwrote their own state with that value, causing the STT server's back-dated onset to shift endpointing statistics by up to ~750ms whenever the VAD-fires-first path was exercised. Tighten to a single source of truth: if _speech_start_time is already set (VAD fired first), preserve it and pass it through to the hook. Only fall back to the STT's server-provided onset when _speech_start_time is None (VAD didn't fire). Zero observable change in the common case; corrects downstream state in the edge case.

Previous revision had two sources of onset time in `on_start_of_speech`: an optional `speech_start_time` kwarg and a `VADEvent` that could be back-dated. The "who wins" policy lived partly inside the function and partly at the STT call site, making the contract harder to read. Make `speech_start_time` a required parameter and push back-dating to each call site. `audio_recognition` now computes the authoritative onset at both SOS handlers (VAD's back-dated time for the VAD handler; VAD's back-date or the STT server timestamp for the STT handler) and hands a single value in. `AgentActivity.on_start_of_speech` drops its internal fallback logic and simply uses what it's given. No behavior change.

…amp=0 Two fixes from review: 1. _stream_wall_start was set in __init__ and only re-set on the first audio frame, so after the base class's _run() retry path reconnects the WebSocket, the anchor still pointed at the original connection's first frame while the server's timestamps restarted at 0. All subsequent SpeechStarted-derived onsets were shifted into the past by however long prior connections ran. Reset at the top of _run() so the next first-frame send re-anchors it. 2. data.get(\"timestamp\", 0) + truthy check conflated an absent field with a legitimate timestamp=0 (onset at stream start). Use data.get(\"timestamp\") + \`is not None\` so a real 0-ms onset converts to wall-clock instead of falling back to arrival time.

…ber from comment

tinalenguyen · 2026-04-20T20:33:22Z

hi @gsharp-aai, thank you for the PR! it seems like this is a specialized framework change for the plugin, is the SpeechStarted event's latency expected behavior? i would lean towards plugin-side logic to mitigate this, let me know your thoughts

Mirror the VAD path (audio_recognition.py:877), which passes speech_start_time to _ensure_user_turn_span so the telemetry span starts at actual speech onset. The STT path was leaving start_time unset, so the span would default to wall-clock-now at message arrival — disagreeing with MetricsReport.started_speaking_at, which this PR already back-dates via self._speech_start_time. Feed both from the same authoritative value.

gsharp-aai · 2026-04-20T22:37:04Z

Hey @tinalenguyen! Thanks for the comment. Just to separate concerns clearly:

Framework change
Our API, along with a couple of others (OpenAI Realtime's input_audio_buffer.speech_started carries audio_start_ms; Speechmatics' StartOfTurn can also include timing), sends timestamp data alongside its start-of-speech signal (SpeechStarted for us).

Currently there's no way for a plugin to pass this information through on the START_OF_SPEECH event. The framework change proposed here is to trust a speech_start_time if provided by the plugin, and otherwise fall back to the message-arrival time (today's behavior). But ultimately would still be on the provider to pass this data through if they want it to be trusted.

Plugin change

is the SpeechStarted event's latency expected behavior

Yes — SpeechStarted on our end has some inherent latency. We gate SpeechStarted behind the first partial transcript, a deliberate server-side choice to prevent phantom events on noise that does not end up being valid speech. This typically isn't an issue when local VAD fires, since self._speech_start_time is already set by the VAD handler and the existing None-guard preserves it. The issue shows up specifically when local VAD does not detect speech but our server does, and a transcript is received.

Because SpeechStarted and the final transcript may arrive close together in that case, the framework's time.time() calls at STT SOS and STT EOS land at nearly the same wall-clock instant, and started_speaking_at ≈ stopped_speaking_at — resulting in MetricsReport.speech_duration ≈ 0. To avoid this, we'd like to use the server-provided onset from SpeechStarted.timestamp to back-date _speech_start_time in that fallback case. VAD-fires-first turns are unaffected.

———

Overall, definitely open to plugin-side logic here! But I believe there would need to be at least some code on the framework side to let plugins propagate onset timing through START_OF_SPEECH events? Happy to iterate on whatever shape the team would prefer.

tinalenguyen · 2026-04-21T19:47:42Z

Thank you for the context @gsharp-aai, that makes a lot of sense. To ensure accuracy for user_speaking spans, I'm open to adding that field, I'll get back to you on what the rest of the team thinks

tinalenguyen

we can add speech_start_time to SpeechEvent, i just added a small comment. otherwise everything looks good to me!

tinalenguyen · 2026-04-21T22:10:44Z

-            with trace.use_span(self._ensure_user_turn_span()):
-                self._hooks.on_start_of_speech(None)
+            # If the plugin provided a server onset timestamp, use it;
+            # otherwise fall back to message arrival time.


maybe we can add a condition where:

self._speech_start_time = ev.speech_start_time if ev.speech_start_time < self._speech_start_time else self._speech_start_time

for when the vad detects activity before the stt as well

Open to this! Just want to flag it changes behavior from the current PR. Two shapes:

Fallback only (current PR as-is): _speech_start_time is only set from STT when VAD hasn't already set it. VAD wins when it fires, preserving current behavior.

Earlier of VAD or STT (your suggestion): every STT SOS compares both and picks the earlier onset, even when VAD already fired.

I leaned toward #1 since local VAD's back-date is usually more accurate than the server timestamp (no network delay, no clock skew) plus less of a behavioral change (in relation to what currently exists), but happy to flip to #2 if you think the "STT caught it earlier" case is common enough to trust by default.

Let me know which shape the team prefers!

chenghao-mou · 2026-04-22T10:47:57Z

+        # Reset on each (re)connection — the server's stream-relative timestamps
+        # restart at 0 with every new WebSocket, so the wall-clock anchor must
+        # also be re-captured from this connection's first frame.
+        self._stream_wall_start = None


we have a field called start_time_offset in stt stream that plays a similar role, and it is assigned when the stream is initialized:

stream.start_time_offset = time.time() - _audio_input_started_at

I think we can add a second field stream.start_time so that other STT implementations can use it as well.

Good call!

Key consideration is that I believe "server-provided onset timestamp" would be anchored to whatever zero-point that provider defines, which will of course vary by each provider's sever-side implementation. Because of that, I was thinking that the framework can't reliably pin a single wall-clock moment that aligns with every provider's "zero" simultaneously (each plugin knows its own server's semantics and should probably own the anchoring moment).

What about putting the field on the base class (shared, discoverable, other plugins can adopt), seeding a framework default at init so plugins that don't override still get some value, and letting each plugin overwrite it at whatever moment corresponds to its own server's zero? The framework can handle resetting it on retries centrally, same pattern as start_time_offset.

Shape:

# base class SpeechStream self._start_time: float = time.time() # framework default @property def start_time(self) -> float: ... @start_time.setter def start_time(self, value: float) -> None: ... # Plus a reset in _main_task across retries, same pattern as start_time_offset.

What do you think?

Edit: updated to seed a framework default and let plugins overwrite it, instead of leaving it as purely plugin-set.

That sounds reasonable. The framework provides a default, and plugins can override it if needed.

The framework provides a default, and plugins can override it if needed.

Updated PR to reflect this

Promote the AAI plugin's _stream_wall_start into a first-class field on the base SpeechStream class. Framework seeds self.start_time with time.time() in __init__ and re-seeds on each retry in _main_task. Plugins can override via the public setter to anchor at a more accurate moment (e.g., AAI overwrites on first ws.send_bytes so the anchor aligns with the server's stream-relative zero). Other STT plugins that receive server-side onset timing can adopt this shared idiom without plugin-local state, and the framework's default prevents silent breakage when a plugin doesn't override. Per PR livekit#5479 review discussion with chenghao-mou.

Framework-level tests (tests/test_stt_base.py): - start_time seeded to time.time() on stream init - setter accepts valid values, rejects negatives - start_time is re-seeded on retry so plugin overrides don't leak across reconnection attempts Plugin-level tests (tests/test_plugin_assemblyai_stt.py): - SpeechStarted handler converts timestamp_ms to wall-clock via self.start_time + timestamp_ms/1000 - timestamp=0 is treated as a valid onset (not "field missing") - missing timestamp leaves speech_start_time=None (framework falls back to message-arrival time) - base-class default is always set before any plugin override Unit-only tests — no network, runs under `tests/make unit-tests`.

gsharp-aai · 2026-04-23T23:35:26Z

@tinalenguyen @chenghao-mou Thank you for the reviews!

I have updated the PR based on this feedback, plus added a few tests. Ready for re-review when the team is able. Thank you!

gsharp-aai added 3 commits April 17, 2026 12:15

chore: tighten SpeechEvent.speech_start_time docstring, drop changeset

f847401

Contributing guide says contributors don't need to touch CHANGELOG or package manifests \u2014 maintainers handle versioning. Shortening the docstring to match local conventions on existing fields.

This comment was marked as resolved.

Sign in to view

gsharp-aai marked this pull request as draft April 18, 2026 00:25

gsharp-aai added 3 commits April 17, 2026 17:26

chore(assemblyai): drop private PR ref and model-specific latency num…

fa991e1

…ber from comment

gsharp-aai marked this pull request as ready for review April 18, 2026 00:38

chore(stt): tighten STT SOS handler comment

2cbdc10

This comment was marked as resolved.

Sign in to view

tinalenguyen reviewed Apr 21, 2026

View reviewed changes

chenghao-mou reviewed Apr 22, 2026

View reviewed changes

gsharp-aai added 2 commits April 23, 2026 16:09

Conversation

gsharp-aai commented Apr 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What this PR changes

Framework

AssemblyAI plugin

Why this is a safe fallback

Provider-side fallback (if you'd prefer not to add the field)

Files changed

Test plan

Uh oh!

This comment was marked as resolved.

Uh oh!

tinalenguyen commented Apr 20, 2026

Uh oh!

This comment was marked as resolved.

Uh oh!

gsharp-aai commented Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tinalenguyen commented Apr 21, 2026

Uh oh!

tinalenguyen left a comment

Choose a reason for hiding this comment

Uh oh!

tinalenguyen Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

gsharp-aai Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chenghao-mou Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

gsharp-aai Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

chenghao-mou Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

gsharp-aai Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

gsharp-aai commented Apr 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

gsharp-aai commented Apr 18, 2026 •

edited

Loading

gsharp-aai commented Apr 20, 2026 •

edited

Loading

gsharp-aai Apr 22, 2026 •

edited

Loading