fix(amd): cancel short_greeting timer on late STT transcript#1390
Merged
toubatbrian merged 16 commits intomainfrom May 7, 2026
Merged
fix(amd): cancel short_greeting timer on late STT transcript#1390toubatbrian merged 16 commits intomainfrom
toubatbrian merged 16 commits intomainfrom
Conversation
Ports python livekit/agents#5584 (AMD improvement) into agents-js. - Expose `humanSpeechThresholdMs`, `humanSilenceThresholdMs`, `machineSilenceThresholdMs`, and `prompt` as `AMDOptions` fields. - Defer to the LLM (instead of forcing HUMAN) when a transcript is already available after a short greeting. - Add `postpone_termination` LLM tool (capped at 3 extensions × 10s) alongside `save_prediction`; fall back to JSON-content parsing when the LLM does not emit tool calls. - Add `participantIdentity` and `suppressCompatibilityWarning` options. - Warn once when the resolved LLM is not in `EVALUATED_LLM_MODELS`. Skipped (architectural divergence — see PR description): dedicated AMD STT pipeline, track-subscription wait, and the `start()` / `start_timers()` lifecycle split.
- Gate `save_prediction` and `postpone_termination` tool side effects on the current `detectGeneration`. Stale in-flight classifications now no-op instead of mutating timers, budget, or capturing a verdict that belongs to a superseded transcript window. - Normalize `save_prediction`'s `label` argument through `parseCategory` before storing, so an off-enum value from a misbehaving LLM (or our manual JSON path that bypasses Zod) is treated as UNCERTAIN rather than producing an `AMDResult` with an invalid category string. - Fix `warnIfNotEvaluated` substring check to also handle date-suffixed model names (e.g. `openai/gpt-4.1-mini-2025-04-14`).
Without this, a postpone_termination tool call resolved after aclose() would still see isStale() === false (settled was never flipped) and install a fresh silenceTimer that survives cleanup, eventually firing scheduleLLMClassification + tryEmitResult and potentially triggering session.interrupt on a closed AMD.
Without a lower bound and NaN guard, a misbehaving LLM passing a negative or non-numeric `seconds` argument would compute a clampedMs of NaN or a negative number, which setTimeout treats as 0 and fires immediately. The manual tool-execution path here bypasses the Zod schema, so this defense lives in execute().
🦋 Changeset detectedLatest commit: a2c8caa The changes in this PR will be included in the next version bump. This PR includes changesets to release 29 packages
Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
|
|
aba6b80 to
95ddc42
Compare
chenghao-mou
approved these changes
May 5, 2026
chenghao-mou
reviewed
May 5, 2026
Port of livekit/agents#5637. When a final STT transcript arrives inside the short-speech HUMAN_SILENCE_THRESHOLD window, cancel the pre-baked HUMAN/short_greeting silence timer and replace it with a long_speech timer anchored at speechEndedAt + MACHINE_SILENCE_THRESHOLD_MS so the LLM verdict gets the final word. https://claude.ai/code/session_017SqU9Zxmo439ZtcdwzKZp9
95ddc42 to
15c346a
Compare
- added SIP code in the example; - added support for separate STT; - added support for participant wait; - added default models
Port of livekit/agents#5637. When a final STT transcript arrives inside the short-speech HUMAN_SILENCE_THRESHOLD window, cancel the pre-baked HUMAN/short_greeting silence timer and replace it with a long_speech timer anchored at speechEndedAt + MACHINE_SILENCE_THRESHOLD_MS so the LLM verdict gets the final word. https://claude.ai/code/session_017SqU9Zxmo439ZtcdwzKZp9
15c346a to
4027e25
Compare
Merged
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Automated port of livekit/agents#5637 (
fix(amd): reset timer for late stt transcript) intoagents-js.Note
This is an automated Claude Code Routine created by @toubatbrian. Right now it is in experimentation stage.
cc @toubatbrian @livekit/agent-devs for review.
Why
The python fix addresses a race in the AMD silence-timer state machine: after a short utterance, AMD pre-bakes a HUMAN/
short_greetingverdict that will fire atspeechEndedAt + humanSilenceThresholdMs. If the STT transcript happens to arrive inside that window — which is common when STT settles a beat slow — the short-greeting timer would still fire before the LLM had a chance to look at the transcript, causing AMD to settle as HUMAN/short_greetinginstead of running the LLM.The JS implementation (
agents/src/voice/amd.ts) had the same bug. This PR ports the python fix.What was ported
All changes land in
agents/src/voice/amd.ts. The python fix introduces a_silence_timer_triggerfield tagging the active silence timer as either"short_speech"(pre-baked HUMAN) or"long_speech"(waiting for LLM/timeout). Onpush_text, if the trigger is"short_speech"the timer is cancelled and replaced with a"long_speech"timer anchored atspeech_ended_at + machine_silence_threshold.JS mirror:
livekit-agents/livekit/agents/voice/amd/classifier.py)agents/src/voice/amd.ts)_silence_timer_trigger: Literal["short_speech", "long_speech"] | NonesilenceTimerTrigger: 'short_speech' | 'long_speech' | undefined_speech_ended_at: float | NonespeechEndedAt: number | undefinedon_user_speech_ended(short branch)handleUserStateChangedshort branchon_user_speech_ended(long branch)handleUserStateChangedlong branchpush_textwhen trigger =="short_speech"handleTranscriptwhen trigger ==='short_speech'_silence_timer_callback/on_user_speech_started/closeonSilenceTimerFired/clearTimer('silence')(covers user-speech-started + cleanup) /resetStateThe replacement long-speech timer is computed against the current wall clock and uses the configurable
machineSilenceThresholdMsfield exposed by #1368:This preserves the python invariant that the timer fires at
speech_ended + machine_silence_threshold, notpush_text + machine_silence_threshold.The new
silenceTimerTriggeris also set on the existing-transcript short-speech branch (added in #1368) and on the long-speech branch, so futurepush_textcalls correctly identify which timer is active.What was intentionally not ported
tests/test_amd_classifier.py(250 lines, exercises the python_make_classifierwith custom thresholds)humanSilenceThresholdMs/machineSilenceThresholdMsas constructor options, the new JS tests use shorter custom thresholds (100 ms / 300 ms) so the suite stays fast. The python invariants (timer cancel, long-speech replacement, short-greeting still fires when no transcript arrives) are covered.makefilechange to addtests/test_amd_classifier.pyto the unit-test suitevitesttest discovery; the new tests are co-located inamd.test.tsand pick up automatically.Implementation nuances
time.time()and stores_speech_ended_atas an epoch float, then schedulescall_later(remaining)whereremaining = (_speech_ended_at + machine_silence_threshold) - time.time(). JS usesDate.now()and storesspeechEndedAtfromUserStateChangedEvent.createdAt(alsoDate.now()-based, seeagents/src/voice/events.ts). The arithmetic is identical._silence_timerand_silence_timer_triggerat the top of_silence_timer_callback. JS does the same inonSilenceTimerFiredso apush_textarriving after the timer has fired (but beforetryEmitResultsettles the run) doesn't see a stale'short_speech'tag.clearTimer('silence')resets the trigger. Centralising the trigger reset inclearTimercovers all the python "cancel + null" sites (on_user_speech_started, the cancel-before-replace insidepush_text,close, and the postpone-termination path in feat(amd): port tunable params and postpone-termination tool from python #1368).Test plan
Two new unit tests in
agents/src/voice/amd.test.ts(using the configurable thresholds from #1368, so total real-time cost is sub-second):should not fire short_greeting when a transcript arrives late— emits a 50 ms speech-ended transition, waits 40 ms (still inside the 100 ms HUMAN window), then emits a final transcript. Asserts the resolved verdict hasreason === 'llm-verified'(from the StaticLLM), not'short_greeting'.should still fire short_greeting when no transcript arrives— emits the same speech transition with no transcript and asserts the resolved verdict hasreason === 'short_greeting'. Guards against accidentally regressing the HUMAN heuristic.Local verification:
pnpm --filter @livekit/agents build— passespnpm exec eslint agents/src/voice/amd.ts agents/src/voice/amd.test.ts— 0 errors / 0 warningspnpm exec prettier --check agents/src/voice/amd.ts agents/src/voice/amd.test.ts .changeset/amd-late-stt-cancel-short-greeting.md— cleanpnpm exec vitest run agents/src/voice/amd.test.ts— 8/8 pass (6 existing + 2 new)Changeset
patchfor@livekit/agents(per the routine's standing instructions).Generated by Claude Code