Bug Description
When using google.realtime.RealtimeModel for multi-turn voice conversations, I observed response latency escalating from ~1s to 20–50s as conversations progress. Attempting to mitigate this by switching to external VAD (Silero + automatic_activity_detection.disabled=True) made things worse — introducing repeated generate_reply timed out errors on top of the existing latency escalation.
Investigation revealed two independent SDK-side issues:
Bug 1 — generate_reply() conflicts with the activity-based audio flow:
When external VAD is used (automatic_activity_detection.disabled=True), the SDK calls generate_reply() after Gemini has already begun processing audio delivered via activity_start/end signals. This injects a redundant ActivityEnd + a "." placeholder turn, causing a state conflict and a 5-second timeout.
The signal flow conflict:
1. User speaks → Silero VAD → activity_start → push_audio → activity_end
→ Gemini receives audio via activity signals, BEGINS PROCESSING
2. STT + EOU completes → on_user_turn_completed → SDK default path:
→ commit_audio() ← no-op in Google plugin (L1380), just logs warning
→ _generate_reply() ← user_message=None (see Bug 2)
→ _realtime_reply_task() → rt_session.generate_reply():
a. Sends ActivityEnd ← REDUNDANT, already sent in step 1
b. Sends send_client_content(text=".", role="user", turn_complete=True)
← CONFLICTS with step 1: Gemini is mid-processing the audio turn
c. Waits 5s for generation_created event ← NEVER FIRES → timeout
Code references (v1.5.2):
realtime_api.py L730–733: generate_reply() sends ActivityEnd when _in_user_activity
realtime_api.py L743–745: sends send_client_content(text=".", turn_complete=True)
realtime_api.py L748–753: 5-second timeout → generate_reply timed out error
Bug 2 — STT transcript unconditionally discarded:
After on_user_turn_completed returns, the SDK hardcodes:
# agent_activity.py L1832
if isinstance(self.llm, llm.RealtimeModel):
user_message = None # type: ignore
This discards the STT transcript unconditionally. _generate_reply() receives user_message=None, and _realtime_reply_task passes user_input=None — the full STT transcript is never forwarded to Gemini.
This means there is no supported way to use external STT text as input to a RealtimeModel, which blocks the most effective workaround for Gemini's audio token accumulation problem (server-side issue tracked separately via Google Issue Tracker #493438050).
Expected Behavior
-
When external VAD is used with RealtimeModel, generate_reply() should not conflict with Gemini's activity-based audio processing — no redundant ActivityEnd or "." placeholder should be injected while Gemini is mid-processing an audio turn.
-
When external STT is configured alongside a RealtimeModel, the SDK should provide an option to forward the STT transcript to the realtime model instead of unconditionally discarding it.
Reproduction Steps
**Step 1 — Observe latency escalation with native Gemini VAD:**
session = AgentSession(
llm=google.realtime.RealtimeModel(
model="gemini-2.5-flash-native-audio-preview-12-2025",
# automatic_activity_detection defaults to True (native Gemini VAD)
),
)
Conduct a 5+ turn voice conversation. Response latency escalates (measured from OpenTelemetry traces: `user_speaking` END → `agent_speaking` START):
| Session | Turn gaps (s) | Avg | Max |
|---|---|---|---|
| Session A | 13.3 → 1.2 → 6.4 → **22.8** → **30.6** → 5.2 → 2.8 | 11.8 | 30.6 |
| Session B | 7.5 → 1.1 → **22.6** → 1.1 → **26.5** | 11.7 | 26.5 |
| Session C | 2.6 → 12.5 → 1.0 → 17.9 → 1.2 → 16.8 → **49.8** → 1.1 | 12.9 | 49.8 |
**Step 2 — Switch to external VAD → `generate_reply timed out`:**
Following the [Gemini turn detection docs](https://docs.livekit.io/agents/models/realtime/plugins/gemini/#turn-detection), switch to external VAD:
session = AgentSession(
vad=silero.VAD.load(...),
stt=inference.STT(model="deepgram/nova-2", language="zh-TW"),
turn_handling=TurnHandlingOptions(turn_detection=MultilingualModel()),
llm=google.realtime.RealtimeModel(
model="gemini-2.5-flash-native-audio-preview-12-2025",
realtime_input_config=RealtimeInputConfig(
automatic_activity_detection=AutomaticActivityDetection(disabled=True),
),
),
)
Result — latency gets **worse**, not better. Logs show:
WARN commit_audio is not supported by Gemini Realtime API. (×5)
ERROR failed to generate a reply: generate_reply timed out (×4)
waiting for generation_created event.
| Session | Turn gaps (s) | Avg | Max | 2nd-half ratio |
|---|---|---|---|---|
| Session D (ext VAD) | 6.2 → 9.4 → **14.7** → **26.4** → **23.8** → **38.1** | 19.8 | 38.1 | **2.9×** |
External VAD is worse (avg 19.8s vs native VAD 11–13s) because the `generate_reply timed out` errors add 5s+ of wasted time per turn on top of the audio token accumulation.
Operating System
Linux (LiveKit Cloud deployment); also reproduced locally on Ubuntu 24.04.
Models Used
gemini-2.5-flash-native-audio-preview-12-2025, deepgram/nova-2
Package Versions
livekit-agents==1.5.2
livekit-plugins-google==1.5.2
livekit-plugins-silero==1.5.2
livekit-plugins-turn-detector==1.5.2
Session/Room/Call IDs
No response
Proposed Solution
Two independent fixes:
**Fix 1 — Prevent `generate_reply()` conflict with activity-based flow (`realtime_api.py`)**
When Gemini has already received audio via the `activity_start/end` path and is processing it, `generate_reply()` should not inject a redundant `ActivityEnd` and `"."` placeholder turn. Options:
- Track whether an active audio turn is being processed; if so, skip the placeholder and wait for the natural `generation_created` event from the audio path
- Or skip `generate_reply()` entirely when `_in_user_activity` was recently `True`
**Fix 2 — Support text-input mode for `RealtimeModel` (`agent_activity.py`)**
Provide an opt-in to forward the STT transcript to the realtime model instead of discarding it. The current hardcoded `user_message = None` at L1832 makes a text-input workaround impossible without monkey-patching:
# Current (L1832):
if isinstance(self.llm, llm.RealtimeModel):
user_message = None # ignore stt transcription for realtime model
# Proposed: respect a capability flag or configuration option
if isinstance(self.llm, llm.RealtimeModel):
if not getattr(self.llm.capabilities, 'text_input_mode', False):
user_message = None # existing behavior for native audio mode
# else: keep user_message → _realtime_reply_task forwards it via update_chat_ctx
This would let users opt in to text-input mode for realtime models, enabling the effective workaround for Gemini's audio token accumulation without requiring internal monkey-patches.
Additional Context
Verified workaround (requires monkey-patching):
By monkey-patching push_audio and start_user_activity to no-ops (preventing audio from reaching Gemini) and forwarding the STT transcript via on_user_turn_completed + generate_reply(user_input=text) + StopResponse, latency becomes completely stable:
| Session |
Transport |
Turn gaps (s) |
Avg |
Max |
2nd-half ratio |
| Session E |
WebRTC |
1.9 → 1.9 → 2.2 → 2.0 → 2.3 → 1.9 → 1.9 |
2.0 |
2.3 |
1.0× |
| Session F |
SIP |
17.2 → 4.1 → 5.3 → 2.2 → 6.1 → 2.1 → 1.8 |
5.6 |
17.2 |
0.3× ↓ |
Session E (WebRTC): 7 turns over 126s, all within 1.9–2.3s with zero escalation. No generate_reply timed out errors in either session.
Workaround code
class TextInputRealtimeModel(google.realtime.RealtimeModel):
"""Intercept audio push; use text-input mode to avoid audio token accumulation."""
def session(self):
sess = super().session()
sess.push_audio = lambda frame: None
sess.start_user_activity = lambda: None
return sess
# In Agent.on_user_turn_completed:
async def on_user_turn_completed(self, turn_ctx, new_message) -> None:
user_text = new_message.text_content
if user_text:
self.session.generate_reply(user_input=user_text)
raise StopResponse() # block SDK default path (which would send "." placeholder)
Background — Gemini audio token accumulation:
The Gemini Live API accumulates audio tokens in the session context (~25 tokens/sec). Over a multi-turn conversation, the growing context causes Gemini's per-turn processing time to increase linearly. This is a server-side Gemini behavior (tracked separately via Google Issue Tracker #493438050), but it creates a strong need for a text-input mode to keep latency stable. The SDK currently blocks this approach due to Bugs #1 and #2 above.
Related issues:
Screenshots and Recordings
No response
Bug Description
When using
google.realtime.RealtimeModelfor multi-turn voice conversations, I observed response latency escalating from ~1s to 20–50s as conversations progress. Attempting to mitigate this by switching to external VAD (Silero +automatic_activity_detection.disabled=True) made things worse — introducing repeatedgenerate_reply timed outerrors on top of the existing latency escalation.Investigation revealed two independent SDK-side issues:
Bug 1 —
generate_reply()conflicts with the activity-based audio flow:When external VAD is used (
automatic_activity_detection.disabled=True), the SDK callsgenerate_reply()after Gemini has already begun processing audio delivered viaactivity_start/endsignals. This injects a redundantActivityEnd+ a"."placeholder turn, causing a state conflict and a 5-second timeout.The signal flow conflict:
Code references (v1.5.2):
realtime_api.pyL730–733:generate_reply()sendsActivityEndwhen_in_user_activityrealtime_api.pyL743–745: sendssend_client_content(text=".", turn_complete=True)realtime_api.pyL748–753: 5-second timeout →generate_reply timed outerrorBug 2 — STT transcript unconditionally discarded:
After
on_user_turn_completedreturns, the SDK hardcodes:This discards the STT transcript unconditionally.
_generate_reply()receivesuser_message=None, and_realtime_reply_taskpassesuser_input=None— the full STT transcript is never forwarded to Gemini.This means there is no supported way to use external STT text as input to a
RealtimeModel, which blocks the most effective workaround for Gemini's audio token accumulation problem (server-side issue tracked separately via Google Issue Tracker #493438050).Expected Behavior
When external VAD is used with
RealtimeModel,generate_reply()should not conflict with Gemini's activity-based audio processing — no redundantActivityEndor"."placeholder should be injected while Gemini is mid-processing an audio turn.When external STT is configured alongside a
RealtimeModel, the SDK should provide an option to forward the STT transcript to the realtime model instead of unconditionally discarding it.Reproduction Steps
Operating System
Linux (LiveKit Cloud deployment); also reproduced locally on Ubuntu 24.04.
Models Used
gemini-2.5-flash-native-audio-preview-12-2025, deepgram/nova-2
Package Versions
Session/Room/Call IDs
No response
Proposed Solution
Additional Context
Verified workaround (requires monkey-patching):
By monkey-patching
push_audioandstart_user_activityto no-ops (preventing audio from reaching Gemini) and forwarding the STT transcript viaon_user_turn_completed+generate_reply(user_input=text)+StopResponse, latency becomes completely stable:Session E (WebRTC): 7 turns over 126s, all within 1.9–2.3s with zero escalation. No
generate_reply timed outerrors in either session.Workaround code
Background — Gemini audio token accumulation:
The Gemini Live API accumulates audio tokens in the session context (~25 tokens/sec). Over a multi-turn conversation, the growing context causes Gemini's per-turn processing time to increase linearly. This is a server-side Gemini behavior (tracked separately via Google Issue Tracker #493438050), but it creates a strong need for a text-input mode to keep latency stable. The SDK currently blocks this approach due to Bugs #1 and #2 above.
Related issues:
generate_reply timed out(closed, 20+ reports, timeout mechanism patched in fix generate_reply timeout for gemini #4237 — but root cause here is different)generate_reply timed out(closed)generate_replynot working in Gemini realtime (closed, 32 comments)Screenshots and Recordings
No response