feat(realtime): add_to_chat_ctx on generate_reply()#5605
Draft
cphoward wants to merge 7 commits intolivekit:mainfrom
Draft
feat(realtime): add_to_chat_ctx on generate_reply()#5605cphoward wants to merge 7 commits intolivekit:mainfrom
cphoward wants to merge 7 commits intolivekit:mainfrom
Conversation
…lity Adds: - add_to_chat_ctx: bool = True keyword-only parameter to abstract RealtimeSession.generate_reply with a Concurrency docstring section. - AgentSession.generate_reply: same parameter; raises NotImplementedError when combined with a non-realtime LLM. - RealtimeCapabilities.ephemeral_response: bool = False field. - AgentActivity dispatcher capability gate: emits DeprecationWarning and falls back to add_to_chat_ctx=True for plugins that do not declare the capability; emits a separate DeprecationWarning when the caller combines add_to_chat_ctx=False with non-empty tools/tool_choice. - OpenAI plugin sets ephemeral_response=True for non-Azure endpoints (Azure path stays on the legacy fallback until conversation:"none" semantics are verified there). - OpenAI plugin generate_reply accepts the new kwarg; the substrate-level behavior lands in a follow-up commit. Default add_to_chat_ctx=True preserves all existing behavior; default ephemeral_response=False preserves all existing plugin behavior.
…ency hardening for ephemeral responses When generate_reply is called with add_to_chat_ctx=False, the OpenAI plugin now sets conversation: "none" on the outbound response.create event so the substrate does not enter the response into its persistent conversation state, force-overrides params.tools=[] / params.tool_choice="none" (with a logger.warning if the caller passed any), and suppresses openai_client_event_queued / openai_server_event_received emits and LK_OPENAI_DEBUG log output for the response lifecycle. Adds a single-isolated-call serialization contract: a second generate_reply(add_to_chat_ctx=False) issued while the first is in flight raises RuntimeError with diagnostic context (in-flight client_event_id, response_id, elapsed-since-issue, docstring section reference). Default add_to_chat_ctx=True calls retain their existing concurrency semantics (the substrate enforces serialization of default-conversation responses). The contract is the API behavior, not a temporary limitation. Closes a shadow-state leak path: items belonging to an in-flight ephemeral response now skip _remote_chat_ctx.insert() and the remote_item_added emit in _handle_conversion_item_added, so the rendered text cannot leak via session.chat_ctx or the reconnection-replay state. Concurrency hardening (re-included from the prior closed PR commit 9241857): orphan filter at _handle_response_created (verbatim port, including the isinstance(metadata, dict) server-VAD bypass), restructured to run BEFORE _current_generation is assigned. Nine bare-assert handler sites converted to early-return on _current_generation is None (output_item_added, content_part_added, text_delta, text_done, audio_transcript_delta, audio_delta, audio_done, output_item_done, _handle_function_call) so the substrate parallel out-of-band path (reachable via orphans, server-VAD overlap, reconnection-mid-response, or timeout races) cannot crash the session. Tests cover: serialization-contract rejection (pre-creation + post-creation arms), conversation-none on the wire (asserted on serialized JSON bytes), tool override at the plugin layer, default-during-isolated proceeds, ephemeral state cleanup on response.done, orphan filter (with metadata-None bypass), all 9 handler guards on None generation, and a live-substrate smoke test against gpt-realtime that verifies audibility and behavioral isolation.
…meral responses Threads the post-capability-gate `effective_add_to_chat_ctx` value through AgentActivity._realtime_reply_task -> AgentActivity._realtime_generation_task -> AgentActivity._realtime_generation_task_impl, then gates all three _chat_ctx._upsert_item sites in the realtime-generate-reply path: - function-call upsert (defense-in-depth: tools are forced off at the plugin layer for isolated turns, so the callback should not fire, but the gate covers any future plugin that does not honor the override); - assistant-message upsert plus dependent emits (_conversation_item_added, speech_handle._item_added) and the OTel ATTR_RESPONSE_TEXT span attribute that tracing backends would otherwise log; - function-call-output upsert plus the corresponding _tool_items_added emit (also defense-in-depth). Adds a defensive gate at AgentActivity._on_remote_item_added that early- returns when the inbound item id is in self._rt_session._ephemeral_remote_item_ids. Uses a duck-typed getattr(..., set()) lookup so plugins that have not opted into ephemeral support continue to behave normally. The OpenAI plugin already drops these items at the source in _handle_conversion_item_added (Phase 2); this gate is a defensive second layer for future plugin implementations. Tests cover: ephemeral item skipped at remote-item-added, calibration that non-ephemeral items still pass through, behavior preserved for plugins without the ephemeral attribute, and a live-substrate end-to-end check that an isolated generate_reply against gpt-realtime does not pollute session.chat_ctx with the rendered text.
…rks for isolated responses For each in-flight isolated response, interrupt() now sends ResponseCancelEvent with the server-assigned response_id and drains the ephemeral tracking dicts (_active_ephemeral_response_ids, _ephemeral_event_ids, _ephemeral_started_at). Cancel-without-id is silently no-op for out-of-band responses on the OpenAI substrate, so isolated turns could not be stopped before this change. The default cancel-without-id is preserved as the race-window fallback: between response.create send and response.created arrival the server- assigned response_id is not yet known, and cancel-without-id is no-op for OOB but still serves as a no-op signal. The cleanup-on-response.done that the serialization contract relies on already landed in the previous fused commit; this commit confirms the test that all four ephemeral tracking structures drain when a response completes naturally. Tests cover: cancel carries response_id when an isolated response is in-flight (verified on serialized event), race-window fallback issues the default no-id cancel without raising, no-active-generation interrupt is a noop.
…rm contrast and capability-gate coverage Reconnect cleanup (the meaningful behavioural fix): A websocket reconnect during an in-flight isolated generate_reply previously left stale entries in _active_ephemeral_response_ids, _ephemeral_event_ids, and _ephemeral_started_at. The serialization-contract check would then treat the next legitimate isolated call as already-in-flight and raise RuntimeError for the lifetime of the session. _reconnect() now drains all four ephemeral tracking structures alongside _response_created_futures. _handle_response_done now also discards the per-response remote-item ids registered during the response lifecycle so the gate set does not grow unbounded across many ephemeral calls in a long-lived session. Test coverage additions: - test_substrate_isolation_isolated_arm_live: hard end-to-end isolation assertion against gpt-realtime through the LiveKit RealtimeSession wrapper. - test_substrate_isolation_calibration_arms_live: BASELINE + SAFETY-CONTROL arms with combined-arms calibration check (at least one must recall) so the ISOLATED arms pass cannot be misattributed to model unwillingness or a safety filter. Per-arm flake from the live model is tolerated; full calibration failure is not. - test_reconnect_drains_ephemeral_tracking_state: regression test for the reconnect bug above. - test_capability_gate_warns_and_falls_back_for_unsupporting_plugin: exercises the dispatcher gate against a legacy 3-kwarg plugin signature (would raise TypeError without the gate). - test_isolated_response_does_not_emit_public_events_for_response: real emit-guard test (predicate + emit branch), replacing the prior predicate- only check. - Strengthened test_concurrent_isolated_generate_reply_rejects_second_pre_creation to assert the first future stays pending and no second response.create reaches the wire. Cleaned up planning-stage comments in test docstrings for clarity. No production-code changes related to the cleanup.
…ult-during-isolated test scope Two refinements: - The dispatcher capability gate previously emitted DeprecationWarning when a plugin without RealtimeCapabilities.ephemeral_response received an add_to_chat_ctx=False call. DeprecationWarning is filtered out by default and the situation is not actually a deprecation (the API is not going away) — it is user misuse: the kwarg cannot be honored by the plugin. Switch to UserWarning so callers see the warning loud and clear, and so they understand the call did not isolate. The same category change applies to the secondary "tools-with-isolated" warning. - test_default_generate_reply_during_isolated_does_not_raise had a misleading name and an under-specified assertion. Default-during- isolated does not raise, but under the single-slot _current_generation the second response.created clobbers the first, detaching its stream from the slot-resident handlers. Test docstring now records this as a documented limitation; the assertion is unchanged because correctness of the overlap is not in scope until _current_generation is refactored to a dict keyed by response.id (separate work).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Adds an
add_to_chat_ctx: bool = Trueparameter toRealtimeSession.generate_replyandAgentSession.generate_reply, plus a newRealtimeCapabilities.ephemeral_responsecapability flag that plugins use to declare whether they honor the parameter. When the OpenAI plugin (on the public endpoint) seesadd_to_chat_ctx=False, it setsconversation: "none"on the outboundresponse.create, force-overridestools=[]/tool_choice="none"(with a warning if the caller passed any), and suppresses LiveKit-internalopenai_client_event_queued/openai_server_event_receivedemits andLK_OPENAI_DEBUGlog output for that response's lifecycle.AgentActivitysuppresses the assistant turn's writes toagent._chat_ctxat all three_upsert_itemsites and at_on_remote_item_added. The OpenAI plugin also intercepts at the source in_handle_conversion_item_addedso ephemeral items never enter_remote_chat_ctx(the shadow state behind the publicchat_ctxproperty and reconnection-replay state).Replaces #5569 (closed). The previous attempt rendered content via
response.create(input=[assistant_message]), which the GA modelgpt-realtimeignored — the model produced an unrelated generic response instead of rendering the input text. Switching toresponse.create(instructions=X, conversation: "none")renders X audibly AND keeps it out of substrate state.This enables use cases where the agent must speak content to the user without that content entering its own reasoning context on subsequent turns — for example, confirmation flows where the agent relays sensitive information from a trusted external source without the model gaining persistent access to that information.
The OpenAI plugin enforces a single-isolated-call serialization contract:
generate_reply(add_to_chat_ctx=False)raisesRuntimeError(with diagnostic context:client_event_id,response_id, elapsed-since-issue, docstring §Concurrency reference) if another isolated call is already in flight on the same session. Defaultadd_to_chat_ctx=Truecalls retain their existing concurrency semantics. The contract is documented in the API docstring §Concurrency.Independently of that contract, the PR re-includes the orphan filter at
_handle_response_created(verbatim port from the previous closed PR commit92418578, including theisinstance(metadata, dict)server-VAD bypass), restructured to run BEFORE_current_generationis assigned. Nine bare-assert handlers in the OpenAI plugin convert fromassert self._current_generation is not Nonetoif self._current_generation is None: returnso the substrate's parallel out-of-band path (still reachable via orphans, server-VAD overlap, reconnection-mid-response, or timeout races) cannot crash the session.interrupt()is wired with the active server-assignedresponse_idso cancel actually stops in-flight isolated responses (cancel-without-id is silently no-op for out-of-band responses on the OpenAI substrate). The default cancel-without-id is preserved as the race-window fallback for the small window betweenresponse.createsend andresponse.createdarrival when the server-assigned id is not yet known.Default
add_to_chat_ctx=Trueand defaultephemeral_response=Falsepreserve all existing behavior; reverting this PR is a no-op for any current caller.Behavior matrix
add_to_chat_ctxephemeral_responseTrue(default)FalseTrue(OpenAI public endpoint)conversation: "none"on the wire; tools forced off; local context, remote context shadow, OTel span, and public events all suppressed.FalseFalse(Phonic, Google, Ultravox, AWS, Azure-OpenAI)UserWarningfrom the dispatcher; falls back toTrue(legacy add-to-context path).FalseAgentSession.generate_replyraisesNotImplementedError.Known limitations
generate_reply(add_to_chat_ctx=False)is serialized — concurrent issuance raisesRuntimeError. This is the documented API contract (see docstring §Concurrency), not a temporary limitation. A follow-up PR could lift it via a_current_generationdict refactor keyed byresponse.id, but that work is not required for this PR's correctness.interrupt()race window: betweenresponse.createsend andresponse.createdarrival the server-assignedresponse.idis not yet known. Ifinterrupt()fires inside this window the cancel falls back to the existing no-id behavior (no-op for out-of-band on the substrate). The audio output'sclear_buffer()still fires locally so the user stops hearing already-buffered audio.RealtimeCapabilities.ephemeral_responseis set toFalsefor Azure-backed sessions becauseconversation: "none"semantics are not verified there. Azure-backedgenerate_reply(add_to_chat_ctx=False)calls go through the dispatcher'sUserWarningfallback to the legacy add-to-context path. A follow-up issue can be filed when Azure parity is verified.Empirical foundation
Substrate behavior verified with content-asserted three-arm contrast against
gpt-realtime(ISOLATED with nonce content / BASELINE no isolation / SAFETY-CONTROL with PII-shaped content). Standalone reproduction probes are available on request — happy to share the scripts that exercise the substrate primitive directly and through theRealtimeSessionwrapper. Audio recordings of model output are preserved separately for verification.Test coverage
31 new tests in
tests/test_realtime/test_generate_reply_isolation.py(30 active, 1 skipped when Azure env vars not set), mix of mock-based (always run) and live-substrate (requireOPENAI_API_KEY):AgentSession.generate_replysignature/error tests.conversation: "none"on serialized JSON bytes, calibration that default does NOT set the field, tool override at the plugin layer.client_event_id,response_id, elapsed-since-issue, §Concurrency reference), default-during-isolated proceeds, succeeds-after-completes, ephemeral state cleanup on response.done.metadata=None._current_generation is None._remote_chat_ctxandremote_item_added._on_remote_item_addedskips ephemeral, calibration that non-ephemeral pass through, plugin-without-attribute behaves normally.interrupt(): sends cancel withresponse_idfor in-flight isolated responses, race-window fallback issues default no-id cancel without raising, no-active-generation is noop._reconnect()drains all ephemeral tracking state so subsequent isolated calls are not blocked by stale entries.gpt-realtimevia the LiveKitRealtimeSessionwrapper, plus end-to-end check that the localchat_ctxis not polluted.