fix(avatar): preserve audio wrappers across avatar hot-swaps#5863
fix(avatar): preserve audio wrappers across avatar hot-swaps#5863longcw wants to merge 11 commits into
Conversation
AvatarSession.start() rebinds session.output.audio to a fresh DataStreamAudioOutput. On the first call the wrapper chain (Recorder, TranscriptSynchronizer) wraps it correctly, but a re-bind during a mid-session avatar switch overwrites the synchronizer-wrapped output with a raw sink, breaking audio/transcription sync and recording. Introduce _AudioSinkProxy, a transparent proxy auto-inserted at the bottom of any wrapper chain. Wrappers cache the proxy (not the leaf), so the leaf can be hot-swapped via the proxy without invalidating upstream references. When the proxy has no inner sink, flush() synthesizes a playback_finished so upstream wrappers don't hang. Add AgentOutput.set_audio_sink(sink, *, preserve_wrappers=False). With preserve_wrappers=True, walks the chain to find the proxy and swaps its downstream; otherwise behaves as the existing audio setter. Avatar plugins migrate to this API; AvatarSession.aclose() detaches the sink so the chain stays intact across aclose -> restart. Drops the "may be replaced by the avatar" warning in AvatarSession.start since the proxy makes mid-session rebinding correct by construction.
β¦ers=True) Route every avatar plugin's audio sink binding through the new AgentOutput.set_audio_sink API so mid-session hot-swaps (e.g. avatar switches) preserve the TranscriptSynchronizer / RecorderAudioOutput wrapper chain. Plugins migrated: anam, avatario, avatartalk, bey, bithuman, did, keyframe, liveavatar, runway, simli, tavus, trugen.
Covers: - auto-wrap inserts the proxy between a wrapper and a bare leaf - auto-wrap skipped when the downstream is already a proxy or a non-leaf - set_audio_sink default replaces the chain - set_audio_sink with preserve_wrappers swaps the proxy's inner in place - preserve_wrappers fallback when no proxy exists in the chain - proxy rejects a wrapper chain as inner (set_next_in_chain assert) - detached proxy synthesizes playback_finished on flush - swap routes new-leaf playback events to upstream listeners - swap disconnects the old leaf from the chain - on_attached/on_detached propagate to current inner and across swaps
Drop the leaf-only assertion in _AudioSinkProxy.set_next_in_chain β the base AudioOutput machinery cascades capture/flush and bubbles playback events through any chain, so the proxy can hold either a leaf or a wrapper chain without breaking the contract upstream.
The base class doesn't track which sink the avatar set, so nulling session.output.audio unconditionally could clobber a sink owned by someone else. The wrapper chain stays intact across hot-swaps anyway because the proxy preserves the wrappers regardless of what's in its downstream slot, so leaving the sink in place until it's replaced or the session tears down is fine.
| else: | ||
| self._audio_sink.on_detached() | ||
|
|
||
| def set_audio_sink(self, sink: AudioOutput | None, *, preserve_wrappers: bool = False) -> None: |
There was a problem hiding this comment.
Any way to avoid this method?
Ideally the audio.setter alone is sufficient
@audio.setter
def audio(self, sink: AudioOutput | None) -> None:
There was a problem hiding this comment.
we can make preserve_wrappers default for setter but then there is no way to fully replace it.
Drop the preserve_wrappers flag: the wrapper-preserving leaf swap is now its own method, and full replacement stays as output.audio = sink.
β¦wrappers-on-avatar-swap
| def flush(self) -> None: | ||
| super().flush() | ||
| if self.next_in_chain: | ||
| self.next_in_chain.flush() | ||
| else: | ||
| # no real sink; synthesize a playback_finished | ||
| self.on_playback_finished(playback_position=0.0, interrupted=True) |
There was a problem hiding this comment.
π‘ Re-entrant on_playback_finished during flush() causes double segment rotation and skips end_audio_input()
When _AudioSinkProxy has no downstream (e.g. after swap_audio_endpoint(None)), its flush() at io.py:358-359 synchronously calls self.on_playback_finished(...) which emits playback_finished. This event propagates up to _SyncedAudioOutput.on_playback_finished (synchronizer.py:633-643) which calls self._synchronizer.rotate_segment() and resets self._pushed_duration = 0.0. Control then returns to _SyncedAudioOutput.flush() (synchronizer.py:598-606) where the if not self._pushed_duration: check at line 601 is now unexpectedly true (reset by the re-entrant on_playback_finished), triggering a second rotate_segment() and returning before self._synchronizer._impl.end_audio_input() at line 606 is ever called.
Trace of the re-entrant call
_SyncedAudioOutput.flush()βself.next_in_chain.flush()(the proxy)_AudioSinkProxy.flush()β no downstream βself.on_playback_finished(0.0, interrupted=True)(synchronous)- Proxy emits β
_SyncedAudioOutput._forward_next_playback_finishedβ_SyncedAudioOutput.on_playback_finishedβrotate_segment(),_pushed_duration = 0.0 - Returns to
_SyncedAudioOutput.flush()line 601:_pushed_durationis now 0 β secondrotate_segment(),end_audio_input()never reached
Prompt for agents
The issue is that _AudioSinkProxy.flush() with no downstream synchronously calls self.on_playback_finished(), which re-enters _SyncedAudioOutput.on_playback_finished (via event forwarding) and mutates _pushed_duration to 0.0. When control returns to _SyncedAudioOutput.flush(), the _pushed_duration check wrongly triggers a second rotate_segment() and skips end_audio_input().
Possible fixes:
1. In _AudioSinkProxy.flush(), schedule the synthesized playback_finished asynchronously (e.g. via asyncio.get_event_loop().call_soon) so it doesn't re-enter during flush().
2. In _SyncedAudioOutput.flush(), save _pushed_duration before calling self.next_in_chain.flush() and use the saved value for the subsequent check.
3. In _AudioSinkProxy.flush(), set a flag and let the caller handle the no-sink case rather than emitting playback_finished inline.
Option 2 is the simplest local fix β save `had_audio = bool(self._pushed_duration)` before calling `self.next_in_chain.flush()` and check `had_audio` instead of `self._pushed_duration` afterward in synchronizer.py _SyncedAudioOutput.flush().
Was this helpful? React with π or π to provide feedback.
Summary
Avatar plugins set
session.output.audio = DataStreamAudioOutput(...)on every start. On the first start this works becauseAgentSession.start()wraps the sink with theTranscriptSynchronizer/RecorderAudioOutputchain afterwards; on a mid-session rebind (avatar switch) the raw assignment blows the chain away, silently breaking transcription sync and recording.Fix it by auto-inserting an
_AudioSinkProxyat the bottom of any wrapper chain. Wrappers cache the proxy, the proxy holds the swappable leaf β so hot-swaps preserve the wrappers above. NewAgentOutput.swap_audio_endpoint(sink)walks the chain to the proxy and swaps its downstream in place, leaving the wrappers attached; full replacement stays asoutput.audio = sink.Plugin migration
All 13 avatar plugins migrated to
swap_audio_endpoint(...): anam, avatario, avatartalk, bey, bithuman, did, keyframe, lemonslice, liveavatar, runway, simli, tavus, trugen.Example
examples/avatar_agents/audio_wavenow demonstrates hot-swapping through aswap_avatarRPC method: it tears down the current avatar (removing it from the room) and launches a fresh one under the same identity, while the audio wrappers and the listeners attached tosession.output.audiosurvive the swap.Related
Clean avatar-worker shutdown on swap depends on livekit/python-sdks#699.