fix(room_io): disconnect locally before server delete to suppress spurious ERROR logs#6252
fix(room_io): disconnect locally before server delete to suppress spurious ERROR logs#6252tsushanth wants to merge 3 commits into
Conversation
…rious ERROR logs
When delete_room_on_close=True, the room is deleted from the server before the
agent's local RTC session closes its own connection. The rust-sdk introduced a
check (rust-sdks#1137) that logs ERROR when a publisher data channel closes while
the local session still considers itself connected. This produces three
misleading ERROR lines on every normal teardown:
publisher data channel '_reliable' closed unexpectedly
publisher data channel '_lossy' closed unexpectedly
publisher data channel '_data_track' closed unexpectedly
Fix: call room.disconnect() first so the local session transitions to the closed
state before the API delete triggers the server-side channel teardown. The
delete_room call is chained immediately after in the same async task, so the
net observable behaviour (room deleted on session close) is unchanged.
Fixes livekit#6250
There was a problem hiding this comment.
🚩 Pre-existing dead code: _close_session_atask is never assigned
The field self._close_session_atask is initialized to None at line 84 and used as a guard at line 406 (and not self._close_session_atask), but it is never assigned anywhere else in the codebase. This means the guard is always True (never prevents the close). This appears to be a leftover from a previous refactor and is not related to this PR's changes.
(Refers to line 84)
Was this helpful? React with 👍 or 👎 to provide feedback.
|
Thanks for creating the PR! The exception handling issue from Devin looks valid, could you fix that? type checking is failing too. |
chenghao-mou
left a comment
There was a problem hiding this comment.
Happy to merge it once the exception handling comment and CI error are fixed
|
Fixed both items:
|
| async def _disconnect_then_delete() -> api.DeleteRoomResponse: | ||
| # Disconnect locally before issuing the server-side delete so the | ||
| # rust-sdk's connection-closed flag is set before the server | ||
| # closes the publisher data channels. Without this the channels | ||
| # are torn down from the remote side while the local session is | ||
| # still "connected", which triggers spurious ERROR-level logs: | ||
| # "publisher data channel '_reliable' closed unexpectedly" | ||
| # See https://github.com/livekit/agents/issues/6250 | ||
| try: | ||
| await self._room.disconnect() | ||
| except Exception: | ||
| logger.exception("error disconnecting room before delete; proceeding with delete") | ||
| delete_fut: asyncio.Future[api.DeleteRoomResponse] = job_ctx.delete_room( | ||
| room_name=self._room.name | ||
| ) | ||
| return await delete_fut | ||
|
|
||
| self._delete_room_task = asyncio.ensure_future(_disconnect_then_delete()) | ||
| self._delete_room_task.add_done_callback(_on_delete_room_task_done) |
There was a problem hiding this comment.
🚩 Timeout budget now includes disconnect latency
In aclose() at line 249, asyncio.wait_for(task, timeout=DEFAULT_API_CONNECT_OPTIONS.timeout) now covers both the room disconnect AND the delete API call (previously it only covered the delete). If the disconnect is slow (e.g. network issues), it consumes time from the deletion budget, potentially causing the delete to never execute. The exception handler at line 493-494 mitigates the worst case (a hanging disconnect would eventually error out or be cancelled by the timeout), but in degraded network conditions the effective time for the delete call is reduced.
Was this helpful? React with 👍 or 👎 to provide feedback.
Fixes #6250
Root cause
When
delete_room_on_close=True,_on_agent_session_closecallsjob_ctx.delete_room()while the local RTC session is still in theconnected state. The server-side delete tears down the publisher data
channels from the remote end. The rust-sdk introduced a check in
rust-sdks#1137 that logs at ERROR level whenever a publisher data channel
closes while the local session has not yet set its own closed flag.
The result is three misleading ERROR lines on every clean teardown:
The call completes cleanly — these are false positives triggered by a race
between the server-initiated close and the local state.
Fix
Call
room.disconnect()first so the local RTC session marks itselfclosed before the API delete sends the server-side channel teardown.
The delete is chained immediately after in the same async task, so the
observable behaviour (room is deleted on session close) is unchanged.
The existing
aclosepath already awaits_delete_room_task, so thenew coroutine-backed task is drained correctly there too.
Testing
Reproduce with
delete_room_on_close=True(the default) andctx.shutdown()called a few seconds after room creation. Before thispatch the three ERROR lines appear on every run; after it they are gone.