PR #5519 (released in v1.5.7) changed _ExitCli(Exception) → _ExitCli(BaseException) so generic except Exception: blocks no longer swallow it during shutdown. Thanks for that fix — it closes one swallow path. There's a second swallow path it doesn't address that I wanted to flag.
The remaining swallow path
_handle_exit in livekit/agents/cli/cli.py raises _ExitCli() synchronously from inside a Python signal handler. Python's signal-via-wakeup-fd mechanism on Linux delivers SIGTERM into the asyncio event loop's next iteration — but the exact frame the signal-handler runs in is whatever the loop happened to be executing at that microsecond. When SIGTERM lands while the loop is mid-callback (e.g., inside _set_result_unless_cancelled resolving a Future), the _ExitCli raise happens inside an asyncio.events.Handle._run frame.
CPython's Handle._run:
def _run(self):
try:
self._context.run(self._callback, *self._args)
except (SystemExit, KeyboardInterrupt):
raise # ← propagates
except BaseException as exc:
...
self._loop.call_exception_handler(context) # ← swallows
After PR #5519, _ExitCli is BaseException, which matches the second except. So it gets logged as "Exception in callback ... _handle_exit raise _ExitCli" and the loop keeps running. The drain path is never reached.
This is the exact production SEV I documented in prodigal-tech/agent-orchestrator#330 — a pod kept advertising itself to LiveKit Cloud through its full 15-minute terminationGracePeriodSeconds, accepted 29 new dispatches during that window, then got SIGKILL'd at grace expiry. We were on 1.3.12 at the time, but the bug is identical on 1.5.7 — the BaseException change doesn't help here.
Empirical reproduction
import asyncio
class ExitCliBase(BaseException):
\"\"\"Mirrors v1.5.7 _ExitCli.\"\"\"
pass
class ExitCliKbd(KeyboardInterrupt):
\"\"\"Alternative.\"\"\"
pass
def trial(label, cls):
loop = asyncio.new_event_loop()
main = loop.create_task(asyncio.sleep(10))
loop.call_later(0.05, lambda: (_ for _ in ()).throw(cls()))
try:
loop.run_until_complete(main)
print(f\"{label}: SWALLOWED — drain never reached\")
except cls:
print(f\"{label}: PROPAGATED → drain runs\")
finally:
loop.close()
trial(\"_ExitCli(BaseException)\", ExitCliBase) # SWALLOWED
trial(\"_ExitCli(KeyboardInterrupt)\", ExitCliKbd) # PROPAGATED
Output on Python 3.11:
Exception in callback ... ExitCliBase
_ExitCli(BaseException): SWALLOWED — drain never reached
_ExitCli(KeyboardInterrupt): PROPAGATED → drain runs
Suggested fixes (one of)
- Change
_ExitCli to inherit from KeyboardInterrupt instead of BaseException. Handle._run re-raises KeyboardInterrupt explicitly; everything else still works (it's still BaseException via inheritance, still bypasses except Exception:). Smallest change.
- Stop using exceptions for signal-driven shutdown. Use
loop.add_signal_handler(SIGTERM, callback) instead of signal.signal(...). The callback can do main_task.cancel() or set a stopping_event that the main coroutine awaits. This avoids raising from arbitrary frames entirely.
Option 2 is more invasive but architecturally cleaner. Option 1 is a one-character change.
Workaround we're carrying
For now we monkey-patch _cli._ExitCli to a KeyboardInterrupt subclass at startup (in our main.py before agents.cli.run_app). Code: agent_orchestrator/patches/livekit_exit_cli.py on this branch. The harness that validated end-to-end recovery is under tests/manual/network_blip/docker/.
Happy to send a PR if option 1 is acceptable.
PR #5519 (released in v1.5.7) changed
_ExitCli(Exception)→_ExitCli(BaseException)so genericexcept Exception:blocks no longer swallow it during shutdown. Thanks for that fix — it closes one swallow path. There's a second swallow path it doesn't address that I wanted to flag.The remaining swallow path
_handle_exitinlivekit/agents/cli/cli.pyraises_ExitCli()synchronously from inside a Python signal handler. Python's signal-via-wakeup-fd mechanism on Linux delivers SIGTERM into the asyncio event loop's next iteration — but the exact frame the signal-handler runs in is whatever the loop happened to be executing at that microsecond. When SIGTERM lands while the loop is mid-callback (e.g., inside_set_result_unless_cancelledresolving a Future), the_ExitCliraise happens inside anasyncio.events.Handle._runframe.CPython's
Handle._run:After PR #5519,
_ExitCliisBaseException, which matches the secondexcept. So it gets logged as"Exception in callback ... _handle_exit raise _ExitCli"and the loop keeps running. The drain path is never reached.This is the exact production SEV I documented in prodigal-tech/agent-orchestrator#330 — a pod kept advertising itself to LiveKit Cloud through its full 15-minute
terminationGracePeriodSeconds, accepted 29 new dispatches during that window, then got SIGKILL'd at grace expiry. We were on 1.3.12 at the time, but the bug is identical on 1.5.7 — the BaseException change doesn't help here.Empirical reproduction
Output on Python 3.11:
Suggested fixes (one of)
_ExitClito inherit fromKeyboardInterruptinstead ofBaseException.Handle._runre-raisesKeyboardInterruptexplicitly; everything else still works (it's stillBaseExceptionvia inheritance, still bypassesexcept Exception:). Smallest change.loop.add_signal_handler(SIGTERM, callback)instead ofsignal.signal(...). The callback can domain_task.cancel()or set astopping_eventthat the main coroutine awaits. This avoids raising from arbitrary frames entirely.Option 2 is more invasive but architecturally cleaner. Option 1 is a one-character change.
Workaround we're carrying
For now we monkey-patch
_cli._ExitClito aKeyboardInterruptsubclass at startup (in ourmain.pybeforeagents.cli.run_app). Code: agent_orchestrator/patches/livekit_exit_cli.py on this branch. The harness that validated end-to-end recovery is undertests/manual/network_blip/docker/.Happy to send a PR if option 1 is acceptable.