Skip to content

_ExitCli still swallowed by asyncio Handle._run after PR #5519 (callback-frame race) #5856

@zaheerabbas-prodigal

Description

@zaheerabbas-prodigal

PR #5519 (released in v1.5.7) changed _ExitCli(Exception)_ExitCli(BaseException) so generic except Exception: blocks no longer swallow it during shutdown. Thanks for that fix — it closes one swallow path. There's a second swallow path it doesn't address that I wanted to flag.

The remaining swallow path

_handle_exit in livekit/agents/cli/cli.py raises _ExitCli() synchronously from inside a Python signal handler. Python's signal-via-wakeup-fd mechanism on Linux delivers SIGTERM into the asyncio event loop's next iteration — but the exact frame the signal-handler runs in is whatever the loop happened to be executing at that microsecond. When SIGTERM lands while the loop is mid-callback (e.g., inside _set_result_unless_cancelled resolving a Future), the _ExitCli raise happens inside an asyncio.events.Handle._run frame.

CPython's Handle._run:

def _run(self):
    try:
        self._context.run(self._callback, *self._args)
    except (SystemExit, KeyboardInterrupt):
        raise                       # ← propagates
    except BaseException as exc:
        ...
        self._loop.call_exception_handler(context)   # ← swallows

After PR #5519, _ExitCli is BaseException, which matches the second except. So it gets logged as "Exception in callback ... _handle_exit raise _ExitCli" and the loop keeps running. The drain path is never reached.

This is the exact production SEV I documented in prodigal-tech/agent-orchestrator#330 — a pod kept advertising itself to LiveKit Cloud through its full 15-minute terminationGracePeriodSeconds, accepted 29 new dispatches during that window, then got SIGKILL'd at grace expiry. We were on 1.3.12 at the time, but the bug is identical on 1.5.7 — the BaseException change doesn't help here.

Empirical reproduction

import asyncio

class ExitCliBase(BaseException):
    \"\"\"Mirrors v1.5.7 _ExitCli.\"\"\"
    pass

class ExitCliKbd(KeyboardInterrupt):
    \"\"\"Alternative.\"\"\"
    pass

def trial(label, cls):
    loop = asyncio.new_event_loop()
    main = loop.create_task(asyncio.sleep(10))
    loop.call_later(0.05, lambda: (_ for _ in ()).throw(cls()))
    try:
        loop.run_until_complete(main)
        print(f\"{label}: SWALLOWEDdrain never reached\")
    except cls:
        print(f\"{label}: PROPAGATEDdrain runs\")
    finally:
        loop.close()

trial(\"_ExitCli(BaseException)\", ExitCliBase)   # SWALLOWED
trial(\"_ExitCli(KeyboardInterrupt)\", ExitCliKbd)  # PROPAGATED

Output on Python 3.11:

Exception in callback ... ExitCliBase
_ExitCli(BaseException): SWALLOWED — drain never reached
_ExitCli(KeyboardInterrupt): PROPAGATED → drain runs

Suggested fixes (one of)

  1. Change _ExitCli to inherit from KeyboardInterrupt instead of BaseException. Handle._run re-raises KeyboardInterrupt explicitly; everything else still works (it's still BaseException via inheritance, still bypasses except Exception:). Smallest change.
  2. Stop using exceptions for signal-driven shutdown. Use loop.add_signal_handler(SIGTERM, callback) instead of signal.signal(...). The callback can do main_task.cancel() or set a stopping_event that the main coroutine awaits. This avoids raising from arbitrary frames entirely.

Option 2 is more invasive but architecturally cleaner. Option 1 is a one-character change.

Workaround we're carrying

For now we monkey-patch _cli._ExitCli to a KeyboardInterrupt subclass at startup (in our main.py before agents.cli.run_app). Code: agent_orchestrator/patches/livekit_exit_cli.py on this branch. The harness that validated end-to-end recovery is under tests/manual/network_blip/docker/.

Happy to send a PR if option 1 is acceptable.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions