Skip to content

Bug: jobs stall indefinitely when process initialization times out (ProcPool._warmed_proc_queue.get() has no timeout) #5868

@onur-yildirim-infinitusai

Description

Summary

When worker processes fail to initialize (e.g. initialize_process_timeout is exceeded), jobs dispatched to the pool stall indefinitely and never emit the expected "failed to launch job after N attempts" error. The 3-retry logic in launch_job is unreachable in this failure mode.

Affected versions

Confirmed on livekit-agents==1.5.11 (likely affects all versions).

How to reproduce

  1. Start a worker on a resource-constrained host where cold module imports take longer than initialize_process_timeout (default: 10 s).
  2. On Linux, the SDK uses forkserver by default, so child processes must re-import application code from scratch — heavy import trees can exceed 10 s under CPU contention.
  3. Dispatch a job while all pre-warm slots are failing.
  4. Observe: the job logs no warmed process available for job, waiting for one to be created and then hangs forever. No further logs appear for that job.

Root cause

ProcPool.launch_job waits for an initialized process via:

proc = await self._warmed_proc_queue.get()   # line ~133 in proc_pool.py

There is no timeout on this get(). When _proc_spawn_task fails (process init timeout), it cleans up and returns without ever adding anything to _warmed_proc_queue:

# _proc_spawn_task failure path
if not initialized:
    self._executors.remove(proc)
    await proc.aclose()
    self.emit("process_closed", proc)
    return   # ← silently returns, _warmed_proc_queue never receives an entry

The launch_job coroutine is now stuck in _warmed_proc_queue.get() indefinitely. The finally block that decrements _jobs_waiting_for_process cannot run until get() returns, so the counter stays elevated and no new spawn is triggered by subsequent calls. The 3-retry loop that emits "failed to launch job on process after N attempts" is after the get() call and is therefore never reached.

Observed log sequence (truncated)

ERROR  livekit.agents  error initializing process   # TimeoutError in supervised_proc.initialize()
ERROR  livekit.agents  error initializing process   # same for every pre-warm slot
WARNING livekit.agents  no warmed process available for job, waiting for one to be created
# ... silence forever, job is never served and never errors out

Expected behavior

  • If process initialization fails, launch_job should retry spawning (up to MAX_ATTEMPTS) rather than waiting forever on an empty queue.
  • At minimum, "failed to launch job on process after N attempts" should be logged so operators know the job was dropped.

Proposed fix

Option A — timeout on _warmed_proc_queue.get() (minimal change):

try:
    proc = await asyncio.wait_for(
        self._warmed_proc_queue.get(),
        timeout=self._opts.initialize_timeout + 5,
    )
except asyncio.TimeoutError:
    if attempt == MAX_ATTEMPTS - 1:
        raise RuntimeError(f"no process became available after {MAX_ATTEMPTS} attempts")
    continue  # retry: loop back, spawn a new process, wait again

Option B — notify waiters on spawn failure (more surgical):

When _proc_spawn_task catches an init exception, put a sentinel on the queue (or fire an event) so that launch_job coroutines can unblock, detect the failure, and retry or raise.

Option C — spawn a replacement immediately on failure:

In the _proc_spawn_task except block, if there are still jobs waiting (self._jobs_waiting_for_process > 0), immediately create a new _proc_spawn_task instead of just returning. This gives the waiting job another chance without changing the queue protocol.

Workaround

Increase initialize_process_timeout in WorkerOptions (e.g. to 60 s) so processes have enough time to complete initialization before the timeout fires. This prevents the failure mode from triggering in practice but does not fix the underlying missing retry/timeout.

WorkerOptions(
    ...
    initialize_process_timeout=60.0,   # default is 10 s
)

Additional context

  • On Linux, multiprocessing_context defaults to "forkserver". The forkserver preloads registered plugin packages but not application code, so each worker subprocess must import the full application module tree from scratch.
  • On a pod cold-start the SDK pre-warms min(cpu_count, 4) processes simultaneously (default num_idle_processes). All of them compete for CPU during import, making each import slower than it would be sequentially — which is exactly when all slots can exceed the 10 s budget at once.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions