Summary
When worker processes fail to initialize (e.g. initialize_process_timeout is exceeded), jobs dispatched to the pool stall indefinitely and never emit the expected "failed to launch job after N attempts" error. The 3-retry logic in launch_job is unreachable in this failure mode.
Affected versions
Confirmed on livekit-agents==1.5.11 (likely affects all versions).
How to reproduce
- Start a worker on a resource-constrained host where cold module imports take longer than
initialize_process_timeout (default: 10 s).
- On Linux, the SDK uses
forkserver by default, so child processes must re-import application code from scratch — heavy import trees can exceed 10 s under CPU contention.
- Dispatch a job while all pre-warm slots are failing.
- Observe: the job logs
no warmed process available for job, waiting for one to be created and then hangs forever. No further logs appear for that job.
Root cause
ProcPool.launch_job waits for an initialized process via:
proc = await self._warmed_proc_queue.get() # line ~133 in proc_pool.py
There is no timeout on this get(). When _proc_spawn_task fails (process init timeout), it cleans up and returns without ever adding anything to _warmed_proc_queue:
# _proc_spawn_task failure path
if not initialized:
self._executors.remove(proc)
await proc.aclose()
self.emit("process_closed", proc)
return # ← silently returns, _warmed_proc_queue never receives an entry
The launch_job coroutine is now stuck in _warmed_proc_queue.get() indefinitely. The finally block that decrements _jobs_waiting_for_process cannot run until get() returns, so the counter stays elevated and no new spawn is triggered by subsequent calls. The 3-retry loop that emits "failed to launch job on process after N attempts" is after the get() call and is therefore never reached.
Observed log sequence (truncated)
ERROR livekit.agents error initializing process # TimeoutError in supervised_proc.initialize()
ERROR livekit.agents error initializing process # same for every pre-warm slot
WARNING livekit.agents no warmed process available for job, waiting for one to be created
# ... silence forever, job is never served and never errors out
Expected behavior
- If process initialization fails,
launch_job should retry spawning (up to MAX_ATTEMPTS) rather than waiting forever on an empty queue.
- At minimum,
"failed to launch job on process after N attempts" should be logged so operators know the job was dropped.
Proposed fix
Option A — timeout on _warmed_proc_queue.get() (minimal change):
try:
proc = await asyncio.wait_for(
self._warmed_proc_queue.get(),
timeout=self._opts.initialize_timeout + 5,
)
except asyncio.TimeoutError:
if attempt == MAX_ATTEMPTS - 1:
raise RuntimeError(f"no process became available after {MAX_ATTEMPTS} attempts")
continue # retry: loop back, spawn a new process, wait again
Option B — notify waiters on spawn failure (more surgical):
When _proc_spawn_task catches an init exception, put a sentinel on the queue (or fire an event) so that launch_job coroutines can unblock, detect the failure, and retry or raise.
Option C — spawn a replacement immediately on failure:
In the _proc_spawn_task except block, if there are still jobs waiting (self._jobs_waiting_for_process > 0), immediately create a new _proc_spawn_task instead of just returning. This gives the waiting job another chance without changing the queue protocol.
Workaround
Increase initialize_process_timeout in WorkerOptions (e.g. to 60 s) so processes have enough time to complete initialization before the timeout fires. This prevents the failure mode from triggering in practice but does not fix the underlying missing retry/timeout.
WorkerOptions(
...
initialize_process_timeout=60.0, # default is 10 s
)
Additional context
- On Linux,
multiprocessing_context defaults to "forkserver". The forkserver preloads registered plugin packages but not application code, so each worker subprocess must import the full application module tree from scratch.
- On a pod cold-start the SDK pre-warms
min(cpu_count, 4) processes simultaneously (default num_idle_processes). All of them compete for CPU during import, making each import slower than it would be sequentially — which is exactly when all slots can exceed the 10 s budget at once.
Summary
When worker processes fail to initialize (e.g.
initialize_process_timeoutis exceeded), jobs dispatched to the pool stall indefinitely and never emit the expected "failed to launch job after N attempts" error. The 3-retry logic inlaunch_jobis unreachable in this failure mode.Affected versions
Confirmed on
livekit-agents==1.5.11(likely affects all versions).How to reproduce
initialize_process_timeout(default: 10 s).forkserverby default, so child processes must re-import application code from scratch — heavy import trees can exceed 10 s under CPU contention.no warmed process available for job, waiting for one to be createdand then hangs forever. No further logs appear for that job.Root cause
ProcPool.launch_jobwaits for an initialized process via:There is no timeout on this
get(). When_proc_spawn_taskfails (process init timeout), it cleans up and returns without ever adding anything to_warmed_proc_queue:The
launch_jobcoroutine is now stuck in_warmed_proc_queue.get()indefinitely. Thefinallyblock that decrements_jobs_waiting_for_processcannot run untilget()returns, so the counter stays elevated and no new spawn is triggered by subsequent calls. The 3-retry loop that emits"failed to launch job on process after N attempts"is after theget()call and is therefore never reached.Observed log sequence (truncated)
Expected behavior
launch_jobshould retry spawning (up toMAX_ATTEMPTS) rather than waiting forever on an empty queue."failed to launch job on process after N attempts"should be logged so operators know the job was dropped.Proposed fix
Option A — timeout on
_warmed_proc_queue.get()(minimal change):Option B — notify waiters on spawn failure (more surgical):
When
_proc_spawn_taskcatches an init exception, put a sentinel on the queue (or fire an event) so thatlaunch_jobcoroutines can unblock, detect the failure, and retry or raise.Option C — spawn a replacement immediately on failure:
In the
_proc_spawn_taskexcept block, if there are still jobs waiting (self._jobs_waiting_for_process > 0), immediately create a new_proc_spawn_taskinstead of just returning. This gives the waiting job another chance without changing the queue protocol.Workaround
Increase
initialize_process_timeoutinWorkerOptions(e.g. to 60 s) so processes have enough time to complete initialization before the timeout fires. This prevents the failure mode from triggering in practice but does not fix the underlying missing retry/timeout.Additional context
multiprocessing_contextdefaults to"forkserver". The forkserver preloads registered plugin packages but not application code, so each worker subprocess must import the full application module tree from scratch.min(cpu_count, 4)processes simultaneously (defaultnum_idle_processes). All of them compete for CPU during import, making each import slower than it would be sequentially — which is exactly when all slots can exceed the 10 s budget at once.