Bug: jobs stall indefinitely when process initialization times out (ProcPool._warmed_proc_queue.get() has no timeout)

## Summary

When worker processes fail to initialize (e.g. `initialize_process_timeout` is exceeded), jobs dispatched to the pool stall **indefinitely** and never emit the expected \"failed to launch job after N attempts\" error. The 3-retry logic in `launch_job` is unreachable in this failure mode.

## Affected versions

Confirmed on `livekit-agents==1.5.11` (likely affects all versions).

## How to reproduce

1. Start a worker on a resource-constrained host where cold module imports take longer than `initialize_process_timeout` (default: 10 s).
2. On Linux, the SDK uses `forkserver` by default, so child processes must re-import application code from scratch — heavy import trees can exceed 10 s under CPU contention.
3. Dispatch a job while all pre-warm slots are failing.
4. Observe: the job logs `no warmed process available for job, waiting for one to be created` and then **hangs forever**. No further logs appear for that job.

## Root cause

`ProcPool.launch_job` waits for an initialized process via:

```python
proc = await self._warmed_proc_queue.get()   # line ~133 in proc_pool.py
```

There is **no timeout** on this `get()`. When `_proc_spawn_task` fails (process init timeout), it cleans up and returns without ever adding anything to `_warmed_proc_queue`:

```python
# _proc_spawn_task failure path
if not initialized:
    self._executors.remove(proc)
    await proc.aclose()
    self.emit("process_closed", proc)
    return   # ← silently returns, _warmed_proc_queue never receives an entry
```

The `launch_job` coroutine is now stuck in `_warmed_proc_queue.get()` indefinitely. The `finally` block that decrements `_jobs_waiting_for_process` cannot run until `get()` returns, so the counter stays elevated and no new spawn is triggered by subsequent calls. The 3-retry loop that emits `"failed to launch job on process after N attempts"` is **after** the `get()` call and is therefore never reached.

## Observed log sequence (truncated)

```
ERROR  livekit.agents  error initializing process   # TimeoutError in supervised_proc.initialize()
ERROR  livekit.agents  error initializing process   # same for every pre-warm slot
WARNING livekit.agents  no warmed process available for job, waiting for one to be created
# ... silence forever, job is never served and never errors out
```

## Expected behavior

- If process initialization fails, `launch_job` should retry spawning (up to `MAX_ATTEMPTS`) rather than waiting forever on an empty queue.
- At minimum, `"failed to launch job on process after N attempts"` should be logged so operators know the job was dropped.

## Proposed fix

**Option A — timeout on `_warmed_proc_queue.get()`** (minimal change):

```python
try:
    proc = await asyncio.wait_for(
        self._warmed_proc_queue.get(),
        timeout=self._opts.initialize_timeout + 5,
    )
except asyncio.TimeoutError:
    if attempt == MAX_ATTEMPTS - 1:
        raise RuntimeError(f"no process became available after {MAX_ATTEMPTS} attempts")
    continue  # retry: loop back, spawn a new process, wait again
```

**Option B — notify waiters on spawn failure** (more surgical):

When `_proc_spawn_task` catches an init exception, put a sentinel on the queue (or fire an event) so that `launch_job` coroutines can unblock, detect the failure, and retry or raise.

**Option C — spawn a replacement immediately on failure**:

In the `_proc_spawn_task` except block, if there are still jobs waiting (`self._jobs_waiting_for_process > 0`), immediately create a new `_proc_spawn_task` instead of just returning. This gives the waiting job another chance without changing the queue protocol.

## Workaround

Increase `initialize_process_timeout` in `WorkerOptions` (e.g. to 60 s) so processes have enough time to complete initialization before the timeout fires. This prevents the failure mode from triggering in practice but does not fix the underlying missing retry/timeout.

```python
WorkerOptions(
    ...
    initialize_process_timeout=60.0,   # default is 10 s
)
```

## Additional context

- On Linux, `multiprocessing_context` defaults to `"forkserver"`. The forkserver preloads registered plugin packages but **not** application code, so each worker subprocess must import the full application module tree from scratch.
- On a pod cold-start the SDK pre-warms `min(cpu_count, 4)` processes simultaneously (default `num_idle_processes`). All of them compete for CPU during import, making each import slower than it would be sequentially — which is exactly when all slots can exceed the 10 s budget at once.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug: jobs stall indefinitely when process initialization times out (ProcPool._warmed_proc_queue.get() has no timeout) #5868

Summary

Affected versions

How to reproduce

Root cause

Observed log sequence (truncated)

Expected behavior

Proposed fix

Workaround

Additional context

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Bug: jobs stall indefinitely when process initialization times out (ProcPool._warmed_proc_queue.get() has no timeout) #5868

Description

Summary

Affected versions

How to reproduce

Root cause

Observed log sequence (truncated)

Expected behavior

Proposed fix

Workaround

Additional context

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions