Worker pool stays "_active=true" with empty `_workers` after synchronized seppukus → runner.promise never resolves

## Summary

Hit a wedged state in graphile-worker 0.16.6 where the pool ended up with `_active=true`, `_shuttingDown=false`, and `_workers=[]` after all workers committed seppuku within a few seconds of each other. `runner.promise` never resolved, the Node process stayed alive indefinitely, and the queue drained to `0 in_flight` while waiting climbed unboundedly.

graphile-worker itself logged the diagnosis:

```
[core] ERROR: Worker exited, but pool is in continuous mode, is active,
              and is not shutting down... Did something go wrong?
```

…and then never recovered.

## Reproduction

Production setup running graphile-worker via `run({ concurrency: 8, pollInterval: 2000, taskList })` in Docker on Hetzner. Task handlers dispatch HTTP requests to a Cloudflare Worker that drives a Workflow; when the workflow returns an `errored` status, the task throws, and the failure is normal graphile-worker behavior (retry up to `max_attempts=25`).

What triggered the wedge: at the moment 8 concurrent jobs were all in the failure-release path, a transient DNS hiccup made the next pg-pool connection lookup hit `getaddrinfo ETIMEOUT`. All 8 workers were unable to release their claims → all 8 committed seppuku:

```
[worker(W1)] ERROR: Failed to release job '64844' after failure '...'; committing seppuku
[worker(W2)] ERROR: Failed to release job '64846' after failure '...'; committing seppuku
[worker(W3)] ERROR: Failed to release job '64847' after failure '...'; committing seppuku
... all 8 seppuku within ~1 second of each other ...
getaddrinfo ETIMEOUT
[core] ERROR: Worker exited, but pool is in continuous mode, is active,
              and is not shutting down... Did something go wrong?
[core] ERROR: Worker exited with error: DNSException: getaddrinfo ETIMEOUT
```

After this, DNS recovered immediately (verified `dns.lookup("postgres")` worked seconds later). But:

- `pool._active === true`
- `pool._shuttingDown === false`
- `pool._workers.length === 0`
- `runner.promise` did not resolve or reject
- Node process stayed alive
- Zero jobs were picked up — for **9 hours** until we noticed and `docker restart`'d.

## Expected behavior

Either:

1. **Pool replaces dead workers** when `_active=true && !_shuttingDown && _workers.length === 0` — eventually-consistent recovery from synchronized worker fatalities.
2. **OR**, if irrecoverable, `runner.promise` rejects (or resolves) so the host can react. Right now the only signal is the `"Did something go wrong?"` log line, which is text, not a state change anything can `await`.

Either would have let our orchestrator (Docker `restart: unless-stopped`) recover automatically. As it stands the host has to add its own watchdog that inspects `pool._workers.length` and `process.exit()` itself when the count stays at 0.

## Workaround (in case anyone else hits this)

External watchdog using the `@internal` fields exposed on `WorkerPool`:

```ts
const runner = await run({ ... });

let workerPool: WorkerPool | null = null;
runner.events.on("pool:create", ({ workerPool: pool }) => { workerPool = pool; });

let zombieSince: number | null = null;
setInterval(() => {
  if (!workerPool) return;
  const zombie =
    workerPool._workers.length === 0 &&
    workerPool._active &&
    !workerPool._shuttingDown;
  if (zombie) {
    zombieSince ??= Date.now();
    if (Date.now() - zombieSince >= 120_000) process.exit(1);
  } else {
    zombieSince = null;
  }
}, 15_000);

await runner.promise;
```

…but this relies on `@internal` fields and isn't a real fix.

## Versions

- `graphile-worker@0.16.6`
- `pg@8.13.1`
- Bun 1.3.9, Linux x86_64
- Postgres 18 on a Docker bridge network (`postgres` resolved via embedded Docker DNS)

## Possible upstream fix

In `WorkerPool`'s worker-exit handler: when a worker's `promise` rejects and `_active && !_shuttingDown`, either (a) spawn a replacement worker, or (b) call `gracefulShutdown()` and let `pool.promise` reject so the runner can propagate up. The current third path — log a question mark and stay alive with zero workers — is the one that wedges hosts.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Worker pool stays "_active=true" with empty `_workers` after synchronized seppukus → runner.promise never resolves #592

Summary

Reproduction

Expected behavior

Workaround (in case anyone else hits this)

Versions

Possible upstream fix

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Worker pool stays "_active=true" with empty _workers after synchronized seppukus → runner.promise never resolves #592

Description

Summary

Reproduction

Expected behavior

Workaround (in case anyone else hits this)

Versions

Possible upstream fix

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Worker pool stays "_active=true" with empty `_workers` after synchronized seppukus → runner.promise never resolves #592