Summary
Hit a wedged state in graphile-worker 0.16.6 where the pool ended up with _active=true, _shuttingDown=false, and _workers=[] after all workers committed seppuku within a few seconds of each other. runner.promise never resolved, the Node process stayed alive indefinitely, and the queue drained to 0 in_flight while waiting climbed unboundedly.
graphile-worker itself logged the diagnosis:
[core] ERROR: Worker exited, but pool is in continuous mode, is active,
and is not shutting down... Did something go wrong?
…and then never recovered.
Reproduction
Production setup running graphile-worker via run({ concurrency: 8, pollInterval: 2000, taskList }) in Docker on Hetzner. Task handlers dispatch HTTP requests to a Cloudflare Worker that drives a Workflow; when the workflow returns an errored status, the task throws, and the failure is normal graphile-worker behavior (retry up to max_attempts=25).
What triggered the wedge: at the moment 8 concurrent jobs were all in the failure-release path, a transient DNS hiccup made the next pg-pool connection lookup hit getaddrinfo ETIMEOUT. All 8 workers were unable to release their claims → all 8 committed seppuku:
[worker(W1)] ERROR: Failed to release job '64844' after failure '...'; committing seppuku
[worker(W2)] ERROR: Failed to release job '64846' after failure '...'; committing seppuku
[worker(W3)] ERROR: Failed to release job '64847' after failure '...'; committing seppuku
... all 8 seppuku within ~1 second of each other ...
getaddrinfo ETIMEOUT
[core] ERROR: Worker exited, but pool is in continuous mode, is active,
and is not shutting down... Did something go wrong?
[core] ERROR: Worker exited with error: DNSException: getaddrinfo ETIMEOUT
After this, DNS recovered immediately (verified dns.lookup("postgres") worked seconds later). But:
pool._active === true
pool._shuttingDown === false
pool._workers.length === 0
runner.promise did not resolve or reject
- Node process stayed alive
- Zero jobs were picked up — for 9 hours until we noticed and
docker restart'd.
Expected behavior
Either:
- Pool replaces dead workers when
_active=true && !_shuttingDown && _workers.length === 0 — eventually-consistent recovery from synchronized worker fatalities.
- OR, if irrecoverable,
runner.promise rejects (or resolves) so the host can react. Right now the only signal is the "Did something go wrong?" log line, which is text, not a state change anything can await.
Either would have let our orchestrator (Docker restart: unless-stopped) recover automatically. As it stands the host has to add its own watchdog that inspects pool._workers.length and process.exit() itself when the count stays at 0.
Workaround (in case anyone else hits this)
External watchdog using the @internal fields exposed on WorkerPool:
const runner = await run({ ... });
let workerPool: WorkerPool | null = null;
runner.events.on("pool:create", ({ workerPool: pool }) => { workerPool = pool; });
let zombieSince: number | null = null;
setInterval(() => {
if (!workerPool) return;
const zombie =
workerPool._workers.length === 0 &&
workerPool._active &&
!workerPool._shuttingDown;
if (zombie) {
zombieSince ??= Date.now();
if (Date.now() - zombieSince >= 120_000) process.exit(1);
} else {
zombieSince = null;
}
}, 15_000);
await runner.promise;
…but this relies on @internal fields and isn't a real fix.
Versions
graphile-worker@0.16.6
pg@8.13.1
- Bun 1.3.9, Linux x86_64
- Postgres 18 on a Docker bridge network (
postgres resolved via embedded Docker DNS)
Possible upstream fix
In WorkerPool's worker-exit handler: when a worker's promise rejects and _active && !_shuttingDown, either (a) spawn a replacement worker, or (b) call gracefulShutdown() and let pool.promise reject so the runner can propagate up. The current third path — log a question mark and stay alive with zero workers — is the one that wedges hosts.
Summary
Hit a wedged state in graphile-worker 0.16.6 where the pool ended up with
_active=true,_shuttingDown=false, and_workers=[]after all workers committed seppuku within a few seconds of each other.runner.promisenever resolved, the Node process stayed alive indefinitely, and the queue drained to0 in_flightwhile waiting climbed unboundedly.graphile-worker itself logged the diagnosis:
…and then never recovered.
Reproduction
Production setup running graphile-worker via
run({ concurrency: 8, pollInterval: 2000, taskList })in Docker on Hetzner. Task handlers dispatch HTTP requests to a Cloudflare Worker that drives a Workflow; when the workflow returns anerroredstatus, the task throws, and the failure is normal graphile-worker behavior (retry up tomax_attempts=25).What triggered the wedge: at the moment 8 concurrent jobs were all in the failure-release path, a transient DNS hiccup made the next pg-pool connection lookup hit
getaddrinfo ETIMEOUT. All 8 workers were unable to release their claims → all 8 committed seppuku:After this, DNS recovered immediately (verified
dns.lookup("postgres")worked seconds later). But:pool._active === truepool._shuttingDown === falsepool._workers.length === 0runner.promisedid not resolve or rejectdocker restart'd.Expected behavior
Either:
_active=true && !_shuttingDown && _workers.length === 0— eventually-consistent recovery from synchronized worker fatalities.runner.promiserejects (or resolves) so the host can react. Right now the only signal is the"Did something go wrong?"log line, which is text, not a state change anything canawait.Either would have let our orchestrator (Docker
restart: unless-stopped) recover automatically. As it stands the host has to add its own watchdog that inspectspool._workers.lengthandprocess.exit()itself when the count stays at 0.Workaround (in case anyone else hits this)
External watchdog using the
@internalfields exposed onWorkerPool:…but this relies on
@internalfields and isn't a real fix.Versions
graphile-worker@0.16.6pg@8.13.1postgresresolved via embedded Docker DNS)Possible upstream fix
In
WorkerPool's worker-exit handler: when a worker'spromiserejects and_active && !_shuttingDown, either (a) spawn a replacement worker, or (b) callgracefulShutdown()and letpool.promisereject so the runner can propagate up. The current third path — log a question mark and stay alive with zero workers — is the one that wedges hosts.