Skip to content

Worker pool stays "_active=true" with empty _workers after synchronized seppukus → runner.promise never resolves #592

@DivMode

Description

@DivMode

Summary

Hit a wedged state in graphile-worker 0.16.6 where the pool ended up with _active=true, _shuttingDown=false, and _workers=[] after all workers committed seppuku within a few seconds of each other. runner.promise never resolved, the Node process stayed alive indefinitely, and the queue drained to 0 in_flight while waiting climbed unboundedly.

graphile-worker itself logged the diagnosis:

[core] ERROR: Worker exited, but pool is in continuous mode, is active,
              and is not shutting down... Did something go wrong?

…and then never recovered.

Reproduction

Production setup running graphile-worker via run({ concurrency: 8, pollInterval: 2000, taskList }) in Docker on Hetzner. Task handlers dispatch HTTP requests to a Cloudflare Worker that drives a Workflow; when the workflow returns an errored status, the task throws, and the failure is normal graphile-worker behavior (retry up to max_attempts=25).

What triggered the wedge: at the moment 8 concurrent jobs were all in the failure-release path, a transient DNS hiccup made the next pg-pool connection lookup hit getaddrinfo ETIMEOUT. All 8 workers were unable to release their claims → all 8 committed seppuku:

[worker(W1)] ERROR: Failed to release job '64844' after failure '...'; committing seppuku
[worker(W2)] ERROR: Failed to release job '64846' after failure '...'; committing seppuku
[worker(W3)] ERROR: Failed to release job '64847' after failure '...'; committing seppuku
... all 8 seppuku within ~1 second of each other ...
getaddrinfo ETIMEOUT
[core] ERROR: Worker exited, but pool is in continuous mode, is active,
              and is not shutting down... Did something go wrong?
[core] ERROR: Worker exited with error: DNSException: getaddrinfo ETIMEOUT

After this, DNS recovered immediately (verified dns.lookup("postgres") worked seconds later). But:

  • pool._active === true
  • pool._shuttingDown === false
  • pool._workers.length === 0
  • runner.promise did not resolve or reject
  • Node process stayed alive
  • Zero jobs were picked up — for 9 hours until we noticed and docker restart'd.

Expected behavior

Either:

  1. Pool replaces dead workers when _active=true && !_shuttingDown && _workers.length === 0 — eventually-consistent recovery from synchronized worker fatalities.
  2. OR, if irrecoverable, runner.promise rejects (or resolves) so the host can react. Right now the only signal is the "Did something go wrong?" log line, which is text, not a state change anything can await.

Either would have let our orchestrator (Docker restart: unless-stopped) recover automatically. As it stands the host has to add its own watchdog that inspects pool._workers.length and process.exit() itself when the count stays at 0.

Workaround (in case anyone else hits this)

External watchdog using the @internal fields exposed on WorkerPool:

const runner = await run({ ... });

let workerPool: WorkerPool | null = null;
runner.events.on("pool:create", ({ workerPool: pool }) => { workerPool = pool; });

let zombieSince: number | null = null;
setInterval(() => {
  if (!workerPool) return;
  const zombie =
    workerPool._workers.length === 0 &&
    workerPool._active &&
    !workerPool._shuttingDown;
  if (zombie) {
    zombieSince ??= Date.now();
    if (Date.now() - zombieSince >= 120_000) process.exit(1);
  } else {
    zombieSince = null;
  }
}, 15_000);

await runner.promise;

…but this relies on @internal fields and isn't a real fix.

Versions

  • graphile-worker@0.16.6
  • pg@8.13.1
  • Bun 1.3.9, Linux x86_64
  • Postgres 18 on a Docker bridge network (postgres resolved via embedded Docker DNS)

Possible upstream fix

In WorkerPool's worker-exit handler: when a worker's promise rejects and _active && !_shuttingDown, either (a) spawn a replacement worker, or (b) call gracefulShutdown() and let pool.promise reject so the runner can propagate up. The current third path — log a question mark and stay alive with zero workers — is the one that wedges hosts.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions