Skip to content

Graceful shutdown: worker join has no timeout; stuck perform blocks SIGTERM #7

@jerry7991

Description

@jerry7991

Problem

bin/nebula_queue_worker:

  • Uses a global $running flag (lines 75–83) — untestable, single-process-only.
  • threads.each(&:join) (line 114) has no timeout.
  • trap('TERM') flips the flag, but a thread inside a long perform (HTTP call, slow DB query) will not return until it finishes. Heroku and K8s send SIGKILL after 30s → the process dies exactly the way we feared in the reliable-fetch issue.

Impact

  • Deploys either hang up to the platform's kill grace period or forcibly lose in-flight jobs.
  • No way to run two isolated workers in one process (tests, embedded use).

Fix

  1. Replace $running with a NebulaQueue::Launcher instance holding its own @running / Concurrent::AtomicBoolean.
  2. On SIGTERM:
    • Stop the fetcher loop (no new jobs pulled).
    • Wait up to shutdown_timeout (default 25s on Heroku, configurable) for in-flight threads to finish.
    • Force-requeue any jobs still checked out (pairs with the reliable-fetch working lists) before exiting.
  3. Exit non-zero if forced termination was required, so the platform logs a clear signal.

Acceptance

  • SIGTERM during a long-running job returns within shutdown_timeout seconds.
  • In-flight jobs are either completed or cleanly requeued — never silently dropped.
  • Shutdown is testable without process-level globals.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions