Skip to content

fix: reject in-flight task promises when worker crashes#146

Merged
jerome-benoit merged 12 commits into
masterfrom
fix/reject-in-flight-promises-on-worker-crash
May 14, 2026
Merged

fix: reject in-flight task promises when worker crashes#146
jerome-benoit merged 12 commits into
masterfrom
fix/reject-in-flight-promises-on-worker-crash

Conversation

@jerome-benoit
Copy link
Copy Markdown
Contributor

Summary

When a worker crashes, reject all associated task promises (in-flight and queued) so pool.execute() callers never hang indefinitely.

Port of poolifier/poolifier#3211 adapted for the web worker context.

Problem

When a worker crashes (onerror event), tasks already dispatched to it had their promises stored in promiseResponseMap but were never settled. The error handler only:

  • Restarted the worker
  • Redistributed queued tasks

In-flight tasks (already sent to the crashed worker) were silently dropped — their pool.execute() promises hung forever.

Changes

Promise settlement on crash

  • Add rejectInFlightTaskPromises(): iterates promiseResponseMap, finds entries targeting the crashed worker by stable workerId, and rejects them
  • Add rejectRemainingQueuedTaskPromises(): rejects queued tasks that could not be redistributed (no eligible destination worker)
  • Crash handling order: reject in-flight → redistribute queued → reject remaining

Architectural improvement

  • Replace mutable workerNodeKey (array index) with stable workerId in PromiseResponseWrapper for crash-time promise matching — array indices become stale after worker removal
  • Resolve workerNodeKey dynamically from workerId at task response time
  • Add updatePromiseResponseWorkerId() to update promise ownership on task steal/redistribute

Hardening

  • Guard against undefined workerId in rejectInFlightTaskPromises to prevent erroneous matching
  • Gate worker restart in exit handler on restartWorkerOnError option (previously unconditional)
  • Guard handleTaskExecutionResponse against removed worker nodes (workerNodeKey === -1)
  • Call errorEvent.preventDefault() in onerror to prevent Deno from propagating handled errors

Web worker adaptations (vs upstream Node.js PR)

  • No cluster worker exit-code crash detection (web workers have no exit codes)
  • No AsyncResource/runInAsyncScope helpers (not available in browser/Deno/Bun)
  • ErrorEvent used instead of Error for event dispatch (standard web API)
  • Synchronous terminate() instead of async (web worker API)

Validation

  • deno check passes (0 type errors)
  • deno lint passes (0 lint errors)
  • All existing tests pass (fixed pool: 15/15, dynamic pool: 9/9, worker-node: 3/3)
  • New crash regression test passes

When a worker crashes (onerror event), tasks already dispatched to it had
their promises stored in promiseResponseMap but were never settled. The
error handler only restarted the worker and redistributed queued tasks —
in-flight tasks were silently dropped, causing pool.execute() callers to
hang indefinitely.

The fix iterates promiseResponseMap on worker error, finds all entries
targeting the crashed worker by stable workerId, and rejects them with a
descriptive error before termination.

Changes:
- Replace workerNodeKey (array index) with stable workerId in
  PromiseResponseWrapper — indices become stale after removeWorkerNode
- Add handleWorkerNodeCrash() to unify crash recovery logic
- Add rejectInFlightTaskPromises() and rejectRemainingQueuedTaskPromises()
- Add updatePromiseResponseWorkerId() for task steal/redistribute tracking
- Resolve workerNodeKey dynamically from message.workerId at response time
- Gate exit handler worker restart on restartWorkerOnError option
- Add crash worker test and regression test for promise rejection

Port of poolifier/poolifier#3211
Copilot AI review requested due to automatic review settings May 14, 2026 17:27
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR ensures pool.execute() callers don’t hang indefinitely when a worker crashes by explicitly rejecting promises for tasks that were in-flight on the crashed worker (and any queued tasks that can’t be redistributed). It also switches crash-time task/promise association from a mutable worker array index to a stable workerId.

Changes:

  • Reject in-flight (and remaining unredistributable queued) task promises when a worker hits onerror.
  • Track promise ownership via stable workerId (and update it on task steal/redistribution); resolve worker node key dynamically from response workerId.
  • Add a regression test and a crash-simulating worker to validate promise rejection behavior.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

File Description
tests/worker-files/thread/crashWorker.mjs Adds a worker that crashes via an unhandled exception to reproduce the hanging-promise scenario.
tests/pools/thread/fixed.test.mjs Adds a regression test for crash-time promise rejection and adjusts queue assertions.
src/utility-types.ts Updates PromiseResponseWrapper to store workerId instead of workerNodeKey.
src/pools/abstract-pool.ts Implements crash handling that rejects affected promises, redistributes queued work, and updates promise ownership on task moves.

Comment thread tests/pools/thread/fixed.test.mjs
- Add crashHandled flag to WorkerInfo to prevent double crash handling
  between onerror and exit handlers
- Move rejectRemainingQueuedTaskPromises outside started/destroying guard
  (queued promises must always be settled regardless of pool state)
- Handle workerId == null case in rejectInFlightTaskPromises (crash before
  worker ID assignment)
- Add { cause } to crash Error constructors for error chain traceability
- Update task statistics (executing/failed) on crash rejection
- Add workerNodeKey === -1 guards in updatePromiseResponseWorkerId and
  handleWorkerReadyResponse
- Exit handler: detect unexpected exit via crashHandled flag, condition
  restart on restartWorkerOnError or normal exit
- Tighten test bounds for stolen/sequentiallyStolen task counts

Aligns with poolifier/poolifier#3211 latest changes
- Guard getWorkerNodeKeyByWorkerId against undefined workerId to prevent
  erroneous matching of uninitialized worker nodes
- Call updatePromiseResponseWorkerId BEFORE handleTask in redistribute
  and stealTask for correctness-by-construction (prevents stale workerId
  if task response arrives synchronously)
- Construct crashError once in handleWorkerNodeCrash and pass as param
  to rejectInFlightTaskPromises/rejectRemainingQueuedTaskPromises (DRY)
- Add executing--/failed++ stats in rejectInFlightTaskPromises null-path
  for consistency with the non-null path
- Fix handleWorkerNodeCrash JSDoc to accurately describe behavior
Copilot AI review requested due to automatic review settings May 14, 2026 21:00
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated no new comments.

…exit

Add flushWorkerNodePromises() method as a catch-all to reject any
unsettled promises when a worker node exits without crash handling
(e.g., unexpected exit without onerror, or termination during pool
destroy).

The exit handler now has three distinct blocks:
1. Crash detection: handleWorkerNodeCrash if ready && !crashHandled
2. Promise flush: flushWorkerNodePromises for remaining unsettled
   promises (guarded by ready && !crashHandled to skip intentional exits)
3. Cleanup: removeWorkerNode + conditional restart

Aligns with poolifier/poolifier#3211 latest changes (flushWorkerNodePromises
catch-all pattern)
Copilot AI review requested due to automatic review settings May 14, 2026 21:37
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 11 out of 11 changed files in this pull request and generated 8 comments.

Comment thread src/pools/abstract-pool.ts Outdated
Comment on lines +2438 to +2439
const { resolve, reject } = promiseResponse
const workerNodeKey = this.getWorkerNodeKeyByWorkerId(workerId)
Comment on lines +2456 to +2458
if (workerNodeKey !== -1) {
this.afterTaskExecutionHook(workerNodeKey, message)
}
Comment on lines +1709 to +1720
if (
workerNodeKey !== -1 &&
!workerNode.info.crashHandled &&
workerNode.info.ready
) {
this.flushWorkerNodePromises(
workerNode,
this.destroying
? new Error('Worker node terminated during pool destroy')
: exitError,
)
}
Comment thread src/pools/abstract-pool.ts Outdated
Comment on lines +2028 to +2031
* Rejects all unsettled promises targeting the given worker node.
* Used as a catch-all when crash handling was bypassed (e.g., during pool
* destroy or for non-ready worker exits). Idempotent: already-deleted
* entries are simply skipped.
Comment thread src/pools/abstract-pool.ts Outdated
Comment on lines +1956 to +1969
const workerNode = this.workerNodes[workerNodeKey]
const crashedWorkerId = workerNode.info.id
if (crashedWorkerId == null) {
for (const [taskId, promiseResponse] of this.promiseResponseMap) {
if (promiseResponse.workerId == null) {
promiseResponse.reject(crashError)
this.promiseResponseMap.delete(taskId)
if (workerNode.usage.tasks.executing > 0) {
--workerNode.usage.tasks.executing
}
++workerNode.usage.tasks.failed
workerNode.dispatchEvent(new Event('taskFinished'))
}
}
Comment thread src/pools/abstract-pool.ts Outdated
Comment on lines +1984 to +1990
promiseResponse.reject(crashError)
this.promiseResponseMap.delete(taskId)
if (workerNode.usage.tasks.executing > 0) {
--workerNode.usage.tasks.executing
}
++workerNode.usage.tasks.failed
workerNode.dispatchEvent(new Event('taskFinished'))
Comment thread src/pools/abstract-pool.ts Outdated
Comment on lines +2017 to +2020
promiseResponse.reject(crashError)
this.promiseResponseMap.delete(task.taskId)
++workerNode.usage.tasks.failed
workerNode.dispatchEvent(new Event('taskFinished'))
Comment thread src/pools/abstract-pool.ts Outdated
Comment on lines +2045 to +2051
promiseResponse.reject(error)
this.promiseResponseMap.delete(taskId)
if (workerNode.usage.tasks.executing > 0) {
--workerNode.usage.tasks.executing
}
++workerNode.usage.tasks.failed
workerNode.dispatchEvent(new Event('taskFinished'))
jerome-benoit and others added 2 commits May 14, 2026 23:58
…mplify null-path

- Extract rejectTaskPromise() helper to eliminate duplicated reject +
  delete + stats + taskFinished pattern across 3 methods
- Add fallback to promiseResponse.workerId in handleTaskExecutionResponse
  when message.workerId lookup fails (worker already removed)
- Simplify rejectInFlightTaskPromises null-ID path to early-return
- Keep flush guard as workerNode.info.ready (not this.destroying) since
  synchronous terminate in web workers causes unhandled rejections during
  pool.destroy() — destroyWorkerNode already handles graceful shutdown
Copilot AI review requested due to automatic review settings May 14, 2026 21:58
@jerome-benoit jerome-benoit review requested due to automatic review settings May 14, 2026 21:58
Copilot AI review requested due to automatic review settings May 14, 2026 22:29
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 9 out of 9 changed files in this pull request and generated 1 comment.

Comment on lines 1189 to +1193
)
this.promiseResponseMap.set(task.taskId!, {
reject,
resolve,
workerNodeKey,
workerId: this.workerNodes[workerNodeKey].info.id,
The JSDoc was copied from upstream where exitCode !== 0 allows flush
during pool destroy. In our web worker adaptation, synchronous terminate
causes unhandled rejections if flush runs during destroy, so the guard
uses workerNode.info.ready instead. Updated doc to reflect this.
The abort signal handler captured workerNodeKey at task submission time.
After steal/redistribute, the stale index dispatches abortTask to the
wrong worker node. Resolve the current worker dynamically from the
stored workerId (kept up-to-date by updatePromiseResponseWorkerId).
Copilot AI review requested due to automatic review settings May 14, 2026 22:43
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 9 out of 9 changed files in this pull request and generated 1 comment.

Comment on lines +1736 to +1737
(this.opts.restartWorkerOnError === true ||
!workerNode.info.crashHandled)
@sonarqubecloud
Copy link
Copy Markdown

Quality Gate Failed Quality Gate failed

Failed conditions
63.9% Coverage on New Code (required ≥ 85%)

See analysis details on SonarQube Cloud

@jerome-benoit jerome-benoit merged commit 74d9529 into master May 14, 2026
19 of 20 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants