Fix XTaskQueue delayed-callback timing and wait-timer teardown races#975
Merged
Conversation
…races Retarget delayed-callback wake handling so stale timer notifications cannot promote future work too early, keep the same-port empty-sweep rescue path iterative, and harden the STL wait timer so teardown cannot race an in-flight callback after a due heap entry is popped. The regression and hook changes remain grouped here because they cover the same delayed-callback scheduling line and the Linux timer backend that services it.
Keep the iOS backend on monotonic due times without truncating sub-millisecond delays, and implement the missing WaitTimer termination path so the Apple backend matches the shared TaskQueue lifecycle contract.
brianpepin
approved these changes
May 12, 2026
Contributor
brianpepin
left a comment
There was a problem hiding this comment.
This change looks good and I appreciate the addl testing.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR fixes several correctness bugs in the
XTaskQueueSubmitDelayedCallbackpipeline and the Linux STL wait-timer backend so delayed callbacks are neither
dispatched before their requested delay has elapsed nor stranded waiting for an
unrelated wake, and STL timer teardown can no longer race an in-flight callback
dispatch. See issues #971, #973, and #974 for additional detail.
The public
XTaskQueueAPI surface is unchanged. The fixes are internal to thedelayed-callback scheduling path and timer backends, but they matter to any
consumer relying on timed waits, deferred continuations, retry back-off,
timeout enforcement, or queue teardown.
This branch also includes a small iOS backend cleanup identified during the
same audit: fractional-millisecond delays now round up instead of down, and the
iOS wait timer now routes destruction through
WaitTimer::Terminate()forconsistency with the shared backend lifecycle. Unlike the STL change, however,
this iOS work does not add new teardown-quiescing logic.
Bugs addressed
This PR addresses five distinct correctness issues in delayed-callback
scheduling and the STL backend, and also includes one small iOS cleanup:
XTaskQueueTerminateon a composite queuecould disturb shared delayed-port state and make a sibling queue's not-yet-
due callback appear ready immediately.
deadline could arrive after the timer had already been re-armed for a later
entry. The old code treated any timer fire as proof that the current armed
deadline had elapsed.
uint64_ttimestamp instead of by whether the deadline had actually passed.Two entries with near-identical due times could see one promoted and the
other stranded until the next unrelated timer fire.
on the same port while the dispatcher was clearing
m_timerDueafter anempty sweep, the new work could strand itself with no timer armed.
pointer, drop the queue lock, and then invoke the callback after
WaitTimer::Terminate()had already destroyed the owning timer object.fractional-millisecond delays and now routes destruction through
WaitTimer::Terminate()for consistency with the sharedWaitTimerlifecycle.Across the five core correctness fixes, the external symptom is the same: a
delayed callback becomes runnable at the wrong time, either too early, only
after unrelated later work happens to wake the queue, or after teardown has
already started. The iOS cleanup is narrower and is limited to due-time
conversion, rounding, and lifecycle consistency.
What changed
Delayed callback scheduling (
TaskQueue.cpp)PromoteReadyPendingCallbackssweeps all pending entries whoseenqueueTime <= now, promotes every ready callback, and then re-arms thetimer for the next future deadline. This replaces the old single-entry,
exact-timestamp flow.
SubmitPendingCallbacknow treats an early timer fire as a stale or earlynotification and retries against the currently armed due time instead of
silently returning.
m_timerDuetoUINT64_MAX, using a loop instead of recursive re-entry, which prevents theempty-sweep lost-wake race from leaving a newly queued delayed callback
without an armed timer.
CancelPendingEntriesno longer mutates shared timer state duringtermination. It gathers canceled entries first and only publishes them after
the pending-list traversal completes.
Timer backends (
WaitTimer_win32.cpp,WaitTimer_stl.cpp,ios_WaitTimer.mm)steady_clock-based due times instead of mixingabsolute wall-clock and relative timers.
stored due time back to a relative wait only when arming the platform timer.
generation-tags heap entries instead of null-scanning the queue on every
re-arm, and uses an explicit quiescing
Terminate()path so teardown canwait for in-flight dispatch safely without leaking the timer queue for process
lifetime.
std::chrono::ceil<std::chrono::milliseconds>instead oftruncating with
duration_cast<std::chrono::milliseconds>.WaitTimer::Terminate(), and its destructorroutes through that entry point so timer cleanup follows the shared
WaitTimerlifecycle shape. This is a narrow consistency cleanup, not thesame quiescing teardown hardening used by STL.
Thread pool (
ThreadPool_stl.cpp)SubmitWorknow usesnotify_oneinstead ofnotify_allwhen waking aworker thread. Each queued call issues its own wake, so thundering-herd wakes
on every submit are unnecessary.
Test hooks and regression coverage
HC_UNITTEST_API).to passing fix for each bug.
Regression coverage
Deterministic coverage now includes:
VerifyDelayedCallbackTimerRaceOnManualQueueVerifyFutureDelayedCallbackQueuedDuringEmptySweepDoesNotStallVerifyTerminationDoesNotEarlyPromoteSiblingDelayedCallbackVerifyStaleDelayedCallbackDoesNotEarlyPromoteNextPendingEntryTogether these cover the original stale callback repro, the shared-port
termination bug, the stale-timer retargeting bug, and the same-port lost-wake
race. Correctness and hardening are further exercised by a downstream Win32
and Linux coroutines test suite and the benchmark validation called out below.
Public API surface
XTaskQueuesignatures change.XTaskQueueSubmitDelayedCallbacknow correctly preserves its not-before-deadline behavior under shared delayed ports, concurrent termination, timer
retargeting, and same-port requeue races.
WaitTimer::Terminate()remains an internal backend concern. The STL fix onlychanges the internal lifetime model so delayed callbacks cannot outlive their
timer object during teardown.
XTaskQueueTerminateis unchanged from the caller's perspective: callbacks onthe terminated queue still complete as canceled. The fix only stops that
termination from affecting sibling queues.
Validation
libHttpClientWindows unit test suite:87/87)Additional downstream integration validation:
delegate queues and delayed callbacks to implement sleep, timeout, and
cancellation behavior.
on that coroutine layer.
the fix, including workloads that had previously exposed the original
shutdown hang.
Not performed:
was not exercised on-device or in an iOS CI environment.
sustained load — the timer bugs cause crashes during connection scaling and
throttle delayed-callback throughput to polling frequency — so no meaningful
pre-fix performance baseline exists for comparison.
Scope
narrow due-time and lifecycle cleanup.
Immediate dispatch and unrelated task-queue behavior are unchanged.
Review focus
enqueueTime <= nowcriterion.SubmitPendingCallback.m_timerDueis cleared.STL teardown quiesce in-flight dispatch instead of relying on raw timer
pointers.
through
WaitTimer::Terminate().Potential follow-up: iOS hardening
The current PR deliberately stops short of a full iOS timer hardening pass.
The iOS changes here are limited to monotonic due-time conversion, ceiling
rounding, and routing destruction through
WaitTimer::Terminate(); they do notattempt to add the shared-state teardown quiescing used by the STL backend.
If we decide to pursue that Apple-platform hardening in a follow-up, the likely
shape is:
NSTimerplus rawWaitTimerImpl*userInfohandoffwith a one-shot
dispatch_source_ttimer or another non-run-loop primitive.block new dispatch and wait for any in-flight callback to quiesce before the
timer object is destroyed.
dispatch later, for example by per-arm source replacement or generation
tagging.
it, since moving away from
NSTimerwould change both the timer primitive andthe execution context.
That would be a reasonable direction if we later decide to fix all timer
backends to the same lifetime standard, but it is intentionally out of scope
for this PR because we do not currently have Apple-platform correctness or
performance validation in hand.