Skip to content

Centralize timer-arming to close TOCTOU race#983

Open
jhugard wants to merge 3 commits into
mainfrom
user/jhugard/fix-timer-arm-toctou
Open

Centralize timer-arming to close TOCTOU race#983
jhugard wants to merge 3 commits into
mainfrom
user/jhugard/fix-timer-arm-toctou

Conversation

@jhugard
Copy link
Copy Markdown
Collaborator

@jhugard jhugard commented May 21, 2026

Summary

Follow-up to #975 and #980: retargeting m_timerDue still had a TOCTOU window
around the non-atomic CAS -> Start sequence. Thread A could publish a later due
time, thread B could then publish and arm an earlier one, and thread A's
trailing Start() could overwrite the OS timer with the later deadline and
strand the earlier callback until unrelated traffic arrived, as in #973.

The previous fix in #980 added a post-Start verification check in
SubmitPendingCallbacks, but the same unguarded pattern existed in QueueItem
and PromoteReadyPendingCallbacks.

This change centralizes the CAS+Start+verify logic into three focused helpers
so every timer-arm path uses the same post-Start verification pattern without
burying it in call-site control flow.

The public XTaskQueue API surface is unchanged. The fix is internal to the
delayed-callback scheduling path and matters to any consumer relying on timed
waits, deferred continuations, retry back-off, or timeout enforcement.

Bug addressed

Timer-arm TOCTOU overwrite. When two threads race to arm m_timerDue, the
sequence CAS → Start is not atomic. Between a successful CAS and the subsequent
Start() call, a concurrent thread can:

  1. CAS a new (earlier) deadline into m_timerDue.
  2. Call Start() with that earlier deadline.

The first thread's Start() then fires after thread B's, overwriting the OS
timer with a later deadline. The earlier pending entry is stranded until some
unrelated timer fire or new submission happens to flush it.

The external symptom matches the stale-callback bug fixed in #975: a delayed
callback becomes runnable only after unrelated later work wakes the queue.
Under sustained concurrent delayed-callback submission this can show up as
latency jitter or starvation for a particular deadline bucket.

What changed

New helpers (TaskQueue.cpp)

Three private methods replace all inline CAS+Start patterns:

Helper Semantics
ArmTimerIfEarlier(dueTime) Min-wins CAS with post-Start verification. Treats equality as a verified re-arm of an already-published deadline. Returns false only when the published due time moved later and the caller should re-evaluate.
ArmTimerForNextPendingDueTime(previousDueTime, nextDueTime) Publishes the next surviving future deadline after a ready sweep. This is the only helper allowed to move m_timerDue later. Returns false when the caller must rescan because the published value changed again.
RearmTimerIfDueTimeUnchanged(dueTime) Re-arms the exact deadline seen by an early or stale timer callback. Refuses to resurrect a deadline that another thread already consumed. Returns false if the observed due time is stale.

Call-site changes

  • QueueItem: Replaced the inline CAS loop with a single call to
    ArmTimerIfEarlier(entry.enqueueTime). The return value is intentionally
    ignored because the entry is already safe in m_pendingList even if another
    thread retargets the timer first.
  • PromoteReadyPendingCallbacks: Replaced the inline CAS loop that
    published nextItem.enqueueTime with ArmTimerForNextPendingDueTime(dueTime, nextItem.enqueueTime). On false (world changed), the function refreshes
    now and dueTime and rescans.
  • SubmitPendingCallbacks: Replaced the inline CAS + post-verification
    stanza from Update TaskQueue with changes from Windows repo #980 with RearmTimerIfDueTimeUnchanged(dueTime). On false,
    the outer loop reloads state and re-evaluates.

The helper extraction removes the duplicated inline CAS logic from all three
call sites.

Design rationale

Why three helpers instead of one?

Each call site has distinct preconditions on the direction the published
deadline may move:

  • QueueItem: only moves earlier (new entry).
  • PromoteReadyPendingCallbacks: moves later (just-fired deadline replaced
    with next future one).
  • SubmitPendingCallbacks: re-arms the same value (early/stale timer fire).

A single function would need mode flags or extra parameters to express these
constraints, and its post-Start verification logic would diverge per mode. Three
small functions with clear names and focused contracts are simpler to audit.

Why <= in ArmTimerIfEarlier?

The <= comparison keeps equality on the same verified path. That matters for
benign races where concurrent submitters compute the same due time, and it
keeps ArmTimerIfEarlier correct for callers that need to re-arm a deadline
that is already published rather than strictly earlier.

Post-Start verification pattern

All three helpers perform the same post-Start() check against the published
value:

afterDue = m_timerDue.load();
if (afterDue == targetDue) → stable, return true
if (afterDue < targetDue) → an earlier deadline won; repair or defer to it
if (afterDue > targetDue) → caller re-evaluates

That verification closes the TOCTOU window: if another thread's CAS lands
between ours and Start(), we detect it and either repair the earlier arm or
yield to the thread that already published it, rather than leaving it stranded.

ABA safety

If m_timerDue goes through X → Y → X between our CAS and verification, we
read back our own value and return true. The concurrent thread that moved it
to Y and back called its own Start(X), so the timer is correctly armed for
X. ABA here is benign by construction.

Public API surface

  • No public XTaskQueue signatures change.
  • XTaskQueueSubmitDelayedCallback behavior is unchanged from the caller's
    perspective: the not-before-deadline guarantee is now upheld more reliably
    under concurrent submission and timer retargeting races.

Validation

libHttpClient Windows unit test suite:

These tests were added in #975 and exercise the exact interleaving scenarios
that this change hardens. The TOCTOU window itself is probabilistic and
difficult to deterministically reproduce in a unit test (the window is a few
nanoseconds between CAS and Start), but the structural guarantee — that
post-Start verification is now uniformly applied — eliminates the class of bug
rather than papering over individual instances.

Additional downstream integration validation:

  • Exercised through an internal coroutines library that uses XTaskQueue delegate
    queues and delayed callbacks to implement sleep, timeout, and cancellation.
  • Also exercised through an internal WebSocket scale/throughput benchmark built
    on that coroutine layer.
  • Repeated Windows soak and end-to-end runs remained stable after the fix.

Scope

  • No public API additions or removals.
  • No new test cases (existing regression suite already covers the affected
    code paths).
  • Change is scoped to delayed-callback timer arming. Immediate dispatch and
    unrelated task-queue behavior are unchanged.
  • No backend-specific timer code changes were required; all platforms pick up
    the fix through the shared TaskQueuePortImpl scheduling logic.

Review focus

  • The post-Start verification loop in each helper (the core TOCTOU fix).
  • The <= semantics in ArmTimerIfEarlier, especially around equal-deadline
    races between concurrent submitters.
  • ArmTimerForNextPendingDueTime's decision to return true when another thread
    publishes an earlier deadline (delegating responsibility to that thread's own
    verify cycle).
  • RearmTimerIfDueTimeUnchanged's refusal to resurrect a stale deadline when
    m_timerDue has already advanced past it.
  • Whether the return-value contract (true = stable/covered, false =
    re-evaluate) is clear and consistently used at each call site.

@jhugard jhugard requested a review from brianpepin May 21, 2026 03:31
Follow-up to #975 and #980: the CAS-then-Start pattern for retargeting
m_timerDue had a TOCTOU window where thread A could win the CAS but
before calling Start(), thread B could CAS+Start an earlier deadline,
which thread A's Start() would then overwrite — stranding the earlier
callback until independent traffic arrived.

The fix in #980 added post-Start verification in SubmitPendingCallbacks,
but the same unguarded pattern existed in QueueItem and
PromoteReadyPendingCallbacks.

This change extracts the CAS+Start+verify logic into three helpers
(ArmTimerIfEarlier, ArmNextPendingCallback, RearmObservedDueTime) so
the post-Start verification is applied uniformly at every call site.
~67 lines of duplicated inline CAS logic removed.
@jhugard jhugard force-pushed the user/jhugard/fix-timer-arm-toctou branch from 499ff7d to 8386b37 Compare May 21, 2026 06:57
Copy link
Copy Markdown
Contributor

@brianpepin brianpepin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Much cleaner. Thanks for the effort!

@jhugard
Copy link
Copy Markdown
Collaborator Author

jhugard commented May 21, 2026

Much cleaner. Thanks for the effort!

I should have seen this race earlier, so thanks for pointing it out!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants