Skip to content

fix(timewheel): use absolute target sequence in addCmd#6

Merged
gocronx merged 1 commit into
masterfrom
fix/timewheel-orphaned-on-cursor-race
May 8, 2026
Merged

fix(timewheel): use absolute target sequence in addCmd#6
gocronx merged 1 commit into
masterfrom
fix/timewheel-orphaned-on-cursor-race

Conversation

@gocronx
Copy link
Copy Markdown
Owner

@gocronx gocronx commented May 8, 2026

Closes #5.

Root cause

add() used to precompute (pos, rounds) from cursor.Load() and write the result to addCh. If the producer goroutine was descheduled between the cursor read and the channel send — common under -race, where the detector measurably slows producer goroutines — the handler could drain the cmd long after cursor had advanced past that slot. The entry would then be appended to a slot the cursor had already passed, stranding it for a full wheel cycle (65.5s with the default 64K-slot / 1ms-tick configuration). The stress test's 10s deadline elapsed first, surfacing as fired N-1 of N.

Fix

Carry the absolute targetSeq in addCmd instead of pre-resolving the slot. The handler computes (pos, rounds) against the up-to-date seq at drain time, and fires immediately when targetSeq <= seq (deadline already passed). The pre- and post-slot drain loops both delegate to a new placeAdd helper, which also folds in the cancelled-set check that previously lived in two places.

Test plan

  • Two unit regression tests on placeAdd covering past-deadline-fire and past-deadline-but-cancelled-drop
  • Full suite passes under -race (go test -race ./...)
  • TestStress_TimeWheel_Concurrent_4Writers passes 8/8 under -race locally; previously ~2/3 failure rate on master

The producer used to precompute (pos, rounds) from cursor.Load() and
write the result to addCh. If the producer was descheduled between the
cursor read and the channel send — common under -race — the handler
could drain the cmd long after cursor had advanced past that slot,
leaving the entry stranded for a full wheel cycle (65.5s with the
default 64K capacity / 1ms tick), which exceeded the stress-test
deadline and surfaced as "fired N-1 of N".

Carry the absolute targetSeq instead. The handler resolves the slot
at drain time against the up-to-date seq via a new placeAdd helper,
and fires immediately when targetSeq <= seq (i.e. the deadline has
already passed). placeAdd is shared by the pre- and post-slot drain
loops, removing the duplicated slot-append logic in Handle.

Stress tests now pass 8/8 under -race (previously ~2/3 failure rate
on master). Two unit regression tests cover the past-deadline and
past-deadline-but-cancelled paths directly.
@gocronx gocronx merged commit cc15e4c into master May 8, 2026
3 checks passed
@gocronx gocronx deleted the fix/timewheel-orphaned-on-cursor-race branch May 8, 2026 15:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

TestStress_TimeWheel_Concurrent_4Writers flaky under -race (last 1-2 of 1M tasks not fired)

1 participant