Skip to content

dispatcher: Fix issue with dispatching to a contended route#5179

Merged
ultrotter merged 6 commits intoprometheus:mainfrom
ultrotter:dispatcherloadedfalse
Apr 20, 2026
Merged

dispatcher: Fix issue with dispatching to a contended route#5179
ultrotter merged 6 commits intoprometheus:mainfrom
ultrotter:dispatcherloadedfalse

Conversation

@ultrotter
Copy link
Copy Markdown
Contributor

@ultrotter ultrotter commented Apr 18, 2026

When dispatching to a route by multiple goroutines, we hit an issue where if we can't swap the value in the map because another goroutine beat us to it, we should try to reload from the map, rather than swap again, and thus set loaded = false.

We also add metrics showing how many retries are happening, and whether the "giving up" situation is ever reached.

The test is a little convoluted, to make sure we are able to force that path, and that we never reach the "give up" state. If it becomes flaky we can remove it, or simplify it and also not call require.Positive(t, testutil.ToFloat64(metrics.aggrGroupCreationRetries), "contended CAS path was not exercised in %d rounds — scheduler is unusually serial", rounds), but as is it does show the issue, and the fix.

Pull Request Checklist

Please check all the applicable boxes.

Which user-facing changes does this PR introduce?

[BUGFIX] Fix the dispatcher in more contended cases

Summary by CodeRabbit

  • New Features

    • Added Prometheus metrics to track alert-group creation retries and when creation attempts are abandoned for improved observability.
  • Bug Fixes

    • Improved concurrent alert-group creation handling so creation races recover reliably and avoid leaving groups in a destroyed state.
  • Tests

    • Added a concurrency-focused regression test validating recovery from CAS-style failures and that retry/give-up metrics are recorded.

@ultrotter ultrotter requested a review from a team as a code owner April 18, 2026 12:12
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 18, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 9abba2da-2a67-4c8a-8407-f0e3db9142b3

📥 Commits

Reviewing files that changed from the base of the PR and between f96ad9b and 7883300.

📒 Files selected for processing (1)
  • dispatch/dispatch_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • dispatch/dispatch_test.go

📝 Walkthrough

Walkthrough

Added two Prometheus counters to DispatcherMetrics for aggregation-group creation contention, adjusted group creation retry control flow to continue via the LoadOrStore path after a successful CAS in races, and incremented the new metrics during retry/give-up conditions. Added a concurrency regression test exercising the CAS-recovery path.

Changes

Cohort / File(s) Summary
Dispatcher metrics & retry logic
dispatch/dispatch.go
Added aggrGroupCreationRetries and aggrGroupCreationGivenUp to DispatcherMetrics. Modified groupAlert loop: after a successful CompareAndSwap in contention, continue with loaded=false so subsequent iterations use LoadOrStore; increment retries on each CAS failure and increment given-up when retry limit is exceeded.
Concurrency regression test
dispatch/dispatch_test.go
Added TestGroupAlert_RecoversWhenCASFails, which pre-seeds a destroyed aggregation group and concurrently calls groupAlert to verify a new non-destroyed replacement group contains the round's alerts, that no give-up occurred, and that retries were exercised when applicable.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title 'dispatcher: Fix issue with dispatching to a contended route' directly describes the main change: fixing a dispatcher bug related to contended route dispatching.
Description check ✅ Passed The description covers the bug fix, explains the root cause, documents added metrics, and includes a comprehensive test. The PR checklist is properly filled with 'Fixes #5176', bugfix confirmation, test addition, signoff confirmation, and release notes provided.
Linked Issues check ✅ Passed The PR addresses the linked issue #5176 by fixing the CAS contention bug that causes excessive retries during aggregation group creation. Code changes include the retry loop fix and metrics to track retry failures and give-up situations.
Out of Scope Changes check ✅ Passed All changes are directly related to fixing the dispatcher's CAS contention issue: metrics additions, retry loop control flow fix, and a focused regression test exercising the contended path.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Guido Trotter added 3 commits April 18, 2026 08:15
Signed-off-by: Guido Trotter <guido@hudson-trading.com>
Signed-off-by: Guido Trotter <guido@hudson-trading.com>
Signed-off-by: Guido Trotter <guido@hudson-trading.com>
@ultrotter ultrotter force-pushed the dispatcherloadedfalse branch from 3052269 to 29b8863 Compare April 18, 2026 12:15
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@dispatch/dispatch_test.go`:
- Around line 842-846: The comment is wrong because NewDispatcher leaves d.state
at DispatcherStateUnknown; instead of leaving it, set the dispatcher state
explicitly to WaitingToStart after creating it (e.g., call
dispatcher.state.Store(WaitingToStart)) so groupAlert falls into the explicit
no-op branch and avoids the warn logs, and update/remove the inaccurate comment
referencing WaitingToStart vs DispatcherStateUnknown.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: ed5cb43e-c682-4ce6-9358-775805b0d823

📥 Commits

Reviewing files that changed from the base of the PR and between dcbdc8c and 3052269.

📒 Files selected for processing (4)
  • config/notifiers.go
  • dispatch/dispatch.go
  • dispatch/dispatch_test.go
  • notify/sns/sns.go

Comment thread dispatch/dispatch_test.go Outdated
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
dispatch/dispatch_test.go (1)

843-845: ⚠️ Potential issue | 🟡 Minor

Set the dispatcher state explicitly or fix this comment.

NewDispatcher leaves the state as DispatcherStateUnknown; since this test skips Run, groupAlert hits the default switch branch and logs "unknown state detected" for created groups. Consider dispatcher.state.Store(DispatcherStateWaitingToStart) before calling groupAlert, or update the comment to describe the actual state.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@dispatch/dispatch_test.go` around lines 843 - 845, The test assumes the
dispatcher is in WaitingToStart but NewDispatcher leaves dispatcher.state as
DispatcherStateUnknown, causing groupAlert to hit the default branch; either set
the dispatcher state explicitly by calling
dispatcher.state.Store(DispatcherStateWaitingToStart) before invoking groupAlert
(or before assigning routeGroupsSlice) so the created groups see the correct
state, or change the comment to accurately state that the dispatcher remains in
DispatcherStateUnknown and will log "unknown state detected" when groupAlert
runs; reference dispatcher.state.Store, DispatcherStateWaitingToStart,
NewDispatcher, and groupAlert when making the change.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@dispatch/dispatch_test.go`:
- Around line 847-890: Change the final assertion so the test is skipped (not
failed) when no contention was observed: after the loop that increments rounds
and before asserting retries, check
testutil.ToFloat64(metrics.aggrGroupCreationRetries) and if it is zero call
t.Skipf with a message mentioning rounds and that the contended CAS path wasn't
exercised (so CI/single-thread runs don't fail); otherwise keep the existing
require.Positive assertion to ensure the contended path is tested when retries >
0. Target the symbols metrics.aggrGroupCreationRetries, rounds, and t (the
*testing.T in dispatch_test.go) where the current require.Positive call is made.

---

Duplicate comments:
In `@dispatch/dispatch_test.go`:
- Around line 843-845: The test assumes the dispatcher is in WaitingToStart but
NewDispatcher leaves dispatcher.state as DispatcherStateUnknown, causing
groupAlert to hit the default branch; either set the dispatcher state explicitly
by calling dispatcher.state.Store(DispatcherStateWaitingToStart) before invoking
groupAlert (or before assigning routeGroupsSlice) so the created groups see the
correct state, or change the comment to accurately state that the dispatcher
remains in DispatcherStateUnknown and will log "unknown state detected" when
groupAlert runs; reference dispatcher.state.Store,
DispatcherStateWaitingToStart, NewDispatcher, and groupAlert when making the
change.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 791bde71-072c-4bc2-a973-977e5a53c264

📥 Commits

Reviewing files that changed from the base of the PR and between 3052269 and 29b8863.

📒 Files selected for processing (2)
  • dispatch/dispatch.go
  • dispatch/dispatch_test.go

Comment thread dispatch/dispatch_test.go Outdated
ultrotter and others added 3 commits April 19, 2026 07:06
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Signed-off-by: Guido Trotter <ultrotter@gmail.com>
Signed-off-by: Guido Trotter <guido@hudson-trading.com>
Signed-off-by: Guido Trotter <guido@hudson-trading.com>
@ultrotter ultrotter merged commit ef66ce9 into prometheus:main Apr 20, 2026
7 checks passed
SoloJacobs pushed a commit that referenced this pull request Apr 28, 2026
* Fix bug in dispatching to contended route
* Add metrics for dispatching to contended routes
* Add test about dispatching to contended routes
* Skip requiring retries > 0 if a single concurrent goroutine is run

Signed-off-by: Guido Trotter <ultrotter@gmail.com>
Co-authored-by: Guido Trotter <guido@hudson-trading.com>
Signed-off-by: Solomon Jacobs <solomonjacobs@protonmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

"excessive retries creating aggregation group" error after upgrading to v0.32.0

3 participants