Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ruler: Sync rules when ruler JOINING the ring instead of ACTIVE #4451

Merged
merged 6 commits into from
Apr 5, 2023

Conversation

ying-jeanne
Copy link
Contributor

@ying-jeanne ying-jeanne commented Mar 9, 2023

What this PR does

The ruler becomes active in the ring after all its services are started. When the ruler becomes ready it automatically starts "owning" rule groups. So other rules in the ring stop evaluating these rule groups because they now shard to the new ruler.

After becoming ACTIVE, the ruler proceeds to sync the rule groups for all the tenants that it own; to do that it needs to iterate the bucket and fetch the rule groups for all tenants. For larger cells with many tenants the time to iterate the bucket can exceed a minute. And this causes ruler to miss one iteration as a consequence alerts are auto solved. When the second alert arrives another alert notification is sent.

To solve this, instead of sync the rules after ruler becoming ACTIVE in the Ring, sync the rules when ruler JOINING the Ring, and when the other tenant would drop the rule groups once new ruler become ACTIVE.

Which issue(s) this PR fixes or relates to

Fixes #4117

Checklist

  • Tests updated
  • Documentation added
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

@ying-jeanne ying-jeanne changed the title Ruler: reduce time between becoming ACTIVE in ring and starting ruler… Ruler: sync rules on ruler joining the ring instead of running Mar 9, 2023
@ying-jeanne ying-jeanne changed the title Ruler: sync rules on ruler joining the ring instead of running Ruler: Sync rules on ruler joining the ring instead of running Mar 9, 2023
@ying-jeanne ying-jeanne changed the title Ruler: Sync rules on ruler joining the ring instead of running Ruler: Sync rules when ruler JOINING the ring instead of ACTIVE Mar 10, 2023
@ying-jeanne ying-jeanne force-pushed the rulerstart branch 5 times, most recently from 923fd0b to 8882167 Compare March 14, 2023 16:55
@ying-jeanne ying-jeanne marked this pull request as ready for review March 14, 2023 16:56
@ying-jeanne ying-jeanne requested review from a team as code owners March 14, 2023 16:56
Copy link
Contributor

@dimitarvdimitrov dimitarvdimitrov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work! I have a few minor comments.

I was surprised that there are already tests for a JOINING ruler in TestSharding. Thanks @pstibrany!

Also, I think there's a typo in the PR description

To solve this, instead of sync the rules after ruler becoming ACTIVE in the Ring, sync the rules when ruler JOINING the Ring, and when the other tenant would drop the rule groups once new ruler become ACTIVE.

Shouldn't it be the other rulers?

pkg/ruler/ruler.go Outdated Show resolved Hide resolved
pkg/ruler/ruler.go Outdated Show resolved Hide resolved
pkg/ruler/ruler.go Outdated Show resolved Hide resolved
pkg/ruler/ruler_ring.go Outdated Show resolved Hide resolved
pkg/ruler/ruler.go Outdated Show resolved Hide resolved
pkg/ruler/manager.go Outdated Show resolved Hide resolved
Copy link
Contributor

@56quarters 56quarters left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great to me but I'll let @dimitarvdimitrov 👍

Copy link
Contributor

@dimitarvdimitrov dimitarvdimitrov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work! I left some nitpicks about comments and what args are passed, but other than that LGTM.

I'll leave this for a review from someone in @grafana/mimir-ruler-and-alertmanager-maintainers as well before we merge it.

CHANGELOG.md Outdated Show resolved Hide resolved
pkg/ruler/manager.go Show resolved Hide resolved
pkg/ruler/ruler.go Outdated Show resolved Hide resolved
pkg/ruler/ruler.go Outdated Show resolved Hide resolved
pkg/ruler/ruler.go Outdated Show resolved Hide resolved
pkg/ruler/ruler.go Show resolved Hide resolved
pkg/ruler/ruler_test.go Outdated Show resolved Hide resolved
pkg/ruler/ruler_test.go Outdated Show resolved Hide resolved
@dimitarvdimitrov
Copy link
Contributor

Something which we talked about, but didn't write anywhere, is that after this change the likelihood of duplicate samples being written is higher.

The reason is that during a restart, after the new ruler is ACTIVE in the ring, the old rulers will detect a ring change and will start syncing their rules from the bucket. That bucket sync is likely to be slow. With the current implementation on main this was somewhat avoided because both new and old rulers sync the bucket concurrently, whereas with this change the new ruler would have already synced the bucket.

We can probably remediate that by first checking and stopping local rules after a ring change instead of directly syncing from the bucket, but for now I think it might not be a big problem.

ying-jeanne and others added 6 commits April 5, 2023 15:53
@ying-jeanne
Copy link
Contributor Author

ying-jeanne commented Apr 5, 2023

I just reached out to alert, they are ok with the ruler changes, since we didn't change alert manager, this should be fine.

Copy link
Contributor

@dimitarvdimitrov dimitarvdimitrov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, nicely done!

@dimitarvdimitrov
Copy link
Contributor

@56quarters do you want to take another look at the PR before we merge it?

Copy link
Contributor

@56quarters 56quarters left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@56quarters 56quarters merged commit 82bc2a1 into main Apr 5, 2023
@56quarters 56quarters deleted the rulerstart branch April 5, 2023 15:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Ruler: reduce time between becoming ACTIVE in ring and starting ruler managers
3 participants