-
Notifications
You must be signed in to change notification settings - Fork 476
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ruler: Sync rules when ruler JOINING the ring instead of ACTIVE #4451
Conversation
923fd0b
to
8882167
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work! I have a few minor comments.
I was surprised that there are already tests for a JOINING
ruler in TestSharding
. Thanks @pstibrany!
Also, I think there's a typo in the PR description
To solve this, instead of sync the rules after ruler becoming ACTIVE in the Ring, sync the rules when ruler JOINING the Ring, and when the other tenant would drop the rule groups once new ruler become ACTIVE.
Shouldn't it be the other rulers?
146d80f
to
cba476d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great to me but I'll let @dimitarvdimitrov 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work! I left some nitpicks about comments and what args are passed, but other than that LGTM.
I'll leave this for a review from someone in @grafana/mimir-ruler-and-alertmanager-maintainers as well before we merge it.
Something which we talked about, but didn't write anywhere, is that after this change the likelihood of duplicate samples being written is higher. The reason is that during a restart, after the new ruler is ACTIVE in the ring, the old rulers will detect a ring change and will start syncing their rules from the bucket. That bucket sync is likely to be slow. With the current implementation on We can probably remediate that by first checking and stopping local rules after a ring change instead of directly syncing from the bucket, but for now I think it might not be a big problem. |
Co-authored-by: Dimitar Dimitrov <dimitar.dimitrov@grafana.com>
Co-authored-by: Dimitar Dimitrov <dimitar.dimitrov@grafana.com>
I just reached out to alert, they are ok with the ruler changes, since we didn't change alert manager, this should be fine. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, nicely done!
@56quarters do you want to take another look at the PR before we merge it? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
What this PR does
The ruler becomes active in the ring after all its services are started. When the ruler becomes ready it automatically starts "owning" rule groups. So other rules in the ring stop evaluating these rule groups because they now shard to the new ruler.
After becoming ACTIVE, the ruler proceeds to sync the rule groups for all the tenants that it own; to do that it needs to iterate the bucket and fetch the rule groups for all tenants. For larger cells with many tenants the time to iterate the bucket can exceed a minute. And this causes ruler to miss one iteration as a consequence alerts are auto solved. When the second alert arrives another alert notification is sent.
To solve this, instead of sync the rules after ruler becoming ACTIVE in the Ring, sync the rules when ruler JOINING the Ring, and when the other tenant would drop the rule groups once new ruler become ACTIVE.
Which issue(s) this PR fixes or relates to
Fixes #4117
Checklist
CHANGELOG.md
updated - the order of entries should be[CHANGE]
,[FEATURE]
,[ENHANCEMENT]
,[BUGFIX]