-
Notifications
You must be signed in to change notification settings - Fork 11.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Alerting: Scheduler use rule fingerprint instead of version #66531
Conversation
if v.Version > msg.Version { | ||
msg = v | ||
} | ||
case <-a.updateCh: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not needed since after #64780 only scheduler sends the commands and therefore there cannot be any concurrent requests with different versions. So we can just drop whatever is in the channel and put a new message
cdcfb1e
to
e711103
Compare
e711103
to
da351da
Compare
// and there were two concurrent messages in updateCh and evalCh, and the eval's one got processed first. | ||
// therefore, at the time when message from updateCh is processed the current rule will have | ||
// at least the same version (or greater) and the state created for the new version of the rule. | ||
if currentRuleVersion >= int64(ctx.Version) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this also does not make sense anymore because only the scheduler makes updates
Hello @yuri-tceretian!
Please, if the current pull request addresses a bug fix, label it with the |
} | ||
writeString := func(s string) { | ||
// avoid allocation when converting string to byte slice | ||
writeBytes(*(*[]byte)(unsafe.Pointer(&s))) //nolint:gosec |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this needs bounds checks? I found this https://groups.google.com/g/golang-nuts/c/Zsfk-VMd_fU/m/O1ru4fO-BgAJ?pli=1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I honestly have no idea why those boundaries are checked there (its 65535 less than Int32.Max), and the author did not provide explanations. Although I do not think it is physically possible to submit such a large string via database, I ran a quick test and confirmed that the logic works even with strings whose length is int32.max.
The thread is very interesting, but I think the solution explained there was for a general case when: 1. a string could be GCed while it is used as byte slice, and 2. to correctly specify capacity of the slice. This is not our case because: we hold a reference to the alert rule, and the lifetime of the byte slice is very short (just a single iteration), and we do not use capacity of the string.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left TODO to convert it to unsafe.StringData once we switch to Go 1.20
// We need to reset state if the loop has started and the alert is already paused. It can happen, | ||
// if we have an alert with state and we do file provision with stateful Grafana, that state | ||
// lingers in DB and won't be cleaned up until next alert rule update. | ||
needReset = needReset || (currentFingerprint == 0 && isPaused) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would like to confirm the logic change here was intentional. It seems to better match the comment description now, so maybe this was a bugfix?
Previously a state would reset in these two cases:
currentRuleVersion != newVersion && currentRuleVersion > 0
currentRuleVersion != newVersion && isPaused
Now, it reset in the following two:
currentFingerprint != f && currentFingerprint != 0
currentFingerprint == 0 && isPaused
Number 1s are effectively the same in both cases. The 2s have changed though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did that intentionally, which I think is clearer: basically the logic is that:
- do not reset when the rule routine has just started (it's fingerprint, used to be version is 0), reset only if version (and now fingerprint are different).
- however, if it's just started and the first eval isPaused=true then reset the state because of provisioning that does not reset the state of rule when it is provisioned.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 🚀
* implement calculation of fingerprint for ruleWithFolder * update scheduler to use fingerprint instead of rule's version (cherry picked from commit 9eb10be)
* implement calculation of fingerprint for ruleWithFolder * update scheduler to use fingerprint instead of rule's version
You can find out why the rule was updated by checking logs (only if debug logs are enabled). Message |
Thanks @yuri-tceretian. Edit: New case We got the same issue because we simply remove one annotation. Here is the diff of annotations, in |
Currently, any change of any field of a rule causes the reset of the state. The reason behind that is that a change could result in different alerts. The notification pipeline uses fingerprints calculated from the set of labels to match the alert. That's how it creates new, maintains active, and resolves alerts - we just create a new alert that has the same set of labels but with different status. Before this PR, we reset the state every time the rule's version changed. This PR improved that and let us reset the state only when the fingerprint changes. This opened possibilities to approach the resetting state more precisely and ignore certain fields while calculating the fingerprint. Changes in annotations do not affect the alert fingerprint, indeed. So, it can be deleted. I do not think that the time range affects it too. So, I think it makes sense to remove them from the fingerprint. We will discuss it. Thanks! |
I opened an issue to improve it #83250 Please upvote it if you think it makes sense to do. |
What is this feature?
ruleWithFolder
that calculates a 64-bit FNV-1 hash. The method is optimized to have as few allocations as possible (the current benchmark gives 4 allocations per operation, ~2000 ns/op on 11th Gen Intel(R) Core(TM) i9-11900H @ 2.50GHz).Why do we need this feature?
This allows the scheduler to not reset the rule's state every time another rule in the same group is changed. See #64256
The field
AlertRule.Version
has been used for two purposes: optimistic concurrency and as a key for state management. In the latter case, every time the rule version is increased the scheduler assumes that the rule has changed and the results of previous evaluations (state) will not match with the results of the current evaluation by state key and resets the state of the rule because otherwise, that would produce orphaned states that would be maintained until they get marked as stale and removed.However, not all fields are used to calculate the state. For example, the rule version does not affect the state key at all, and therefore, it does not need to be used to decide if a state needs to be reset.
Changes in this PR will make the version to be used for only optimistic concurrency in the database and not relied upon anywhere else.
If merged, it will allow us to remove the code that watches for folder changes and increments the rule version every time folder is renamed.
grafana/pkg/services/ngalert/ngalert.go
Lines 301 to 315 in da48327
Which issue(s) does this PR fix?:
Fixes #64843 and #64256.
Special notes for your reviewer:
Please check that: