Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alerting: Scheduler use rule fingerprint instead of version #66531

Merged
merged 7 commits into from
Apr 28, 2023

Conversation

yuri-tceretian
Copy link
Contributor

@yuri-tceretian yuri-tceretian commented Apr 13, 2023

What is this feature?

  • Creates method Fingerprint for ruleWithFolder that calculates a 64-bit FNV-1 hash. The method is optimized to have as few allocations as possible (the current benchmark gives 4 allocations per operation, ~2000 ns/op on 11th Gen Intel(R) Core(TM) i9-11900H @ 2.50GHz).
  • Updates rule evaluation routine to check for fingerprint instead of version and reset state everytime it is different.

Why do we need this feature?
This allows the scheduler to not reset the rule's state every time another rule in the same group is changed. See #64256

The field AlertRule.Version has been used for two purposes: optimistic concurrency and as a key for state management. In the latter case, every time the rule version is increased the scheduler assumes that the rule has changed and the results of previous evaluations (state) will not match with the results of the current evaluation by state key and resets the state of the rule because otherwise, that would produce orphaned states that would be maintained until they get marked as stale and removed.
However, not all fields are used to calculate the state. For example, the rule version does not affect the state key at all, and therefore, it does not need to be used to decide if a state needs to be reset.

Changes in this PR will make the version to be used for only optimistic concurrency in the database and not relied upon anywhere else.

If merged, it will allow us to remove the code that watches for folder changes and increments the rule version every time folder is renamed.

func subscribeToFolderChanges(logger log.Logger, bus bus.Bus, dbStore api.RuleStore) {
// if folder title is changed, we update all alert rules in that folder to make sure that all peers (in HA mode) will update folder title and
// clean up the current state
bus.AddEventListener(func(ctx context.Context, e *events.FolderTitleUpdated) error {
// do not block the upstream execution
go func(evt *events.FolderTitleUpdated) {
logger.Info("Got folder title updated event. updating rules in the folder", "folderUID", evt.UID)
_, err := dbStore.IncreaseVersionForAllRulesInNamespace(ctx, evt.OrgID, evt.UID)
if err != nil {
logger.Error("Failed to update alert rules in the folder after its title was changed", "error", err, "folderUID", evt.UID, "folder", evt.Title)
return
}
}(e)
return nil
})

Which issue(s) does this PR fix?:
Fixes #64843 and #64256.

Special notes for your reviewer:

Please check that:

  • It works as expected from a user's perspective.
  • If this is a pre-GA feature, it is behind a feature toggle.
  • The docs are updated, and if this is a notable improvement, it's added to our What's New doc.

if v.Version > msg.Version {
msg = v
}
case <-a.updateCh:
Copy link
Contributor Author

@yuri-tceretian yuri-tceretian Apr 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not needed since after #64780 only scheduler sends the commands and therefore there cannot be any concurrent requests with different versions. So we can just drop whatever is in the channel and put a new message

@yuri-tceretian yuri-tceretian force-pushed the yuri-tceretian/alertrule-fingerprint branch from cdcfb1e to e711103 Compare April 13, 2023 20:12
@yuri-tceretian yuri-tceretian added area/alerting Grafana Alerting add to changelog no-backport Skip backport of PR labels Apr 13, 2023
@yuri-tceretian yuri-tceretian added this to the 10.0.0 milestone Apr 13, 2023
@yuri-tceretian yuri-tceretian force-pushed the yuri-tceretian/alertrule-fingerprint branch from e711103 to da351da Compare April 13, 2023 20:46
// and there were two concurrent messages in updateCh and evalCh, and the eval's one got processed first.
// therefore, at the time when message from updateCh is processed the current rule will have
// at least the same version (or greater) and the state created for the new version of the rule.
if currentRuleVersion >= int64(ctx.Version) {
Copy link
Contributor Author

@yuri-tceretian yuri-tceretian Apr 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this also does not make sense anymore because only the scheduler makes updates

@yuri-tceretian yuri-tceretian added backport v9.5.x Bot will automatically open backport PR and removed no-backport Skip backport of PR labels Apr 25, 2023
@grafanabot
Copy link
Contributor

Hello @yuri-tceretian!
Backport pull requests need to be either:

  • Pull requests which address bugs,
  • Urgent fixes which need product approval, in order to get merged,
  • Docs changes.

Please, if the current pull request addresses a bug fix, label it with the type/bug label.
If it already has the product approval, please add the product-approved label. For docs changes, please add the type/docs label.
If the pull request modifies CI behaviour, please add the type/ci label.
If none of the above applies, please consider removing the backport label and target the next major/minor release.
Thanks!

}
writeString := func(s string) {
// avoid allocation when converting string to byte slice
writeBytes(*(*[]byte)(unsafe.Pointer(&s))) //nolint:gosec
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

@yuri-tceretian yuri-tceretian Apr 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I honestly have no idea why those boundaries are checked there (its 65535 less than Int32.Max), and the author did not provide explanations. Although I do not think it is physically possible to submit such a large string via database, I ran a quick test and confirmed that the logic works even with strings whose length is int32.max.
The thread is very interesting, but I think the solution explained there was for a general case when: 1. a string could be GCed while it is used as byte slice, and 2. to correctly specify capacity of the slice. This is not our case because: we hold a reference to the alert rule, and the lifetime of the byte slice is very short (just a single iteration), and we do not use capacity of the string.

Copy link
Contributor Author

@yuri-tceretian yuri-tceretian Apr 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left TODO to convert it to unsafe.StringData once we switch to Go 1.20

pkg/services/ngalert/schedule/registry.go Show resolved Hide resolved
pkg/services/ngalert/schedule/registry.go Show resolved Hide resolved
pkg/services/ngalert/schedule/registry.go Show resolved Hide resolved
pkg/services/ngalert/schedule/schedule.go Outdated Show resolved Hide resolved
// We need to reset state if the loop has started and the alert is already paused. It can happen,
// if we have an alert with state and we do file provision with stateful Grafana, that state
// lingers in DB and won't be cleaned up until next alert rule update.
needReset = needReset || (currentFingerprint == 0 && isPaused)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to confirm the logic change here was intentional. It seems to better match the comment description now, so maybe this was a bugfix?

Previously a state would reset in these two cases:

  1. currentRuleVersion != newVersion && currentRuleVersion > 0
  2. currentRuleVersion != newVersion && isPaused

Now, it reset in the following two:

  1. currentFingerprint != f && currentFingerprint != 0
  2. currentFingerprint == 0 && isPaused

Number 1s are effectively the same in both cases. The 2s have changed though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did that intentionally, which I think is clearer: basically the logic is that:

  • do not reset when the rule routine has just started (it's fingerprint, used to be version is 0), reset only if version (and now fingerprint are different).
  • however, if it's just started and the first eval isPaused=true then reset the state because of provisioning that does not reset the state of rule when it is provisioned.

@armandgrillet armandgrillet added product-approved Pull requests that are approved by product/managers and are allowed to be backported and removed missing-labels kata:alerting-ease-of-use labels Apr 26, 2023
Copy link
Member

@JacobsonMT JacobsonMT left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 🚀

@yuri-tceretian yuri-tceretian merged commit 9eb10be into main Apr 28, 2023
9 checks passed
@yuri-tceretian yuri-tceretian deleted the yuri-tceretian/alertrule-fingerprint branch April 28, 2023 14:42
grafanabot pushed a commit that referenced this pull request Apr 28, 2023
* implement calculation of fingerprint for ruleWithFolder
* update scheduler to use fingerprint instead of rule's version

(cherry picked from commit 9eb10be)
ryantxu pushed a commit that referenced this pull request May 3, 2023
* implement calculation of fingerprint for ruleWithFolder
* update scheduler to use fingerprint instead of rule's version
@zerok zerok modified the milestones: 10.0.0, 10.0.0-preview May 31, 2023
@benoittgt
Copy link
Contributor

Hello

We update alerts via a script every 30min, but it rarely change. Today no changes. But we are still seeing issue like this. We are on 10.2.2.

Screenshot_2024-02-19_at_16_48_21

Any idea how we can understand why it's marked at "Normal (updated)"?

@yuri-tceretian
Copy link
Contributor Author

You can find out why the rule was updated by checking logs (only if debug logs are enabled). Message Updating rule and context parameter rule_uid that matches the rule's UID also has diff.
If debug logs are not enabled, you can check for message Clearing the state of the rule because it was updated that also has context parameter rule_uid as well as fingerprint.
Also, you can check database table alert_rule_version that contains the history of rule changes.

@benoittgt
Copy link
Contributor

benoittgt commented Feb 21, 2024

Thanks @yuri-tceretian. alert_rule_version helped us. We saw a state "normal (updated)" because we changed the relativeTimeRange. But still, the alert should not be marked as normal (updated) for us it means invalid state for few seconds + the firing "since" time that is reseted which should not be the case. Why not pending (updated)? :)

Screenshot 2024-02-21 at 09 45 43

Edit: New case

We got the same issue because we simply remove one annotation. Here is the diff of annotations, in alert_rule_version

Screenshot_2024-02-22_at_16_01_06

@yuri-tceretian
Copy link
Contributor Author

Currently, any change of any field of a rule causes the reset of the state. The reason behind that is that a change could result in different alerts. The notification pipeline uses fingerprints calculated from the set of labels to match the alert. That's how it creates new, maintains active, and resolves alerts - we just create a new alert that has the same set of labels but with different status.
The rule change could result in alerts with a different set of labels and therefore different fingerprints. Therefore it could cause duplicates in the notification pipeline, i.e. when active alerts produced by the previous version get orphaned and not maintained but a new set produced by new version of rule gets created. In this case, the user could receive multiple notifications. That's why we decided to just reset the state once we detected that the rule was updated.

Before this PR, we reset the state every time the rule's version changed. This PR improved that and let us reset the state only when the fingerprint changes. This opened possibilities to approach the resetting state more precisely and ignore certain fields while calculating the fingerprint.

Changes in annotations do not affect the alert fingerprint, indeed. So, it can be deleted. I do not think that the time range affects it too. So, I think it makes sense to remove them from the fingerprint. We will discuss it. Thanks!

@yuri-tceretian
Copy link
Contributor Author

I opened an issue to improve it #83250 Please upvote it if you think it makes sense to do.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
add to changelog area/alerting Grafana Alerting area/backend backport v9.5.x Bot will automatically open backport PR product-approved Pull requests that are approved by product/managers and are allowed to be backported
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

Alerting: replace rule version with rule's hash in Scheduler\State manager
7 participants