Alerting: Scheduler use rule fingerprint instead of version #66531

yuri-tceretian · 2023-04-13T19:46:59Z

What is this feature?

Creates method Fingerprint for ruleWithFolder that calculates a 64-bit FNV-1 hash. The method is optimized to have as few allocations as possible (the current benchmark gives 4 allocations per operation, ~2000 ns/op on 11th Gen Intel(R) Core(TM) i9-11900H @ 2.50GHz).
Updates rule evaluation routine to check for fingerprint instead of version and reset state everytime it is different.

Why do we need this feature?
This allows the scheduler to not reset the rule's state every time another rule in the same group is changed. See #64256

The field AlertRule.Version has been used for two purposes: optimistic concurrency and as a key for state management. In the latter case, every time the rule version is increased the scheduler assumes that the rule has changed and the results of previous evaluations (state) will not match with the results of the current evaluation by state key and resets the state of the rule because otherwise, that would produce orphaned states that would be maintained until they get marked as stale and removed.
However, not all fields are used to calculate the state. For example, the rule version does not affect the state key at all, and therefore, it does not need to be used to decide if a state needs to be reset.

Changes in this PR will make the version to be used for only optimistic concurrency in the database and not relied upon anywhere else.

If merged, it will allow us to remove the code that watches for folder changes and increments the rule version every time folder is renamed.

grafana/pkg/services/ngalert/ngalert.go

Lines 301 to 315 in da48327

    
           func subscribeToFolderChanges(logger log.Logger, bus bus.Bus, dbStore api.RuleStore) { 
        
           	// if folder title is changed, we update all alert rules in that folder to make sure that all peers (in HA mode) will update folder title and 
        
           	// clean up the current state 
        
           	bus.AddEventListener(func(ctx context.Context, e *events.FolderTitleUpdated) error { 
        
           		// do not block the upstream execution 
        
           		go func(evt *events.FolderTitleUpdated) { 
        
           			logger.Info("Got folder title updated event. updating rules in the folder", "folderUID", evt.UID) 
        
           			_, err := dbStore.IncreaseVersionForAllRulesInNamespace(ctx, evt.OrgID, evt.UID) 
        
           			if err != nil { 
        
           				logger.Error("Failed to update alert rules in the folder after its title was changed", "error", err, "folderUID", evt.UID, "folder", evt.Title) 
        
           				return 
        
           			} 
        
           		}(e) 
        
           		return nil 
        
           	})

Which issue(s) does this PR fix?:
Fixes #64843 and #64256.

Special notes for your reviewer:

Please check that:

It works as expected from a user's perspective.
If this is a pre-GA feature, it is behind a feature toggle.
The docs are updated, and if this is a notable improvement, it's added to our What's New doc.

yuri-tceretian · 2023-04-13T19:48:13Z

pkg/services/ngalert/schedule/registry.go

-		if v.Version > msg.Version {
-			msg = v
-		}
+	case <-a.updateCh:


This is not needed since after #64780 only scheduler sends the commands and therefore there cannot be any concurrent requests with different versions. So we can just drop whatever is in the channel and put a new message

pkg/services/ngalert/schedule/schedule_unit_test.go

yuri-tceretian · 2023-04-13T20:48:23Z

pkg/services/ngalert/schedule/schedule.go

-			// and there were two concurrent messages in updateCh and evalCh, and the eval's one got processed first.
-			// therefore, at the time when message from updateCh is processed the current rule will have
-			// at least the same version (or greater) and the state created for the new version of the rule.
-			if currentRuleVersion >= int64(ctx.Version) {


this also does not make sense anymore because only the scheduler makes updates

grafanabot · 2023-04-25T01:35:02Z

Hello @yuri-tceretian!
Backport pull requests need to be either:

Pull requests which address bugs,
Urgent fixes which need product approval, in order to get merged,
Docs changes.

Please, if the current pull request addresses a bug fix, label it with the type/bug label.
If it already has the product approval, please add the product-approved label. For docs changes, please add the type/docs label.
If the pull request modifies CI behaviour, please add the type/ci label.
If none of the above applies, please consider removing the backport label and target the next major/minor release.
Thanks!

grobinson-grafana · 2023-04-25T12:47:09Z

pkg/services/ngalert/schedule/registry.go

+	}
+	writeString := func(s string) {
+		// avoid allocation when converting string to byte slice
+		writeBytes(*(*[]byte)(unsafe.Pointer(&s))) //nolint:gosec


Does this needs bounds checks? I found this https://groups.google.com/g/golang-nuts/c/Zsfk-VMd_fU/m/O1ru4fO-BgAJ?pli=1

I honestly have no idea why those boundaries are checked there (its 65535 less than Int32.Max), and the author did not provide explanations. Although I do not think it is physically possible to submit such a large string via database, I ran a quick test and confirmed that the logic works even with strings whose length is int32.max.
The thread is very interesting, but I think the solution explained there was for a general case when: 1. a string could be GCed while it is used as byte slice, and 2. to correctly specify capacity of the slice. This is not our case because: we hold a reference to the alert rule, and the lifetime of the byte slice is very short (just a single iteration), and we do not use capacity of the string.

I left TODO to convert it to unsafe.StringData once we switch to Go 1.20

pkg/services/ngalert/schedule/registry.go

pkg/services/ngalert/schedule/schedule.go

JacobsonMT · 2023-04-25T14:59:20Z

pkg/services/ngalert/schedule/schedule.go

+					// We need to reset state if the loop has started and the alert is already paused. It can happen,
+					// if we have an alert with state and we do file provision with stateful Grafana, that state
+					// lingers in DB and won't be cleaned up until next alert rule update.
+					needReset = needReset || (currentFingerprint == 0 && isPaused)


I would like to confirm the logic change here was intentional. It seems to better match the comment description now, so maybe this was a bugfix?

Previously a state would reset in these two cases:

currentRuleVersion != newVersion && currentRuleVersion > 0

currentRuleVersion != newVersion && isPaused

Now, it reset in the following two:

currentFingerprint != f && currentFingerprint != 0

currentFingerprint == 0 && isPaused

Number 1s are effectively the same in both cases. The 2s have changed though.

I did that intentionally, which I think is clearer: basically the logic is that:

do not reset when the rule routine has just started (it's fingerprint, used to be version is 0), reset only if version (and now fingerprint are different).

however, if it's just started and the first eval isPaused=true then reset the state because of provisioning that does not reset the state of rule when it is provisioned.

JacobsonMT

LGTM 🚀

* implement calculation of fingerprint for ruleWithFolder * update scheduler to use fingerprint instead of rule's version (cherry picked from commit 9eb10be)

* implement calculation of fingerprint for ruleWithFolder * update scheduler to use fingerprint instead of rule's version

benoittgt · 2024-02-19T16:05:44Z

Hello

We update alerts via a script every 30min, but it rarely change. Today no changes. But we are still seeing issue like this. We are on 10.2.2.

Any idea how we can understand why it's marked at "Normal (updated)"?

yuri-tceretian · 2024-02-20T14:46:42Z

You can find out why the rule was updated by checking logs (only if debug logs are enabled). Message Updating rule and context parameter rule_uid that matches the rule's UID also has diff.
If debug logs are not enabled, you can check for message Clearing the state of the rule because it was updated that also has context parameter rule_uid as well as fingerprint.
Also, you can check database table alert_rule_version that contains the history of rule changes.

benoittgt · 2024-02-21T08:48:51Z

Thanks @yuri-tceretian. alert_rule_version helped us. We saw a state "normal (updated)" because we changed the relativeTimeRange. But still, the alert should not be marked as normal (updated) for us it means invalid state for few seconds + the firing "since" time that is reseted which should not be the case. Why not pending (updated)? :)

Edit: New case

We got the same issue because we simply remove one annotation. Here is the diff of annotations, in alert_rule_version

yuri-tceretian · 2024-02-22T15:25:17Z

Currently, any change of any field of a rule causes the reset of the state. The reason behind that is that a change could result in different alerts. The notification pipeline uses fingerprints calculated from the set of labels to match the alert. That's how it creates new, maintains active, and resolves alerts - we just create a new alert that has the same set of labels but with different status.
The rule change could result in alerts with a different set of labels and therefore different fingerprints. Therefore it could cause duplicates in the notification pipeline, i.e. when active alerts produced by the previous version get orphaned and not maintained but a new set produced by new version of rule gets created. In this case, the user could receive multiple notifications. That's why we decided to just reset the state once we detected that the rule was updated.

Before this PR, we reset the state every time the rule's version changed. This PR improved that and let us reset the state only when the fingerprint changes. This opened possibilities to approach the resetting state more precisely and ignore certain fields while calculating the fingerprint.

Changes in annotations do not affect the alert fingerprint, indeed. So, it can be deleted. I do not think that the time range affects it too. So, I think it makes sense to remove them from the fingerprint. We will discuss it. Thanks!

yuri-tceretian · 2024-02-22T15:57:13Z

I opened an issue to improve it #83250 Please upvote it if you think it makes sense to do.

yuri-tceretian requested a review from a team April 13, 2023 19:47

yuri-tceretian mentioned this pull request Apr 13, 2023

Alerting: Scheduler to check rule's fingerprint to reset state when version is changed #66424

Closed

3 tasks

yuri-tceretian self-assigned this Apr 13, 2023

grafanabot added the area/backend label Apr 13, 2023

yuri-tceretian commented Apr 13, 2023

View reviewed changes

yuri-tceretian force-pushed the yuri-tceretian/alertrule-fingerprint branch from cdcfb1e to e711103 Compare April 13, 2023 20:12

yuri-tceretian added area/alerting Grafana Alerting add to changelog no-backport Skip backport of PR labels Apr 13, 2023

yuri-tceretian added this to the 10.0.0 milestone Apr 13, 2023

yuri-tceretian commented Apr 13, 2023

View reviewed changes

pkg/services/ngalert/schedule/schedule_unit_test.go Outdated Show resolved Hide resolved

yuri-tceretian added 3 commits April 13, 2023 16:36

implement calculation of fingerprint for ruleWithFolder

9148ecb

move getting folder title up

d20363f

update scheduler to support fingerprint instead of version

da351da

yuri-tceretian force-pushed the yuri-tceretian/alertrule-fingerprint branch from e711103 to da351da Compare April 13, 2023 20:46

yuri-tceretian commented Apr 13, 2023

View reviewed changes

yuri-tceretian added the kata:alerting-ease-of-use label Apr 18, 2023

yuri-tceretian requested review from JacobsonMT and grobinson-grafana April 18, 2023 15:38

change fingerprint formatting

3591301

yuri-tceretian mentioned this pull request Apr 24, 2023

Alerting: replace rule version with rule's hash in Scheduler\State manager #64843

Closed

yuri-tceretian added backport v9.5.x Bot will automatically open backport PR and removed no-backport Skip backport of PR labels Apr 25, 2023

grafanabot added the missing-labels label Apr 25, 2023

grobinson-grafana reviewed Apr 25, 2023

View reviewed changes

Merge branch 'up/main' into yuri-tceretian/alertrule-fingerprint

6f0bc92

JacobsonMT reviewed Apr 25, 2023

View reviewed changes

yuri-tceretian added 2 commits April 25, 2023 20:58

add 0 len check and todo

f22d3da

fixup

534d402

This was referenced Apr 26, 2023

Alerting: Reloading provisioned alerts resets all alerts states #61917

Closed

File-rovisioned alerts resets after reload #53302

Closed

armandgrillet added product-approved Pull requests that are approved by product/managers and are allowed to be backported and removed missing-labels kata:alerting-ease-of-use labels Apr 26, 2023

JacobsonMT approved these changes Apr 28, 2023

View reviewed changes

grobinson-grafana approved these changes Apr 28, 2023

View reviewed changes

yuri-tceretian merged commit 9eb10be into main Apr 28, 2023
9 checks passed

yuri-tceretian deleted the yuri-tceretian/alertrule-fingerprint branch April 28, 2023 14:42

grafanabot mentioned this pull request Apr 28, 2023

[v9.5.x] Alerting: Scheduler use rule fingerprint instead of version #67516

Merged

yuri-tceretian mentioned this pull request Apr 28, 2023

Alerting: Any change to a rule always increments version and resets state for all other rules in the same group #64256

Closed

ryantxu pushed a commit that referenced this pull request May 3, 2023

Alerting: Scheduler use rule fingerprint instead of version (#66531)

3a5d66c

* implement calculation of fingerprint for ruleWithFolder * update scheduler to use fingerprint instead of rule's version

zerok modified the milestones: 10.0.0, 10.0.0-preview May 31, 2023

yuri-tceretian mentioned this pull request Jun 30, 2023

Alerting: Use unsafe.Slice for hashing a string during rule fingerprint calculation #71000

Merged

yuri-tceretian mentioned this pull request Dec 7, 2023

Alert are marked as updated even if nothing changed #79235

Closed

yuri-tceretian mentioned this pull request Feb 22, 2024

Reduce set of fields which changes in alert rule that cause the reset of state. #83250

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Alerting: Scheduler use rule fingerprint instead of version #66531

Alerting: Scheduler use rule fingerprint instead of version #66531

yuri-tceretian commented Apr 13, 2023 •

edited

Loading

yuri-tceretian Apr 13, 2023 •

edited

Loading

yuri-tceretian Apr 13, 2023 •

edited

Loading

grafanabot commented Apr 25, 2023

grobinson-grafana Apr 25, 2023

yuri-tceretian Apr 26, 2023 •

edited

Loading

yuri-tceretian Apr 26, 2023 •

edited

Loading

JacobsonMT Apr 25, 2023

yuri-tceretian Apr 26, 2023

JacobsonMT left a comment

benoittgt commented Feb 19, 2024

yuri-tceretian commented Feb 20, 2024

benoittgt commented Feb 21, 2024 •

edited

Loading

yuri-tceretian commented Feb 22, 2024

yuri-tceretian commented Feb 22, 2024

	func subscribeToFolderChanges(logger log.Logger, bus bus.Bus, dbStore api.RuleStore) {
	// if folder title is changed, we update all alert rules in that folder to make sure that all peers (in HA mode) will update folder title and
	// clean up the current state
	bus.AddEventListener(func(ctx context.Context, e *events.FolderTitleUpdated) error {
	// do not block the upstream execution
	go func(evt *events.FolderTitleUpdated) {
	logger.Info("Got folder title updated event. updating rules in the folder", "folderUID", evt.UID)
	_, err := dbStore.IncreaseVersionForAllRulesInNamespace(ctx, evt.OrgID, evt.UID)
	if err != nil {
	logger.Error("Failed to update alert rules in the folder after its title was changed", "error", err, "folderUID", evt.UID, "folder", evt.Title)
	return
	}
	}(e)
	return nil
	})

Alerting: Scheduler use rule fingerprint instead of version #66531

Alerting: Scheduler use rule fingerprint instead of version #66531

Conversation

yuri-tceretian commented Apr 13, 2023 • edited Loading

yuri-tceretian Apr 13, 2023 • edited Loading

Choose a reason for hiding this comment

yuri-tceretian Apr 13, 2023 • edited Loading

Choose a reason for hiding this comment

grafanabot commented Apr 25, 2023

grobinson-grafana Apr 25, 2023

Choose a reason for hiding this comment

yuri-tceretian Apr 26, 2023 • edited Loading

Choose a reason for hiding this comment

yuri-tceretian Apr 26, 2023 • edited Loading

Choose a reason for hiding this comment

JacobsonMT Apr 25, 2023

Choose a reason for hiding this comment

yuri-tceretian Apr 26, 2023

Choose a reason for hiding this comment

JacobsonMT left a comment

Choose a reason for hiding this comment

benoittgt commented Feb 19, 2024

yuri-tceretian commented Feb 20, 2024

benoittgt commented Feb 21, 2024 • edited Loading

yuri-tceretian commented Feb 22, 2024

yuri-tceretian commented Feb 22, 2024

yuri-tceretian commented Apr 13, 2023 •

edited

Loading

yuri-tceretian Apr 13, 2023 •

edited

Loading

yuri-tceretian Apr 13, 2023 •

edited

Loading

yuri-tceretian Apr 26, 2023 •

edited

Loading

yuri-tceretian Apr 26, 2023 •

edited

Loading

benoittgt commented Feb 21, 2024 •

edited

Loading