Alerting: In migration, create one label per channel #76527

JacobsonMT · 2023-10-13T10:43:02Z

This PR changes how routing is done by the legacy alerting migration.

Previously, we created a single label on each alert rule that contained an array of contact point names. Ex: __contact__="slack legacy testing","slack legacy testing2"

This label was then routed against a series of regex-matching policies with continue=true. Ex: __contacts__ =~ .*"slack legacy testing".*

More details here: #52071

In the case of many contact points, this array could quickly become difficult to manage and difficult to grok at-a-glance.

This PR replaces the single __contact__ label with multiple __legacy_c_{contactname}__ labels and simple equality-matching policies. These channel-specific policies are nested in a single route under the top-level route which matches against __legacy_use_channels__ = true for ease of organization.

This should improve the experience for users wanting to keep the default migrated routing strategy but who also want to modify which contact points an alert sends to.

Special notes for your reviewer:

Please check that:

It works as expected from a user's perspective.
If this is a pre-GA feature, it is behind a feature toggle.
The docs are updated, and if this is a notable improvement, it's added to our What's New doc.

JacobsonMT · 2023-10-13T10:43:34Z

/deploy-to-hg

ephemeral-instances-bot · 2023-10-13T10:44:10Z

Preparing your instance. A comment containing your instance's url will be added to this PR when the instance is ready.
Your instance will be ready in ~10 minutes.
Check the GitHub actions tab to follow the workflow progress
Slack channel: #proj-ephemeral-hg-instances
Building instance with jacobsonmt/migration_improve_contact_point_creation oss branch and main enterprise branch. How to choose a branch

ephemeral-instances-bot · 2023-10-13T10:55:09Z

Your instance can be accessed at: https://ephemeral1511182176527jacobso.grafana-dev.net
The instance is not using the CDN assets.
How to access / How to update instance config / How to build a specific branch

yuri-tceretian · 2023-10-17T19:30:32Z

Related to #56582

yuri-tceretian · 2023-10-17T20:14:35Z

pkg/services/ngalert/migration/channel.go

+	// These will match two routes as they are all defined with Continue=true.
+
+	label := fmt.Sprintf(ContactLabelTemplate, channel.UID)
+	mat, _ := labels.NewMatcher(labels.MatchEqual, label, "true")


labels is an external dependency and it's implementation details can change in future. I think it is a good practice to correctly handle the function result, in this case, error.

That's fair, I went back and forth on this one. I'll add it back.

pkg/services/ngalert/migration/channel.go

pkg/services/ngalert/migration/models/alertmanager.go

yuri-tceretian · 2023-10-17T20:31:51Z

pkg/services/ngalert/migration/alert_rule.go

+			uid, err := om.migrationStore.GetAlertNotificationUidWithId(ctx, orgID, ui.ID)
+			if err != nil {
+				l.Error("Failed to get alert notification UID", "notificationId", ui.ID, "err", err)
+			}


I think this answers my question above: we are going to drop a notification without UID. I wonder what is the reason for such decision?

This is basically a simplified version of how it works in legacy alerting, see

grafana/pkg/services/alerting/rule.go

Lines 170 to 190 in 00d9543

for _, v := range ruleDef.Settings.Get("notifications").MustArray() {

jsonModel := simplejson.NewFromAny(v)

if id, err := jsonModel.Get("id").Int64(); err == nil {

uid, err := translateNotificationIDToUID(ctx, store, id, ruleDef.OrgID)

if err != nil {

if !errors.Is(err, models.ErrAlertNotificationFailedTranslateUniqueID) {

logger.Error("Failed to translate notification id to uid", "error", err.Error(), "dashboardId", model.DashboardID, "alert", model.Name, "panelId", model.PanelID, "notificationId", id)

}

if logTranslationFailures {

logger.Warn("Unable to translate notification id to uid", "dashboardId", model.DashboardID, "alert", model.Name, "panelId", model.PanelID, "notificationId", id)

}

} else {

model.Notifications = append(model.Notifications, uid)

}

} else if uid, err := jsonModel.Get("uid").String(); err == nil {

model.Notifications = append(model.Notifications, uid)

} else {

return nil, ValidationError{Reason: "Neither id nor uid is specified in 'notifications' block, " + err.Error(), DashboardID: model.DashboardID, AlertID: model.ID, PanelID: model.PanelID}

}

}

.

At least one of the id or uid is guaranteed to be present on each entry of parsedSettings.Notifications. So, we performa cached lookup of the UID for a given ID, or use the UID if one exists.

There was a point in time a while ago where UIDs didn't exist for notification channels, but we are guaranteed to have one now since we run after sqlstore migrations (specifically

grafana/pkg/services/sqlstore/migrations/alert_mig.go

Lines 187 to 190 in 2212c6d

mg.AddMigration("Update uid column values in alert_notification", new(RawSQLMigration).

SQLite("UPDATE alert_notification SET uid=printf('%09d',id) WHERE uid IS NULL;").

Postgres("UPDATE alert_notification SET uid=lpad('' || id::text,9,'0') WHERE uid IS NULL;").

Mysql("UPDATE alert_notification SET uid=lpad(id,9,'0') WHERE uid IS NULL;"))

)

If neither id not uid is present on the entry, or if the id is invalid, then that legacy alert would be failing to send to the channel in legacy alerting anyways.

pkg/services/ngalert/migration/alert_rule.go

pkg/services/ngalert/migration/cond_trans.go

yuri-tceretian · 2023-10-17T20:53:17Z

pkg/services/ngalert/migration/cond_trans.go

@@ -265,7 +307,7 @@ func getNewRefID(refIDs map[string][]int) (string, error) {
 		}
 		return sR, nil
 	}
-	return "", fmt.Errorf("failed to generate unique RefID")
+	return "", fmt.Errorf("generate unique RefID")


I do not think I can agree with this. This makes the chained message be more readable message failed to migrate alert rules: failed to migrate alert rule:failed to generate unique RefID is more readable than migrate alert rules: migrate rule: generate unique RefID.

Yes, you're right. The leaf errors should mention the failure, this was accidental collateral in a mass replace.

Though in this case it would be:

Error "migration failed: executing migration: migrate org 1: migrate alerts: migrate and save dashboard '1': migrate alert 'alert 1': transform conditions: failed to generate unique RefID"

pkg/services/ngalert/migration/ualert.go

yuri-tceretian

Tested and it works as it should. LGTM

github-actions · 2023-11-19T01:49:58Z

This pull request has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in 2 weeks if no further activity occurs. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions!

JacobsonMT · 2023-11-19T19:23:31Z

Rebased onto main and added the following changes:

8e164e9: Fixes bug where we weren't caching folder permissions correctly.
107199b: Use channel names instead of UIDs in routing labels.
3a95337: Rename the routing labels so they all use the same __legacy prefix. Should make it more obvious that they came from the legacy migration and are related.

yuri-tceretian

Code-wise LGTM.

However, I wonder whether it does not contradict what we want to achieve as part of the simplification of notification policy management where users will be able to pick only one contact point. If we go with a different approach and create a contact point per set of the selected notification policies, then we will be able to leverage the new functionality we're adding. WDYT?

Previously, we used the notification channel names to create the routing labels on an alert rule. This makes future work to individually re-migrate single channels difficult as channel names can be easily changed and are not guaranteed unique by any db constraint. This change modifies the logic to instead use the channel uid, which is guaranteed unique per org at the db level and is significantly less likely to change (indeed even if it does change, we can likely assume that the new channel is not equal in identity to the old one). In addition, we move away from a single `__contact__` label with an array of receiver names combined with regex-based route matching to multiple `__contact_{uid}__` labels and a simple equality-based route matching. These routes are now all nested under a single top-level route matched against `__use_legacy_channels__ = true` for ease of organization. This is done to improve the experience for users wanting to keep the default migrated routing strategy but modify which contact points an alert sends to. Editing the array could be difficult when large. This has the added benefit of great simplifying the logic around contact point migration as well and removes the (now unnecessary) DashAlert wrapper for legacymodels.Alert.

JacobsonMT added area/alerting Grafana Alerting area/backend add to changelog no-backport Skip backport of PR labels Oct 13, 2023

JacobsonMT added this to the 10.2.x milestone Oct 13, 2023

JacobsonMT requested review from rwwiv and yuri-tceretian October 13, 2023 10:43

JacobsonMT requested a review from a team as a code owner October 13, 2023 10:43

JacobsonMT requested review from grobinson-grafana and removed request for a team October 13, 2023 10:43

JacobsonMT force-pushed the jacobsonmt/migration_improve_contact_point_creation branch from 8f8dd4a to 99f5270 Compare October 17, 2023 20:01

JacobsonMT requested review from a team as code owners October 17, 2023 20:01

JacobsonMT requested review from zserge, mildwonkey and nikimanoledaki and removed request for a team October 17, 2023 20:01

JacobsonMT changed the base branch from main to jacobsonmt/migration_fix_sqlite_provisioning_contention October 17, 2023 20:01

JacobsonMT removed request for zserge, mildwonkey, nikimanoledaki and a team October 17, 2023 20:02

yuri-tceretian reviewed Oct 17, 2023

View reviewed changes

JacobsonMT requested a review from yuri-tceretian October 18, 2023 18:51

yuri-tceretian approved these changes Oct 18, 2023

View reviewed changes

Base automatically changed from jacobsonmt/migration_fix_sqlite_provisioning_contention to main October 19, 2023 14:03

github-actions bot added the stale Issue with no recent activity label Nov 19, 2023

JacobsonMT force-pushed the jacobsonmt/migration_improve_contact_point_creation branch from 3ab8995 to 3a95337 Compare November 19, 2023 19:19

JacobsonMT removed the stale Issue with no recent activity label Nov 19, 2023

JacobsonMT requested a review from yuri-tceretian November 19, 2023 19:23

JacobsonMT changed the title ~~Alerting: In migration, create one label per channel using UID instead of name~~ Alerting: In migration, create one label per channel Nov 20, 2023

JacobsonMT mentioned this pull request Nov 21, 2023

Alerting: During legacy migration reduce the number of created silences #78505

Merged

1 task

JacobsonMT added the area/alerting/migration Issues relating to legacy alerting migration label Nov 24, 2023

JacobsonMT force-pushed the jacobsonmt/migration_improve_contact_point_creation branch 3 times, most recently from ea3ade5 to eae4d32 Compare November 30, 2023 18:26

yuri-tceretian approved these changes Dec 4, 2023

View reviewed changes

JacobsonMT added 8 commits December 18, 2023 11:16

Address PR review comments

2642ae1

Error to Warn log when notification id has no associated uid

36719bd

Fix small bug where folder permissions were calculated too often

a959711

Use channel names in label instead of uids

4a69987

Modify routing labels to start with the same prefix __legacy

79bbacf

Fix rebase conflict

ac05217

Fix linting

218cd04

JacobsonMT force-pushed the jacobsonmt/migration_improve_contact_point_creation branch from eae4d32 to 218cd04 Compare December 18, 2023 18:21

JacobsonMT merged commit 0424d44 into main Dec 19, 2023
13 checks passed

JacobsonMT deleted the jacobsonmt/migration_improve_contact_point_creation branch December 19, 2023 18:25

grafana-delivery-bot bot modified the milestones: 10.2.x, 10.3.x Dec 19, 2023

summerwollin modified the milestones: 10.3.x, 10.3.0 Jan 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Alerting: In migration, create one label per channel #76527

Alerting: In migration, create one label per channel #76527

JacobsonMT commented Oct 13, 2023 •

edited

JacobsonMT commented Oct 13, 2023

ephemeral-instances-bot bot commented Oct 13, 2023

ephemeral-instances-bot bot commented Oct 13, 2023

yuri-tceretian commented Oct 17, 2023

yuri-tceretian Oct 17, 2023

JacobsonMT Oct 18, 2023

yuri-tceretian Oct 17, 2023 •

edited

JacobsonMT Oct 18, 2023

JacobsonMT Oct 18, 2023

yuri-tceretian Oct 17, 2023

JacobsonMT Oct 18, 2023

JacobsonMT Oct 18, 2023

yuri-tceretian left a comment

github-actions bot commented Nov 19, 2023

JacobsonMT commented Nov 19, 2023

yuri-tceretian left a comment

	for _, v := range ruleDef.Settings.Get("notifications").MustArray() {
	jsonModel := simplejson.NewFromAny(v)
	if id, err := jsonModel.Get("id").Int64(); err == nil {
	uid, err := translateNotificationIDToUID(ctx, store, id, ruleDef.OrgID)
	if err != nil {
	if !errors.Is(err, models.ErrAlertNotificationFailedTranslateUniqueID) {
	logger.Error("Failed to translate notification id to uid", "error", err.Error(), "dashboardId", model.DashboardID, "alert", model.Name, "panelId", model.PanelID, "notificationId", id)
	}

	if logTranslationFailures {
	logger.Warn("Unable to translate notification id to uid", "dashboardId", model.DashboardID, "alert", model.Name, "panelId", model.PanelID, "notificationId", id)
	}
	} else {
	model.Notifications = append(model.Notifications, uid)
	}
	} else if uid, err := jsonModel.Get("uid").String(); err == nil {
	model.Notifications = append(model.Notifications, uid)
	} else {
	return nil, ValidationError{Reason: "Neither id nor uid is specified in 'notifications' block, " + err.Error(), DashboardID: model.DashboardID, AlertID: model.ID, PanelID: model.PanelID}
	}
	}

	mg.AddMigration("Update uid column values in alert_notification", new(RawSQLMigration).
	SQLite("UPDATE alert_notification SET uid=printf('%09d',id) WHERE uid IS NULL;").
	Postgres("UPDATE alert_notification SET uid=lpad('' \|\| id::text,9,'0') WHERE uid IS NULL;").
	Mysql("UPDATE alert_notification SET uid=lpad(id,9,'0') WHERE uid IS NULL;"))

Alerting: In migration, create one label per channel #76527

Alerting: In migration, create one label per channel #76527

Conversation

JacobsonMT commented Oct 13, 2023 • edited

JacobsonMT commented Oct 13, 2023

ephemeral-instances-bot bot commented Oct 13, 2023

ephemeral-instances-bot bot commented Oct 13, 2023

yuri-tceretian commented Oct 17, 2023

yuri-tceretian Oct 17, 2023

Choose a reason for hiding this comment

JacobsonMT Oct 18, 2023

Choose a reason for hiding this comment

yuri-tceretian Oct 17, 2023 • edited

Choose a reason for hiding this comment

JacobsonMT Oct 18, 2023

Choose a reason for hiding this comment

JacobsonMT Oct 18, 2023

Choose a reason for hiding this comment

yuri-tceretian Oct 17, 2023

Choose a reason for hiding this comment

JacobsonMT Oct 18, 2023

Choose a reason for hiding this comment

JacobsonMT Oct 18, 2023

Choose a reason for hiding this comment

yuri-tceretian left a comment

Choose a reason for hiding this comment

github-actions bot commented Nov 19, 2023

JacobsonMT commented Nov 19, 2023

yuri-tceretian left a comment

Choose a reason for hiding this comment

JacobsonMT commented Oct 13, 2023 •

edited

yuri-tceretian Oct 17, 2023 •

edited