New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Alerting: In migration, create one label per channel #76527
Alerting: In migration, create one label per channel #76527
Conversation
/deploy-to-hg |
|
|
Related to #56582 |
8f8dd4a
to
99f5270
Compare
// These will match two routes as they are all defined with Continue=true. | ||
|
||
label := fmt.Sprintf(ContactLabelTemplate, channel.UID) | ||
mat, _ := labels.NewMatcher(labels.MatchEqual, label, "true") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
labels
is an external dependency and it's implementation details can change in future. I think it is a good practice to correctly handle the function result, in this case, error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's fair, I went back and forth on this one. I'll add it back.
uid, err := om.migrationStore.GetAlertNotificationUidWithId(ctx, orgID, ui.ID) | ||
if err != nil { | ||
l.Error("Failed to get alert notification UID", "notificationId", ui.ID, "err", err) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this answers my question above: we are going to drop a notification without UID. I wonder what is the reason for such decision?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is basically a simplified version of how it works in legacy alerting, see
grafana/pkg/services/alerting/rule.go
Lines 170 to 190 in 00d9543
for _, v := range ruleDef.Settings.Get("notifications").MustArray() { | |
jsonModel := simplejson.NewFromAny(v) | |
if id, err := jsonModel.Get("id").Int64(); err == nil { | |
uid, err := translateNotificationIDToUID(ctx, store, id, ruleDef.OrgID) | |
if err != nil { | |
if !errors.Is(err, models.ErrAlertNotificationFailedTranslateUniqueID) { | |
logger.Error("Failed to translate notification id to uid", "error", err.Error(), "dashboardId", model.DashboardID, "alert", model.Name, "panelId", model.PanelID, "notificationId", id) | |
} | |
if logTranslationFailures { | |
logger.Warn("Unable to translate notification id to uid", "dashboardId", model.DashboardID, "alert", model.Name, "panelId", model.PanelID, "notificationId", id) | |
} | |
} else { | |
model.Notifications = append(model.Notifications, uid) | |
} | |
} else if uid, err := jsonModel.Get("uid").String(); err == nil { | |
model.Notifications = append(model.Notifications, uid) | |
} else { | |
return nil, ValidationError{Reason: "Neither id nor uid is specified in 'notifications' block, " + err.Error(), DashboardID: model.DashboardID, AlertID: model.ID, PanelID: model.PanelID} | |
} | |
} |
At least one of the id
or uid
is guaranteed to be present on each entry of parsedSettings.Notifications
. So, we performa cached lookup of the UID for a given ID, or use the UID if one exists.
There was a point in time a while ago where UIDs didn't exist for notification channels, but we are guaranteed to have one now since we run after sqlstore migrations (specifically
grafana/pkg/services/sqlstore/migrations/alert_mig.go
Lines 187 to 190 in 2212c6d
mg.AddMigration("Update uid column values in alert_notification", new(RawSQLMigration). | |
SQLite("UPDATE alert_notification SET uid=printf('%09d',id) WHERE uid IS NULL;"). | |
Postgres("UPDATE alert_notification SET uid=lpad('' || id::text,9,'0') WHERE uid IS NULL;"). | |
Mysql("UPDATE alert_notification SET uid=lpad(id,9,'0') WHERE uid IS NULL;")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If neither id
not uid
is present on the entry, or if the id
is invalid, then that legacy alert would be failing to send to the channel in legacy alerting anyways.
@@ -265,7 +307,7 @@ func getNewRefID(refIDs map[string][]int) (string, error) { | |||
} | |||
return sR, nil | |||
} | |||
return "", fmt.Errorf("failed to generate unique RefID") | |||
return "", fmt.Errorf("generate unique RefID") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not think I can agree with this. This makes the chained message be more readable message failed to migrate alert rules: failed to migrate alert rule:failed to generate unique RefID
is more readable than migrate alert rules: migrate rule: generate unique RefID
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, you're right. The leaf errors should mention the failure, this was accidental collateral in a mass replace.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Though in this case it would be:
Error "migration failed: executing migration: migrate org 1: migrate alerts: migrate and save dashboard '1': migrate alert 'alert 1': transform conditions: failed to generate unique RefID"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tested and it works as it should. LGTM
This pull request has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in 2 weeks if no further activity occurs. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions! |
3ab8995
to
3a95337
Compare
Rebased onto main and added the following changes: 8e164e9: Fixes bug where we weren't caching folder permissions correctly. |
ea3ade5
to
eae4d32
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code-wise LGTM.
However, I wonder whether it does not contradict what we want to achieve as part of the simplification of notification policy management where users will be able to pick only one contact point. If we go with a different approach and create a contact point per set of the selected notification policies, then we will be able to leverage the new functionality we're adding. WDYT?
Previously, we used the notification channel names to create the routing labels on an alert rule. This makes future work to individually re-migrate single channels difficult as channel names can be easily changed and are not guaranteed unique by any db constraint. This change modifies the logic to instead use the channel uid, which is guaranteed unique per org at the db level and is significantly less likely to change (indeed even if it does change, we can likely assume that the new channel is not equal in identity to the old one). In addition, we move away from a single `__contact__` label with an array of receiver names combined with regex-based route matching to multiple `__contact_{uid}__` labels and a simple equality-based route matching. These routes are now all nested under a single top-level route matched against `__use_legacy_channels__ = true` for ease of organization. This is done to improve the experience for users wanting to keep the default migrated routing strategy but modify which contact points an alert sends to. Editing the array could be difficult when large. This has the added benefit of great simplifying the logic around contact point migration as well and removes the (now unnecessary) DashAlert wrapper for legacymodels.Alert.
eae4d32
to
218cd04
Compare
This PR changes how routing is done by the legacy alerting migration.
Previously, we created a single label on each alert rule that contained an array of contact point names. Ex:
__contact__="slack legacy testing","slack legacy testing2"
This label was then routed against a series of regex-matching policies with
continue=true
. Ex:__contacts__ =~ .*"slack legacy testing".*
More details here: #52071
In the case of many contact points, this array could quickly become difficult to manage and difficult to grok at-a-glance.
This PR replaces the single
__contact__
label with multiple__legacy_c_{contactname}__
labels and simple equality-matching policies. These channel-specific policies are nested in a single route under the top-level route which matches against__legacy_use_channels__ = true
for ease of organization.This should improve the experience for users wanting to keep the default migrated routing strategy but who also want to modify which contact points an alert sends to.
Special notes for your reviewer:
Please check that: