Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alerting: In migration, create one label per channel #76527

Merged
merged 8 commits into from Dec 19, 2023

Conversation

JacobsonMT
Copy link
Member

@JacobsonMT JacobsonMT commented Oct 13, 2023

This PR changes how routing is done by the legacy alerting migration.

Previously, we created a single label on each alert rule that contained an array of contact point names. Ex: __contact__="slack legacy testing","slack legacy testing2"

This label was then routed against a series of regex-matching policies with continue=true. Ex: __contacts__ =~ .*"slack legacy testing".*

More details here: #52071

In the case of many contact points, this array could quickly become difficult to manage and difficult to grok at-a-glance.

This PR replaces the single __contact__ label with multiple __legacy_c_{contactname}__ labels and simple equality-matching policies. These channel-specific policies are nested in a single route under the top-level route which matches against __legacy_use_channels__ = true for ease of organization.

This should improve the experience for users wanting to keep the default migrated routing strategy but who also want to modify which contact points an alert sends to.

Special notes for your reviewer:

image

image

Please check that:

  • It works as expected from a user's perspective.
  • If this is a pre-GA feature, it is behind a feature toggle.
  • The docs are updated, and if this is a notable improvement, it's added to our What's New doc.

@JacobsonMT JacobsonMT added this to the 10.2.x milestone Oct 13, 2023
@JacobsonMT JacobsonMT requested a review from a team as a code owner October 13, 2023 10:43
@JacobsonMT JacobsonMT requested review from grobinson-grafana and removed request for a team October 13, 2023 10:43
@JacobsonMT
Copy link
Member Author

/deploy-to-hg

@ephemeral-instances-bot
Copy link

  • Preparing your instance. A comment containing your instance's url will be added to this PR when the instance is ready.
  • Your instance will be ready in ~10 minutes.
  • Check the GitHub actions tab to follow the workflow progress
  • Slack channel: #proj-ephemeral-hg-instances
  • Building instance with jacobsonmt/migration_improve_contact_point_creation oss branch and main enterprise branch. How to choose a branch

@ephemeral-instances-bot
Copy link

@yuri-tceretian
Copy link
Contributor

Related to #56582

@JacobsonMT JacobsonMT force-pushed the jacobsonmt/migration_improve_contact_point_creation branch from 8f8dd4a to 99f5270 Compare October 17, 2023 20:01
@JacobsonMT JacobsonMT requested review from a team as code owners October 17, 2023 20:01
@JacobsonMT JacobsonMT requested review from zserge, mildwonkey and nikimanoledaki and removed request for a team October 17, 2023 20:01
@JacobsonMT JacobsonMT changed the base branch from main to jacobsonmt/migration_fix_sqlite_provisioning_contention October 17, 2023 20:01
// These will match two routes as they are all defined with Continue=true.

label := fmt.Sprintf(ContactLabelTemplate, channel.UID)
mat, _ := labels.NewMatcher(labels.MatchEqual, label, "true")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

labels is an external dependency and it's implementation details can change in future. I think it is a good practice to correctly handle the function result, in this case, error.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's fair, I went back and forth on this one. I'll add it back.

pkg/services/ngalert/migration/channel.go Outdated Show resolved Hide resolved
pkg/services/ngalert/migration/models/alertmanager.go Outdated Show resolved Hide resolved
uid, err := om.migrationStore.GetAlertNotificationUidWithId(ctx, orgID, ui.ID)
if err != nil {
l.Error("Failed to get alert notification UID", "notificationId", ui.ID, "err", err)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this answers my question above: we are going to drop a notification without UID. I wonder what is the reason for such decision?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is basically a simplified version of how it works in legacy alerting, see

for _, v := range ruleDef.Settings.Get("notifications").MustArray() {
jsonModel := simplejson.NewFromAny(v)
if id, err := jsonModel.Get("id").Int64(); err == nil {
uid, err := translateNotificationIDToUID(ctx, store, id, ruleDef.OrgID)
if err != nil {
if !errors.Is(err, models.ErrAlertNotificationFailedTranslateUniqueID) {
logger.Error("Failed to translate notification id to uid", "error", err.Error(), "dashboardId", model.DashboardID, "alert", model.Name, "panelId", model.PanelID, "notificationId", id)
}
if logTranslationFailures {
logger.Warn("Unable to translate notification id to uid", "dashboardId", model.DashboardID, "alert", model.Name, "panelId", model.PanelID, "notificationId", id)
}
} else {
model.Notifications = append(model.Notifications, uid)
}
} else if uid, err := jsonModel.Get("uid").String(); err == nil {
model.Notifications = append(model.Notifications, uid)
} else {
return nil, ValidationError{Reason: "Neither id nor uid is specified in 'notifications' block, " + err.Error(), DashboardID: model.DashboardID, AlertID: model.ID, PanelID: model.PanelID}
}
}
.

At least one of the id or uid is guaranteed to be present on each entry of parsedSettings.Notifications. So, we performa cached lookup of the UID for a given ID, or use the UID if one exists.

There was a point in time a while ago where UIDs didn't exist for notification channels, but we are guaranteed to have one now since we run after sqlstore migrations (specifically

mg.AddMigration("Update uid column values in alert_notification", new(RawSQLMigration).
SQLite("UPDATE alert_notification SET uid=printf('%09d',id) WHERE uid IS NULL;").
Postgres("UPDATE alert_notification SET uid=lpad('' || id::text,9,'0') WHERE uid IS NULL;").
Mysql("UPDATE alert_notification SET uid=lpad(id,9,'0') WHERE uid IS NULL;"))
)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If neither id not uid is present on the entry, or if the id is invalid, then that legacy alert would be failing to send to the channel in legacy alerting anyways.

pkg/services/ngalert/migration/alert_rule.go Outdated Show resolved Hide resolved
pkg/services/ngalert/migration/cond_trans.go Outdated Show resolved Hide resolved
pkg/services/ngalert/migration/cond_trans.go Outdated Show resolved Hide resolved
@@ -265,7 +307,7 @@ func getNewRefID(refIDs map[string][]int) (string, error) {
}
return sR, nil
}
return "", fmt.Errorf("failed to generate unique RefID")
return "", fmt.Errorf("generate unique RefID")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not think I can agree with this. This makes the chained message be more readable message failed to migrate alert rules: failed to migrate alert rule:failed to generate unique RefID is more readable than migrate alert rules: migrate rule: generate unique RefID.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you're right. The leaf errors should mention the failure, this was accidental collateral in a mass replace.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Though in this case it would be:

Error "migration failed: executing migration: migrate org 1: migrate alerts: migrate and save dashboard '1': migrate alert 'alert 1': transform conditions: failed to generate unique RefID"

pkg/services/ngalert/migration/ualert.go Show resolved Hide resolved
Copy link
Contributor

@yuri-tceretian yuri-tceretian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested and it works as it should. LGTM

Base automatically changed from jacobsonmt/migration_fix_sqlite_provisioning_contention to main October 19, 2023 14:03
Copy link
Contributor

This pull request has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in 2 weeks if no further activity occurs. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions!

@github-actions github-actions bot added the stale Issue with no recent activity label Nov 19, 2023
@JacobsonMT JacobsonMT force-pushed the jacobsonmt/migration_improve_contact_point_creation branch from 3ab8995 to 3a95337 Compare November 19, 2023 19:19
@JacobsonMT JacobsonMT removed the stale Issue with no recent activity label Nov 19, 2023
@JacobsonMT
Copy link
Member Author

Rebased onto main and added the following changes:

8e164e9: Fixes bug where we weren't caching folder permissions correctly.
107199b: Use channel names instead of UIDs in routing labels.
3a95337: Rename the routing labels so they all use the same __legacy prefix. Should make it more obvious that they came from the legacy migration and are related.

@JacobsonMT JacobsonMT changed the title Alerting: In migration, create one label per channel using UID instead of name Alerting: In migration, create one label per channel Nov 20, 2023
@JacobsonMT JacobsonMT added the area/alerting/migration Issues relating to legacy alerting migration label Nov 24, 2023
@JacobsonMT JacobsonMT force-pushed the jacobsonmt/migration_improve_contact_point_creation branch 3 times, most recently from ea3ade5 to eae4d32 Compare November 30, 2023 18:26
Copy link
Contributor

@yuri-tceretian yuri-tceretian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code-wise LGTM.

However, I wonder whether it does not contradict what we want to achieve as part of the simplification of notification policy management where users will be able to pick only one contact point. If we go with a different approach and create a contact point per set of the selected notification policies, then we will be able to leverage the new functionality we're adding. WDYT?

Previously, we used the notification channel names to create the routing labels
on an alert rule. This makes future work to individually re-migrate
single channels difficult as channel names can be easily changed and are not
guaranteed unique by any db constraint.

This change modifies the logic to instead use the channel uid, which is
guaranteed unique per org at the db level and is significantly less likely to
change (indeed even if it does change, we can likely assume that the new channel
 is not equal in identity to the old one).

In addition, we move away from a single `__contact__` label with an array of
receiver names combined with regex-based route matching to multiple
`__contact_{uid}__` labels and a simple equality-based route matching. These
routes are now all nested under a single top-level route matched against
`__use_legacy_channels__ = true` for ease of organization.

This is done to improve the experience for users wanting to keep the
default migrated routing strategy but modify which contact points an alert
sends to. Editing the array could be difficult when large.

This has the added benefit of great simplifying the logic around contact point
migration as well and removes the (now unnecessary) DashAlert wrapper for
 legacymodels.Alert.
@JacobsonMT JacobsonMT force-pushed the jacobsonmt/migration_improve_contact_point_creation branch from eae4d32 to 218cd04 Compare December 18, 2023 18:21
@JacobsonMT JacobsonMT merged commit 0424d44 into main Dec 19, 2023
13 checks passed
@JacobsonMT JacobsonMT deleted the jacobsonmt/migration_improve_contact_point_creation branch December 19, 2023 18:25
@grafana-delivery-bot grafana-delivery-bot bot modified the milestones: 10.2.x, 10.3.x Dec 19, 2023
@summerwollin summerwollin modified the milestones: 10.3.x, 10.3.0 Jan 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
add to changelog area/alerting/migration Issues relating to legacy alerting migration area/alerting Grafana Alerting area/backend no-backport Skip backport of PR
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

None yet

3 participants