Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alerting: During legacy migration reduce the number of created silences #78505

Merged
merged 4 commits into from Jan 24, 2024

Conversation

JacobsonMT
Copy link
Member

During legacy migration every migrated rule was given a label rule_uid=<uid>. This was used to silence DatasourceError/DatasourceNoData alerts for migrated rules that had either ExecutionErrorState/NoDataState set to keep_state, respectively.

This could potentially create a large amount of silences and a high cardinality label. Both of these scenarios have poor outcomes for CPU load and latency in unified alerting.

Instead, this change creates one label per ExecutionErrorState/NoDataState when they are set to keep_state as well as two silence rules, if rules with said labels were created during migration. These silence rules are:

  • __legacy_silence_error_keep_state__ = true
  • __legacy_silence_nodata_keep_state__ = true

This will drastically reduce the number of created silence rules in most cases
as well as not create the potentially high cardinality label rule_uid.

Who is this feature for?

Users migrating from legacy alerting.

Special notes for your reviewer:

This is a forward port of #77642, modified to fit with the new migration structure and better tested. Note the change in label names to fit with the new prefix from #76527.

Depends on #76527, so should wait for that to be merged.

image

image

Please check that:

  • It works as expected from a user's perspective.

@JacobsonMT JacobsonMT added area/alerting Grafana Alerting area/backend add to changelog no-backport Skip backport of PR no-changelog Skip including change in changelog/release notes labels Nov 21, 2023
@JacobsonMT JacobsonMT requested a review from a team as a code owner November 21, 2023 21:26
@JacobsonMT JacobsonMT requested review from rwwiv, yuri-tceretian and grobinson-grafana and removed request for a team November 21, 2023 21:26
@grafana-delivery-bot grafana-delivery-bot bot added this to the 10.3.x milestone Nov 21, 2023
@JacobsonMT JacobsonMT force-pushed the jacobsonmt/migration_improve_silence_labels branch from 71b6da0 to 6ce88f9 Compare November 21, 2023 21:27
@JacobsonMT JacobsonMT added the area/alerting/migration Issues relating to legacy alerting migration label Nov 24, 2023
@JacobsonMT JacobsonMT force-pushed the jacobsonmt/migration_improve_contact_point_creation branch 2 times, most recently from 8e41dd9 to ea3ade5 Compare November 30, 2023 15:14
@JacobsonMT JacobsonMT force-pushed the jacobsonmt/migration_improve_silence_labels branch from 6ce88f9 to 5a71d5b Compare November 30, 2023 15:28
@JacobsonMT JacobsonMT force-pushed the jacobsonmt/migration_improve_contact_point_creation branch from ea3ade5 to eae4d32 Compare November 30, 2023 18:26
@JacobsonMT JacobsonMT force-pushed the jacobsonmt/migration_improve_silence_labels branch from 5a71d5b to 32685c6 Compare November 30, 2023 18:27
@JacobsonMT JacobsonMT force-pushed the jacobsonmt/migration_improve_contact_point_creation branch from eae4d32 to 218cd04 Compare December 18, 2023 18:21
Base automatically changed from jacobsonmt/migration_improve_contact_point_creation to main December 19, 2023 18:25
During legacy migration every migrated rule was given a label rule_uid=<uid>.
This was used to silence DatasourceError/DatasourceNoData alerts for
migrated rules that had either ExecutionErrorState/NoDataState set to
keep_state, respectively.

This could potentially create a large amount of silences and a high cardinality
label. Both of these scenarios have poor outcomes for CPU load and latency in
unified alerting.

Instead, this change creates one label per ExecutionErrorState/NoDataState when
they are set to keep_state as well as two silence rules, if rules with said
labels were created during migration. These silence rules are:

- __legacy_silence_error_keep_state__ = true
- __legacy_silence_nodata_keep_state__ = true

This will drastically reduce the number of created silence rules in most cases
as well as not create the potentially high cardinality label `rule_uid`.
@JacobsonMT JacobsonMT force-pushed the jacobsonmt/migration_improve_silence_labels branch from 32685c6 to 6297513 Compare January 10, 2024 21:35
@JacobsonMT JacobsonMT removed the no-changelog Skip including change in changelog/release notes label Jan 10, 2024
Copy link
Contributor

@rwwiv rwwiv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well needed change 🎉 Left some feedback about moving the silence logic below the service level but the rest LGTM

}
if parsedSettings.ExecutionErrorState == "keep_state" {
om.rulesWithErrorSilenceLabels++
n, v := getLabelForErrorSilenceMatching()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Now that these strings are constant we don't need to keep them as function return values.

@@ -47,6 +49,7 @@ type migrationService struct {
migrationStore migrationStore.Store

encryptionService secrets.Service
silenceFile func(filename string) (io.WriteCloser, error)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It feels like we're leaking details we don't need to up to the service level here. WDYT about abstracting this further to e.g. a "silence handler" if you want to keep using a string builder in tests, or just reading the file directly (and optionally calling those tests integration tests if I/O is a concern).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure it's leaking details per se, but I agree it's a bit clunky for the sake of the test. I like your silenceHandler idea though, I went with that.

Comment on lines 493 to 507
var silences []*pb.MeshSilence
if om.rulesWithErrorSilenceLabels > 0 {
om.log.Info("Creating silence for rules with ExecutionErrorState = keep_state", "rules", om.rulesWithErrorSilenceLabels)
silences = append(silences, errorSilence())
}
if om.rulesWithNoDataSilenceLabels > 0 {
om.log.Info("Creating silence for rules with NoDataState = keep_state", "rules", om.rulesWithNoDataSilenceLabels)
silences = append(silences, noDataSilence())
}
if len(silences) > 0 {
om.log.Debug("Writing silences file", "silences", len(silences))
if err := ms.writeSilencesFile(om.orgID, silences); err != nil {
return fmt.Errorf("write silence file for org %d: %w", o.ID, err)
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's enough here that it probably shouldn't be under migrateAllOrgs anymore, especially since the logic is actually being tested in silences_test.go. WDYT about moving this block to its own function/method in silences.go and testing that directly?

@grobinson-grafana
Copy link
Contributor

Should I be able to see the label in the alert? Here I am testing Keep Last State for a NoData alert:

Screenshot 2024-01-18 at 11 10 42 AM

@JacobsonMT
Copy link
Member Author

Should I be able to see the label in the alert? Here I am testing Keep Last State for a NoData alert:
Screenshot 2024-01-18 at 11 10 42 AM

I guess we hide internal labels by default in this view. We should probably be more consistent overall, though I'm not sure if un-hiding all system labels is what we want in this one or not.

Copy link
Contributor

@rwwiv rwwiv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM :shipit:

@JacobsonMT JacobsonMT merged commit 71e70c4 into main Jan 24, 2024
11 checks passed
@JacobsonMT JacobsonMT deleted the jacobsonmt/migration_improve_silence_labels branch January 24, 2024 20:56
@grafana-delivery-bot grafana-delivery-bot bot modified the milestones: 10.3.x, 10.4.x Jan 24, 2024
Ukochka pushed a commit that referenced this pull request Feb 14, 2024
…es (#78505)

* Alerting: During legacy migration reduce the number of created silences

During legacy migration every migrated rule was given a label rule_uid=<uid>.
This was used to silence DatasourceError/DatasourceNoData alerts for
migrated rules that had either ExecutionErrorState/NoDataState set to
keep_state, respectively.

This could potentially create a large amount of silences and a high cardinality
label. Both of these scenarios have poor outcomes for CPU load and latency in
unified alerting.

Instead, this change creates one label per ExecutionErrorState/NoDataState when
they are set to keep_state as well as two silence rules, if rules with said
labels were created during migration. These silence rules are:

- __legacy_silence_error_keep_state__ = true
- __legacy_silence_nodata_keep_state__ = true

This will drastically reduce the number of created silence rules in most cases
as well as not create the potentially high cardinality label `rule_uid`.
@aangelisc aangelisc modified the milestones: 10.4.x, 10.4.0 Mar 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
add to changelog area/alerting/migration Issues relating to legacy alerting migration area/alerting Grafana Alerting area/backend no-backport Skip backport of PR
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

None yet

4 participants