New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Alerting: During legacy migration reduce the number of created silences #78505
Conversation
71b6da0
to
6ce88f9
Compare
8e41dd9
to
ea3ade5
Compare
6ce88f9
to
5a71d5b
Compare
ea3ade5
to
eae4d32
Compare
5a71d5b
to
32685c6
Compare
eae4d32
to
218cd04
Compare
During legacy migration every migrated rule was given a label rule_uid=<uid>. This was used to silence DatasourceError/DatasourceNoData alerts for migrated rules that had either ExecutionErrorState/NoDataState set to keep_state, respectively. This could potentially create a large amount of silences and a high cardinality label. Both of these scenarios have poor outcomes for CPU load and latency in unified alerting. Instead, this change creates one label per ExecutionErrorState/NoDataState when they are set to keep_state as well as two silence rules, if rules with said labels were created during migration. These silence rules are: - __legacy_silence_error_keep_state__ = true - __legacy_silence_nodata_keep_state__ = true This will drastically reduce the number of created silence rules in most cases as well as not create the potentially high cardinality label `rule_uid`.
32685c6
to
6297513
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well needed change 🎉 Left some feedback about moving the silence logic below the service level but the rest LGTM
} | ||
if parsedSettings.ExecutionErrorState == "keep_state" { | ||
om.rulesWithErrorSilenceLabels++ | ||
n, v := getLabelForErrorSilenceMatching() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Now that these strings are constant we don't need to keep them as function return values.
@@ -47,6 +49,7 @@ type migrationService struct { | |||
migrationStore migrationStore.Store | |||
|
|||
encryptionService secrets.Service | |||
silenceFile func(filename string) (io.WriteCloser, error) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It feels like we're leaking details we don't need to up to the service level here. WDYT about abstracting this further to e.g. a "silence handler" if you want to keep using a string builder in tests, or just reading the file directly (and optionally calling those tests integration tests if I/O is a concern).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure it's leaking details per se, but I agree it's a bit clunky for the sake of the test. I like your silenceHandler
idea though, I went with that.
var silences []*pb.MeshSilence | ||
if om.rulesWithErrorSilenceLabels > 0 { | ||
om.log.Info("Creating silence for rules with ExecutionErrorState = keep_state", "rules", om.rulesWithErrorSilenceLabels) | ||
silences = append(silences, errorSilence()) | ||
} | ||
if om.rulesWithNoDataSilenceLabels > 0 { | ||
om.log.Info("Creating silence for rules with NoDataState = keep_state", "rules", om.rulesWithNoDataSilenceLabels) | ||
silences = append(silences, noDataSilence()) | ||
} | ||
if len(silences) > 0 { | ||
om.log.Debug("Writing silences file", "silences", len(silences)) | ||
if err := ms.writeSilencesFile(om.orgID, silences); err != nil { | ||
return fmt.Errorf("write silence file for org %d: %w", o.ID, err) | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's enough here that it probably shouldn't be under migrateAllOrgs
anymore, especially since the logic is actually being tested in silences_test.go
. WDYT about moving this block to its own function/method in silences.go
and testing that directly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
…es (#78505) * Alerting: During legacy migration reduce the number of created silences During legacy migration every migrated rule was given a label rule_uid=<uid>. This was used to silence DatasourceError/DatasourceNoData alerts for migrated rules that had either ExecutionErrorState/NoDataState set to keep_state, respectively. This could potentially create a large amount of silences and a high cardinality label. Both of these scenarios have poor outcomes for CPU load and latency in unified alerting. Instead, this change creates one label per ExecutionErrorState/NoDataState when they are set to keep_state as well as two silence rules, if rules with said labels were created during migration. These silence rules are: - __legacy_silence_error_keep_state__ = true - __legacy_silence_nodata_keep_state__ = true This will drastically reduce the number of created silence rules in most cases as well as not create the potentially high cardinality label `rule_uid`.
During legacy migration every migrated rule was given a label
rule_uid=<uid>
. This was used to silenceDatasourceError
/DatasourceNoData
alerts for migrated rules that had eitherExecutionErrorState
/NoDataState
set tokeep_state
, respectively.This could potentially create a large amount of silences and a high cardinality label. Both of these scenarios have poor outcomes for CPU load and latency in unified alerting.
Instead, this change creates one label per
ExecutionErrorState
/NoDataState
when they are set tokeep_state
as well as two silence rules, if rules with said labels were created during migration. These silence rules are:__legacy_silence_error_keep_state__ = true
__legacy_silence_nodata_keep_state__ = true
This will drastically reduce the number of created silence rules in most cases
as well as not create the potentially high cardinality label
rule_uid
.Who is this feature for?
Users migrating from legacy alerting.
Special notes for your reviewer:
This is a forward port of #77642, modified to fit with the new migration structure and better tested. Note the change in label names to fit with the new prefix from #76527.
Depends on #76527, so should wait for that to be merged.
Please check that: