Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alerting: Stop persisting user-defined templates to disk #83456

Merged
merged 9 commits into from
Mar 4, 2024

Conversation

JacobsonMT
Copy link
Member

What is this feature?

Updates Grafana Alertmanager to work with new interface from grafana/alerting#161. This change stops passing user-defined templates to the Grafana Alertmanager by persisting them to disk and instead passes them by string.

Why do we need this feature?

This PR is a general improvement and simplification, but there are also some specific reasons why we might want to do this:

  • Incremental step towards abstracting data layer for Grafana's Alertmanager interface away from needing disk-based storage. Next step would likely focus on FileStore.
  • Reduces likelihood of consistency complexities on save/apply, especially in the context of future Remote AM work.
  • Since DefaultTemplateString is now treated like other defaults and therefore consistently applied first. User-defined templates can now rely on being able to always override defaults.

Who is this feature for?

Developers mostly, though alert users wanting to consistently override default templates will see improvement here.

Which issue(s) does this PR fix?:

Fixes #76476

Special notes for your reviewer:

  • Accept template contents as strings instead of reading from disk alerting#161 needs to be merged first and the change to go.mod updated with the new ref.
  • I chose not to remove Config.Templates entirely. We should no longer be using it anywhere in local Grafana AM but it's still possible external AMs could use the filenames somehow. We should make a decision one way or the other eventually, but for now keeping it is the safe bet.
  • We used to rely on the disk-persisted templates to decide when a template had changed and update the AM config accordingly. This is no longer possible, so instead we replace the hash of PostableApiAlertingConfig with the hash of PostableUserConfig, which includes the template contents. This change also necessitates we modify GetStatus to expect PostableUserConfig as the config.

@JacobsonMT JacobsonMT requested a review from a team February 27, 2024 00:13
@JacobsonMT JacobsonMT requested a review from a team as a code owner February 27, 2024 00:13
@JacobsonMT JacobsonMT requested review from rwwiv, yuri-tceretian, grobinson-grafana, suntala, idafurjes and yangkb09 and removed request for a team February 27, 2024 00:13
@grafana-delivery-bot grafana-delivery-bot bot added this to the 11.0.x milestone Feb 27, 2024
}

if err := json.Unmarshal(status, config); err != nil {
am.logger.Error("Unable to unmarshall alertmanager config", "Err", err)
}

return *apimodels.NewGettableStatus(config)
return *apimodels.NewGettableStatus(&config.AlertmanagerConfig)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not seem to be related to the PR. Is this the fix that makes it parse the correct format?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is related, but not obviously so. I'll try to make it more obvious in a future PR though. Currently, GrafanaAlertmanager.GetStatus() just returns the raw config you initially provided to it. It was PostableApiAlertingConfig but now it's a PostableUserConfig.

Copy link
Contributor

@yuri-tceretian yuri-tceretian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.
I am not 100% sure about the ignoring of "Templates" field, though. In Mimir and GMA it's certainly redundant but that's the AM's contract.

@JacobsonMT
Copy link
Member Author

LGTM. I am not 100% sure about the ignoring of "Templates" field, though. In Mimir and GMA it's certainly redundant but that's the AM's contract.

This should be a no-op from my understanding, since we didn't read paths from the Templates field before either. I agree that we should consider whether we want to allow that in the future. Unless you meant that this was allowed before and I've broken it somehow?

@yuri-tceretian
Copy link
Contributor

AFAIU, the vanilla Alertmanager uses Templates to control what files to read from the templates directory. So the field Template can potentially contain a subset of TemplateFiles. In Mimir and GMA they are always equal because it's hardcoded to be so.
Making it empty or completely ignoring it, we kinda make it work differently, which is fine by me.

Copy link
Contributor

@santihernandezc santihernandezc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍

@@ -63,11 +63,6 @@ func (t *TemplateService) SetTemplate(ctx context.Context, orgID int64, tmpl def
revision.cfg.TemplateFiles = map[string]string{}
}
revision.cfg.TemplateFiles[tmpl.Name] = tmpl.Template
tmpls := make([]string, 0, len(revision.cfg.TemplateFiles))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIU the config still expects the template file "names" to be listed in the config under Templates, even if the templates themselves are passed in memory, or it won't be able to load them. Is something else expanding this back out? (or, did this change in alertmanager?)

Copy link
Member Author

@JacobsonMT JacobsonMT Mar 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIU the config still expects the template file "names" to be listed in the config under Templates, even if the templates themselves are passed in memory, or it won't be able to load them.

My understanding of the existing code is that we always overwrite the Templates field with the information from TemplateFiles:

PersistTemplates uses only TemplateFiles when parsing the config: https://github.com/grafana/grafana/pull/83456/files#diff-786d520b4c7dd72f03ab0422ad82bc8f1c23c8874b60159b77431588f415a151L17

applyConfig then overwrites the Templates field with the results of PersistTemplates: https://github.com/grafana/grafana/pull/83456/files#diff-4aa6a5bc542bc278b3cddc81fe825af177f1f354c9dbb08d0f01d86b7dbe414eL345

Is something else expanding this back out? (or, did this change in alertmanager?)

This PR and the grafana/alerting sister PR effectively make this temporary saving to disk and transfer to the Templates field unnecessary. I might be misunderstanding this part of your question though.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I think I misunderstood you entirely. If you mean upstream prometheus alertmanager needing the Templates file names then I don't think it does, or at least this PR appears to be working as intended. When upstream is ran through main then it relies on the configCoordinator to subscribe to config changes and would likely require the Templates field, but we don't use main and instead piece together an alertmanager part-by-part in grafana/alerting GrafanaAlertmanager.ApplyConfig and manually call Run() on the Dispatcher and Inhibitor.

am.logger.Debug("Neither config nor template have changed, skipping configuration sync.")
// If configuration hasn't changed, we've got nothing to do.
configHash := md5.Sum(rawConfig)
if am.Base.ConfigHash() == configHash {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does changing the level at which the hash operates break remote alertmanagers? (@santihernandezc may know more than me here). Will the remote alertmanager still return the hash of the inner, AM config and not this new outer object (PostableUserConfig), it'll not match up and reload needlessly - is this possible?

Do all implementations return the hash at the right level?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The hash is computed in Grafana and sent to the remote Alertmanager, when we compare configurations we do it using that same hash we sent.
If all config hashes are derived from the same struct, then nothing should break.
Plus, the integration tests are passing 👌

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way it's currently implemented, this should not affect the remote alertmanagers as this code is internal to the local alertmanager and does not affect what we store in the database.

However, if this somehow changes then we'd likely want all the hashes to be based on PostableUserConfig anyways since it contains TemplateFiles which is how we'll be passing templates to the remote alertmanager, correct?

I'll let @santihernandezc confirm though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should not affect the remote alertmanagers as this code is internal to the local alertmanager and does not affect what we store in the database.

Agree, we don't use the internal Alertmanager's hash for the remote/forked AM, we use what we receive via the MOA.

if this somehow changes then we'd likely want all the hashes to be based on PostableUserConfig anyways since it contains TemplateFiles which is how we'll be passing templates to the remote alertmanager, correct?

Yep, AFAIU that's the way we currently do it. The hash we send to the remote Alertmanager is the one we get from the database in ApplyConfig. This hash is derived from cmd.AlertmanagerConfiguration (permalink), which is a marshaled PostableUserConfig (permalink).

Copy link
Contributor

@alexweav alexweav left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@JacobsonMT JacobsonMT force-pushed the jacobsonmt/template-as-string branch from a9b111e to dc31023 Compare March 4, 2024 17:58
@JacobsonMT JacobsonMT enabled auto-merge (squash) March 4, 2024 18:00
@JacobsonMT JacobsonMT merged commit 2e8c514 into main Mar 4, 2024
12 checks passed
@JacobsonMT JacobsonMT deleted the jacobsonmt/template-as-string branch March 4, 2024 18:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

Alerting: Remove disk sync code for templates
5 participants