Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

extra_labels are not usable in alert content #1218

Closed
mwitkow opened this Issue Nov 13, 2015 · 13 comments

Comments

Projects
None yet
3 participants
@mwitkow
Copy link
Contributor

mwitkow commented Nov 13, 2015

With prometheus 0.15 we were using the labels to specify which cluster the given Prometheus instance is monitoring. The content and title of an alert contained this information to provide useful human-readable output such us "Service %job failing in %cluster is down".

Currently the extra_labels is only added as metadata around an alert, and not used in the evaluation of the body of the alert.

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Nov 13, 2015

This is by design, alerts should be oblivious to the configuration of the Prometheus server generally as how your Prometheus servers are organised isn't the concern of an alert.

I'm currently working on notification templates on the alertmanager, which will let you do what you want.

@mwitkow

This comment has been minimized.

Copy link
Contributor Author

mwitkow commented Nov 13, 2015

I understand promoting good workflows through Best Practices doc, but I do think that systems should be built in flexible in usage, if it doesn't require much work.

What if someone didn't want to use the AlertManager for this and had their own system? They'd need to write their own templating system, yet again.
Or why would you allow an alert to have a template-able content for some parameters of the issue and not for others?

Not to mention the fact that moving to extra_labels, which I really think is a great move to make sure that time series aren't cluttered, silently breaks existing ways in which users use Prometheus.

@fabxc

This comment has been minimized.

Copy link
Member

fabxc commented Nov 13, 2015

I see little harm in supporting this. The intuitive understanding of the $label parameter for anyone not concerned with low-level details would be that it equals the labelset pushed by Prometheus.

Alerts are not oblivious to configuration by definition because time series are not.

The alternative is that people add the label through relabeling to hack the old behavior back in. Let's have a clean solution instead.

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Nov 13, 2015

We already discussed this in #1128.

There's harm in supporting this, as it encourages users to do alert notifications in the wrong place and in a less-maintainable fashion.

The clean solution here is notification templates, not relabelling.

@fabxc

This comment has been minimized.

Copy link
Member

fabxc commented Nov 13, 2015

Looking forward to seeing those so we can cut a beta version.

But for the record, signing something off as 'discussed' because you worded your opinion is not valid.

On Fri, Nov 13, 2015 at 7:13 PM Brian Brazil notifications@github.com
wrote:

We already discussed this in #1128
#1128.

There's harm in supporting this, as it encourages users to go alert
notifications in the wrong place and in a less-maintainable fashion.

The clean solution here is notification templates, not relabelling.


Reply to this email directly or view it on GitHub
#1218 (comment)
.

@mwitkow

This comment has been minimized.

Copy link
Contributor Author

mwitkow commented Nov 13, 2015

We actually were holding off upgrading 0.15->0.16 due to this, as we had to fix our queries and add some relabel hacks. We eventually needed to move on to the hashmod change and the query interpolation fix.

Nonetheless, we didn't know about the Alerting content interpolation as the release notes did mention Alertmanager, so we thought we'd be safe, and there's no way of having tests for this. It's only after we started getting "High RPC failure rate on SomeService in " pages that we realise that our alerting set up was completely hosed, and we found ourselves checking each cluster one by one. Not really a nice SRE experience.

As a (happy) user of Prometheus, it's really sad to see that our set up was completely broken by a decision motivated not by a feature/optimisation, but by enforcing "best-practices" in favour of a "cleaner solution". Especially since the "cleaner solution" is not ready at the time of breakage and the impact of the change wasn't really clear.

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Nov 14, 2015

I'm sorry that ye were broken, as a pre-1.0 project we have to break things sometimes in order to move forward and provide you with a better result overall. Sometimes that includes things that are the only way to do things right now, for which we're a little behind on full solution.

Looking forward to seeing those so we can cut a beta version.

I'm hoping to finish off a first usable cut this weekend,

@fabxc

This comment has been minimized.

Copy link
Member

fabxc commented Nov 16, 2015

@mwitkow The missing clear documentation of the breaking change was a mistake on our end. Sorry about that.

What are your thoughts on solving this? For the time being hacking the old behavior via relabeling obviously works, but might not be the nicest solution.

Suppose you are defining your alerts in a global Prometheus on aggregations of per-cluster metrics. Those will have the external labels attached. Consequentially they will be available for those alert templates. That doesn't seem to be too much of a difference to me. Thoughts @brian-brazil?

Prometheus servers for different purposes might be dealing with different external labels. Making separate notification templates for each in the Alertmanager requires additional complexity in the AM and will impact user experience as alerts are no longer configurable in one place.

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Nov 16, 2015

Consequentially they will be available for those alert templates. That doesn't seem to be too much of a difference to me. Thoughts @brian-brazil?

That's problematic for other reasons, if the per-cluster label is hanging around from alerts at a global level then the alert should probably be in the cluster Prometheus rather than the global. There's some cases like outlier clusters that can only be calculated in at the global level - but in that case the usual alert notification templates will handle that fine as the same templates work for cluster as for global.

Prometheus servers for different purposes might be dealing with different external labels. Making separate notification templates for each in the Alertmanager requires additional complexity in the AM and will impact user experience as alerts are no longer configurable in one place.

I don't think that a given team is likely to have more than one set of external labels, as they're likely going to have one taxonomy of Prometheus servers for their purposes.

Even if they did, it doesn't affect the argument that notification templates don't belong individually in each alert.

@mwitkow

This comment has been minimized.

Copy link
Contributor Author

mwitkow commented Nov 17, 2015

@fabxc, I don't really have an opinion how it should be solved long term, but I don't really see why we can't have some form of expansion in Alert content. I understand not wanting to mix external metadata with internal stuff... so maybe static_labels would have been useful as a stop-gap? We don't need it ourselves, as we've already spent spent half a day adding relabel values to all of our configs.

As for the argument "pre-1.0 project we have to break things sometimes in order to move forward", I wholeheartedly agree. Prometheus team should have the ability to break things if they prohibit progress or optimizations. But at the same time Prometheus has a large user base and substantial adoption with varying use cases. As such while planning breaking feature, please weigh the benefits against user pain.

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Nov 17, 2015

The PR for the relevant feature in the new alertmanager is prometheus/alertmanager#147 if you want to try it out.

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Dec 16, 2015

The new alertmanager has the feature that's required.

@lock

This comment has been minimized.

Copy link

lock bot commented Mar 24, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked and limited conversation to collaborators Mar 24, 2019

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
You can’t perform that action at this time.