Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.
Sign upextra_labels are not usable in alert content #1218
Comments
This comment has been minimized.
This comment has been minimized.
|
This is by design, alerts should be oblivious to the configuration of the Prometheus server generally as how your Prometheus servers are organised isn't the concern of an alert. I'm currently working on notification templates on the alertmanager, which will let you do what you want. |
This comment has been minimized.
This comment has been minimized.
|
I understand promoting good workflows through Best Practices doc, but I do think that systems should be built in flexible in usage, if it doesn't require much work. What if someone didn't want to use the AlertManager for this and had their own system? They'd need to write their own templating system, yet again. Not to mention the fact that moving to |
This comment has been minimized.
This comment has been minimized.
|
I see little harm in supporting this. The intuitive understanding of the Alerts are not oblivious to configuration by definition because time series are not. The alternative is that people add the label through relabeling to hack the old behavior back in. Let's have a clean solution instead. |
This comment has been minimized.
This comment has been minimized.
|
We already discussed this in #1128. There's harm in supporting this, as it encourages users to do alert notifications in the wrong place and in a less-maintainable fashion. The clean solution here is notification templates, not relabelling. |
This comment has been minimized.
This comment has been minimized.
|
Looking forward to seeing those so we can cut a beta version. But for the record, signing something off as 'discussed' because you worded your opinion is not valid. On Fri, Nov 13, 2015 at 7:13 PM Brian Brazil notifications@github.com
|
This comment has been minimized.
This comment has been minimized.
|
We actually were holding off upgrading 0.15->0.16 due to this, as we had to fix our queries and add some relabel hacks. We eventually needed to move on to the Nonetheless, we didn't know about the Alerting content interpolation as the release notes did mention Alertmanager, so we thought we'd be safe, and there's no way of having tests for this. It's only after we started getting "High RPC failure rate on SomeService in " pages that we realise that our alerting set up was completely hosed, and we found ourselves checking each cluster one by one. Not really a nice SRE experience. As a (happy) user of Prometheus, it's really sad to see that our set up was completely broken by a decision motivated not by a feature/optimisation, but by enforcing "best-practices" in favour of a "cleaner solution". Especially since the "cleaner solution" is not ready at the time of breakage and the impact of the change wasn't really clear. |
This comment has been minimized.
This comment has been minimized.
|
I'm sorry that ye were broken, as a pre-1.0 project we have to break things sometimes in order to move forward and provide you with a better result overall. Sometimes that includes things that are the only way to do things right now, for which we're a little behind on full solution.
I'm hoping to finish off a first usable cut this weekend, |
This comment has been minimized.
This comment has been minimized.
|
@mwitkow The missing clear documentation of the breaking change was a mistake on our end. Sorry about that. What are your thoughts on solving this? For the time being hacking the old behavior via relabeling obviously works, but might not be the nicest solution. Suppose you are defining your alerts in a global Prometheus on aggregations of per-cluster metrics. Those will have the external labels attached. Consequentially they will be available for those alert templates. That doesn't seem to be too much of a difference to me. Thoughts @brian-brazil? Prometheus servers for different purposes might be dealing with different external labels. Making separate notification templates for each in the Alertmanager requires additional complexity in the AM and will impact user experience as alerts are no longer configurable in one place. |
This comment has been minimized.
This comment has been minimized.
That's problematic for other reasons, if the per-cluster label is hanging around from alerts at a global level then the alert should probably be in the cluster Prometheus rather than the global. There's some cases like outlier clusters that can only be calculated in at the global level - but in that case the usual alert notification templates will handle that fine as the same templates work for cluster as for global.
I don't think that a given team is likely to have more than one set of external labels, as they're likely going to have one taxonomy of Prometheus servers for their purposes. Even if they did, it doesn't affect the argument that notification templates don't belong individually in each alert. |
This comment has been minimized.
This comment has been minimized.
|
@fabxc, I don't really have an opinion how it should be solved long term, but I don't really see why we can't have some form of expansion in Alert content. I understand not wanting to mix external metadata with internal stuff... so maybe As for the argument "pre-1.0 project we have to break things sometimes in order to move forward", I wholeheartedly agree. Prometheus team should have the ability to break things if they prohibit progress or optimizations. But at the same time Prometheus has a large user base and substantial adoption with varying use cases. As such while planning breaking feature, please weigh the benefits against user pain. |
This comment has been minimized.
This comment has been minimized.
|
The PR for the relevant feature in the new alertmanager is prometheus/alertmanager#147 if you want to try it out. |
This comment has been minimized.
This comment has been minimized.
|
The new alertmanager has the feature that's required. |
brian-brazil
closed this
Dec 16, 2015
This comment has been minimized.
This comment has been minimized.
lock
bot
commented
Mar 24, 2019
|
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |
mwitkow commentedNov 13, 2015
With prometheus 0.15 we were using the
labelsto specify whichclusterthe given Prometheus instance is monitoring. The content and title of an alert contained this information to provide useful human-readable output such us "Service %job failing in %cluster is down".Currently the
extra_labelsis only added as metadata around an alert, and not used in the evaluation of the body of the alert.