Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.
Sign upCreate a section ANNOTATIONS with user-defined payload and generalize RUNBOOK, DESCRIPTION, SUMMARY into fields therein. #1002
Comments
This comment has been minimized.
This comment has been minimized.
|
Counterpoint from someone who doesn't feel too strongly either way: Having a special field for the runbook entry gives it clear semantic meaning and integrates well with notification mechanisms like PagerDuty, which do have a native runbook field and nicely display it when looking at an alert. It seems this is mostly a debate between machine-readability (explicit semantics) of data vs. having to add code for each of these semantic special cases. Then there's an argument for keeping it just because it exists already and removing it would be a breaking change. I know @beorn7 feels strongest about this, so I think it's best if he discusses this directly with @brian-brazil before the rest of us will try to mediate :) |
This comment has been minimized.
This comment has been minimized.
Are you sure about this? I can't find it in the PagerDuty docs (https://developer.pagerduty.com/documentation/integration/events/trigger). Adding new keywords in Prometheus for everything each notification method offers is not scalable or maintainable.
If that's the case, we should set a higher bar on making changes - particularly ones added in a hurry. We're still not at 1.0, so shouldn't be scared of the occasional breakage. |
This comment has been minimized.
This comment has been minimized.
|
On Tue, Aug 18, 2015 at 4:14 PM, Brian Brazil notifications@github.com
|
This comment has been minimized.
This comment has been minimized.
|
My strong feelings are about the relative importance of the runbook link compared to other fields. It's for example much more important than the summary. (I'm talking about large-scale organizations where you will be on-call for more than the few systems you work with day-to-day. Small-scare organizations arguably don't need Prometheus in the first place.) I'm very much in favor of solving all of this with non-grouping labels. The runbook link was not introduced in a hurry but because we had already solved way less important alert components via dedicated fields. |
beorn7
changed the title
Remove RUNBOOK from alerts
Allow specification of custom non-grouping alert labels. Then remove RUNBOOK (and possibly SUMMARY and DESCRIPTION) as an explicit field and generalize as label.
Aug 26, 2015
This comment has been minimized.
This comment has been minimized.
|
Renamed the issue to reflect the discussion we had back in June. |
This comment has been minimized.
This comment has been minimized.
|
I think this is a little trickier, as there will be labels coming from the alert that you don't want to group on (e.g. in a HA setup where each half has a different label to distinguish them) which is different from providing information about the alert itself as distinct from a particular instance of it. |
This comment has been minimized.
This comment has been minimized.
|
Yeah, we have to solve that problem. But as long as we have no generic way to pass labels for things like runbook, summary, description or whatever else will show up in whatever notification mechanism is used, we cannot remove RUNBOOK. It's in active use at SC, and it is of crucial importance to enable receivers of an alert to act on it. |
This comment has been minimized.
This comment has been minimized.
We have always had one of those in one labels.
We shouldn't be tying prometheus to the needs of a single company, and the question isn't whether we should support runbook but whether it deserves first-class support and all the maintenance and development burden it brings. The current behaviour of runbook does not require a non-grouping label, as it's always a constant so it can be replaced by a grouping label today. |
This comment has been minimized.
This comment has been minimized.
This is not my experience, I find looking at a playbook to be an act of desperation and worse than useless in general. This is why we need to be wary about adding features like this, not all companies and people have the same workflow and we need to keep Prometheus generic. |
This comment has been minimized.
This comment has been minimized.
|
We have to conclude that our experiences differ. I emphatically disagree that the RUNBOOK entry is based on the needs of a single company. The runbook link is not necessarily a constant. A setup where the link depends on the alert name is easily imaginable ;). But we have to keep things even more flexible that that. Runbook link, summary, and description are all very similar as they need to be created from some kind of templating. More fields of that kind are possible depending on the use case and the used notification system. That's why we need a generic solution here, giving each user the flexibility to cater for their specific use case. |
This comment has been minimized.
This comment has been minimized.
|
This. Having any of them hard coded into the statement was anti-Promethean in the first place.
Where the difference between WITH and PAYLOAD are attachment to time series and that PAYLOAD values should be templated. |
This comment has been minimized.
This comment has been minimized.
Indeed and we will support that someday, though that's still constant relative to the alertname.
It's not just those - all alert labels need to be able to be created from templating.
I'd propose templated labels as the solution, and then similarly templated notifications on the alertmanger side. None of this requires handling runbook specially. |
This comment has been minimized.
This comment has been minimized.
It's a bit tricker, the most common case will be that the non-grouping labels are global labels so won't be part of the alert definition. This indicates towards a global setting of some form. |
This comment has been minimized.
This comment has been minimized.
|
On Wed, Aug 26, 2015 at 1:13 PM, Brian Brazil notifications@github.com wrote:
And neither description or summary. That's my whole point. I'm just I tried to express that in the title of this issue. If somebody can Björn Rabenstein, Engineer SoundCloud Ltd. | Rheinsberger Str. 76/77, 10115 Berlin, Germany |
This comment has been minimized.
This comment has been minimized.
|
Are we really thinking about even removing SUMMARY or DESCRIPTION as language-level constructs? Basically every notification mechanism depends on them, and thus we benefit a lot from statically checking that they are specified. |
This comment has been minimized.
This comment has been minimized.
I don't see how this helps your use case as non-grouping is orthogonal to runbook as it never affects grouping. What exactly are you trying to get into your notifications with this? I'm not so sure about description and summary as they are going to be the main uses of templates, and unlike almost all other potential templated labels can vary significantly from eval to eval (e.g. a top 10 worst servers list or a string with the current value) - so you don't want to group on those or put them in timeseries. |
This comment has been minimized.
This comment has been minimized.
|
Perhaps the "non-grouping" is distracting... I want a very simple and obvious way to allow users to create labels via templates that can then be used to assemble the notification. I'm opposed to treating "description" and/or "summary" special. Everything that applies to them also applies to "runbook" (and possibly other fields). I don't support the assumption that every notification mechanism depends on "description" and "summary" and those therefore need to be checked statically by Prometheus in contrast to other fields. "runbook" or any other custom label potentially need exactly the same flexibility and variability as "description" and/or "summary". We should not put any assumption about const'ness of "runbook" into our design. |
This comment has been minimized.
This comment has been minimized.
We're in agreement on that. Right now all we have is labels and hardcoded templates, but ultimately we should have go templates for all notification fields in the alertmanager.
Technically they're the same, in practice they're special. As indicated above they tend to vary from eval cycle to eval cycle, so can't be used safely as either grouping or non-grouping labels. The other use cases that initially come to mind that are similarly special (e.g. parametrised urls based on alert value) can be handled via summary/description so I'm not sure we need to genericise that - let's wait for a use case. |
This comment has been minimized.
This comment has been minimized.
|
Based on this discussion, I've updated the summary. I also realised there's another way to implement this without code changes: Put it in the description. Reviewing this discussion, the only argument I see as to why this feature needs to stay as-is is that SoundCloud is using it. This has not prevented us from removing features in the past. |
This comment has been minimized.
This comment has been minimized.
|
Obviously you can put into the description what you want. It could be a runbook link. You could also put the description in the runbook field. However, my whole point is that a runbook link is so crucial that it most be treated as a 1st class citizen, for one so that you can prominently place the link in your resulting notification but also to signal to the naive user that they really have to fill in that field for meaningful alerts. Now I completely realize that you disagree. And others. Possibly all of you. But I have to restate (over and over again) that we are not keeping the feature because Soundcloud is using it. We are keeping it because I deem it so important. (And Soundcloud uses it because I happen to be an a TL role there.) You could all decide to overrule me. Unsolvable disagreements happen, and at some point we have to go one way or the other. (And this particular feature will surely not doom the project one way or the other.) However, it's moot to discuss over and over again if it matters if Soundcloud uses the feature because it's not about that. It's about the opinion of an important project member. I restate that my suggestion is to acknowledge that use cases differ and generalize "annotations" (or whatever to call them). Let's implement that first, instead of kicking out an existing feature that is important (at least for me). |
This comment has been minimized.
This comment has been minimized.
|
I recognise that there's a use case. My contention is that it's not important enough to justify increasing complexity for all users of alerting by forcing them to deal with this as a special case, compared to the benefit it brings.
What are you thinking here? |
This comment has been minimized.
This comment has been minimized.
|
And I recognize that you don't think it's important enough. We have to agree that we disagree here. The generalized labels would probably only need the ability to be expanded via the usual templating. If it's the right thing to enable template exansion in general within
Ideally, we want some escaping magic or features to make it easy to render valid URLs. |
This comment has been minimized.
This comment has been minimized.
I'm for adding that for all labels (there's advanced use cases for it), so I can put that PR together. We can't make summary/description labels though (and need to be a tad careful with runbook, if you're hitting that case you probably should put it in the description). |
This comment has been minimized.
This comment has been minimized.
|
Cross-inspirations from #1043 : labels that go via template expansion change a lot, so are not really suitable as proper labels on the
|
This comment has been minimized.
This comment has been minimized.
Depends how you do it, where it usually comes up they're constant relative to a given alert (mostly for advanced alert routing) - so I'm not worried there. Where the real problem is if someone does We've got four options there:
My instinct is that generalising will cause more problems than it will solve though making alert notifications too generic, and that we should leave summary/description as the only special cases. They do map pretty well to the notification methods we have thus far. |
This comment has been minimized.
This comment has been minimized.
Just for my understanding: how would a templated non-grouping label break |
This comment has been minimized.
This comment has been minimized.
It it had labels that weren't constant relative to the alert (e.g. contained In practice, I think we'd recommend linking to a console (passing required labels) that'd display whatever time-series you want and avoid this problem. Accordingly I think side-stepping this problem is practical, and likely the best approach overall. |
This comment has been minimized.
This comment has been minimized.
|
But I thought the whole purpose of non-grouping labels is that they
|
This comment has been minimized.
This comment has been minimized.
You want non-grouping labels attached to alerts and timeseries, as it's how you'll distinguish time-series from different HA prometheis in your long-term storage. These would be typically configured at the prometheus-level (usually as global labels once we fix those) rather than the alert label. What Beorn is proposing results in a 3rd class of key/value pair that only applies to alerts. |
This comment has been minimized.
This comment has been minimized.
With Matthias's proposal they would see have to see those attributes in all examples, so that does change something in adoption. If we ignore that bit, it causes a problem when they to to the next stage beyond simple alerts where they want to tweak the notification a little bit. Now rather than having a field that likely covers their need, they've instead got theory to learn and choices to make. More concretely rather than a user thinking "I'll add it to description" they now think more along the lines of "should I put this in description, or a label or in payload and what do I name them and how do I lay out my notification templates". Similarly if they want to do something simple with routing (which is a common initial use case) they have to learn that only labels work for that, and we'll start to get questions and confusion around why payload doesn't as syntactically they're identical in the alert expression. Most beginner alerts wouldn't even have a "WITH" clause, so users would think that "PAYLOAD" was the place to put labels for routing as it looks like a place to put labels. Trying to prevent that requires more docs, which means more reading and confusion, which means less adoption.
Yes, the question is at what stage we require them to consult the docs and how complex those docs are. With Matthias's proposal they have to read them much sooner and they're more complex. I'd like to push out the need to read the full docs as far as possible and try to limit how may concepts docs have to explain, particularly where they're only really relevant to the more advanced use cases. If you look at target relabelling, we're already seeing these type of problems due to the system being too generic and difficult to learn for a new user. While I still believe that is the least-worst decision in that case, let's not introduce another one of those into the system without being sure it'll overall lead to an increased number of happy users. What's being proposed here is a feature that's mainly of use to advanced users, we shouldn't be forcing beginner and intermediate users to have to deal with complexity that doesn't apply to their use case when we've a reasonable way to avoid that. |
This comment has been minimized.
This comment has been minimized.
barkerd427
commented
Sep 26, 2015
|
I'll admit that I haven't read this whole thread, but I've read a large portion of it. Let me offer my perspective as a naive, basic user. When I started looking for an alerting system for work, I went through several options before settling on Prometheus. Nagios has a lot of complexity. It was hard just to get installed. Then I went to Icinga2. That was easier to setup and install, but still not great with my companies restrictive access. It then took another day to figure out that I'd still have to open a bunch of ports to get metrics out. Prometheus was dead simple to install. It requires very little effort to configure. When I started to add alerts, the docs were very easy to follow. There are still gaps in the docs, but experience easily filled in those. However, there is something lacking. I think that is what we're covering here. As most enterprise user, I live in a world that has both old and ancient technology. Some of it written in-house. This gives me some more complex cases with respect to the information these systems need. Our alerting/paging system requires a long complex string to be sent as an SNMP trap or through a tcp connection. I'm currently passing this through the description, but that makes my emails look cryptic. Perhaps I'm just not architecting it correctly. This string is populated by Prometheus, but I could feasibly do this in my webhook receiver. Either way, I feel like another option for additional payload is missing. When I first saw the runbook attribute, I had a good idea what it was, but it wasn't clear how to use it. Payload makes more sense to me assuming it's optional and the internal attributes are flexible. It also needs to support templating. |
This comment has been minimized.
This comment has been minimized.
|
Thanks for the insight, this sounds like a concrete use case.
That's the way to do it currently, and as long as you're talking non-binary strings I think this is something we should probably cover within Prometheus. Once we have notification templates there'll be ways to clean up the emails (regexes), but that's obviously a bit messy. What sort of information is in this string? I want to verify that this isn't something that will be covered by notification templates (roughly speaking, if you're not using
I think everyone is onboard with that. Presuming this works out as a use case, the debate then is around a) how we expose this to users configuration-wise and b) keeping summary/description first class. |
This comment has been minimized.
This comment has been minimized.
barkerd427
commented
Sep 27, 2015
|
I haven't been able to get back to my work laptop to pull the actual
|
This comment has been minimized.
This comment has been minimized.
barkerd427
commented
Sep 28, 2015
|
@brian-brazil, here's the format for my alerts:
And this is the pertinent part of my alert rule:
I would prefer a method of passing this string as extra data as this same description is sent to the alerting system and through email and Slack which are not part of the alerting system. Alternatively, I'd be very happy with a way of passing other metadata information that can be added to the body of an email or as extra comments on Slack. I may also need to send this string to handle any situations where I need to use SNMP instead of TCP:
I'm not sure yet how I'll input the numbers, but again this doesn't look nice to a human and I'd rather pass more meaningful data. Adding more lines to the description section will just make it hard to parse out the piece I need for our alerting system. |
This comment has been minimized.
This comment has been minimized.
|
The inevitability of user-defined payload looks more convincing than ever... The decision is irrelevant for the super-naive user who doesn't use description or summary at all. For the description and/or summary using user, it's the difference between
and
with all the gotchas about detecting typos on different levels. (Note that PAYLOAD is probably not a good word, we need something better, or implement other ideas, but I don't want to add those to this already quite convoluted discussion.) The concern I have is that for those who actually need payload, it's between
and
The former syntax suggests that summary/description are special, but they are not special for those who need their own payload anyway. We essentially need to weigh the degree of confusion for new/naive users by the first comparison against the degree of confusion for ambitious/advanced users by the second comparison. I personally see the former as negligible but the latter as quite confusing and warty. |
This comment has been minimized.
This comment has been minimized.
|
Hmm, most of what you're doing there belongs in the alertmanager rather than Prometheus I'd say as you're mainly formatting a custom notification string rather than something specific to that alert. The only wrinkles there are app_eventcode, which would be plausible as a label, and app_eventmsg which sounds like the summary. I think that notification templates will cover your use case. The SNMP one is more interesting, my suspicion is that that's so custom that it'd have to go via the webhook and the rest per the TCP string. |
This comment has been minimized.
This comment has been minimized.
|
There's the other option of
and configuration the the global prometheus level that foo and bar are special (which we'll have for non-grouping labels too, as those can't be configured per-alert).
I'm not sure about that, I'd still expect alerts using this to also go out via notifiers using summary/description in the usual fashion. Payloads are about additional information. |
This comment has been minimized.
This comment has been minimized.
|
After 85 comments, many considerations and opinions, I now propose the following solution. I encourage everyone to simply state
Example:
The new Alertmanager will have configurable handling of annotation fields with reasonable defaults, e.g. |
This comment has been minimized.
This comment has been minimized.
|
|
This comment has been minimized.
This comment has been minimized.
I presume you mean NOT part of the identity of the alert here, otherwise it'd be the same as labels. For the sake of closing this off This implies several breaking changes, which is okay as we're talking the alertmanager. If we're going to do this it's we'll also have to redesign the json that goes between prometheus and the alertmanager (basically switch what's currently outside and inside the "payload" field) and hopefully avoid having to break that interface again. |
This comment has been minimized.
This comment has been minimized.
|
You are correct, fixed. |
This comment has been minimized.
This comment has been minimized.
|
I was thinking for quite some time about the mental barrier to understand the difference between labels and annotations, and I had some ideas how to hide it to the newcomer, but in the end, I came to the conclusion that clearly separating it is ultimately the best solution. A well worded explanation seems possible ("shows up in your alert timeseries" and "is part of the identity of the alert" are after all quite obvious concepts). And yes, the JSON format has to be updated, too, but that was obvious for a while... |
This comment has been minimized.
This comment has been minimized.
|
Yep, this change should go hand in hand with the Alertmanager rewrite, which needs a breaking of the transfer format anyways. So with the Prometheus version that introduces this, people will have to upgrade to the new Alertmanager in lockstep. |
This comment has been minimized.
This comment has been minimized.
|
And: \o/ for reaching a decision! |
beorn7
changed the title
Remove RUNBOOK from alerts
Create a section ANNOTATIONS with user-defined payload and generalize RUNBOOK, DESCRIPTION, SUMMARY into fields therein.
Sep 29, 2015
This comment has been minimized.
This comment has been minimized.
Actually I think we do this one sooner, what I'd be tying this to more is the notifications templates. |
brian-brazil
referenced this issue
Sep 29, 2015
Closed
SUMMARY and DESCRIPTION should be optional for alerts #1043
This comment has been minimized.
This comment has been minimized.
|
If we do that, we will break Prometheus<->AM twice instead of once. Not so happy about that, I mean, it's not that urgent or is it? |
This comment has been minimized.
This comment has been minimized.
|
I'm not sure why we'd need to break it twice, I'm not forseeing any fundamental structural changes to the interaction that'd require breaking changes, it's more a cleanup of the existing API. It'd also be good not to block notification templates on the rewrite. |
This comment has been minimized.
This comment has been minimized.
|
From #1128 we see that users are already having problems with the lack of notification templates and working around them which is causing other issues, so I'd like to get notification templates in at least in inside the next month. |
This comment has been minimized.
This comment has been minimized.
|
With some effort we can have a new AM by then that has the new features we want and the persistence of the current one. |
This comment has been minimized.
This comment has been minimized.
|
I see the notification stuff as orthogonal to the new AM, the only wrinkle is the API changes this issue requires. |
This comment has been minimized.
This comment has been minimized.
|
AIUI the new AM will require other API changes besides those fields - @fabxc will know more. If that's true, it would create a second breaking change to the transfer format. |
This comment has been minimized.
This comment has been minimized.
|
A small compatibility layer in the AM is probably best then. No point in a breaking change in notification templates too. |
This comment has been minimized.
This comment has been minimized.
|
Resolved by 7c90db2 |
brian-brazil
closed this
Dec 16, 2015
This comment has been minimized.
This comment has been minimized.
lock
bot
commented
Mar 24, 2019
|
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |
brian-brazil commentedAug 18, 2015
RUNBOOK was added in a hurry in #843 for an internal demo of one of our users, which didn't give it enough time to be fully discussed. The demo has been done, so we can reconsider this.
I think we should revert this change, and remove RUNBOOK:
Runbooks are not a fundamental aspect of an alert, are not in use by all of our users and thus I don't believe they meet the bar for first-class support within prometheus. This is especially true considering that they don't add anything that isn't already possible with labels.