Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create a section ANNOTATIONS with user-defined payload and generalize RUNBOOK, DESCRIPTION, SUMMARY into fields therein. #1002

Closed
brian-brazil opened this Issue Aug 18, 2015 · 102 comments

Comments

Projects
None yet
7 participants
@brian-brazil
Copy link
Member

brian-brazil commented Aug 18, 2015

RUNBOOK was added in a hurry in #843 for an internal demo of one of our users, which didn't give it enough time to be fully discussed. The demo has been done, so we can reconsider this.

I think we should revert this change, and remove RUNBOOK:

  • Our general policy is that if it can be done with labels, do it with labels
  • All notification methods in the alertmanager will need extra code to deal with this
  • In future, all alertmanager notification templates will need extra code to deal with this
  • In general, all user code touching the alertmanager will need extra code to deal with this
  • This presumes a certain workflow in that you have something called a "runbook" (and not any other name - playbook is also common) and that you have exactly one of them

Runbooks are not a fundamental aspect of an alert, are not in use by all of our users and thus I don't believe they meet the bar for first-class support within prometheus. This is especially true considering that they don't add anything that isn't already possible with labels.

@juliusv

This comment has been minimized.

Copy link
Member

juliusv commented Aug 18, 2015

Counterpoint from someone who doesn't feel too strongly either way: Having a special field for the runbook entry gives it clear semantic meaning and integrates well with notification mechanisms like PagerDuty, which do have a native runbook field and nicely display it when looking at an alert.

It seems this is mostly a debate between machine-readability (explicit semantics) of data vs. having to add code for each of these semantic special cases.

Then there's an argument for keeping it just because it exists already and removing it would be a breaking change.

I know @beorn7 feels strongest about this, so I think it's best if he discusses this directly with @brian-brazil before the rest of us will try to mediate :)

@brian-brazil

This comment has been minimized.

Copy link
Member Author

brian-brazil commented Aug 18, 2015

Having a special field for the runbook entry gives it clear semantic meaning and integrates well with notification mechanisms like PagerDuty, which do have a native runbook field and nicely display it when looking at an alert.

Are you sure about this? I can't find it in the PagerDuty docs (https://developer.pagerduty.com/documentation/integration/events/trigger). Adding new keywords in Prometheus for everything each notification method offers is not scalable or maintainable.

Then there's an argument for keeping it just because it exists already and removing it would be a breaking change.

If that's the case, we should set a higher bar on making changes - particularly ones added in a hurry. We're still not at 1.0, so shouldn't be scared of the occasional breakage.

@juliusv

This comment has been minimized.

Copy link
Member

juliusv commented Aug 18, 2015

On Tue, Aug 18, 2015 at 4:14 PM, Brian Brazil notifications@github.com
wrote:

Having a special field for the runbook entry gives it clear semantic
meaning and integrates well with notification mechanisms like PagerDuty,
which do have a native runbook field and nicely display it when looking at
an alert.

Are you sure about this? I can't find it in the PagerDuty docs (
https://developer.pagerduty.com/documentation/integration/events/trigger).
Adding new keywords in Prometheus for everything each notification method
offers is not scalable or maintainable.

Oh, you're right. I confused this with the client and client_url
fields, which get special treatment in PagerDuty's UI. I agree that we
shouldn't add new keywords for things that are too special for a given
notification method. Again, one could argue that RUNBOOK is general enough.

Then there's an argument for keeping it just because it exists already and
removing it would be a breaking change.

If that's the case, we should set a higher bar on making changes -
particularly ones added in a hurry. We're still not at 1.0, so shouldn't be
scared of the occasional breakage.

Agreed on the former! For the latter, breaking changes will always hurt
even before 1.0, but yeah, they're acceptable.

@beorn7

This comment has been minimized.

Copy link
Member

beorn7 commented Aug 26, 2015

My strong feelings are about the relative importance of the runbook link compared to other fields. It's for example much more important than the summary. (I'm talking about large-scale organizations where you will be on-call for more than the few systems you work with day-to-day. Small-scare organizations arguably don't need Prometheus in the first place.)

I'm very much in favor of solving all of this with non-grouping labels. The runbook link was not introduced in a hurry but because we had already solved way less important alert components via dedicated fields.

@beorn7 beorn7 changed the title Remove RUNBOOK from alerts Allow specification of custom non-grouping alert labels. Then remove RUNBOOK (and possibly SUMMARY and DESCRIPTION) as an explicit field and generalize as label. Aug 26, 2015

@beorn7

This comment has been minimized.

Copy link
Member

beorn7 commented Aug 26, 2015

Renamed the issue to reflect the discussion we had back in June.

@brian-brazil

This comment has been minimized.

Copy link
Member Author

brian-brazil commented Aug 26, 2015

I think this is a little trickier, as there will be labels coming from the alert that you don't want to group on (e.g. in a HA setup where each half has a different label to distinguish them) which is different from providing information about the alert itself as distinct from a particular instance of it.

@beorn7

This comment has been minimized.

Copy link
Member

beorn7 commented Aug 26, 2015

Yeah, we have to solve that problem. But as long as we have no generic way to pass labels for things like runbook, summary, description or whatever else will show up in whatever notification mechanism is used, we cannot remove RUNBOOK. It's in active use at SC, and it is of crucial importance to enable receivers of an alert to act on it.

@brian-brazil

This comment has been minimized.

Copy link
Member Author

brian-brazil commented Aug 26, 2015

we have no generic way to pass labels for things like runbook, summary, description or whatever else will show up in whatever notification mechanism is used,

We have always had one of those in one labels.

It's in active use at SC, and it is of crucial importance to enable receivers of an alert to act on it.

We shouldn't be tying prometheus to the needs of a single company, and the question isn't whether we should support runbook but whether it deserves first-class support and all the maintenance and development burden it brings.

The current behaviour of runbook does not require a non-grouping label, as it's always a constant so it can be replaced by a grouping label today.

@brian-brazil

This comment has been minimized.

Copy link
Member Author

brian-brazil commented Aug 26, 2015

it is of crucial importance to enable receivers of an alert to act on it.

This is not my experience, I find looking at a playbook to be an act of desperation and worse than useless in general. This is why we need to be wary about adding features like this, not all companies and people have the same workflow and we need to keep Prometheus generic.

@beorn7

This comment has been minimized.

Copy link
Member

beorn7 commented Aug 26, 2015

We have to conclude that our experiences differ.

I emphatically disagree that the RUNBOOK entry is based on the needs of a single company.

The runbook link is not necessarily a constant. A setup where the link depends on the alert name is easily imaginable ;). But we have to keep things even more flexible that that. Runbook link, summary, and description are all very similar as they need to be created from some kind of templating. More fields of that kind are possible depending on the use case and the used notification system. That's why we need a generic solution here, giving each user the flexibility to cater for their specific use case.

@fabxc

This comment has been minimized.

Copy link
Member

fabxc commented Aug 26, 2015

This. Having any of them hard coded into the statement was anti-Promethean in the first place.
Something like this is more fitting:

ALERT <name>
  IF <expr> FOR <dur>
  WITH {
    <label> = "<value>",
    ...
  }
  PAYLOAD {
    <label> = "<value>",
    ...
  }

Where the difference between WITH and PAYLOAD are attachment to time series and that PAYLOAD values should be templated.

@brian-brazil

This comment has been minimized.

Copy link
Member Author

brian-brazil commented Aug 26, 2015

The runbook link is not necessarily a constant. A setup where the link depends on the alert name is easily imaginable ;).

Indeed and we will support that someday, though that's still constant relative to the alertname.

But we have to keep things even more flexible that that. Runbook link, summary, and description are all very similar as they need to be created from some kind of templating.

It's not just those - all alert labels need to be able to be created from templating.

That's why we need a generic solution here, giving each user the flexibility to cater for their specific use case.

I'd propose templated labels as the solution, and then similarly templated notifications on the alertmanger side. None of this requires handling runbook specially.

@brian-brazil

This comment has been minimized.

Copy link
Member Author

brian-brazil commented Aug 26, 2015

Where the difference between WITH and PAYLOAD are attachment to time series and that PAYLOAD values should be templated.

It's a bit tricker, the most common case will be that the non-grouping labels are global labels so won't be part of the alert definition. This indicates towards a global setting of some form.

@beorn7

This comment has been minimized.

Copy link
Member

beorn7 commented Aug 26, 2015

On Wed, Aug 26, 2015 at 1:13 PM, Brian Brazil notifications@github.com wrote:

I'd propose templated labels as the solution, and then similarly templated notifications on the alertmanger side. None of this requires handling runbook specially.

And neither description or summary. That's my whole point. I'm just
opposed to removing RUNBOOK (or SUMMARY or DESCRIPTION, although less
so) before we have a solution in place.

I tried to express that in the title of this issue. If somebody can
come up with a better title, be my guest.

Björn Rabenstein, Engineer
http://soundcloud.com/brabenstein

SoundCloud Ltd. | Rheinsberger Str. 76/77, 10115 Berlin, Germany
Managing Director: Alexander Ljung | Incorporated in England & Wales
with Company No. 6343600 | Local Branch Office | AG Charlottenburg |
HRB 110657B

@juliusv

This comment has been minimized.

Copy link
Member

juliusv commented Aug 26, 2015

Are we really thinking about even removing SUMMARY or DESCRIPTION as language-level constructs? Basically every notification mechanism depends on them, and thus we benefit a lot from statically checking that they are specified.

@brian-brazil

This comment has been minimized.

Copy link
Member Author

brian-brazil commented Aug 26, 2015

And neither description or summary. That's my whole point. I'm just
opposed to removing RUNBOOK

I don't see how this helps your use case as non-grouping is orthogonal to runbook as it never affects grouping.

What exactly are you trying to get into your notifications with this?
If you had used a label it'd have appeared in all notifications with no code changes required (at the moment it's only in pagerduty and opsgenie).

I'm not so sure about description and summary as they are going to be the main uses of templates, and unlike almost all other potential templated labels can vary significantly from eval to eval (e.g. a top 10 worst servers list or a string with the current value) - so you don't want to group on those or put them in timeseries.
I'm for keeping description and summary.

@beorn7

This comment has been minimized.

Copy link
Member

beorn7 commented Aug 26, 2015

Perhaps the "non-grouping" is distracting... I want a very simple and obvious way to allow users to create labels via templates that can then be used to assemble the notification.

I'm opposed to treating "description" and/or "summary" special. Everything that applies to them also applies to "runbook" (and possibly other fields). I don't support the assumption that every notification mechanism depends on "description" and "summary" and those therefore need to be checked statically by Prometheus in contrast to other fields.

"runbook" or any other custom label potentially need exactly the same flexibility and variability as "description" and/or "summary". We should not put any assumption about const'ness of "runbook" into our design.

@brian-brazil

This comment has been minimized.

Copy link
Member Author

brian-brazil commented Aug 26, 2015

I want a very simple and obvious way to allow users to create labels via templates that can then be used to assemble the notification.

We're in agreement on that. Right now all we have is labels and hardcoded templates, but ultimately we should have go templates for all notification fields in the alertmanager.

I'm opposed to treating "description" and/or "summary" special. Everything that applies to them also applies to "runbook" (and possibly other fields).

Technically they're the same, in practice they're special. As indicated above they tend to vary from eval cycle to eval cycle, so can't be used safely as either grouping or non-grouping labels. The other use cases that initially come to mind that are similarly special (e.g. parametrised urls based on alert value) can be handled via summary/description so I'm not sure we need to genericise that - let's wait for a use case.

@brian-brazil

This comment has been minimized.

Copy link
Member Author

brian-brazil commented Sep 1, 2015

Based on this discussion, I've updated the summary.

I also realised there's another way to implement this without code changes: Put it in the description.

Reviewing this discussion, the only argument I see as to why this feature needs to stay as-is is that SoundCloud is using it. This has not prevented us from removing features in the past.

@beorn7

This comment has been minimized.

Copy link
Member

beorn7 commented Sep 1, 2015

Obviously you can put into the description what you want. It could be a runbook link. You could also put the description in the runbook field. However, my whole point is that a runbook link is so crucial that it most be treated as a 1st class citizen, for one so that you can prominently place the link in your resulting notification but also to signal to the naive user that they really have to fill in that field for meaningful alerts.

Now I completely realize that you disagree. And others. Possibly all of you. But I have to restate (over and over again) that we are not keeping the feature because Soundcloud is using it. We are keeping it because I deem it so important. (And Soundcloud uses it because I happen to be an a TL role there.)

You could all decide to overrule me. Unsolvable disagreements happen, and at some point we have to go one way or the other. (And this particular feature will surely not doom the project one way or the other.) However, it's moot to discuss over and over again if it matters if Soundcloud uses the feature because it's not about that. It's about the opinion of an important project member.

I restate that my suggestion is to acknowledge that use cases differ and generalize "annotations" (or whatever to call them). Let's implement that first, instead of kicking out an existing feature that is important (at least for me).

@brian-brazil

This comment has been minimized.

Copy link
Member Author

brian-brazil commented Sep 1, 2015

I recognise that there's a use case. My contention is that it's not important enough to justify increasing complexity for all users of alerting by forcing them to deal with this as a special case, compared to the benefit it brings.

generalize "annotations" (or whatever to call them).

What are you thinking here?

@beorn7

This comment has been minimized.

Copy link
Member

beorn7 commented Sep 1, 2015

And I recognize that you don't think it's important enough. We have to agree that we disagree here.

The generalized labels would probably only need the ability to be expanded via the usual templating. If it's the right thing to enable template exansion in general within WITHor have a separate section I can't say.

# Current approach.
ALERT InstanceDown
  IF up == 0
  FOR 5m
  WITH {
    severity="page"
  }
  SUMMARY "Instance {{$labels.instance}} down"
  DESCRIPTION "{{$labels.instance}} of job {{$labels.job}} has been down for more than 5 minutes."
  RUNBOOK "http://wiki.int.example.com/InstanceDown?service={{$labels.job}}&zone={{$labels.zone}}"
# WITH with templating, to be vetted.
ALERT InstanceDown
  IF up == 0
  FOR 5m
  WITH {
    severity="page"
    summary="Instance {{$labels.instance}} down"
    description="{{$labels.instance}} of job {{$labels.job}} has been down for more than 5 minutes."
    runbook="http://wiki.int.example.com/InstanceDown?service={{$labels.job}}&zone={{$labels.zone}}"
  }

Ideally, we want some escaping magic or features to make it easy to render valid URLs.

@brian-brazil

This comment has been minimized.

Copy link
Member Author

brian-brazil commented Sep 1, 2015

The generalized labels would probably only need the ability to be expanded via the usual templating.

I'm for adding that for all labels (there's advanced use cases for it), so I can put that PR together.

We can't make summary/description labels though (and need to be a tad careful with runbook, if you're hitting that case you probably should put it in the description).

@beorn7

This comment has been minimized.

Copy link
Member

beorn7 commented Sep 1, 2015

Cross-inspirations from #1043 : labels that go via template expansion change a lot, so are not really suitable as proper labels on the ALERT timeseries. So we could have a separate stanza for template expansions that are not labels.

ALERT InstanceDown
  IF up == 0
  FOR 5m
  WITH LABELS {
    severity="page"
  }
  WITH ANNOTATIONS {
    summary="Instance {{$labels.instance}} down"
    description="{{$labels.instance}} of job {{$labels.job}} has been down for more than 5 minutes."
    runbook="http://wiki.int.example.com/InstanceDown?service={{$labels.job}}&zone={{$labels.zone}}"
  }
@brian-brazil

This comment has been minimized.

Copy link
Member Author

brian-brazil commented Sep 1, 2015

labels that go via template expansion change a lot, so are not really suitable as proper labels on the ALERT timeseries.

Depends how you do it, where it usually comes up they're constant relative to a given alert (mostly for advanced alert routing) - so I'm not worried there.

Where the real problem is if someone does runbook="{{$value}}"

We've got four options there:

  • Don't support it
  • List it as a non-grouping label (still breaks ALERT)
  • Put it in the description
  • Generalise as you suggest

My instinct is that generalising will cause more problems than it will solve though making alert notifications too generic, and that we should leave summary/description as the only special cases. They do map pretty well to the notification methods we have thus far.

@juliusv

This comment has been minimized.

Copy link
Member

juliusv commented Sep 1, 2015

  • List it as a non-grouping label (still breaks ALERT)

Just for my understanding: how would a templated non-grouping label break ALERT?

@brian-brazil

This comment has been minimized.

Copy link
Member Author

brian-brazil commented Sep 1, 2015

Just for my understanding: how would a templated non-grouping label break ALERT?

It it had labels that weren't constant relative to the alert (e.g. contained {{$value}}) it'd create a new timeseries every eval cycle, which would be very difficult to deal with if you cared about the content of ALERT.

In practice, I think we'd recommend linking to a console (passing required labels) that'd display whatever time-series you want and avoid this problem. Accordingly I think side-stepping this problem is practical, and likely the best approach overall.

@juliusv

This comment has been minimized.

Copy link
Member

juliusv commented Sep 1, 2015

But I thought the whole purpose of non-grouping labels is that they
wouldn't be attached to any time series, but just serve as extra metadata
in alerts.
On Sep 1, 2015 15:57, "Brian Brazil" notifications@github.com wrote:

Just for my understanding: how would a templated non-grouping label break
ALERT?

It it had labels that weren't constant relative to the alert (e.g.
contained {{$value}}) it'd create a new timeseries every eval cycle,
which would be very difficult to deal with if you cared about the content
of ALERT.

In practice, I think we'd recommend linking to a console (passing required
labels) that'd display whatever time-series you want and avoid this
problem. Accordingly I think side-stepping this problem is practical, and
likely the best approach overall.


Reply to this email directly or view it on GitHub
#1002 (comment)
.

@brian-brazil

This comment has been minimized.

Copy link
Member Author

brian-brazil commented Sep 1, 2015

But I thought the whole purpose of non-grouping labels is that they
wouldn't be attached to any time series, but just serve as extra metadata
in alerts.

You want non-grouping labels attached to alerts and timeseries, as it's how you'll distinguish time-series from different HA prometheis in your long-term storage. These would be typically configured at the prometheus-level (usually as global labels once we fix those) rather than the alert label.

What Beorn is proposing results in a 3rd class of key/value pair that only applies to alerts.

@brian-brazil

This comment has been minimized.

Copy link
Member Author

brian-brazil commented Sep 26, 2015

With that premise, why would not having these two fields as first class attributes change anything in adoption for beginners?

With Matthias's proposal they would see have to see those attributes in all examples, so that does change something in adoption.

If we ignore that bit, it causes a problem when they to to the next stage beyond simple alerts where they want to tweak the notification a little bit. Now rather than having a field that likely covers their need, they've instead got theory to learn and choices to make. More concretely rather than a user thinking "I'll add it to description" they now think more along the lines of "should I put this in description, or a label or in payload and what do I name them and how do I lay out my notification templates".

Similarly if they want to do something simple with routing (which is a common initial use case) they have to learn that only labels work for that, and we'll start to get questions and confusion around why payload doesn't as syntactically they're identical in the alert expression. Most beginner alerts wouldn't even have a "WITH" clause, so users would think that "PAYLOAD" was the place to put labels for routing as it looks like a place to put labels. Trying to prevent that requires more docs, which means more reading and confusion, which means less adoption.

As soon as they want to customize alerts further, they'll have to consult the documentation anyway. I don't see why the specific syntax used (first class attributes vs. just labels) will make any difference or make things anymore confusing.

Yes, the question is at what stage we require them to consult the docs and how complex those docs are. With Matthias's proposal they have to read them much sooner and they're more complex. I'd like to push out the need to read the full docs as far as possible and try to limit how may concepts docs have to explain, particularly where they're only really relevant to the more advanced use cases.

If you look at target relabelling, we're already seeing these type of problems due to the system being too generic and difficult to learn for a new user. While I still believe that is the least-worst decision in that case, let's not introduce another one of those into the system without being sure it'll overall lead to an increased number of happy users. What's being proposed here is a feature that's mainly of use to advanced users, we shouldn't be forcing beginner and intermediate users to have to deal with complexity that doesn't apply to their use case when we've a reasonable way to avoid that.

@barkerd427

This comment has been minimized.

Copy link

barkerd427 commented Sep 26, 2015

I'll admit that I haven't read this whole thread, but I've read a large portion of it. Let me offer my perspective as a naive, basic user.

When I started looking for an alerting system for work, I went through several options before settling on Prometheus. Nagios has a lot of complexity. It was hard just to get installed. Then I went to Icinga2. That was easier to setup and install, but still not great with my companies restrictive access. It then took another day to figure out that I'd still have to open a bunch of ports to get metrics out. Prometheus was dead simple to install. It requires very little effort to configure.

When I started to add alerts, the docs were very easy to follow. There are still gaps in the docs, but experience easily filled in those. However, there is something lacking. I think that is what we're covering here.

As most enterprise user, I live in a world that has both old and ancient technology. Some of it written in-house. This gives me some more complex cases with respect to the information these systems need.

Our alerting/paging system requires a long complex string to be sent as an SNMP trap or through a tcp connection. I'm currently passing this through the description, but that makes my emails look cryptic. Perhaps I'm just not architecting it correctly. This string is populated by Prometheus, but I could feasibly do this in my webhook receiver.

Either way, I feel like another option for additional payload is missing. When I first saw the runbook attribute, I had a good idea what it was, but it wasn't clear how to use it. Payload makes more sense to me assuming it's optional and the internal attributes are flexible. It also needs to support templating.

@brian-brazil

This comment has been minimized.

Copy link
Member Author

brian-brazil commented Sep 26, 2015

Thanks for the insight, this sounds like a concrete use case.

I'm currently passing this through the description, but that makes my emails look cryptic. Perhaps I'm just not architecting it correctly. This string is populated by Prometheus, but I could feasibly do this in my webhook receiver.

That's the way to do it currently, and as long as you're talking non-binary strings I think this is something we should probably cover within Prometheus. Once we have notification templates there'll be ways to clean up the emails (regexes), but that's obviously a bit messy.

What sort of information is in this string? I want to verify that this isn't something that will be covered by notification templates (roughly speaking, if you're not using query you will probably be able to handle this entirely in the alertmanager).

It also needs to support templating.

I think everyone is onboard with that.

Presuming this works out as a use case, the debate then is around a) how we expose this to users configuration-wise and b) keeping summary/description first class.

@barkerd427

This comment has been minimized.

Copy link

barkerd427 commented Sep 27, 2015

I haven't been able to get back to my work laptop to pull the actual
string, but I inject the application name and instance into the string. I
also add a custom message and error name. It also includes the type of
event and some general application information. The SNMP trap string is
just a standard format. Let me know if you want to see the actual format of
each.
On Sep 26, 2015 13:19, "Brian Brazil" notifications@github.com wrote:

Thanks for the insight, this sounds like a concrete use case.

I'm currently passing this through the description, but that makes my
emails look cryptic. Perhaps I'm just not architecting it correctly. This
string is populated by Prometheus, but I could feasibly do this in my
webhook receiver.

That's the way to do it currently, and as long as you're talking
non-binary strings I think this is something we should probably cover
within Prometheus. Once we have notification templates there'll be ways to
clean up the emails (regexes), but that's obviously a bit messy.

What sort of information is in this string? I want to verify that this
isn't something that will be covered by notification templates (roughly
speaking, if you're not using query you will probably be able to handle
this entirely in the alertmanager).

It also needs to support templating.

I think everyone is onboard with that.

Presuming this works out as a use case, the debate then is around a) how
we expose this to users configuration-wise and b) keeping
summary/description first class.


Reply to this email directly or view it on GitHub
#1002 (comment)
.

@barkerd427

This comment has been minimized.

Copy link

barkerd427 commented Sep 28, 2015

@brian-brazil, here's the format for my alerts:

The event format is the word "EVENT" followed by key=value pairs separated by "~%~".  The keys you will be sending are:
    msg_class=DST_Trap
    app_host=<servername>
    app_name=<your app name>
    app_object=<reference object for alert>
    app_severity=<severity: one of CRITICAL, MAJOR, MINOR, WARNING, INFO, OK>
    app_client=<client if applicable>
    app_eventcode=<event code>
    app_eventmsg=<event message>

And this is the pertinent part of my alert rule:

DESCRIPTION "EVENT msg_class=DST_Trap~%~app_host={{$labels.instance}}~%~app_name={{$labels.job}}~%~app_object={{$labels.task}}~%~app_severity=CRITICAL~%~app_client=NA~%~app_eventcode=ANALYTICS_CALC~%~app_eventmsg=Error in analytics calculations"

I would prefer a method of passing this string as extra data as this same description is sent to the alerting system and through email and Slack which are not part of the alerting system. Alternatively, I'd be very happy with a way of passing other metadata information that can be added to the body of an email or as extra comments on Slack.

I may also need to send this string to handle any situations where I need to use SNMP instead of TCP:

1.3.6.1.4.1.2329.1.30.1 string "{{$labels.instance}}" 1.3.6.1.4.1.2329.1.30.2 string "{{$labels.job}}" 1.3.6.1.4.1.2329.1.30.3 string "{{$labels.task}}" 1.3.6.1.4.1.2329.1.30.4 string "CRITICAL" 1.3.6.1.4.1.2329.1.30.5 string "NA" 1.3.6.1.4.1.2329.1.30.6 string "ANALYTICS_CALC" 1.3.6.1.4.1.2329.1.30.7 string "Error in analytics calculations"

I'm not sure yet how I'll input the numbers, but again this doesn't look nice to a human and I'd rather pass more meaningful data. Adding more lines to the description section will just make it hard to parse out the piece I need for our alerting system.

@beorn7

This comment has been minimized.

Copy link
Member

beorn7 commented Sep 28, 2015

The inevitability of user-defined payload looks more convincing than ever...
We can still discuss how to implement it syntactically to make it a nice and non-confusing as possible (I have some ideas), but please assume it will happen while discussing if description and summary should be hard-coded and keywords in the syntax.

The decision is irrelevant for the super-naive user who doesn't use description or summary at all.

For the description and/or summary using user, it's the difference between

  SUMMARY "Instance {{$labels.instance}} down"
  DESCRIPTION "{{$labels.instance}} of job {{$labels.job}} has been down for more than 5 minutes."

and

  PAYLOAD {
    summary = "Instance {{$labels.instance}} down"
    description = "{{$labels.instance}} of job {{$labels.job}} has been down for more than 5 minutes."
  }

with all the gotchas about detecting typos on different levels. (Note that PAYLOAD is probably not a good word, we need something better, or implement other ideas, but I don't want to add those to this already quite convoluted discussion.)

The concern I have is that for those who actually need payload, it's between

  SUMMARY "Instance {{$labels.instance}} down"
  DESCRIPTION "{{$labels.instance}} of job {{$labels.job}} has been down for more than 5 minutes."
  PAYLOAD {
    foo = "some template"
    bar = "another template"
  }

and

  PAYLOAD {
    summary = "Instance {{$labels.instance}} down"
    description = "{{$labels.instance}} of job {{$labels.job}} has been down for more than 5 minutes."
    foo = "some template"
    bar = "another template"
  }

The former syntax suggests that summary/description are special, but they are not special for those who need their own payload anyway.

We essentially need to weigh the degree of confusion for new/naive users by the first comparison against the degree of confusion for ambitious/advanced users by the second comparison.

I personally see the former as negligible but the latter as quite confusing and warty.

@brian-brazil

This comment has been minimized.

Copy link
Member Author

brian-brazil commented Sep 28, 2015

Hmm, most of what you're doing there belongs in the alertmanager rather than Prometheus I'd say as you're mainly formatting a custom notification string rather than something specific to that alert. The only wrinkles there are app_eventcode, which would be plausible as a label, and app_eventmsg which sounds like the summary. I think that notification templates will cover your use case.

The SNMP one is more interesting, my suspicion is that that's so custom that it'd have to go via the webhook and the rest per the TCP string.

@brian-brazil

This comment has been minimized.

Copy link
Member Author

brian-brazil commented Sep 28, 2015

There's the other option of

SUMMARY "Instance {{$labels.instance}} down"
DESCRIPTION "{{$labels.instance}} of job {{$labels.job}} has been down for more than 5 minutes."
WITH  {
  foo = "some template"
  bar = "another template"
}

and configuration the the global prometheus level that foo and bar are special (which we'll have for non-grouping labels too, as those can't be configured per-alert).

The former syntax suggests that summary/description are special, but they are not special for those who need their own payload anyway.

I'm not sure about that, I'd still expect alerts using this to also go out via notifiers using summary/description in the usual fashion. Payloads are about additional information.

@fabxc

This comment has been minimized.

Copy link
Member

fabxc commented Sep 29, 2015

After 85 comments, many considerations and opinions, I now propose the following solution.

I encourage everyone to simply state 👍 or 👎 so this can function as a vote for a final conclusion. Pros and cons have been discussed extensively above.

  • The RUNBOOK field will be removed as requested.
  • The SUMMARY and DESCRIPTION fields will be removed.
  • The WITH clause will be renamed into LABELS.
  • The ANNOTATIONS clause is added to allow free-form key/value data that are not considered in the identity of an alert.
  • Values in LABELS and ANNOTATIONS are template-processed.

Example:

ALERT InstanceDown
  IF up == 0 FOR 10m
  LABELS {
    severity = "critical"
  }
  ANNOTATIONS {
    summary = "instance {{$labels.instance}} down"
    description = "instance {{$labels.instance}} has not been reachable for more than 10 minutes"
  }

The new Alertmanager will have configurable handling of annotation fields with reasonable defaults, e.g. summary and description.
The goal is no more configuration than previously for simple setups.

@juliusv

This comment has been minimized.

Copy link
Member

juliusv commented Sep 29, 2015

👍 Yes. I've been leaning both ways for a while and unfortunately it won't be possible to make everyone happy on this issue, but this now looks like the best way forward for me.

@brian-brazil

This comment has been minimized.

Copy link
Member Author

brian-brazil commented Sep 29, 2015

The ANNOTATIONS clause is added to allow free-form key/value data that are is considered in the identity of an alert.

I presume you mean NOT part of the identity of the alert here, otherwise it'd be the same as labels.

For the sake of closing this off 👍

This implies several breaking changes, which is okay as we're talking the alertmanager. If we're going to do this it's we'll also have to redesign the json that goes between prometheus and the alertmanager (basically switch what's currently outside and inside the "payload" field) and hopefully avoid having to break that interface again.

@fabxc

This comment has been minimized.

Copy link
Member

fabxc commented Sep 29, 2015

You are correct, fixed.

@beorn7

This comment has been minimized.

Copy link
Member

beorn7 commented Sep 29, 2015

👍

I was thinking for quite some time about the mental barrier to understand the difference between labels and annotations, and I had some ideas how to hide it to the newcomer, but in the end, I came to the conclusion that clearly separating it is ultimately the best solution. A well worded explanation seems possible ("shows up in your alert timeseries" and "is part of the identity of the alert" are after all quite obvious concepts).

And yes, the JSON format has to be updated, too, but that was obvious for a while...

@juliusv

This comment has been minimized.

Copy link
Member

juliusv commented Sep 29, 2015

Yep, this change should go hand in hand with the Alertmanager rewrite, which needs a breaking of the transfer format anyways. So with the Prometheus version that introduces this, people will have to upgrade to the new Alertmanager in lockstep.

@juliusv

This comment has been minimized.

Copy link
Member

juliusv commented Sep 29, 2015

And: \o/ for reaching a decision!

@beorn7 beorn7 changed the title Remove RUNBOOK from alerts Create a section ANNOTATIONS with user-defined payload and generalize RUNBOOK, DESCRIPTION, SUMMARY into fields therein. Sep 29, 2015

@brian-brazil

This comment has been minimized.

Copy link
Member Author

brian-brazil commented Sep 29, 2015

Yep, this change should go hand in hand with the Alertmanager rewrite, which needs a breaking of the transfer format anyways.

Actually I think we do this one sooner, what I'd be tying this to more is the notifications templates.

@juliusv

This comment has been minimized.

Copy link
Member

juliusv commented Sep 29, 2015

If we do that, we will break Prometheus<->AM twice instead of once. Not so happy about that, I mean, it's not that urgent or is it?

@brian-brazil

This comment has been minimized.

Copy link
Member Author

brian-brazil commented Sep 29, 2015

I'm not sure why we'd need to break it twice, I'm not forseeing any fundamental structural changes to the interaction that'd require breaking changes, it's more a cleanup of the existing API. It'd also be good not to block notification templates on the rewrite.

@brian-brazil

This comment has been minimized.

Copy link
Member Author

brian-brazil commented Sep 29, 2015

From #1128 we see that users are already having problems with the lack of notification templates and working around them which is causing other issues, so I'd like to get notification templates in at least in inside the next month.

@fabxc

This comment has been minimized.

Copy link
Member

fabxc commented Sep 29, 2015

With some effort we can have a new AM by then that has the new features we want and the persistence of the current one.

@brian-brazil

This comment has been minimized.

Copy link
Member Author

brian-brazil commented Sep 29, 2015

I see the notification stuff as orthogonal to the new AM, the only wrinkle is the API changes this issue requires.

@juliusv

This comment has been minimized.

Copy link
Member

juliusv commented Sep 29, 2015

AIUI the new AM will require other API changes besides those fields - @fabxc will know more. If that's true, it would create a second breaking change to the transfer format.

@brian-brazil

This comment has been minimized.

Copy link
Member Author

brian-brazil commented Sep 29, 2015

A small compatibility layer in the AM is probably best then. No point in a breaking change in notification templates too.

@brian-brazil

This comment has been minimized.

Copy link
Member Author

brian-brazil commented Dec 16, 2015

Resolved by 7c90db2

@lock

This comment has been minimized.

Copy link

lock bot commented Mar 24, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked and limited conversation to collaborators Mar 24, 2019

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
You can’t perform that action at this time.