Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alerting: Alert Rule Categorizing & Sorting, Severity, Annotation & Notification filtering #6007

Closed
torkelo opened this issue Sep 12, 2016 · 14 comments
Labels
area/alerting Grafana Alerting type/discussion Issue to start a discussion

Comments

@torkelo
Copy link
Member

torkelo commented Sep 12, 2016

We have 2-3 design decision we have to decide on soon that are all somewhat interconnected.

Alert Rule Severity

Today we have a severity option (Critical & Warning) and Alert States (OK, Unknown(NoData), Critical, Warning, Execution Error).

The problem here is that we are mixing Alert States (OK, Unknown, Execution Error) and letting the Firing State be viewed differently depending on severity (degree of importance associated with words like Critical & Warning).

Mixing these different things severity (degree of importance, sortable) with rule status (not sortable) is a bit tricky.

We are a bit tempted to remove the severity option, or at least change it and separate it from the state. So we only end up with these alert states:

  • OK
  • NoData (Formerly Unknown)
  • Alerting (Other alternative is Firing)
  • Execution Error

Alert Rule Categorization

Right now there is no way to organize a large set of alert rules and the annotations (i.e. state change history) they create. The only thing that approaches categorization is notifications, where each alert can have it's own set of notifications. But these are not handled as proper categorization properties as there is no way to filter/group by notifications, nor is this something included in the alert annotation (state change event). This seems like it can quickly become a problem for even medium sized Grafana installs where more than one team is using the Alert feature.

Possible solutions:

  1. Use notification groups but index them and include it as a filterable property, and include it in alert annotations
  2. Use dashboard tags
  3. Use new concept alert tags that are set on each alert rule (like dashboard tags).
  4. Introduce new concept Alert Category that can work like a single tag. But we can here allow users to create these alert categories, specify options for each (color, defaults, message template, default notifications).

Of all these solutions Alert category is probably a lot easier/less complex to implement compared against using tags. Tags requires a many to many lookup table between alert and alert category as well as for the tables annotation and annotation category. Searching and filtering tables with many to many relations in SQL is a PITA.

Skipping the many to many (tags) ability and having only one alert category makes it simpler for the user as well and makes it possible to look at the alert category as sort of severity as well (If some users want to set up Warning and Critical categories they could do that. The restriction here being that if two teams who want to have their own categories they would have to create Team B Critical and Team B Warning.

Another reason that a single category property might be preferable is that it allows for it to include defaults for things like notification message & notifications.

Feedback would be much appreciated

@RichiH
Copy link
Member

RichiH commented Sep 12, 2016

Severity

You need severity; if anything, you should also allow to split up urgency (call and wake on-call people) and importance (this needs to be done first, but potentially only during business hours). There should be a way to have a dashboard with "work everything off from top to bottom", sorted by importance, overriden by urgency. Else, you will create pager fatigue because the smallest of blips create the same alarms as a huge outage.

Categorization

I would vote for arbitrary labels with arbitrary values. That way, users can slice and dice their own data. Of course, a set of sane presets should be included to try and keep most people along a common set of nomenclature. @bergquist think Prometheus labels ;)

@torkelo torkelo added type/discussion Issue to start a discussion area/alerting Grafana Alerting labels Sep 12, 2016
@torkelo
Copy link
Member Author

torkelo commented Sep 12, 2016

Do we need severity? Many alert systems seems to skip them (Prometheus, DataDog, Librato)

@RichiH
Copy link
Member

RichiH commented Sep 12, 2016

I think you do. And Prometheus allows arbitrary severities which you can define in freeform.

@torkelo
Copy link
Member Author

torkelo commented Sep 12, 2016

@RichiH but there is no Severity concept in Prometheus right? just labels & label values.

So no way to filter on say alerts with severity above Warning.

@wleese
Copy link
Contributor

wleese commented Sep 12, 2016

IMHO severity, something resolving around how users respond to an alert, could be omitted completely if the notification provides enough flexibility.

Those who need 'warning' and 'critical' alerts could define 2 rules that would have different notifications like different endpoints (mail and sms/push notification).

Looking at our nagios based setup (>100.000 checks), the warning severity is merely used on dashboards. So if someone notices the warning, they could take some preventive action. But in reality this rarely happens with exception of fairly specific checks such as disk space. Or, if the warning hangs around long enough for someone to get annoyed by it.

That said, asking this question seems a bit like telling someone they can get a car for free and then asking which features they want. Expected answer: everything ;)

@RichiH
Copy link
Member

RichiH commented Sep 12, 2016

@torkelo You can use the labels in whatever way you want and you can route different alarms into different receivers. I.e. I don't escalate for a "disk over 95% full", but I show it in a dashboard. I do escalate on a "disk full in x hours/before normal working times start" - that's a importance & urgency concept backed into the language, even if it's not called that.

@wleese "warning" and "critical" is too short-sighted in my experience. You need a way to inform about stuff, warn about stuff, say "this is important" and "this is burning" - One-dimensionally, this maps to info, warning, high, critical.

I do agree that this issue has potential to descend into bikeshedding, though.

@hgomez-sonarsource
Copy link

Zabbix use 5 levels

  • Disaster
  • High
  • Average
  • Warning
  • Information

Disaster is quite meaningfull (stop everything and fix it now). High is something to take care asap, others levels are mostly informational

@theist
Copy link

theist commented Sep 12, 2016

Agreed with @wleese warning is normally treated as "an alert that does not alert anyone", is there because is worth to be seen but nothing is bad. Also is worth for an API, instead of doing active watching. You have a program like nagstamon during work hours to prevent warning to degrade in critical. But warning levels rarely are used to wake on call rotations.

Beyond ok/warning/critical everyone has its own book. This fits well for nagios / shinken users. Zabbix users can feel that is not enough, but imho ok/warn/crit covers 80% of use cases and 50% of complex ones. If you need more complicated alerting I would advise to go to the source, doing scripts that cover more complex stuff using the influx/prometheus/graphite/etc data.

Regarding categorization, I'll vote for using dashboard tags or alert tags, sinde grafana is a tool for making dashboards, I'll want the alerts to be related to the dashboards in some way

@s4z
Copy link

s4z commented Sep 12, 2016

Would it be possible to include templated values from the dashboard as meta data in alerts?

I'd say OK, Warning and Critical (or even OK/Crit) are all that's required for levels however for categorisation dashboard tags wouldn't work for our use case. Our dashboards can view many environments and templating is used to select specific one(s) - probably more than 50% of the envs we wouldn't alert to pager duty however some of them we would want to route to web hooks to feed back into performance testing systems.

Quite a while ago now I wrote a simple graphite monitor that used regexp groups to pull out key=value pairs from metrics and used these to route triggered events. It plugged into both Elastic and a performance testing system - triggers for testing systems of course routed to both and triggers for non-testing systems routed to Elastic only. The metric names included the environment names among other metadata.

That way I could use one set of rules to monitor all environments and it was possible to trigger different things based on the metric and the level (ok/warn/crit). I believe Bosun can do this as well if I'm not mistaken.

@mhiller
Copy link

mhiller commented Sep 12, 2016

Critical , Warn and Ok are pretty much industry standard. I think they should be supported at a minimum.

@fdelapena
Copy link

Concerning levels, RFC3877 mentions some levels based on ITU M.3100, also checking RFC5674 details may be interesting for some ideas from syslog.

@torkelo
Copy link
Member Author

torkelo commented Sep 13, 2016

After talking to a few people and getting some feedback from this thread we are moving forward with these alert states:

  • OK
  • Alerting
  • NoData
  • ExecError

We will remove the severity option and replace it with a more general Alert Category property which will allow users a simple way to organize & filter alerts. Users will of course be able to specify their own alert categories. The big limitation here will be that this categorization will only be 1-dimensional (only one category per alert).

I know many want the most flexible and powerful alerting system where you can specify severity, labels, label values, etc and filter by everything. Well we can't do everything, so will have to start simple :)

@torkelo torkelo closed this as completed Sep 13, 2016
@GowthamShanmugam
Copy link

GowthamShanmugam commented Aug 11, 2017

@torkelo Hi i am gowtham, i have one doubt is grafana, when i sent http api request for list particular alert (api/alerts/1), it gives alert details but severity is always comes empty string""

how i can set severity like warning, critical ?

{
"Id": 1,
"Version": 0,
"OrgId": 1,
"DashboardId": 1,
"PanelId": 1,
"Name": "Panel Title alert",
"Message": "",
"Severity": "",
.
.
.

@torkelo
Copy link
Member Author

torkelo commented Aug 11, 2017

There is no severity option to set

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/alerting Grafana Alerting type/discussion Issue to start a discussion
Projects
None yet
Development

No branches or pull requests

9 participants