Alerting: Alert Rule Categorizing & Sorting, Severity, Annotation & Notification filtering #6007

torkelo · 2016-09-12T05:11:49Z

We have 2-3 design decision we have to decide on soon that are all somewhat interconnected.

Alert Rule Severity

Today we have a severity option (Critical & Warning) and Alert States (OK, Unknown(NoData), Critical, Warning, Execution Error).

The problem here is that we are mixing Alert States (OK, Unknown, Execution Error) and letting the Firing State be viewed differently depending on severity (degree of importance associated with words like Critical & Warning).

Mixing these different things severity (degree of importance, sortable) with rule status (not sortable) is a bit tricky.

We are a bit tempted to remove the severity option, or at least change it and separate it from the state. So we only end up with these alert states:

OK
NoData (Formerly Unknown)
Alerting (Other alternative is Firing)
Execution Error

Alert Rule Categorization

Right now there is no way to organize a large set of alert rules and the annotations (i.e. state change history) they create. The only thing that approaches categorization is notifications, where each alert can have it's own set of notifications. But these are not handled as proper categorization properties as there is no way to filter/group by notifications, nor is this something included in the alert annotation (state change event). This seems like it can quickly become a problem for even medium sized Grafana installs where more than one team is using the Alert feature.

Possible solutions:

Use notification groups but index them and include it as a filterable property, and include it in alert annotations
Use dashboard tags
Use new concept alert tags that are set on each alert rule (like dashboard tags).
Introduce new concept Alert Category that can work like a single tag. But we can here allow users to create these alert categories, specify options for each (color, defaults, message template, default notifications).

Of all these solutions Alert category is probably a lot easier/less complex to implement compared against using tags. Tags requires a many to many lookup table between alert and alert category as well as for the tables annotation and annotation category. Searching and filtering tables with many to many relations in SQL is a PITA.

Skipping the many to many (tags) ability and having only one alert category makes it simpler for the user as well and makes it possible to look at the alert category as sort of severity as well (If some users want to set up Warning and Critical categories they could do that. The restriction here being that if two teams who want to have their own categories they would have to create Team B Critical and Team B Warning.

Another reason that a single category property might be preferable is that it allows for it to include defaults for things like notification message & notifications.

Feedback would be much appreciated

The text was updated successfully, but these errors were encountered:

RichiH · 2016-09-12T05:43:19Z

Severity

You need severity; if anything, you should also allow to split up urgency (call and wake on-call people) and importance (this needs to be done first, but potentially only during business hours). There should be a way to have a dashboard with "work everything off from top to bottom", sorted by importance, overriden by urgency. Else, you will create pager fatigue because the smallest of blips create the same alarms as a huge outage.

Categorization

I would vote for arbitrary labels with arbitrary values. That way, users can slice and dice their own data. Of course, a set of sane presets should be included to try and keep most people along a common set of nomenclature. @bergquist think Prometheus labels ;)

torkelo · 2016-09-12T06:11:41Z

Do we need severity? Many alert systems seems to skip them (Prometheus, DataDog, Librato)

RichiH · 2016-09-12T06:48:46Z

I think you do. And Prometheus allows arbitrary severities which you can define in freeform.

torkelo · 2016-09-12T06:51:29Z

@RichiH but there is no Severity concept in Prometheus right? just labels & label values.

So no way to filter on say alerts with severity above Warning.

wleese · 2016-09-12T06:51:49Z

IMHO severity, something resolving around how users respond to an alert, could be omitted completely if the notification provides enough flexibility.

Those who need 'warning' and 'critical' alerts could define 2 rules that would have different notifications like different endpoints (mail and sms/push notification).

Looking at our nagios based setup (>100.000 checks), the warning severity is merely used on dashboards. So if someone notices the warning, they could take some preventive action. But in reality this rarely happens with exception of fairly specific checks such as disk space. Or, if the warning hangs around long enough for someone to get annoyed by it.

That said, asking this question seems a bit like telling someone they can get a car for free and then asking which features they want. Expected answer: everything ;)

RichiH · 2016-09-12T07:01:27Z

@torkelo You can use the labels in whatever way you want and you can route different alarms into different receivers. I.e. I don't escalate for a "disk over 95% full", but I show it in a dashboard. I do escalate on a "disk full in x hours/before normal working times start" - that's a importance & urgency concept backed into the language, even if it's not called that.

@wleese "warning" and "critical" is too short-sighted in my experience. You need a way to inform about stuff, warn about stuff, say "this is important" and "this is burning" - One-dimensionally, this maps to info, warning, high, critical.

I do agree that this issue has potential to descend into bikeshedding, though.

hgomez-sonarsource · 2016-09-12T08:35:55Z

Zabbix use 5 levels

Disaster
High
Average
Warning
Information

Disaster is quite meaningfull (stop everything and fix it now). High is something to take care asap, others levels are mostly informational

theist · 2016-09-12T09:37:16Z

Agreed with @wleese warning is normally treated as "an alert that does not alert anyone", is there because is worth to be seen but nothing is bad. Also is worth for an API, instead of doing active watching. You have a program like nagstamon during work hours to prevent warning to degrade in critical. But warning levels rarely are used to wake on call rotations.

Beyond ok/warning/critical everyone has its own book. This fits well for nagios / shinken users. Zabbix users can feel that is not enough, but imho ok/warn/crit covers 80% of use cases and 50% of complex ones. If you need more complicated alerting I would advise to go to the source, doing scripts that cover more complex stuff using the influx/prometheus/graphite/etc data.

Regarding categorization, I'll vote for using dashboard tags or alert tags, sinde grafana is a tool for making dashboards, I'll want the alerts to be related to the dashboards in some way

s4z · 2016-09-12T10:11:01Z

Would it be possible to include templated values from the dashboard as meta data in alerts?

I'd say OK, Warning and Critical (or even OK/Crit) are all that's required for levels however for categorisation dashboard tags wouldn't work for our use case. Our dashboards can view many environments and templating is used to select specific one(s) - probably more than 50% of the envs we wouldn't alert to pager duty however some of them we would want to route to web hooks to feed back into performance testing systems.

Quite a while ago now I wrote a simple graphite monitor that used regexp groups to pull out key=value pairs from metrics and used these to route triggered events. It plugged into both Elastic and a performance testing system - triggers for testing systems of course routed to both and triggers for non-testing systems routed to Elastic only. The metric names included the environment names among other metadata.

That way I could use one set of rules to monitor all environments and it was possible to trigger different things based on the metric and the level (ok/warn/crit). I believe Bosun can do this as well if I'm not mistaken.

mhiller · 2016-09-12T16:37:50Z

Critical , Warn and Ok are pretty much industry standard. I think they should be supported at a minimum.

fdelapena · 2016-09-12T17:31:04Z

Concerning levels, RFC3877 mentions some levels based on ITU M.3100, also checking RFC5674 details may be interesting for some ideas from syslog.

torkelo · 2016-09-13T09:34:42Z

After talking to a few people and getting some feedback from this thread we are moving forward with these alert states:

OK
Alerting
NoData
ExecError

We will remove the severity option and replace it with a more general Alert Category property which will allow users a simple way to organize & filter alerts. Users will of course be able to specify their own alert categories. The big limitation here will be that this categorization will only be 1-dimensional (only one category per alert).

I know many want the most flexible and powerful alerting system where you can specify severity, labels, label values, etc and filter by everything. Well we can't do everything, so will have to start simple :)

GowthamShanmugam · 2017-08-11T07:09:47Z

@torkelo Hi i am gowtham, i have one doubt is grafana, when i sent http api request for list particular alert (api/alerts/1), it gives alert details but severity is always comes empty string""

how i can set severity like warning, critical ?

{
"Id": 1,
"Version": 0,
"OrgId": 1,
"DashboardId": 1,
"PanelId": 1,
"Name": "Panel Title alert",
"Message": "",
"Severity": "",
.
.
.

torkelo · 2017-08-11T19:50:47Z

There is no severity option to set

torkelo mentioned this issue Sep 12, 2016

building alerting system for grafana #2209

Closed

torkelo added type/discussion Issue to start a discussion area/alerting Grafana Alerting labels Sep 12, 2016

torkelo closed this as completed Sep 13, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Alerting: Alert Rule Categorizing & Sorting, Severity, Annotation & Notification filtering #6007

Alerting: Alert Rule Categorizing & Sorting, Severity, Annotation & Notification filtering #6007

torkelo commented Sep 12, 2016

RichiH commented Sep 12, 2016 •

edited

torkelo commented Sep 12, 2016

RichiH commented Sep 12, 2016

torkelo commented Sep 12, 2016

wleese commented Sep 12, 2016 •

edited

RichiH commented Sep 12, 2016

hgomez-sonarsource commented Sep 12, 2016

theist commented Sep 12, 2016

s4z commented Sep 12, 2016

mhiller commented Sep 12, 2016

fdelapena commented Sep 12, 2016

torkelo commented Sep 13, 2016 •

edited

GowthamShanmugam commented Aug 11, 2017 •

edited

torkelo commented Aug 11, 2017

Alerting: Alert Rule Categorizing & Sorting, Severity, Annotation & Notification filtering #6007

Alerting: Alert Rule Categorizing & Sorting, Severity, Annotation & Notification filtering #6007

Comments

torkelo commented Sep 12, 2016

Alert Rule Severity

Alert Rule Categorization

Feedback would be much appreciated

RichiH commented Sep 12, 2016 • edited

Severity

Categorization

torkelo commented Sep 12, 2016

RichiH commented Sep 12, 2016

torkelo commented Sep 12, 2016

wleese commented Sep 12, 2016 • edited

RichiH commented Sep 12, 2016

hgomez-sonarsource commented Sep 12, 2016

theist commented Sep 12, 2016

s4z commented Sep 12, 2016

mhiller commented Sep 12, 2016

fdelapena commented Sep 12, 2016

torkelo commented Sep 13, 2016 • edited

GowthamShanmugam commented Aug 11, 2017 • edited

torkelo commented Aug 11, 2017

RichiH commented Sep 12, 2016 •

edited

wleese commented Sep 12, 2016 •

edited

torkelo commented Sep 13, 2016 •

edited

GowthamShanmugam commented Aug 11, 2017 •

edited