New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Alerting: Alert Rule Categorizing & Sorting, Severity, Annotation & Notification filtering #6007
Comments
SeverityYou need severity; if anything, you should also allow to split up urgency (call and wake on-call people) and importance (this needs to be done first, but potentially only during business hours). There should be a way to have a dashboard with "work everything off from top to bottom", sorted by importance, overriden by urgency. Else, you will create pager fatigue because the smallest of blips create the same alarms as a huge outage. CategorizationI would vote for arbitrary labels with arbitrary values. That way, users can slice and dice their own data. Of course, a set of sane presets should be included to try and keep most people along a common set of nomenclature. @bergquist think Prometheus labels ;) |
Do we need severity? Many alert systems seems to skip them (Prometheus, DataDog, Librato) |
I think you do. And Prometheus allows arbitrary severities which you can define in freeform. |
@RichiH but there is no So no way to filter on say alerts with severity above Warning. |
IMHO severity, something resolving around how users respond to an alert, could be omitted completely if the notification provides enough flexibility. Those who need 'warning' and 'critical' alerts could define 2 rules that would have different notifications like different endpoints (mail and sms/push notification). Looking at our nagios based setup (>100.000 checks), the warning severity is merely used on dashboards. So if someone notices the warning, they could take some preventive action. But in reality this rarely happens with exception of fairly specific checks such as disk space. Or, if the warning hangs around long enough for someone to get annoyed by it. That said, asking this question seems a bit like telling someone they can get a car for free and then asking which features they want. Expected answer: everything ;) |
@torkelo You can use the labels in whatever way you want and you can route different alarms into different receivers. I.e. I don't escalate for a "disk over 95% full", but I show it in a dashboard. I do escalate on a "disk full in x hours/before normal working times start" - that's a importance & urgency concept backed into the language, even if it's not called that. @wleese "warning" and "critical" is too short-sighted in my experience. You need a way to inform about stuff, warn about stuff, say "this is important" and "this is burning" - One-dimensionally, this maps to info, warning, high, critical. I do agree that this issue has potential to descend into bikeshedding, though. |
Zabbix use 5 levels
Disaster is quite meaningfull (stop everything and fix it now). High is something to take care asap, others levels are mostly informational |
Agreed with @wleese warning is normally treated as "an alert that does not alert anyone", is there because is worth to be seen but nothing is bad. Also is worth for an API, instead of doing active watching. You have a program like nagstamon during work hours to prevent warning to degrade in critical. But warning levels rarely are used to wake on call rotations. Beyond ok/warning/critical everyone has its own book. This fits well for nagios / shinken users. Zabbix users can feel that is not enough, but imho ok/warn/crit covers 80% of use cases and 50% of complex ones. If you need more complicated alerting I would advise to go to the source, doing scripts that cover more complex stuff using the influx/prometheus/graphite/etc data. Regarding categorization, I'll vote for using dashboard tags or alert tags, sinde grafana is a tool for making dashboards, I'll want the alerts to be related to the dashboards in some way |
Would it be possible to include templated values from the dashboard as meta data in alerts? I'd say OK, Warning and Critical (or even OK/Crit) are all that's required for levels however for categorisation dashboard tags wouldn't work for our use case. Our dashboards can view many environments and templating is used to select specific one(s) - probably more than 50% of the envs we wouldn't alert to pager duty however some of them we would want to route to web hooks to feed back into performance testing systems. Quite a while ago now I wrote a simple graphite monitor that used regexp groups to pull out key=value pairs from metrics and used these to route triggered events. It plugged into both Elastic and a performance testing system - triggers for testing systems of course routed to both and triggers for non-testing systems routed to Elastic only. The metric names included the environment names among other metadata. That way I could use one set of rules to monitor all environments and it was possible to trigger different things based on the metric and the level (ok/warn/crit). I believe Bosun can do this as well if I'm not mistaken. |
Critical , Warn and Ok are pretty much industry standard. I think they should be supported at a minimum. |
After talking to a few people and getting some feedback from this thread we are moving forward with these alert states:
We will remove the severity option and replace it with a more general I know many want the most flexible and powerful alerting system where you can specify severity, labels, label values, etc and filter by everything. Well we can't do everything, so will have to start simple :) |
@torkelo Hi i am gowtham, i have one doubt is grafana, when i sent http api request for list particular alert (api/alerts/1), it gives alert details but severity is always comes empty string"" how i can set severity like warning, critical ? { |
There is no severity option to set |
We have 2-3 design decision we have to decide on soon that are all somewhat interconnected.
Alert Rule Severity
Today we have a severity option (Critical & Warning) and Alert States (OK, Unknown(NoData), Critical, Warning, Execution Error).
The problem here is that we are mixing Alert States (OK, Unknown, Execution Error) and letting the Firing State be viewed differently depending on severity (degree of importance associated with words like Critical & Warning).
Mixing these different things severity (degree of importance, sortable) with rule status (not sortable) is a bit tricky.
We are a bit tempted to remove the severity option, or at least change it and separate it from the state. So we only end up with these alert states:
OK
NoData
(Formerly Unknown)Alerting
(Other alternative isFiring
)Execution Error
Alert Rule Categorization
Right now there is no way to organize a large set of alert rules and the annotations (i.e. state change history) they create. The only thing that approaches categorization is notifications, where each alert can have it's own set of notifications. But these are not handled as proper categorization properties as there is no way to filter/group by notifications, nor is this something included in the alert annotation (state change event). This seems like it can quickly become a problem for even medium sized Grafana installs where more than one team is using the Alert feature.
Possible solutions:
Alert Category
that can work like a single tag. But we can here allow users to create these alert categories, specify options for each (color, defaults, message template, default notifications).Of all these solutions
Alert category
is probably a lot easier/less complex to implement compared against using tags. Tags requires a many to many lookup table between alert and alert category as well as for the tables annotation and annotation category. Searching and filtering tables with many to many relations in SQL is a PITA.Skipping the many to many (tags) ability and having only one alert category makes it simpler for the user as well and makes it possible to look at the alert category as sort of severity as well (If some users want to set up
Warning
andCritical
categories they could do that. The restriction here being that if two teams who want to have their own categories they would have to createTeam B Critical
andTeam B Warning
.Another reason that a single category property might be preferable is that it allows for it to include defaults for things like notification message & notifications.
Feedback would be much appreciated
The text was updated successfully, but these errors were encountered: