New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Alerting: Firing/Notification Severity (Critical / Warning / Info) #6553
Comments
I know you are trying to keep it simple, but I would for sure like say for disk space free % prefer multi stage with varying priorities |
Severities for alerts is something one expects, and when there we'll be multiple alerts per graph, people will start working around this missing feature by sending emails to different email address for example. My take is that the severities should be per condition, (eg. the same from CPU Load alert you should configure the conditions for Warning/Critical/No Data). Also the No Data condition should not fire an alert when transitioning from initial state to no data state (eg. when a new server is added). Btw. thanks for Grafana! |
There should be Critical and Warning levels specified per condition. It helps users prioritize events. Also, alerting is a feature for power user, there is no need to keep it that simple. |
There was a talk I saw a while back from the etsy team or the stack exchange team that brought up a very good point: All alerts should be actionable. Which basically meant do away with warning alerts because when you get many of them you will consider them noise and they will be ignored. For example free disk space. Common practice is to warn on 10% free and critical on 5% free. Regardless of the value it requires someone to take action, if it's not done from a warning message then it's done by the critical message. What you have is just one alert state, the one that makes you take action. What is still needed is an escalation path. If an alert is not acknowledged or fixed within some period of time then send a second alert to a different group or channel. Disk space is at 10% free is the first alert and requires an action, free up some space, add more space, modify the alert, something is done handle it. If it's not fixed after 3 hour for example alerts could be sent to a second engineer, a team lead or the stakeholders. |
@elvarb I think escalation policies can set in pagerduty(obviously one has to pay for their subscription) which is integrated with Grafana! |
True, but that is only one tool of many. But in a sense I agree with that method, to have a dedicated alert handling tool, alerta.io for example. The groups I always see in that picture.
In a perfect world |
Rather than explicitly adding a severity property to alerts, consider instead allowing multiple alerts per panel (this is more generally useful, too). To handle severity (and various other scenarios), you might add a colour property to the alerts for annotations, and implement variables that can be referenced in notifications/annotations. Perhaps the template variables would be best configured globally within the top level Alerting configuration - a user could create a template var called 'severity' (with a default value?) which would then be available for population from each alert, and which could be referenced by notifiers. The notifiers would expose their output content as templates, allowing interpolation of the template variables. Exposing the output templates would be useful in general for users to be able to customise notifications. Including a severity var by default might be a good way to close off this issue, and provide a default example for the functionality. |
Multiple alerts per panel would be a fantastic addition. Figuring out how to display the alert lines would be interesting. Also having alert severity levels would be extremely useful. A common example would be disk usage with a critical state at 80% then an alert at 90%. Having different alerting channels and custom notification messages for each alert level would be required. |
I was super excited to get started with Grafana. Best tool I've worked with in so long in so many ways. I was excited to find a way to model the capacity (0-100%) of multiple different services (of the same type) within just one graph, instead of having to create new a new graph per service. However I was disappointed to see 1) the single alert per graph limit and 2) the alert did not re-trigger when the metric value changed, provided that the overall alert state had not changed. E.g. It might be desirable to manage ~10-20 states / series within one graph. And (sorry if I'm waffling but I got super excited with this tool) I think that's what #6041 is about. I think that'd be a great addition. |
Any word on the progress of adding Criticality to alerting? |
Any update on this? |
Somewhat related PR: #19425 Lets a PagerDuty notification channel specify the severity instead of just hardcoding |
Any update on this? We would like to specify the severity of an alert, specifically, we'd like it to be included in the dashboard JSON schema. We saw that this PR #19425 merged functionality for PagerDuty but we'd like this to be included for all alerts. |
That PR puts the severity attribute on the notification channel. You could create a different channel for each severity: "Notify critical" "Notify warning" etc and that channel selectio would be in the dashboard json (I think). Agree that's a bit of a hack - a severity in the actual alert would be a bit nicer so the dashboard config isn't dependant on the channel config. |
Ah I see, yes, we could use the Alerting API to create those various channels and then assign the "notifications" section of the alert json to that channel UID. Agreed, abit of a work around, but would be great to see this feature come directly to the dashboard directly. |
Any update that is not related to pagerduty? |
I think this is still relevant even with the addition of the next-gen alerts. With the new expressions however there seems to be a very easy way of achieving the required result (feel free to tell me this is already possible). It'd be great to be able to add a label based on the evaluated metrics/expressions. Example: Now if C could somehow add a label depending on some math output this whole issue would be solved as the notifications can already be bound to specific label values. |
Is there any update on this topic regarding NG-Alerts? For example I want to build a RAM Check. Currently I need three checks which do: A: A promQL query that results in some timeseries Where I then change the threshold of C to lets say 85 for Rule 1, 90 for Rule 2 and 95 for Rule 3. Then these Rules have a label called severity to push to a webhook, which is warning for Rule 1, major for Rule 2 and critical for Rule 3. Ideally there would be a way to either set my Label text dynamically based on the value of $B or to be able to have multiple expressions in $C which then set a label depending on what matches. Is this still not possible and needs multiple rules? |
Thats how currently it works in DataDog. You have "severity thresholds" on top of priority specific Monitor gets. |
For anyone still having this issue, this is actually working, thanks to a member of the slack community (https://grafana.slack.com/archives/C0Y4TLW74/p1662366010165239?thread_ts=1662121247.062299&cid=C0Y4TLW74). 'You can create a custom label severity with value like {{ if and (gt $values.B0.Value 5.0) (lt $values.B0.Value 7.5) }}critical{{ else }}warning{{ end }} adjusting the conditions to your needs.' I tried and it did work, so i can dynamically assign a severity label |
Hi! 👋 Just to add to what @m-wack said, this is preferred approach to adding severity in Grafana Managed Alerts for the time being. I appreciate adding this for each alert is quite laborious though. Perhaps we could look into providing something in the UI that that made this easier if there is enough demand for it. Another approach you can use is to have two alerts with different severity labels. For example, in Prometheus/Mimir:
This also works in Grafana Managed Alerts, but with the exception that each alert must have a different name. For example, |
When an alert rule is firing (in state Alerting), should there be different severity states as well?
By Severity I mean: Critical, Warn, Info, etc
This is very from from being worked on but is a placeholder issue for feature request & discussions of this nature.
The text was updated successfully, but these errors were encountered: