Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alerting: Firing/Notification Severity (Critical / Warning / Info) #6553

Open
torkelo opened this issue Nov 11, 2016 · 21 comments
Open

Alerting: Firing/Notification Severity (Critical / Warning / Info) #6553

torkelo opened this issue Nov 11, 2016 · 21 comments
Labels
area/alerting/notifications Issues when sending alert notifications area/alerting Grafana Alerting type/discussion Issue to start a discussion type/feature-request

Comments

@torkelo
Copy link
Member

torkelo commented Nov 11, 2016

When an alert rule is firing (in state Alerting), should there be different severity states as well?

By Severity I mean: Critical, Warn, Info, etc

  • How should severity specified?
  • Per alert rule or per condition / threshold?

This is very from from being worked on but is a placeholder issue for feature request & discussions of this nature.

@roscoe57
Copy link

I know you are trying to keep it simple, but I would for sure like say for disk space free % prefer multi stage with varying priorities
eg free spare < 20% info (alrrt but optional notification), <10% warning & notify, <5% critical & notify
my other use case is temperature > info/warn/severe

@calind
Copy link

calind commented Nov 15, 2016

Severities for alerts is something one expects, and when there we'll be multiple alerts per graph, people will start working around this missing feature by sending emails to different email address for example.

My take is that the severities should be per condition, (eg. the same from CPU Load alert you should configure the conditions for Warning/Critical/No Data). Also the No Data condition should not fire an alert when transitioning from initial state to no data state (eg. when a new server is added).

Btw. thanks for Grafana!

@haron
Copy link

haron commented Nov 15, 2016

There should be Critical and Warning levels specified per condition. It helps users prioritize events. Also, alerting is a feature for power user, there is no need to keep it that simple.

@elvarb
Copy link

elvarb commented Nov 19, 2016

There was a talk I saw a while back from the etsy team or the stack exchange team that brought up a very good point:

All alerts should be actionable.

Which basically meant do away with warning alerts because when you get many of them you will consider them noise and they will be ignored.

For example free disk space. Common practice is to warn on 10% free and critical on 5% free. Regardless of the value it requires someone to take action, if it's not done from a warning message then it's done by the critical message.

What you have is just one alert state, the one that makes you take action.

What is still needed is an escalation path.

If an alert is not acknowledged or fixed within some period of time then send a second alert to a different group or channel.

Disk space is at 10% free is the first alert and requires an action, free up some space, add more space, modify the alert, something is done handle it.

If it's not fixed after 3 hour for example alerts could be sent to a second engineer, a team lead or the stakeholders.

@utkarshcmu
Copy link
Collaborator

@elvarb I think escalation policies can set in pagerduty(obviously one has to pay for their subscription) which is integrated with Grafana!

@elvarb
Copy link

elvarb commented Nov 19, 2016

True, but that is only one tool of many.

But in a sense I agree with that method, to have a dedicated alert handling tool, alerta.io for example.

The groups I always see in that picture.

  • metric collectors
  • queues
  • parsing
  • storing
  • visualizing
  • alert triggers
  • alert handling
  • issue tracking

In a perfect world

@pdf
Copy link

pdf commented Jan 14, 2017

Rather than explicitly adding a severity property to alerts, consider instead allowing multiple alerts per panel (this is more generally useful, too).

To handle severity (and various other scenarios), you might add a colour property to the alerts for annotations, and implement variables that can be referenced in notifications/annotations.

Perhaps the template variables would be best configured globally within the top level Alerting configuration - a user could create a template var called 'severity' (with a default value?) which would then be available for population from each alert, and which could be referenced by notifiers.

The notifiers would expose their output content as templates, allowing interpolation of the template variables. Exposing the output templates would be useful in general for users to be able to customise notifications.

Including a severity var by default might be a good way to close off this issue, and provide a default example for the functionality.

@magnuspwhite
Copy link

Multiple alerts per panel would be a fantastic addition. Figuring out how to display the alert lines would be interesting.

Also having alert severity levels would be extremely useful. A common example would be disk usage with a critical state at 80% then an alert at 90%. Having different alerting channels and custom notification messages for each alert level would be required.

@yannispanousis
Copy link

I was super excited to get started with Grafana. Best tool I've worked with in so long in so many ways. I was excited to find a way to model the capacity (0-100%) of multiple different services (of the same type) within just one graph, instead of having to create new a new graph per service.

However I was disappointed to see 1) the single alert per graph limit and 2) the alert did not re-trigger when the metric value changed, provided that the overall alert state had not changed.

E.g. It might be desirable to manage ~10-20 states / series within one graph. And (sorry if I'm waffling but I got super excited with this tool) I think that's what #6041 is about. I think that'd be a great addition.

@SilentGlasses
Copy link

Any word on the progress of adding Criticality to alerting?

@MichaelMitchellM
Copy link

Any update on this?

@yemble
Copy link
Contributor

yemble commented Sep 25, 2019

Somewhat related PR: #19425

Lets a PagerDuty notification channel specify the severity instead of just hardcoding critical.

@jpmcb
Copy link
Contributor

jpmcb commented Dec 24, 2019

Any update on this?

We would like to specify the severity of an alert, specifically, we'd like it to be included in the dashboard JSON schema.

We saw that this PR #19425 merged functionality for PagerDuty but we'd like this to be included for all alerts.

@yemble
Copy link
Contributor

yemble commented Dec 24, 2019

That PR puts the severity attribute on the notification channel. You could create a different channel for each severity: "Notify critical" "Notify warning" etc and that channel selectio would be in the dashboard json (I think).

Agree that's a bit of a hack - a severity in the actual alert would be a bit nicer so the dashboard config isn't dependant on the channel config.

@jpmcb
Copy link
Contributor

jpmcb commented Dec 24, 2019

Ah I see, yes, we could use the Alerting API to create those various channels and then assign the "notifications" section of the alert json to that channel UID.

Agreed, abit of a work around, but would be great to see this feature come directly to the dashboard directly.

@aviadbi1
Copy link

Any update that is not related to pagerduty?

@siegenthalerroger
Copy link

I think this is still relevant even with the addition of the next-gen alerts. With the new expressions however there seems to be a very easy way of achieving the required result (feel free to tell me this is already possible).

It'd be great to be able to add a label based on the evaluated metrics/expressions.

Example:
A: A promQL query that results in some timeseries
B: A reduce expression (let's say average)
C: A math expression that is evaluating whether $B is above/below a certain threshold.

Now if C could somehow add a label depending on some math output this whole issue would be solved as the notifications can already be bound to specific label values.

@m-wack
Copy link

m-wack commented Oct 14, 2022

Is there any update on this topic regarding NG-Alerts?

For example I want to build a RAM Check. Currently I need three checks which do:

A: A promQL query that results in some timeseries
B: A reduce expression (let's say average)
C: A math expression that is evaluating whether $B is above/below a certain threshold.

Where I then change the threshold of C to lets say 85 for Rule 1, 90 for Rule 2 and 95 for Rule 3.

Then these Rules have a label called severity to push to a webhook, which is warning for Rule 1, major for Rule 2 and critical for Rule 3.

Ideally there would be a way to either set my Label text dynamically based on the value of $B or to be able to have multiple expressions in $C which then set a label depending on what matches.

Is this still not possible and needs multiple rules?

@hajdukda
Copy link

Thats how currently it works in DataDog. You have "severity thresholds" on top of priority specific Monitor gets.

@m-wack
Copy link

m-wack commented Dec 9, 2022

For anyone still having this issue, this is actually working, thanks to a member of the slack community (https://grafana.slack.com/archives/C0Y4TLW74/p1662366010165239?thread_ts=1662121247.062299&cid=C0Y4TLW74).

'You can create a custom label severity with value like {{ if and (gt $values.B0.Value 5.0) (lt $values.B0.Value 7.5) }}critical{{ else }}warning{{ end }} adjusting the conditions to your needs.'

I tried and it did work, so i can dynamically assign a severity label

@grobinson-grafana
Copy link
Contributor

Hi! 👋

Just to add to what @m-wack said, this is preferred approach to adding severity in Grafana Managed Alerts for the time being. I appreciate adding this for each alert is quite laborious though. Perhaps we could look into providing something in the UI that that made this easier if there is enough demand for it.

Another approach you can use is to have two alerts with different severity labels. For example, in Prometheus/Mimir:

groups:
- name: example
  rules:
  - alert: HighLatency
    expr: histogram_quantile(rate(http_request_duration_seconds_bucket[5m]), 0.95) > 0.5
    for: 5m
    labels:
      severity: high
  - alert: HighLatency
    expr: histogram_quantile(rate(http_request_duration_seconds_bucket[5m]), 0.95) > 0.1
    for: 5m
    labels:
      severity: low

This also works in Grafana Managed Alerts, but with the exception that each alert must have a different name. For example, HighLatencyLowSeverity and HighLatencyHighSeverity.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/alerting/notifications Issues when sending alert notifications area/alerting Grafana Alerting type/discussion Issue to start a discussion type/feature-request
Projects
Status: Backlog
Development

No branches or pull requests