New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alerting support for queries using template variables #6557

Open
calind opened this Issue Nov 12, 2016 · 126 comments

Comments

Projects
None yet
@calind
Copy link

calind commented Nov 12, 2016

It would be pretty useful if grafana would support alerting for queries using template variables. The way I see it work it would be as follows:

  1. Generate queries foreach template variable combination (discarding template variable for all)
  2. When generating queries, consider the frozen list if the template variable is set to never refresh, else update the template variable list
  3. Allow filtering (trough regex or by providing a static value) for each template variable

The current workaround is to use an invisible wildcard metric, but the problem I see with this approach is that it loses context.

@oiooj

This comment has been minimized.

Copy link
Contributor

oiooj commented Nov 13, 2016

+1

@bergquist

This comment has been minimized.

Copy link
Contributor

bergquist commented Nov 14, 2016

  1. What would be the difference compared to just using all?
@antoiner77

This comment has been minimized.

Copy link

antoiner77 commented Nov 14, 2016

+1
Would be nice to be able to add alerting on server with a low life time (AWS auto scaling), auto register the server on grafana is easy with the templating but it's sad to not be able to put alerting on them

@calind

This comment has been minimized.

Copy link
Author

calind commented Nov 15, 2016

@bergquist it's unpractical using all for example when you have more than a dozen hosts.

nivex6impyskjxkpmldv

If for example only few of them are failing, (let's say 5), it is very useful to receive an email for each failing alert. This way is also much easier to integrate with other tools which in general expect one alert per metric.

The current approach (using all) is pretty neat though when there are fewer instances or when you are alerting at service level (eg. # of jobs in queue).

@Deshke

This comment has been minimized.

Copy link

Deshke commented Nov 15, 2016

what @calind said, i've got multiple $host variables wich are working fine with the influxDB but not with the alerts

@NotSoCleverLogin

This comment has been minimized.

Copy link

NotSoCleverLogin commented Nov 18, 2016

+1 as well.

Just a thought, since you are able to query with a template variable, wouldn't you just be able to do the same query with the alerting metrics and maybe iterate through the results to see which meet the alert criteria?

@bergquist

This comment has been minimized.

Copy link
Contributor

bergquist commented Nov 18, 2016

@NotSoCleverLogin It would be possible. But would you want to change the behavior of alert rule based on what template varlue are selected?

Using the all option for the template is the only way that makes sense for me.

@mstaalesen

This comment has been minimized.

Copy link

mstaalesen commented Nov 22, 2016

+1

I have a setup of X environments with the same components in each environment. We are currently using prometheus to alert on e.g cpu usage/disk usage etc. There we specify an alert for a query, and when the alert is triggered it will just state which environment the alert was triggered from.

If we would do this with the All variable, that would work to some extent. But, using @calind's example, the screenshot would be filled with the trend of all cpus from all of my environments, and not just the environment where I would want to be informed about said problem. The graph will (or can) be obscured with information from other environments. In some scenarios it could be interesting to compare cpu in other environments, but there are no guarantees that what is happening in a test environment is happening in our production environment, etc.

We are also looking into creating dashboards that can be used by operations, showing annotations for alerts in the "standard" overview dashboard. Given that we use 'env' template variables for these kind of dashboards it's not really possible for us to do that with how it is implemented right now. I would have to manually (at least to some extent) generate a "shadow" dashboard where the alerts are triggered (which makes me loose the annotations in the overview dashboard).

Another thing I think template variables can help you do is to route the alerts (should you choose to implement such a feature) to different sources (some to operations if in production, to qa/developers if in test environments etc).

@StianOvrevage

This comment has been minimized.

Copy link

StianOvrevage commented Nov 23, 2016

+1 for supporting alerts on templated queries.

@calind

This comment has been minimized.

Copy link
Author

calind commented Nov 24, 2016

@bergquist, some dashboards don't have an All option. For example system metrics by collectd (https://grafana.net/dashboards/24). Having an All option would certainly not be practical for let's say 10 or more servers. That's why the need to iterate trough template variables.

@StianOvrevage

This comment has been minimized.

Copy link

StianOvrevage commented Nov 25, 2016

Allowing use of All is a good and welcomed start.

In Prometheus, queries need to be written in a different way to allow All:

some.metric{hostname=~"$Hostname"}

Notice the extra tilde there, allowing for regular expression searching (and the wildcard in All).

I have not benchmarked the possible performance impact of going from a straight query to a regex search query but at least for now it would apparently solve our problems.

@max3163

This comment has been minimized.

Copy link

max3163 commented Nov 29, 2016

+1

1 similar comment
@jordandev

This comment has been minimized.

Copy link

jordandev commented Nov 29, 2016

+1

@steverweber

This comment has been minimized.

Copy link

steverweber commented Dec 2, 2016

not sure how it should be implemented, just know it's needed..

@Krylon360

This comment has been minimized.

Copy link

Krylon360 commented Dec 2, 2016

+1
We use Prometheus as the Datasource to monitor our Kubernetes Infrastructure for bout our On-Prem K8S Clusters and our AWS K8S Clusters.
All of our dashboards use Templated Variables for the Datasource ($Environment), $Instance/Node, $Namespace, and $Pod.
Due to the way the Prometheus Query Structure is; all of the queries have Templated Variables; which prevents the Alert Rules from allowing to save.
I would love to see Templated Variable Queries added to the alerting.

@andrewawagner

This comment has been minimized.

Copy link

andrewawagner commented Dec 6, 2016

+1

@bergquist bergquist referenced this issue Dec 7, 2016

Closed

alert not working with templating #6230

0 of 3 tasks complete
@shervinkh

This comment has been minimized.

Copy link

shervinkh commented Dec 7, 2016

+1
We use templating dashboards for multi-server environment which is the logical way (and many people use), So we can't use alerting with grafana right now. The only way is to have a separate non-templating dashboard or setup alerting with prometheus itself which is not easy.

@steverweber

This comment has been minimized.

Copy link

steverweber commented Dec 8, 2016

perhaps if there was an option or simple way to save/export a dashboard with the template variables backed/pre-rendered into all the fields... this would perhaps be a good half way point until another solution is found.

@daraeburn

This comment has been minimized.

Copy link

daraeburn commented Dec 12, 2016

+1 for supporting alerts on templated queries. We currently use templating on all our dashboards so can't take advantage of this really cool feature.

@tsn77130

This comment has been minimized.

Copy link

tsn77130 commented Dec 12, 2016

+1, we have a lot of templated dashboards, and we can't use alerting for now, we have to deduplicate dashboards for having alerts, and we so lose templating power

@drewboswell

This comment has been minimized.

Copy link

drewboswell commented Dec 12, 2016

+1, Almost all of our dashboards use template variables (and nested template variables).

We would like to be able to set alerts on repeat panels to get individual alerts per template-variable group if needed. Plus this means that the alerting is dynamic and not super manual as it is now.

DANGER: Variables in theory will be good to have, but we need to keep in mind that if some guy goes into your dashboard and changes the value and saves, the resulting alerting will be affected. Don't know if that's ok behaviour or not, will be complicated.

@ebirukov

This comment has been minimized.

Copy link

ebirukov commented Dec 12, 2016

+1

@erSitzt

This comment has been minimized.

Copy link

erSitzt commented Dec 12, 2016

When working with grafana it feels like templating is encouraged everywhere and it feels wrong to create an extra set of graphs not using variables just to use the alerting feature...

@kanwangzjm

This comment has been minimized.

Copy link

kanwangzjm commented Dec 13, 2016

+1 for supporting alerts on templated queries.
also, we found that when we use Chinese ruleName or Chinese title, we received abnormal email with rule triggered. For example, we expected “个股分时线接口请求时间(getTimeTrend) alert” but received "个è�¡å��æ�¶çº¿æ�¥å�£è¯·æ±�æ�¶é�´(getTimeTrend) alert", maybe the charset is not correct.

@JonathanTroyer

This comment has been minimized.

Copy link

JonathanTroyer commented Dec 20, 2017

So one case where we don't have any of these issues is where the template variable is a constant type. For instance, we have multiple dashboards relying on a constant variable to limit the data on that dashboard to a particular resource (the reason we didn't use a multi-value variable is because each dashboard is different enough to justify different setups but close enough to justify a "template" dashboard). At least in this case (constant variables) nothing about the current alerting behavior needs to change.

@crazy-canux

This comment has been minimized.

Copy link

crazy-canux commented Jan 24, 2018

Any news about this topic?

@ashuw018

This comment has been minimized.

Copy link

ashuw018 commented Jan 28, 2018

Hi,

Is there any hopes of getting this feature? i just wonder how other systems are having this features as they must also be using sort of templating as when we install agent it automatically gets appears on portal and alerting can also be set for that. (This is my experience with New Relic).

@vishwanathh i liked the approach of having separate section for alerting (if it is being complicated to has it in graph panel) where we can put in our queries just for alerting. as this way our users will not see the placeholder panel(used for alerting).

Sorry for the extra noise but this would be a really great feature to have in Grafana.

@tangyong

This comment has been minimized.

Copy link

tangyong commented Jan 30, 2018

+1, very very important feature!

@tangyong

This comment has been minimized.

Copy link

tangyong commented Jan 30, 2018

In addition, if letting me modify prometheus metric query expression to remove template variable, this is not feasible at all. So, I think that this feature is most important for the prometheus+grafana to land on production!

Anyway, please team can consider the priority, thanks!

@pdf

This comment has been minimized.

Copy link

pdf commented Feb 2, 2018

With 5.0 heading out the door shortly, I'd love to see some significant focus given alerting during the next release series. Looking at Github reactions, alerting-related deficiencies appear to have far-and-away the most interest from users.

I know there has been some reluctance to tackle these things due to UI/UX complexity concerns, however I'm not convinced these concerns are necessarily justified. Is there anything we as users can do that might help planning/design or to move these issues forward, short of pull-requests with actual code?

@ashuw018

This comment has been minimized.

Copy link

ashuw018 commented Feb 2, 2018

@torkelo This has helped me to setup alerting for all of my hosts using tags and now my each alerting graph contains multiple series formed by the combination of tags. Everything seemed to be working fine. But going through the docs and other issues i realized that if any of the series within graph has already took alerting state then alerts for other series will not be trigger if they also crossing limit.

Thats again being limitations.

Thanks.

@deiv061

This comment has been minimized.

Copy link

deiv061 commented Feb 9, 2018

Any news about this feature ?

@spiffytech

This comment has been minimized.

Copy link

spiffytech commented Feb 9, 2018

What's the effort for a new contributor to add this feature?

@sjayaraman

This comment has been minimized.

Copy link

sjayaraman commented Feb 14, 2018

+1

@nookalavikas

This comment has been minimized.

Copy link

nookalavikas commented Mar 19, 2018

Please allow template variables to be used for Alert Notifications.
+1

@amihura

This comment has been minimized.

Copy link

amihura commented Mar 22, 2018

+1

@Moon-Tae-Kwon

This comment has been minimized.

Copy link

Moon-Tae-Kwon commented Apr 2, 2018

We hope to be resolved.

@rpelau

This comment has been minimized.

Copy link

rpelau commented Apr 2, 2018

+1

@calebtote

This comment has been minimized.

Copy link
Contributor

calebtote commented Apr 5, 2018

I don't want to beat a dead horse here, but we're having the same issue, and I want to provide some context as to why the existing proposals don't work in all circumstances. I also have a couple of ideas for workarounds, but why we need some features to help make the workarounds sufficient.

For all scenarios below, we're using a single templated variable: $env

"Why not just create alerting dashboards?"

We want to alert on a couple different environments, not just production. So we'd now need to have the same metric in at least 3 different places (the troubleshooting dashboard with all metrics, not just the metric we alert for; the prod alerts dashboard; the integration alerts dashboard). This can get out of hand pretty quick, and is prone to user error.

Equally as important, this nullifies much of the gain from automated annotations from alerts. If I have to go back and forth from my exploratory dashboard to my alerts dashboard to see the annotations for when an event started and when it ended, that's going to be pretty tedious.

Attempted Solution

What we've done to try to get around this is we've added duplicate metrics specifically for alerting to our dashboards. So if there's a metric we want to alert on, we go to the panel and add explicit metrics for those alerts (and hide them).

Our series list for a given panel that needs to alert will look along the lines of:
screen shot 2018-04-05 at 4 53 57 pm

With the non-templated series marked as hidden. Then in the alerting tab, we set thresholds for these series, not the variable series.

screen shot 2018-04-05 at 4 40 19 pm

Problems with this solution

This doesn't work great though. For example:
screen shot 2018-04-05 at 4 43 21 pm

As you can see, the Alerts panel doesn't allow us to specify which environment is alerting -- so we have to drill down into the alert to figure out which environment is borked at the moment. However, an easy fix for this might be just allowing the description to be as verbose as the Alert History panel that shows state transitions:

screen shot 2018-04-05 at 4 44 33 pm

This is at least somewhat helpful, but even in this panel there's no indication of which alert has gone back to Healthy (the description from the above screenshot was derived from the alias we set on the series if anyone is wondering how to at least get that much to show up).

Things that would help until this specific ticket is resolved

  • Allow the option for displaying the series alias instead of a typed description for the alert (this way the alias can at least specify the $variable it's alerting for)
  • Allow the state transition back to healthy to also show the series alias (in the History screenshot above)
  • Allow a legend value for active alerts (using the series alias I assume) for a given panel

Things I'm not sure how to fix

Annotations on graphs that have alerts configured for multiple environments/variables:
screen shot 2018-04-05 at 4 42 41 pm

With this we can't really tell which alert is firing without going into the panel. The legend suggestion could help clarify this, but doesn't do much for the annotation if the correct $env isn't selected (in the above picture, int is alerting, but prod is the variable selected on the dashboard, so we're displaying annotations from the int alert over top of the graph using prod.

@unixway

This comment has been minimized.

Copy link

unixway commented Apr 9, 2018

+1

@PheonixS

This comment has been minimized.

Copy link

PheonixS commented Apr 9, 2018

plus one :)

@adamcstephens

This comment has been minimized.

Copy link

adamcstephens commented Apr 9, 2018

Please stop +1'ing this issue. It generates unnecessary spam emails. The ability to add a reaction to a github issue comment has existed for a while now, and over 429 people have figured out how to like the initial comment instead of spamming everybody who is subscribed.

@manueligno78

This comment has been minimized.

Copy link

manueligno78 commented May 23, 2018

Please we really need this feature, we would like to use templating, but in our case is most important to have a clear alerting system. So to workaround this we are avoiding templating in our dashboard... its a mess.

@thiagocorredor

This comment has been minimized.

Copy link

thiagocorredor commented May 30, 2018

I agree and this feature will help us a lot !!!!

@kamzhuyuqing

This comment has been minimized.

Copy link

kamzhuyuqing commented Jun 28, 2018

+1 please

@cmuzyunda

This comment has been minimized.

Copy link

cmuzyunda commented Jun 28, 2018

we need this

@ChahatB

This comment has been minimized.

Copy link

ChahatB commented Jun 29, 2018

+1 please. It's really needed.

@pdf

This comment has been minimized.

Copy link

pdf commented Jun 29, 2018

@bergquist @torkelo can we please lock this issue to stop the +1 spam?

@grafana grafana locked as spam and limited conversation to collaborators Jun 29, 2018

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.