-
Notifications
You must be signed in to change notification settings - Fork 11.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
building alerting system for grafana #2209
Comments
I'd love to help out with this! My suggestion would be to stick with the nagios-style guidelines. That way the tools could easily be used with other monitoring tools. e.g. Nagios, Zenoss, Icinga, etc.. |
The biggest thing about this feature is getting the basic architecture right. Some questions i would like to explore
Going more in depth into 1) |
I agree with torkelo. In my experience with other projects with everything "built-in" it can get quite cumbersome to troubleshoot. I like the idea of the service running externally, but a nice config page in grafana that talks to the service through the HTTP api to handle managing all the alerts. Also, for large scale deployments this would probably end up being a requirement as performance would eventually degrade (I would at least have this as a configuration option).
I think that could be a good place to start. Alert if its set, don't if its not. Back to number 1. I think that if the bosun service could run separately but still have the ability to completely configure everything through grafana that would be, in my opinion, ideal. Keep up the awesome work. |
The only shortcoming I have seen with bosun is the data sources it can use. If you could leverage the language for expressing bosun alerting but also integrate with existing data sources that are configured via the regular grafana UI it would certainly be ideal. Being able to represent alerting thresholds, when you are close to them, as well as automatically push annotations for when they have triggered in my mind make an ideal single pane UI. Looking forward to the work that will be done here! |
Lastly; as we depend on Grafana more I admit i'm willing to say 2. could be something i'd be willing to pay for. |
I'm curious why people think this should be included into Grafana at all? |
Absolutely correct @dennisjac; Grafana only renders things. But as we've moved things server side it's no longer just client rendering; the possibilities of a worker process that could check your metrics and alert; is less difficult. Data is in a database; provided it's sprinkled with the data that tells it to check the metric ... Some people may agree or disagree that we should not cross the streams and make Grafana do more than visualize it (roughly) but I'm not them. |
I'm not really opposed to the feature for people who want it to be integrated but I hope it will be made optional for people who already have monitoring/alerting systems available. The new Telegraf project (metric collector from the influxdb guys) also is looking at monitoring/alerting features which is dislike for the same reason. I elaborated on this here: |
I think torkelo has done a really good job at giving us features in Grafana2 that we don't have to enable. As far as influxdb they're going to have to make some money somehow; either off of support of influxdb and professional services or products for it. The latter sounds much more viable |
Another angle on this. There seems to be upcoming support for elasticsearch as a metric storage for grafana. Bosun can right now query elasticsearch for log data. Would it make sense when designing the alerting system to allow for alerts from log data as well? Maybe not a feature for the first version, but something that can be implemented later. Also I agree with the idea of splitting the processes. Have Grafana the interface to view and create alerts, have something else handle the alerting. Having the alerting part api based would also allow other tools to interface with it. |
+1 to Alerting. Outside DevOps usage, applications built for end users need to provide user defined alerts. Nice to have it in the visualization tool... |
+1 this will close the loop - the propose of getting metrics. |
+1 Alerting from Grafana + a Horizontally Scaling Backend from InfluxDB will make them the standard to beat for Metrics Alerting Configurations |
+1 I'd love horizontal scaling of the alerting on multiple grafana nodes. |
It would be great if one could associate a "debounce" like behavior with an alert. For example, I want to fire an alert only if the defined threshold exceeds X for N minutes. I have seen this with some of the alerting tools, unfortunately we are currently using Seyren which doesn't appear to provide such an option. We are using Grafana for our dashboard development and are looking forward to pulling the alerting into Grafana as well. Keep up the good work. |
We have two use cases:
We would love to have an unified alerting system handles alerts, flap detection, escalation and contacts. That helps us recording and correlating events/operations in the same source of truth. A lot of system has solved the alerting problem. I hope Grafana can do better at this in long term, short term not to reinvent existing systems would be helpful in terms of deliverables. One suggestion is Grafana can provide API for extracting monitoring definition (alerting state), third party can contribute configuration export plugins. This would be very ideal in our use case exporting nagios configuration. More importantly, I would love to see some integrated anomaly detection solution too!
|
I agree with @activars. I don't really see why a dashboard solution should handle alerting which is a more or less solved problem by lots of other tools, mostly quite mature. Do one thing and do it well. IMHO it would make more sense to focus on the integration part. Example: Define dynamic warn/crit thresholds in grafana (e.g. like in @Dieterbe example above) and provide an API (REST?) that returns the state (normal, warn, crit) of exactly this graph. A nagios, icinga, bosun etc. could request all the "monitoring" enabled graphs (another API feature), iterate through the individual states and do the necessary alerting. In our case service catalogs and defined actions are the hard part - which service is how business critical, where to send emails to, flapping etc. Also you would not have to worry about user / group management in grafana which most companies already have in a central place (AD, LDAP, Crowd etc.) and integrated with the alerting system. Also we have to consider that unlike a dashboard solution the quality requirements for an alerting tool can be considered much higher in term of reliability, resilience, stability etc. which creates (testing) effort that shouldn't be underestimated. Also what about non-timeseries related checks, like calling a webservice, pinging a machine, running custom scripts...would you want that in grafana as well? I guess the bosun adoption would provide all this but I'm not really familiar with it. On the other hand I can image how a simple alerting system would make a lot of users happy that don't have a good alternative in place, but this could maybe be resolved with some example integration patterns for other alerting tools. |
As much as I want Grafana to solve all of my problems, I think falkenbt hit the nail on the head with this one. An API to expose the mentioned data, some plumbing in bosun, and some integration patterns with common alerting platforms makes a lot of sense. |
Congratulations on your new job at raintank @Dieterbe! I have been reading your blog for a while and you have some really sound ideas on monitoring, particularly regarding metrics and its place in alerting. I am confident that you will find a good way implementing alerting in grafana. As you probably would agree upon, the people behind Bosun are pretty much doing alerting the right way. The lacking thing with Bosun is really the visualizations. I would like to see Bosun behind the Grafana UI. Combining Grafanas dashboard and bosuns alerting behind the same interface would make for an awesome and complete monitoring solution. Also i think it would be a shame to fragment the open source monitoring community further, your ideas on monitoring seem to be really compatible with the ideas of the people behind Bosun. If you would unite i am sure the result would be great. Where i work we are using Elastic for logs/events and have just begun using InfluxDB for metrics. We have been exploring different solutions for monitoring and are currently leaning towards Bosun. We are already using Grafana for dashboards, but would like to access all our monitoring information through the same interface, it would be great if Grafana could become that interface. Keep up the great job, and good luck! |
@sudharsh your implementation sounds really interesting. Are you planning on releasing this to the wild? |
lots of good ideas, thanks everyone. As long as your handler supports querying the datastore you use. we would start off with simple static threshold but later also want to make it easy to choose reduction functions, boolean expressions between multiple conditions, etc. @sudharsh that is a very nice approach. I like how your solution can talk directly to a remote API, bypassing the intermediate step described above (of course this does imply it only works for 1 given backend which we try to avoid), and that it can automatically reload the configuration. (you're right, bosun currently does not support it, it might in the future. FWIW the litmus handler does handle this fine and it uses bosun's expression evaluation mechanism). I never really got into riemann much. Mostly I've been concerned about adding such a different language to the stack that not many people understand or can debug when things go wrong. But I'm very curious to learn more about your system and about Riemann's CLJ code. (I'ld love it if my suspicions are incorrect) @dennisjac yes it would be optional. Does anyone have any thoughts on what kind of context to ship in notifications like emails? |
I like the general approach of docker - batteries included, but removeable. So a basic alerting implementation that can be swapped out would be a good approach imho. |
influxdb will be supported for alerting ? or only graphite ? |
One thing I would like to see is the idea of hierarchical alert trees. There's simply too many facets being monitored and stand alone alert states have an unmanageable cardinality. With a hierarchy tree, I can define all these low level alerts which roll up to medium level alerts which roll up to high level ...... As such, each rolled up alert automatically assumes the high severity of all the children below it. In that way, I can get an impression of [and manage] system health accurately with a much lower surface area of analysis. This is an example I have borrowed from an old document I wrote a while ago. Yes, please chuckle away at the use of the word "Struts". It's OLD ok ? This presents a very simple hierarchy for one server. At some point, the server experiences sustained 75% CPU utilization, so this trips these alerts into a warning state: CPU-# --> CPU --> Host/OS --> System If one really applied themselves, one could keep an eye on an entire data center with one indicator. (yeah, not really, but this serves as a thought excercise) |
Why do not use graphite-beacon? I think you can merge graphite-beacon that is very light with grafana. |
@felixbarny I like that terminology. we'll likely adopt that wording. |
@Dieterbe is it possible to have an update of the current status ? for alerting system |
@Dieterbe Any ETA for alerting support for OpenTSDB? |
@sofixa Thanks, should have looked at the roadmap myself, case of not RTFMing. Appreciated nonetheless. |
i don't work on alerting anymore. maybe @torkelo or @bergquist can answer. |
Any ETA for alerting support for OpenTSDB |
@LoaderMick @naveen-tirupattur OpenTSDB alerting is added to Grafana, should be a part of the next release. Also, the alerting for OpenTSDB is working in the nightly builds. |
Any ETA for alerting support for influxDB and prometheus too? |
@nnsaln alerting for both data sources is already in master branch. |
I cant seem to get the alerting working with OpenTSDB with (Grafana v4.0.0-pre1 (commit: 578507a)). I tested the email system (working) but the alerts just don't fire even when I have a very low threshold. Is there anyway to run the queries manually and see the data that it is pulling? |
Grafana v4.0.0-pre1 (commit: 9b28bf2) |
@torkelo |
Hi guys, will Grafana support alerting for queries using template variables or is there a target release for this? |
All, please try 4.0 beta; if something is missing, open new issues.
Richard
Sent by mobile; excuse my brevity.
|
I've tried 4.0 beta, but I still got this error |
I cannot save alert notifications - send to, after I saved, row send to is become blank again |
@nnsaln You're supposed to fill notification target there, not email address. Open the grafana side menu and hover over the Alerting menu option, then hit the Notifications menu options. There you can setup a notification target that you can use from your alert rules. |
Is there any plan to support template variables along with alerting ? I do
understand each graph generated by a (or set) template variable corresponds
to a different graph and hence generating alert against a static value is
not correct.
…On Mon, Dec 5, 2016 at 2:06 AM, Tomas Barton ***@***.***> wrote:
@nnsaln <https://github.com/nnsaln> You're supposed to fill notification
target there, not email address. Open the grafana side menu and hover over
the Alerting menu option, then hit the Notifications menu options. There
you can setup a notification target that you can use from your alert rules.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#2209 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAY0-X4UkyVE0MeBlSiYD9892OuruGcVks5rE-I6gaJpZM4FJUTl>
.
--
Deepak
|
No, there is currently no support to do this. Maybe in far future but |
99% of dashboards use template variables. They were designed with template
variables to avoid "dashboard explosion" problem.
…On Mon, Dec 5, 2016 at 8:20 PM, Torkel Ödegaard ***@***.***> wrote:
No, there is currently no support to do this. Maybe in far future but
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#2209 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAY0-T9iFrqUcq4KbIECDe526040U6DHks5rFOJ4gaJpZM4FJUTl>
.
--
Deepak
|
Yes, but a generic exploration dashboard is not the same as a dashboard design for alert rules. So far there has not been a proposal for how to support template variables in a intuitive / understandable way. What should alert query with variable do? Interpolate with current saved variable value, with all? Should it treat every value as separate rule and keep state for every etc. Supporting templating variables opens up a can of worms for complexity and potentially confusing behavior. might e added some day if someone comes up with a simple and understandable way. |
In the meantime nothing stops you to create seperate alert dashboards.
Alerting is new and a huge addition to grafana. It will evolve within time,
but in the short time it was implemented it added huge value to grafana,
and thanks to all contributors for that!
Am 06.12.2016 11:14 nachm. schrieb "Torkel Ödegaard" <
notifications@github.com>:
… Yes, but a generic exploration dashboard is not the same as a dashboard
design for alert rules.
So far there has not been a proposal for how to support template variables
in a intuitive / understandable way. What should alert query with variable
do? Interpolate with current saved variable value, with all? Should it
treat every value as separate rule and keep state for every etc. Supporting
templating variables opens up a can of worms for complexity and potentially
confusing behavior. might e added some day if someone comes up with a
simple and understandable way.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#2209 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AEKf_5VldwX2fG-USjnmlMH2qOZIDdKpks5rFd5DgaJpZM4FJUTl>
.
|
+1 Torkel.
It does make alerting fairly complicated.
…On Tue, Dec 6, 2016 at 2:14 PM, Torkel Ödegaard ***@***.***> wrote:
Yes, but a generic exploration dashboard is not the same as a dashboard
design for alert rules.
So far there has not been a proposal for how to support template variables
in a intuitive / understandable way. What should alert query with variable
do? Interpolate with current saved variable value, with all? Should it
treat every value as separate rule and keep state for every etc. Supporting
templating variables opens up a can of worms for complexity and potentially
confusing behavior. might e added some day if someone comes up with a
simple and understandable way.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#2209 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAY0-UgrMH9u7sI-FmPVgFhMVXJBvzTvks5rFd48gaJpZM4FJUTl>
.
--
Deepak
|
@bergquist regarding this comment
Is there a ticket to track the progress? Any branch to contribute? And big thanks for the nice job! |
Kern,
<3 grafana.
I was just trying to share thoughts around alerting with template
dashboards.
…On Fri, Dec 9, 2016 at 2:53 AM, Dmitry Zhukov ***@***.***> wrote:
@bergquist <https://github.com/bergquist> regarding this comment
alerting within grafana does not support HA yet. Our plan is to add
support to partition alerts between servers in the future
Is there a ticket to track the progress? Any branch to contribute?
And big thanks for the nice job!
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#2209 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAY0-aQXFZUeEfVl0MSQP7FQpMZGIh0mks5rGTMsgaJpZM4FJUTl>
.
--
Deepak
|
@jaimegago to create alerts programmatically use the dashboard api, alerts are saved along with a panel & dashboard. |
@torkelo How about notifications targets (e.g. create a new notification email via API) ? edit: Answering to myself here, I found the api/alert-notifications endpoint. I think it just needs to be documented |
Of course there is an http api for that, just go to alerting notifications page, add a notification and check the http api call grafana makes |
@torkelo ,Is there any api can be used to create alert (not create alert notification ) programmatically |
@CCWeiZ Alerts is a part of the dashboard json. So you can only create dashboard that contains alert not alerts only. You can read more about the dashboard api on http://docs.grafana.org/http_api/dashboard/ |
is this available: I want to setup an alert for if a value compare to 3 days ago, the value is not increasing. (says the requests, if now value - 3 days ago requests < 100, then we say there are no much requests.). How to do this? |
Hi everyone,
I recently joined raintank and I will be working with @torkelo, @mattttt , and you, on alerting support for Grafana.
From the results of the Grafana User Survey it is obvious that alerting is the most commonly missed feature for Grafana.
I have worked on/with a few alerting systems in the past (nagios, bosun, graph-explorer, etsy's kale stack, ...) and I'm excited about the opportunity in front of us:
we can take the best of said systems, but combine them with Grafana's focus on a polished user experience, resulting in a powerful alerting system, well-integrated and smooth to work with.
First of all, terminology sync:
I want to spec out requirements, possible implementation ideas and their pro's/cons. With your feedback, we can adjust, refine and choose a specific direction.
General thoughts:
Many alerting systems are more basic (define expression/threshold, get notification when breached), for those it seems integration is not worth the pain (though I won't stop you)
The integrations are a long term effort. I think the low hanging fruit ("meet 80% of the needs with 20% of the effort") can be met with a system
that is more closely tied to Grafana, i.e. compiled into the grafana binary.
That said, a lot of people confuse seperation of concerns with "must be different services".
If the code is sane, it'll be decoupled packages but there's nothing necessarily wrong with compiling them together. i.e. you could run:
That said, we don't want to reinvent the wheel: we want alerting code and functionality to integrate well with Grafana, but if high-quality code is compatible, we should use it. In fact, I have a prototype that leverages some existing bosun code. (see "Current state")
but they should be able to take the same or similar alerting rule definitions (thresholds, boolean logic, ..), they mostly are about how the actual rules are executed and don't
change much about how rules are defined. Since polling is much simpler and should be able to scale fairly far this should IMHO be our initial focus.
Current state
The raintank/grafana version currently has an alerting package
with a simple scheduler, an in-process worker bus as well as rabbitmq based, an alert executor and email notifications.
It uses the bosun expression libraries which gives us the ability to evaluate arbitrarily complex expressions (use several metrics, use boolean logic, math, etc).
This package is currently raintank-specific but we will merge a generic version of this into upstream grafana. This will provide an alert execution platform but notably still missing is
these are harder problems, which I hope to tackle with your input.
Requirements, Future implementations
First off, I think bosun is a pretty fantastic system for alerting (not so much for visualization)
You can make your alerting rules as advanced as you want, and it enables you to fine-tune over time, backtest on historical data, so you can get them just right.
And it has a good state machine.
In theory we could just compile bosun straight into grafana, and leverage bosun via its REST api instead of Golang api, but then we have less finegrained control and
for now I feel more comfortable trying out piece by piece (piece meaning golang package) and make the integration decision on a case by case basis. Though the integration
may look different down the road based on experience and as we figure out what we want our alerting to look like.
Either way, we don't just want great alerting. We want great alerting combined with great visualizations, notifications with context, and a smooth workflow where you can manage
your alerts in the same place you manage your visualizations. So it needs to be nicely integrated into Grafana. To that end, there's a few things to consider:
to the input series
Basically, there's a bunch of stuff you may want visualized (V), and a bunch of stuff you want alerts (A), and V and A have some overlap.
I need to think about this a bit more and wonder what y'all think.
There will definitely need to be 1 central place where you can get an overview of all the things you're alerting on, irrespective of where those rules are defined.
There's a few more complications which I'll explain through an example sketch of how alerting could look like:
let's say we have a timeseries for requests (A) and one for errorous requests (B) and this is what we want to plot.
we then use fields C,D,E to put stuff that we don't want to alert on.
C contains the formula for ratio of error requests against the total.
we may for example want to alert (see E) if the median of this ratio in the last 5min ago is more than 1.5 of what the ratio was in the same 5minute period last week, and also
if the errors seen in the last 5min is worse than the errors seen since 2 months ago until 5min ago.
notes:
other ponderings:
display a threshold line
stats.$site.requests
andstats.$site.errors
, and we want to have seperate alert instances for every site (but only set up the rule once)? what if we only want it for a select few of the sites. what if we want different parameters based on which site? bosun actually supports all these features, and we could expose them though we should probably build a UI around them.I think for an initial implementation every graph could have two fields, like so:
where the expression is something like what I put in E in the sketch.
for logic/data that we don't want to visualize, we just toggle off the visibility icon.
grafana would replace the variables in the formula's, execute the expression (with the current bosun based executor). results (state changes) could be fed into something like elasticsearch and displayed via the annotations system.
Thoughts?
Do you have concerns or needs that I didn't addres?
The text was updated successfully, but these errors were encountered: