Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add an alerting engine #71

Closed
jd opened this issue Jun 5, 2017 · 17 comments
Closed

Add an alerting engine #71

jd opened this issue Jun 5, 2017 · 17 comments

Comments

@jd
Copy link
Member

jd commented Jun 5, 2017

There's no alerting engine in Gnocchi. It should offer an easy solution to trigger actions on e.g. threshold.

The Aodh project from OpenStack supports Gnocchi, but it does that mainly by polling the API regularly. It's pretty slow at the end of the day and OpenStack specific.

We might need to add some other features first, but this issue should be a placeholder to discuss a design.

@jd jd added the enhancement label Jun 5, 2017
@chungg
Copy link
Member

chungg commented Jun 28, 2017

random idea: if we mark a metric as 'has alert', when we aggregate, we would know this, and we can broadcast to an alert engine to evaluate rule? although this might be a bit aggressive and create a lot of extra rule evaluations at the beginning of new metric... not sure how we schedule triggers for alert evaluations either.

@jd
Copy link
Member Author

jd commented Jun 29, 2017

@sileht has exactly the same idea. Since metricd loads all the data to do aggregation, it seems it's the best place to also do the alerting evaluation.

@sileht
Copy link
Member

sileht commented Jun 30, 2017

Ceph have an interresting API to monitor objects: http://docs.ceph.com/docs/master/rados/api/librados/#rados_watch3

But I have no idea how it scales.

@chungg
Copy link
Member

chungg commented Jul 4, 2017

watch seems interesting. i'm not sure what the notification it's sending is... if we read the object, will it notify as well?

it does create a watch object so it will bump up the storage size (not sure how much)

@lucasponce
Copy link

Hi @jd,

From Hawkular Alerting we are working in a 2.x of lightweight components that can be used separately and independently.
We have started to work on the Federated Alerting concept adding support to several sources like Prometheus, Elasticsearch, Kafka.
I have also thought about Gnocchi and I wonder if it could be interesting to add a plugin for it.
We have written a blog about it,
http://www.hawkular.org/blog/2017/08/alerts-multiple-sources.html
And a demo to see how it would look like
https://www.youtube.com/watch?v=mM1mwJneKO4

The aim is to complement a gap where today an integration between systems is mandatory.

I would be interested to know your feedback and if this could be useful for the Gnocchi ecosystem.

Thanks !

@chungg
Copy link
Member

chungg commented Aug 25, 2017

@lucasponce is hawkular polling prometheus/elasticsearch with the query defined in 'expression'?

@lucasponce
Copy link

lucasponce commented Aug 25, 2017

@chungg Yes. Hawkular defines Triggers, which contains Conditions, on this objects Hawkular can read data from multiple sources, as Prometheus, Apache Kafka or Elasticsearch (or your own source of data using an API).
Also we can define several modes like polling or streaming and we can add value playing with more complex scenarios like:
http://www.hawkular.org/blog/2017/08/alerts-nelson-rules.html
http://www.hawkular.org/blog/2017/01/13/events-aggregation-extension.html
An intro of this component can be found here
https://www.slideshare.net/ponceballesteros/hawkular-alerting
(in the 2.x version we have eliminated the Cassandra dependency, and hawkular alerting is just a single process).
Please let us know what do you think.

@chungg
Copy link
Member

chungg commented Aug 25, 2017

i see. i imagine it'd behave similar to how Aodh interacts with Gnocchi currently (ie. create an alarm definition with query, run query periodically, trigger action, repeat).

i don't see having Hawkular support Gnocchi as a bad thing; a good thing actually, since it removes barriers to adoption. that said, i think the hope is that we don't need to define another polling interval and instead leverage the fact that when Gnocchi stores data, it already has time series loaded, and therefore it can check and trigger alarms as data is entered into Gnocchi rather than be dependent on a polling cycle. if we can do that, i still think we can connect to Hawkular by possibly 'streaming' the triggered alarm to Hawkular rather than have Hawkular poll Gnocchi.

@lucasponce
Copy link

Yes, the streaming approach was the primary integration for Hawkular Metrics as well.
When a metric is inserted in the system, it queries a cache to validate if a metrics has a trigger defined, then is sent into the alerting engine for evaluation.
Also, we introduced the dual 'polling' approach as there are also different scenarios where polling could be more interesting.
For example, as described in nelson rules defined, to detect advanced behaviours around a metric, a streaming polling could be good; but for other scenarios, like when you want to compare periods (average of today vs yesterday) perhaps a polling could be appropiate.
Well, the interesting about Hawkular Alerting is that the alerting engine was born in an independent way (it is not linked with Hawkular Metrics), so that let us to send data from different sources without being linked to any specific format of an implementation.
Auto-resolution of problems is also a nice feature where we think it adds additional value: detect when something went wrong (i.e. a bad response time of an application) and detect when it's back. So dealing with lifecycle states can help and enrich the model.
I agree that alerting on thresholds doesn't require a complex alerting engine (even this can be done outside with simple scripts polling regularly) but when collecting a high number of metrics complexity grows and then this type of engines can be very interesting.

Perhaps we can explore the streaming approach in Gnocchi, then perhaps a plugin would be necessary and Gnocchi could get the info related what are the metrics that have definitions and send the data into the alerting engine.

@chungg
Copy link
Member

chungg commented Aug 29, 2017

are there existing 'streaming' plugins? i'm not sure what plans the other contributors have for implementing this functionality. i don't have it queued up so not sure when we can move forward with this.

@lucasponce
Copy link

I think we can try two ways.
One that wouldn't need gnocchi development, where we can create a polling plugin in Hawkular Alerting (similar as we have with Elasticsearch/Kafka/Prometheus) where we can get data and process them in streaming. So it is not a pure "streaming" approach, but this perhaps could help to add a primary Gnocchi/Hawkular Alerting integration.
And a second approach, we could implement a similar pattern like we did in Hawkular Metrics, meaning that it should be in Gnocchi side to push data into Hawkular Alerting for evaluation filtering for those metrics that have triggers defined.
This last proposal would be a pure streaming, but it would need some kind of plugin in gnocchi side.
I am not familiar with the Gnocchi architecture, this kind of logic is something doable that can be implemented as something optional ?

@chungg
Copy link
Member

chungg commented Aug 29, 2017

seems like a sane approach to me 👍

@seuf
Copy link

seuf commented Aug 30, 2017

influxdb has the same approach, you can create "subscribers" on time series, and each time a new value arrive for a serie, it is streamed to the subscriber (for example kapacitor for alerting).
Maybe the metricd should implement a streaming method to forward all the metrics to an other process which will chose to process or not the data.

@chungg
Copy link
Member

chungg commented Sep 7, 2017

it streams the entire series?

i was thinking maybe after it processes a metric, it just sends a msg to a udp/MQ target and whatever alerting engine is consuming the message will know that it should trigger a request to gnocchi for data and an evaluation on their end.

this is different from my original thought of leveraging the fact that gnocchi already loads the series when adding new measures. i don't believe that works because gnocchi doesn't actually load the entire series when adding new measures (just the back_window, which i guess still works if the back_window matches the timespan of your alarm)

@lucasponce
Copy link

lucasponce commented Sep 8, 2017

Hawkular Alerting doesn't need the entire series.
Just when you have a new data point, if that metric has a trigger associated, send it for evaluation.
This can be done with the right design to make it efficient and to not overload system with unnecessary move of data (i.e. only move data when there is triggers defined to minimize the overload), but yes, in essence is the fastest way.
An alternative (which is what I have in mind as a first approach) is to query from the alerting engine in a buffer way, but process in streaming as well, this is less efficient as you need a poll design versus a push, but it can offer a preliminar way of integration (I have planned to prepare a prototype in the following days, sorry for my delay in the thread as I was in PTO).

@chungg
Copy link
Member

chungg commented Sep 8, 2017

oh. so hawkular keeps a running history of the datapoints corresponding to it's evaluation period? that also works as well.

is there a specific format/schema and protocol used to stream data?

@lucasponce
Copy link

oh. so hawkular keeps a running history of the datapoints corresponding to it's evaluation period? that also works as well.

Yes, it depends of the type of condition and evaluation performed but the alerting engine is stateful in that sense, that the state of an evaluation is stored. This helps to perform operations like "dampenings" (I want to be alerted when x < 10 happens 5 times from 10 evaluations, or during a period of time), "rates"/"stateful evaluation" (when I need to evaluate a stream of data points), or even "missing" (alert me when I haven't received or something has not happened in the system in the last 5 minutes). Also, when an alert is fired, there is a second part to explain why, and for that purpose we store this info to provide details (this is also optional).

is there a specific format/schema and protocol used to stream data?

Yeah, REST API with json objects to send data/events to evaluate

http://www.hawkular.org/docs/rest/rest-alerts-v2.html#POST__data
http://www.hawkular.org/docs/rest/rest-alerts-v2.html#POST__events_data

In version 1.x we could use also a native Java API to access to these services, but for 2.x version we have lightweight the whole architecture and we are focus on REST API only.

If interested Hawkular Alerting provides a nice set of examples to jump into and take a look into this
https://github.com/hawkular/hawkular-alerts/tree/master/examples

But also, it is good to collect requeriment about what could be ideal way to integrate.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

6 participants