Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: a new design for alert system #3963

Closed
paulfantom opened this issue Jul 13, 2018 · 19 comments
Closed

RFC: a new design for alert system #3963

paulfantom opened this issue Jul 13, 2018 · 19 comments

Comments

@paulfantom
Copy link
Contributor

As of today alart-notify.sh has 2304 lines of bash script with multiple code repetitions and extensive boilerplate. Adding new alert destination comes down to copying a bunch of code and pasting it with a minor alterations. There is an attempt to reduce this in #3960 but that does not solve fundamental problem, which is the fact that bash just simply isn't made for this.

To solve that I can think of two possible solutions, which aren't exclusive with each other:

  1. gradually rewrite alert-notify.sh to python, go-lang or other lang
  2. introduce a modular approach, same way as it is now for gathering data.

As for second point I was thinking of something like python.d.plugin but for alert system, temporarily let's call it alert.d.plugin. This plugin could just send alerts to a specific alert module (like slack.alert) and take care of things like alert deduplication, management, and routing.

Example alert flow:

                 email.alert
                     ^
+-------+            |
|netdata|    +--------------+
| core  |--->|alert.d.plugin|-----> slack.alert
+-------+    +--------------+
  1. netdata core emits generic alert
  2. alert.d.plugin parses generic alert and routes it to specified module, ex. email.alert AND/OR slack.alert
  3. specific module takes care of actual alert sending to end user

Since alerting should be available on every system and python isn't available everywhere, I would suggest using go-lang, but I don't have any strong opinions on choosing language.

@ktsaou @l2isbad @Ferroin

@paulfantom paulfantom changed the title RFC for a new design for alert system RFC - a new design for alert system Jul 13, 2018
@Ferroin
Copy link
Member

Ferroin commented Jul 13, 2018

So, somewhat ironically, I'm actually working on a notification daemon in Python that replicates a lot of the functionality from alarm-notify.sh plus a few extra features:

  • Jinja2 based templating for formatting alerts.
  • Queueing, with retries on failure, and the ability to resend notifications after a configured interval.
  • An easy to work with API for adding new delivery methods (modeled after netdata's python.d.plugin interface).

It's not yet complete to the point of being able to post it, but I've got most of it written right now (core logic, logging, basic queue backends, and a couple of basic delivery methods, working on the configuration parsing right now, after which I should be able to start testing all of it as one unit instead of individual modules). I have no idea if this would be of direct interest for netdata, but I was planning on providing stuff with it directly so it can work with netdata as a (near) drop-in replacement for alarm-notify.sh (and of course write up a netdata plugin to monitor it too). The only dependencies outside of the python standard library at the moment are Jinja2 and PyYAML, though it requires Python 3.6 or newer (once I have it working, I plan to try and get it working with older versions too, but I'm not optimistic about that).

@paulfantom
Copy link
Contributor Author

notification daemon in Python

So essentially the one you are writing is another application which should run alongside netdata?

@ktsaou
Copy link
Member

ktsaou commented Jul 13, 2018

hm... interesting ideas. Thank you for this!

Another approach would be to move alarm-notify.sh inside netdata. This would give the following benefits:

  1. netdata will dispatch notifications even when the disk is suddenly unresponsive (quite frequent situation in VMs - AWS, Azure, etc - when the storage layer is having difficulties).

  2. netdata already has a alarm log in it, and keeps track of various statuses, it also saves and loads it back - so no need to replicate it. It already deduplicates it to avoid sending notifications for alerts that have been recovered.

  3. no other dependencies - netdata should be IoT friendly; another language is another issue to solve there.

There will be 2 drawbacks:

  • to reconfigure notifications, we will have to send is SIGUSR2 to netdata (reload health configuration)
  • I am the only C dev currently available - so this can take time

If we mix the above with the go approach, alert.d.plugin could be a plugin that is always running (a pipe from netdata to it), and netdata could just push the transaction log to it in real-time. The plugin would have to report back a status for each notification, so that netdata could let the dashboard know.

It will be a little bit complicated, because the netdata daemon is multi-host (due to streaming), so the plugin would have to add/remove hosts and alarm logs, as they are added/removed in netdata.

So, this would also require a small protocol to be implemented for the bidirectional communication between netdata and the plugin.

@ktsaou
Copy link
Member

ktsaou commented Jul 13, 2018

Yet another solution could be to create a daemon for managing the alarms, that one or more netdata could connect to it permanently, to push the alert log.

That daemon could also have any API for the dashboards to learn the status of the alarms.

@paulfantom
Copy link
Contributor Author

Somehow I am not in favor of another daemon for this, the all-in-one solution works right now and is easy for new users. Preferably implementation following hexagonal architecture pattern would be good here and allow for quick extraction to another application if needed in the future.

I was also considering moving alerting into core netdata as it seemed most optimal, but since only @ktsaou know C, this would potentially harm development time and future expansion.

It seems to me that the mixed approach might be the most future-proof despite some initial problems with implementation.

As for bi-directional communication protocol - gRPC?

@ktsaou
Copy link
Member

ktsaou commented Jul 15, 2018

Somehow I am not in favor of another daemon for this

me neither, but it is an option, so it should be stated.

Ideally, each netdata should be autonomous, but should also allow centralized notifications. Centralized notifications will allow things we can't support today, like web push notifications, mobile push notifications, will eliminate the need of having a working MTA on all servers for mail delivery, will simplify the dependencies for containers, etc.

So, a simple solution could be to improve alarm-notify.sh a bit to make it more efficient and at the same time support a mechanism to bypass alarm-notify.sh entirely and use a central server for dispatching all notifications.

So, the netdata daemon could be enhanced with at least these:

  1. retry notifications that failed to be dispatched
  2. repeat notifications that are not cleared [RFE] send notifications repeatedly until the alert is resolved #2956
  3. allow variables in alarms' info lines health monitoring enhancements #809
  4. simplify and modularize alarm-notifiy.sh Generalize the recipient finding logic and reduce the boilerplate code. #3960 (shell is a good choice for most sysadmins to customize)
  5. push in real-time all alarm changes to a central server - i.e. the exec line of alarms and the script to execute on alarm option in netdata.conf could support push SERVER:PORT to send the notifications to a registry-like netdata server.

@paulfantom
Copy link
Contributor Author

I don't think netdata core (part written in C) package should manage alerts itself and that part should be offloaded to a plugin or external daemon. Core would just focus on emiting alerts when a condition apply and not have any logic of resending, deduplicating alerts and all that stuff. In that model there is a question if we want to resend alert to plugin/daemon every every X seconds or only on rising/falling edge (trigger based approach), but that is just an implementation detail.

If we could put whole logic of managing alerts into a plugin/daemon application (points 1,2,3, and 5 from @ktsaou post), then it would be easier to maintain and possibly extend that part. Let's use experience of other monitoring projects like prometheus or influxdb, where instance emits alerts to some alert management app which in turn routes them via proper channel.

The idea is to mix two worlds. Create a modular plugin which would communicate with netdata over gRPC via unix socket. This will allow to easily package and ship netdata and still maintain a possibility for a centralized alert management by switching from unix socket to TCP/IP communication.

One could probably ask, why not completely offload alert management to a daemon app? And answer is simple - better user experience. This way we ship a complete package with everything in it and without a need for another program. Just consider how it simplifies netdata deployment in containerized environments.

Idea schematics:

alert

  • NC - Netdata Core
  • alert.d.daemon - either netdata with alert.d.plugin or a standalone application
  • XYZ.alrt - module responsible for sending alerts

Left is a simple setup for one netdata instance. Basically one which should be enabled by default.
Right is a centralized alert management solution.

@ktsaou
Copy link
Member

ktsaou commented Jul 15, 2018

hm... I don't see a gRPC implementation for C. It only exists for C++. But a simple socket protocol could also do.

The whole idea with the communication between netdata core and alert.d is that it has to be async. So, netdata core will send the request, receive immediately/synchronously an ACK, but the actual status of the notification will come later, once the notification has been dispatched. So, it is purely 2-way async communication.

Also, in standalone mode, it could also work over pipes, to avoid having a separate daemon running.

Generally the above operation would probably better implemented with a queue, so that will survive disruption. For example using http://zeromq.org/ - Check this video about zeromq, it is nice: https://www.youtube.com/watch?v=_JCBphyciAs

@paulfantom
Copy link
Contributor Author

Maybe this one https://github.com/Juniper/grpc-c ?

As for pipes - I don't know if it wouldn't be better to focus on one communication system/protocol instead of creating two separate implementations.

Zeromq could also work, it is used as a default communication protocol for saltstack to push configuration changes, so it is quite proven.

@paulfantom
Copy link
Contributor Author

Both gRPC and zeroMQ could be used for async communication. However tere is one advantage of gRPC over zeroMQ, it is layered over HTTP so can use existing network infrastructure. And since more people know how HTTP works, it would probably be easier for them to configure things like firewalls, encrypting data, implementing flow control, etc.

However zeroMQ looks like it is a little faster protocol (for async communication it need ROUTER-DEALER connection not REQ-REP).

There is quite nice comparission of both on stack overflow: https://stackoverflow.com/questions/39350681/grpc-and-zeromq-comparsion

@paulfantom
Copy link
Contributor Author

Also as for C library for gRPC, it is here: https://github.com/grpc/grpc/tree/master/src/core

@Ferroin
Copy link
Member

Ferroin commented Jul 16, 2018

@paulfantom Yes, it's designed to run as a separate daemon. My primary motivation for writing it is to have only one thing on my systems actually dispatching notifications, instead of the current situation of half a dozen, and thus have all the configuration for notifications in one place. Long term, I don't think it's likely to be the best option for netdata, but I just wanted it to be out there as something that could be an alternative delivery mechanism (and I am serious about providing the required stuff with it to make it trivial to use with netdata, same goes for every other item on my systems that sends alerts).

As far as long-term usage, what I see as an ideal solution would meet all these criteria:

  1. It's trivially possible without needing a compiler or development tools to add new delivery mechanisms on a single system. The current alarm-notify.sh mechanism allows for that. As long as custom notification support is preserved (ideally allowing calling a script with the same parameter format that alarm-notify.sh takes), this is taken care of pretty trivially.
  2. It's efficient for the common case of infrequent notifications. Stupid as it sounds, the current approach actually does fit that to a degree, it has zero overhead until there's a notification (and, believe it or not, it's probably more efficient than if it were written in Python to use the same interface, CPython interpreters take forever to start up). This is really big for IoT systems and other cases with limited resources.
  3. It's easy enough to understand that people can easily add new delivery mechanisms without needing a bunch of knowledge about how it all works. alarm-notify.sh does not currently meet this requirement, but that's mostly because of how much boilerplate code is needed.
  4. It's reasonably independent of the core functionality in netdata. Not everybody needs alerts, and it might develop to be useful enough on it's own that it could be split to a separate piece of software which netdata bundles with it.

@paulfantom
Copy link
Contributor Author

paulfantom commented Jul 16, 2018

Ad 1: That's why I wanted to have alert.d.plugin manager plugin, so anyone could add another XYZ.alert script, which would be reponsible for sending alerts. This script could be then written in any language - compiled or not, as it would be executed by alert.d.plugin when needed.

Ad 2: As for efficiency in Bash vs Python - it depends on many variables. I also don't say that alert.d.plugin should be written in python, but I am sure it shouldn't be written in bash (for the same reasons netdata doesn't use charts.d.plugin so much).
Also time is not so much an issue when sending alerts, since it is dependant on many various external providers. Basically there is no difference if alert will be sent in 0.001s after incident or 2s. On the other hand time of actual incedent is important.
System resources are a different thing which should be considered a lot.

common case of infrequent notifications

Actually that is something that should be considered. It works well when used in non-distrubuted systems. However in distributed ones, constant alarm resending is considered a feature, since it allows master to "see" if a slave is still active or if it is unavailable (disconnected from network or high system load) - it is a heartbeat signal. If there was a way of providing heartbeat feature in other, less expensive, way then trigger-based system is still ok.

Ad 3: If there was an alert.d.plugin with pluggable architecture, then adding new delivery system could be as easy as adding new metrics module. Developer would need to know how the API look and where to put this new file.

Ad 4: That's what I proposed in a chart. 😄 It should communicate with netdata with a specified protocol (I am looking if there is some open-source spec), so if anyone would like to send alert via it, just communicate by using that protocol.

From what I see daemon you are writing is an extended version of prometheus/alertmanager. It is efficient (written in go-lang), easy to understand (config is a YAML file), and it is a separate application. What is different is an ability to add new delivery mechanisms on the fly.

I also wanted to add, that we SHOULD NOT create new communication protocols and rely as much as possible on ones which are already established. Relevant: https://xkcd.com/927/.

@Ferroin
Copy link
Member

Ferroin commented Jul 16, 2018

On the note of item 2 in particular, my point about efficiency is more about minimizing impact on the rest of the system when an alarm fires than it is about timely delivery (though timely delivery is still a plus IMO). If your delivery mechanism has to compute something before it is sending a notification, it should do so as efficiently as possible to minimize how much it slows down everything else.

As far as what I'm working on, it's conceptually pretty similar to alertmanager, with the biggest differences aside from language being how routing is configured and the lack of some features that alertmanager provides that I don't care about (like clustering and silences). It doesn't really support 'on-the-fly' additions of delivery mechanisms, but I plan to have things set up to work kind of like netdata's python.d module for this (that is, you drop a file with the right API exposed in the correct directory, restart the daemon, and it just works).

Also, entirely agreed that a there is no reason for a new protocol, with the specific caveat that I would like to avoid doing stupid things like using HTTP as a transport layer when it's not needed.

@paulfantom
Copy link
Contributor Author

paulfantom commented Jul 16, 2018

If your delivery mechanism has to compute something before it is sending a notification

In my proposal that sort of processing would be in XYZ.alert module, since you need to process data for various systems. We could be exchanging json or other data object between netdata core, alert.d.plugin, and XYZ.alert so this part could be optimized as much as it is possible. However delivery system is dependent on a provider and can be tuned only to some extent (compare passing unproccessed data to some webhook vs sending an email).

Silencing is actually an important feature. You don't want to send alerts on 1am for systems that are not crucial, but you want to send those alerts on work hours.

Using HTTP as transport layer can actually be a good idea. It is supported by every middle layer device. Everyone knows how to configure firewall to allow HTTP traffic, how to tag it, how to secure it and so on. And if you use something based on TCP (I don't even consider UDP for alerts) it isn't so simple. It could be fast, but it is a pain for end users. And what is the reason for fast but unusable system?
Also almost every dev knows how to use HTTP for new projects 😄

@Ferroin
Copy link
Member

Ferroin commented Jul 16, 2018

Silencing is actually an important feature. You don't want to send alerts on 1am for systems that are not crucial, but you want to send those alerts on work hours.

In this case, yes, for my usage, I just do the silencing at the level of the sender, not the broker, so I don't need it (currently) in what I'm writing. I might add it eventually, but it's not going to be in the initial version simply because I have to draw the line somewhere for features.

As far as HTTP, I'm more referring to cases like microsoft's SSDP, which is run on top of HTTP on top of UDP (so they didn't need it for framing), shouldn't be crossing firewalls (so they didn't need it for simplicity there), and really needs to be efficient (which is not helped by HTTP).

@paulfantom
Copy link
Contributor Author

I don't have much knowledge on microsoft SSDP, but from what you are saying it looks like shit.

Here we might or might not end up with using HTTP for communication over unix socket on localhost. Which isn't ideal but solves some problems with scalability.

@paulfantom paulfantom changed the title RFC - a new design for alert system [RFC] a new design for alert system Aug 8, 2018
@stale
Copy link

stale bot commented Nov 23, 2018

Currently netdata team doesn't have enough capacity to work on this issue. We will be more than glad to accept a pull request with a solution to problem described here. This issue will be closed after another 60 days of inactivity.

@stale stale bot added the stale label Nov 23, 2018
@cakrit cakrit self-assigned this Nov 23, 2018
@stale stale bot removed the stale label Nov 23, 2018
@paulfantom paulfantom changed the title [RFC] a new design for alert system RFC: a new design for alert system Nov 24, 2018
@cakrit cakrit assigned Ferroin and unassigned cakrit Nov 26, 2018
@cakrit cakrit added feature request New features and removed discussion labels Dec 7, 2018
@cakrit
Copy link
Contributor

cakrit commented Dec 7, 2018

@Ferroin let's use this one for the new refactoring of alarm-notify.sh. Limit the scope to splitting the code to individual files that will reside in the directories for each notification method, so that alarm-notify.sh becomes smaller and cleaner.

@cakrit cakrit added this to the v1.12-rc2 milestone Dec 7, 2018
@Ferroin Ferroin modified the milestones: v1.12-rc2, v1.13-rc0 Jan 2, 2019
@Ferroin Ferroin removed this from the v1.12 milestone Jan 31, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants