-
Notifications
You must be signed in to change notification settings - Fork 5.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[health] Dispatch some alarms into health.log instead of debug #7576
Conversation
Manage this branch in SquashTest this branch here: https://saruspetehealth-logs-s3w9a.squash.io |
115b443
to
dc9aa7d
Compare
I've reworked the script and now testing it on my prod. |
08d2c72
to
5bf6920
Compare
When all logs are enabled, I got:
There's a bit of storm about first initialization, but then only useful messages from state changes and alarm execution. |
@Saruspete hi. What are your plans on this PR? Are you still interested in it and going to finish? |
Hello, it's currently working in production under 1.26 and 1.28 (for about 2 months), I was waiting to check if that didn't generated any issue, but it looks fine for now. So far, looks good to me. I'll remove the WIP tag |
A thing just hit me: I removed the mutex that was copy/pasted from |
I do not remeber, we have a make check, but I am not sure if this would do exactly what you want. @Ferroin can you help us here?
About these last errors, I compiled the latest master and I am not having these errors, please, can you try to rebase your PR. |
It should be rebased already (during my clean):
|
Firstly, sorry for the delay. I think you are making an awesome job here, and probably this log can be one day exported for our dashboard or a file, my suggestion for you is to create a mutex for it, because it will be necessary in the future. |
I am not specialist in git, @ilyam8 do you have some tip for us? |
Hello @Saruspete , I tested your PR now and I saw that the compilation error was fixed, thank you!
As you can see the alarm is useless for real world, but it was good for me to identify details on your PR, firtsly, take a look in my bash-5.1# grep telegram /var/log/netdata/error.log
2021-03-23 22:12:38: alarm-notify.sh: INFO: sent telegram notification for: hades system.cpu.dev_dim_template_idle is CRITICAL to '-3'
2021-03-23 22:12:39: alarm-notify.sh: INFO: sent telegram notification for: hades system.cpu.dev_dim_template_system is WARNING to '-3'
2021-03-23 22:12:39: alarm-notify.sh: INFO: sent telegram notification for: hades system.cpu.dev_dim_template_user is CRITICAL to '-3'
2021-03-23 22:12:41: alarm-notify.sh: INFO: sent telegram notification for: hades system.cpu.dev_dim_template_system is CRITICAL to '-3'
2021-03-23 22:12:41: alarm-notify.sh: INFO: sent telegram notification for: hades system.cpu.dev_dim_template_iowait is WARNING to '-3'
2021-03-23 22:12:44: alarm-notify.sh: INFO: sent telegram notification for: hades system.cpu.dev_dim_template_iowait is CLEAR to '-3'
2021-03-23 22:12:50: alarm-notify.sh: INFO: sent telegram notification for: hades system.cpu.dev_dim_template_system is WARNING to '-3'
2021-03-23 22:12:53: alarm-notify.sh: INFO: sent telegram notification for: hades system.cpu.dev_dim_template_system is CRITICAL to '-3'
2021-03-23 22:13:04: alarm-notify.sh: INFO: sent telegram notification for: hades system.cpu.dev_dim_template_system is WARNING to '-3' and now the bash-5.1# cat /var/log/netdata/health.log
2021-03-23 22:12:39: done executing command '1' - returned with code 0 at 1616537557
2021-03-23 22:12:39: done executing command '2' - returned with code 0 at 1616537557
2021-03-23 22:12:39: done executing command '3' - returned with code 0 at 1616537557
2021-03-23 22:12:41: done executing command '4' - returned with code 0 at 1616537560
2021-03-23 22:12:41: done executing command '5' - returned with code 0 at 1616537560
2021-03-23 22:12:44: done executing command '6' - returned with code 0 at 1616537563
2021-03-23 22:12:50: done executing command '7' - returned with code 0 at 1616537568
2021-03-23 22:12:53: done executing command '8' - returned with code 0 at 1616537572
2021-03-23 22:13:04: done executing command '9' - returned with code 0 at 1616537581 I am missing an unique identifier between the two files that allow us to compare the data, as you can see the timestamp was not the same for them, please, can you add the alarm name to In my vision, we could also move the information that was kept inside |
You should have a The log you're seeing is enabled by Also, I couldn't find a way to get the executed command from the ae pointer (doesn't seems be to saved in it). Do you have any idea here ? Should we copy the |
You are right, we do not store the |
Ok, there's no point storing the command long-term, so I don't want to change the alarm_entry struct for that. |
I do not thing the hash will help us, because we dispatch the command with different arguments, and among the arguments is the timestamp. I think we could use the following fields to help us with the identification:
These four components will help us to create an unique identification. |
@thiagoftsm do we need this work to be merged? If yes, please update the PR and merge it. If not, let's close it. |
After a quick talk with our team, I rebase the PR and now we are proceeding with next steps to merge it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also update [logs] section options to reflect the new health.log
configuration option?
Closing this because we are planning to redesign log messages in general and we'll take care of the health logs as well. |
Summary
Fixes: #6822
Goal: to provide more detail on the "alarm script" execution (#6822)
This script can be used to send simple notification (features provided
by default), but can also be used to notify more complex systems.
When netdata registers an alarm but the upstream alerting system doesn't
receive the alarm/clear, it's complicated to pinpoint where the alert
was dropped.
Issues where this would be useful:
the health thread is stopped too. Adding execution time log with the
notification to have some trace of slow script.
user don't want to be flooded by notifications. Many bug tickets were
due to this feature silently discarding notifications (like How do alerts work? #3326 Telegram Alarms works only by hand execution. #3489
httpcheck plugin doesn't send clear event to alerta #3590 alarm-notify.sh not always executed #2952 and more). Here we can have a log while not being in debug
return code and special path for exit != 0
New configuration options in
[health]
section to control logging. By default, all are off, thus is the same state as currently (debug).More logs lines should be added to cover the different paths where alerts come from (like repeating alerts, transcient, etc..)
Component Name
health + libnetdata/logs
Additional Information
Sample output (with all options = yes):