Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[dev.icinga.com #7579] only notify users on recovery which have been notified before (not-ok state) #2197

Closed
icinga-migration opened this issue Nov 5, 2014 · 5 comments
Labels
area/notifications Notification events bug Something isn't working
Milestone

Comments

@icinga-migration
Copy link

This issue has been migrated from Redmine: https://dev.icinga.com/issues/7579

Created by mfriedrich on 2014-11-05 09:20:21 +00:00

Assignee: mfriedrich
Status: Resolved (closed on 2014-11-09 18:55:05 +00:00)
Target Version: 2.2.0
Last Update: 2014-11-09 18:55:05 +00:00 (in Redmine)

Icinga Version: 2.1.1

Problem description

Nagios and Icinga 1.x use the direct host/service~~contact relationship. When a notification is sent for a hard state change, the core stores that information as service~~>notified_on_$state. That may change on NOT-OK from critical to warning, to unknown, and vice versa. Users will get notified if their notification filter can be passed.

Once the host/service recovers from a problem state (not-ok -> ok transition, triggers a hard state change and a notification again).

Steps to reproduce with Nagios/Icinga 1.x

Install Nagios 3.5.1 into /tmp/nagios3/install

$ mkdir /tmp/nagios3 && cd /tmp/nagios3

$ wget http://downloads.sourceforge.net/project/nagios/nagios-3.x/nagios-3.5.1/nagios-3.5.1.tar.gz
$ tar xzf nagios-3.5.1.tar.gz
$ cd nagios

$ ./configure --with-nagios-user=michi --with-nagios-group=michi --with-nagios-cmd-group=michi --prefix=/tmp/nagios3/install
$ make all
$ make fullinstall install-config install-commandmode

Add the attached configuration into the newly created config dir and include it in nagios.cfg (attached as well).

$ mkdir /tmp/nagios3/install/etc/trivago
$ ls -la  /tmp/nagios3/install/etc/trivago
test_trivago.cfg

$ vim /tmp/nagios3/install/etc/nagios.cfg

cfg_dir=/tmp/nagios3/install/etc/trivago

Call the test runner sending passive check results

Run nagios in foreground with "run_trivago_nagios".

Put the test script somewhere and call it.

sudo ./trigger_trivago_crit_recovery_nagios

Check /tmp/nagios3/install/var/trivago.log for the log-notifications command putting all information.

Reprudoce with Icinga 2.x

  • Add the attached configuration trivago.conf to /etc/icinga2/conf.d
  • icinga2 feature enable command notification debuglog && service icinga2 restart
  • Run the test runner using "trigger_trivago_crit_recovery"
  • tail -f /var/log/icinga2/debug.log
  • test -x /var/log/icinga2/trivago.log && tail -f /var/log/icinga2/trivago.log

Conclusion

While it's reasonable to store only the state where this host/service was notified in, the problem is different in Icinga 2.x: The notification object needs to store which user has been notified for which problem state in the past (only the last not-ok state is important). By matching the user it will only send recoveries to that specific user, if a problem notification occured before. Nagios/Icinga1 do not store the contact, but only check if someone was notified ($count > 0).

On Recovery, the "notified user on state" history must be reset.

Proposed Fix

  • add a new map to notification.ti

2 stages of notification types and recoveries

  • notification
  • user

Notifications:

  • check the notification type and checkable state
    • on recovery, verify that this user has been notified before. reset the history everytime.
    • else notify the user and store the user and the state it has been notified for

Users:

  • check notification type and checkable state against user Statefilter
    • on recovery, verify that user has been notified for the state matching the filter. reset the history then.
    • else notify the user and store the user and the state

Attachments

Changesets

2014-11-09 18:47:24 +00:00 by (unknown) 885e770

Only notify users on recovery who have been notified on not-OK before

Also ensure that type NotificationRecovery always
passes the state filter (missing `OK` is totally fine).

Also fix that notification delays set the correct
next notification time to the begin time window.

fixes #7579
fixes #7623
fixes #6547

2014-11-14 17:11:58 +00:00 by (unknown) f73d696

Make sure that notified users are stored in state file

refs #7579

Relations:

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2014-11-09 18:31:48 +00:00

  • Relates set to 7623

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2014-11-09 18:34:46 +00:00

  • File deleted trivago.conf

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2014-11-09 18:34:58 +00:00

  • File deleted trigger_trivago_crit_recovery

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2014-11-09 18:35:50 +00:00

  • File added trivago.conf
  • File added trigger_trivago_crit_recovery
  • Subject changed from Problem notifications must store user/state for Recovery notification filters to only notify users on recovery which have been notified before (not-ok state)

Use the updated test script.

Test against unpatched Icinga 2

debug.log

[2014-11-09 19:20:12 +0100] information/ExternalCommandListener: Executing external command: [1415557212] PROCESS_SERVICE_CHECK_RESULT;trivago-host;trivago-service;0;1. everything is ok
[2014-11-09 19:20:12 +0100] notice/ExternalCommandProcessor: Processing passive check result for service 'trivago-service'
[2014-11-09 19:20:12 +0100] notice/Checkable: State Change: Checkable trivago-host!trivago-service hard state change from UNKNOWN to OK detected.
[2014-11-09 19:20:13 +0100] information/ExternalCommandListener: Executing external command: [1415557213] PROCESS_SERVICE_CHECK_RESULT;trivago-host;trivago-service;2;critical
[2014-11-09 19:20:13 +0100] notice/ExternalCommandProcessor: Processing passive check result for service 'trivago-service'
[2014-11-09 19:20:13 +0100] notice/Checkable: State Change: Checkable trivago-host!trivago-service soft state change from OK to CRITICAL detected.
[2014-11-09 19:20:14 +0100] information/ExternalCommandListener: Executing external command: [1415557213] PROCESS_SERVICE_CHECK_RESULT;trivago-host;trivago-service;2;critical
[2014-11-09 19:20:14 +0100] notice/ExternalCommandProcessor: Processing passive check result for service 'trivago-service'
[2014-11-09 19:20:15 +0100] notice/CheckerComponent: Pending checkables: 0; Idle checkables: 16; Checks/s: 0
[2014-11-09 19:20:15 +0100] information/ExternalCommandListener: Executing external command: [1415557213] PROCESS_SERVICE_CHECK_RESULT;trivago-host;trivago-service;2;critical
[2014-11-09 19:20:15 +0100] notice/ExternalCommandProcessor: Processing passive check result for service 'trivago-service'
[2014-11-09 19:20:15 +0100] notice/Checkable: State Change: Checkable trivago-host!trivago-service hard state change from CRITICAL to CRITICAL detected.
[2014-11-09 19:20:15 +0100] information/Checkable: Checking for configured notifications for object 'trivago-host!trivago-service'
[2014-11-09 19:20:15 +0100] debug/Checkable: Checkable 'trivago-host!trivago-service' has 1 notification(s).
[2014-11-09 19:20:15 +0100] debug/Notification: FType=32, TypeFilter=96
[2014-11-09 19:20:15 +0100] notice/Notification: Not sending notifications for notification object 'trivago-host!trivago-service!trivago-notify-service and user 'warning-user': state filter does not match
[2014-11-09 19:20:15 +0100] information/Notification: Sending notification for user 'critical-user'
[2014-11-09 19:20:15 +0100] notice/Process: Running command 'sh' '-c' '/bin/echo "`date +%s` type: 'PROBLEM' host state: 'DOWN' service state: 'CRITICAL' recipient: 'critical-user'" >> /var/log/icinga2/trivago.log': PID 25950
[2014-11-09 19:20:15 +0100] information/Notification: Completed sending notification for object 'trivago-host!trivago-service'
[2014-11-09 19:20:15 +0100] notice/Process: PID 25950 ('sh' '-c' '/bin/echo "`date +%s` type: 'PROBLEM' host state: 'DOWN' service state: 'CRITICAL' recipient: 'critical-user'" >> /var/log/icinga2/trivago.log') terminated with exit code 0
[2014-11-09 19:20:16 +0100] information/ExternalCommandListener: Executing external command: [1415557216] PROCESS_SERVICE_CHECK_RESULT;trivago-host;trivago-service;2;critical
[2014-11-09 19:20:16 +0100] notice/ExternalCommandProcessor: Processing passive check result for service 'trivago-service'
[2014-11-09 19:20:17 +0100] information/ExternalCommandListener: Executing external command: [1415557217] PROCESS_SERVICE_CHECK_RESULT;trivago-host;trivago-service;2;critical
[2014-11-09 19:20:17 +0100] notice/ExternalCommandProcessor: Processing passive check result for service 'trivago-service'
[2014-11-09 19:20:18 +0100] information/ExternalCommandListener: Executing external command: [1415557218] PROCESS_SERVICE_CHECK_RESULT;trivago-host;trivago-service;0;ok - recovery
[2014-11-09 19:20:18 +0100] notice/ExternalCommandProcessor: Processing passive check result for service 'trivago-service'
[2014-11-09 19:20:18 +0100] notice/Checkable: State Change: Checkable trivago-host!trivago-service hard state change from CRITICAL to OK detected.
[2014-11-09 19:20:18 +0100] information/Checkable: Checking for configured notifications for object 'trivago-host!trivago-service'
[2014-11-09 19:20:18 +0100] debug/Checkable: Checkable 'trivago-host!trivago-service' has 1 notification(s).
[2014-11-09 19:20:18 +0100] debug/Notification: FType=64, TypeFilter=96
[2014-11-09 19:20:18 +0100] information/Notification: Sending notification for user 'warning-user'
[2014-11-09 19:20:18 +0100] information/Notification: Sending notification for user 'critical-user'
[2014-11-09 19:20:18 +0100] notice/Process: Running command 'sh' '-c' '/bin/echo "`date +%s` type: 'RECOVERY' host state: 'DOWN' service state: 'OK' recipient: 'critical-user'" >> /var/log/icinga2/trivago.log': PID 25966
[2014-11-09 19:20:18 +0100] notice/Process: Running command 'sh' '-c' '/bin/echo "`date +%s` type: 'RECOVERY' host state: 'DOWN' service state: 'OK' recipient: 'warning-user'" >> /var/log/icinga2/trivago.log': PID 25967
[2014-11-09 19:20:18 +0100] information/Notification: Completed sending notification for object 'trivago-host!trivago-service'
[2014-11-09 19:20:18 +0100] information/Notification: Completed sending notification for object 'trivago-host!trivago-service'
[2014-11-09 19:20:18 +0100] notice/Process: PID 25967 ('sh' '-c' '/bin/echo "`date +%s` type: 'RECOVERY' host state: 'DOWN' service state: 'OK' recipient: 'warning-user'" >> /var/log/icinga2/trivago.log') terminated with exit code 0
[2014-11-09 19:20:18 +0100] notice/Process: PID 25966 ('sh' '-c' '/bin/echo "`date +%s` type: 'RECOVERY' host state: 'DOWN' service state: 'OK' recipient: 'critical-user'" >> /var/log/icinga2/trivago.log') terminated with exit code 0

notifications.log

1415557215 type: 'PROBLEM' host state: 'DOWN' service state: 'CRITICAL' recipient: 'critical-user'
1415557218 type: 'RECOVERY' host state: 'DOWN' service state: 'OK' recipient: 'warning-user'
1415557218 type: 'RECOVERY' host state: 'DOWN' service state: 'OK' recipient: 'critical-user'

Proposed fix

The state isn't important, or the state change. The notification object just stores which users have been notified, similar to what is sent to db_ido for notification history.

Once a recovery happens, the to-be-notified users are checked against the list of notified users. If they match, all good, get their recovery. If not, the notification is skipped. After notifying all users, the list is being reset. We'll use a set of user names here, maybe there are better solutions.

Test the fix

Same configuration, same input, different code.

debug.log

[2014-11-09 19:33:06 +0100] information/ExternalCommandListener: Executing external command: [1415557986] PROCESS_SERVICE_CHECK_RESULT;trivago-host;trivago-service;0;1. everything is ok
[2014-11-09 19:33:06 +0100] notice/ExternalCommandProcessor: Processing passive check result for service 'trivago-service'
[2014-11-09 19:33:06 +0100] notice/Checkable: State Change: Checkable trivago-host!trivago-service hard state change from UNKNOWN to OK detected.
[2014-11-09 19:33:07 +0100] information/ExternalCommandListener: Executing external command: [1415557987] PROCESS_SERVICE_CHECK_RESULT;trivago-host;trivago-service;2;critical
[2014-11-09 19:33:07 +0100] notice/ExternalCommandProcessor: Processing passive check result for service 'trivago-service'
[2014-11-09 19:33:07 +0100] notice/Checkable: State Change: Checkable trivago-host!trivago-service soft state change from OK to CRITICAL detected.
[2014-11-09 19:33:08 +0100] information/ExternalCommandListener: Executing external command: [1415557987] PROCESS_SERVICE_CHECK_RESULT;trivago-host;trivago-service;2;critical
[2014-11-09 19:33:08 +0100] notice/ExternalCommandProcessor: Processing passive check result for service 'trivago-service'
[2014-11-09 19:33:09 +0100] notice/Process: PID 27038 ('/usr/lib64/nagios/plugins/check_ping' '-4' '-H' '127.0.0.1' '-c' '200,15%' '-w' '100,5%') terminated with exit code 0
[2014-11-09 19:33:09 +0100] notice/Checkable: State Change: Checkable nbmif!ping4 hard state change from UNKNOWN to OK detected.
[2014-11-09 19:33:09 +0100] information/ExternalCommandListener: Executing external command: [1415557987] PROCESS_SERVICE_CHECK_RESULT;trivago-host;trivago-service;2;critical
[2014-11-09 19:33:09 +0100] notice/ExternalCommandProcessor: Processing passive check result for service 'trivago-service'
[2014-11-09 19:33:09 +0100] notice/Checkable: State Change: Checkable trivago-host!trivago-service hard state change from CRITICAL to CRITICAL detected.
[2014-11-09 19:33:09 +0100] information/Checkable: Checking for configured notifications for object 'trivago-host!trivago-service'
[2014-11-09 19:33:09 +0100] debug/Checkable: Checkable 'trivago-host!trivago-service' has 1 notification(s).
[2014-11-09 19:33:09 +0100] debug/Notification: FType=32, TypeFilter=96
[2014-11-09 19:33:09 +0100] information/Notification: Sending notification for user 'critical-user'
[2014-11-09 19:33:09 +0100] notice/Notification: Not sending notifications for notification object 'trivago-host!trivago-service!trivago-notify-service and user 'warning-user': state filter does not match
[2014-11-09 19:33:09 +0100] notice/Notification: Notification filters for user 'warning-user' not matched. Not sending notification.
[2014-11-09 19:33:09 +0100] notice/Process: Running command 'sh' '-c' '/bin/echo "`date +%s` type: 'PROBLEM' host state: 'DOWN' service state: 'CRITICAL' recipient: 'critical-user'" >> /var/log/icinga2/trivago.log': PID 27059
[2014-11-09 19:33:09 +0100] information/Notification: Completed sending notification for object 'trivago-host!trivago-service'
[2014-11-09 19:33:09 +0100] notice/Process: PID 27059 ('sh' '-c' '/bin/echo "`date +%s` type: 'PROBLEM' host state: 'DOWN' service state: 'CRITICAL' recipient: 'critical-user'" >> /var/log/icinga2/trivago.log') terminated with exit code 0
[2014-11-09 19:33:10 +0100] notice/CheckerComponent: Pending checkables: 0; Idle checkables: 16; Checks/s: 0.8
[2014-11-09 19:33:10 +0100] debug/CheckerComponent: Executing check for 'trivago-host!ping4'
[2014-11-09 19:33:10 +0100] debug/CheckerComponent: Executing check for 'nbmif!http'
[2014-11-09 19:33:10 +0100] debug/CheckerComponent: Executing check for 'nbmif!ping6'
[2014-11-09 19:33:10 +0100] debug/CheckerComponent: Executing check for 'nbmif!ssh'
[2014-11-09 19:33:10 +0100] notice/Process: Running command '/usr/lib64/nagios/plugins/check_ping' '-4' '-H' '127.0.0.1' '-c' '200,15%' '-w' '100,5%': PID 27063
[2014-11-09 19:33:10 +0100] debug/CheckerComponent: Check finished for object 'trivago-host!ping4'
[2014-11-09 19:33:10 +0100] notice/Process: Running command '/usr/lib64/nagios/plugins/check_ssh' '127.0.0.1': PID 27064
[2014-11-09 19:33:10 +0100] debug/CheckerComponent: Check finished for object 'nbmif!ssh'
[2014-11-09 19:33:10 +0100] notice/Process: Running command '/usr/lib64/nagios/plugins/check_ping' '-6' '-H' '::1' '-c' '200,15%' '-w' '100,5%': PID 27066
[2014-11-09 19:33:10 +0100] debug/CheckerComponent: Check finished for object 'nbmif!ping6'
[2014-11-09 19:33:10 +0100] notice/Process: Running command '/usr/lib64/nagios/plugins/check_http' '-I' '127.0.0.1' '-u' '/': PID 27069
[2014-11-09 19:33:10 +0100] debug/CheckerComponent: Check finished for object 'nbmif!http'
[2014-11-09 19:33:10 +0100] notice/Process: PID 27069 ('/usr/lib64/nagios/plugins/check_http' '-I' '127.0.0.1' '-u' '/') terminated with exit code 1
[2014-11-09 19:33:10 +0100] notice/Checkable: State Change: Checkable nbmif!http hard state change from UNKNOWN to WARNING detected.
[2014-11-09 19:33:10 +0100] notice/Process: PID 27064 ('/usr/lib64/nagios/plugins/check_ssh' '127.0.0.1') terminated with exit code 0
[2014-11-09 19:33:10 +0100] notice/Checkable: State Change: Checkable nbmif!ssh hard state change from UNKNOWN to OK detected.
[2014-11-09 19:33:10 +0100] notice/ThreadPool: Pool #1: Pending tasks: 0; Average latency: 0ms; Threads: 16; Pool utilization: 0.0407649%
[2014-11-09 19:33:10 +0100] information/ExternalCommandListener: Executing external command: [1415557990] PROCESS_SERVICE_CHECK_RESULT;trivago-host;trivago-service;2;critical
[2014-11-09 19:33:10 +0100] notice/ExternalCommandProcessor: Processing passive check result for service 'trivago-service'
[2014-11-09 19:33:11 +0100] information/ExternalCommandListener: Executing external command: [1415557991] PROCESS_SERVICE_CHECK_RESULT;trivago-host;trivago-service;2;critical
[2014-11-09 19:33:11 +0100] notice/ExternalCommandProcessor: Processing passive check result for service 'trivago-service'
[2014-11-09 19:33:12 +0100] information/ExternalCommandListener: Executing external command: [1415557992] PROCESS_SERVICE_CHECK_RESULT;trivago-host;trivago-service;0;ok - recovery
[2014-11-09 19:33:12 +0100] notice/ExternalCommandProcessor: Processing passive check result for service 'trivago-service'
[2014-11-09 19:33:12 +0100] notice/Checkable: State Change: Checkable trivago-host!trivago-service hard state change from CRITICAL to OK detected.
[2014-11-09 19:33:12 +0100] information/Checkable: Checking for configured notifications for object 'trivago-host!trivago-service'
[2014-11-09 19:33:12 +0100] debug/Checkable: Checkable 'trivago-host!trivago-service' has 1 notification(s).
[2014-11-09 19:33:12 +0100] debug/Notification: FType=64, TypeFilter=96
[2014-11-09 19:33:12 +0100] information/Notification: Sending notification for user 'critical-user'
[2014-11-09 19:33:12 +0100] notice/Notification: We did not notify user 'warning-user' before. Not sending recovery notification.
[2014-11-09 19:33:12 +0100] notice/Process: Running command 'sh' '-c' '/bin/echo "`date +%s` type: 'RECOVERY' host state: 'DOWN' service state: 'OK' recipient: 'critical-user'" >> /var/log/icinga2/trivago.log': PID 27083
[2014-11-09 19:33:12 +0100] information/Notification: Completed sending notification for object 'trivago-host!trivago-service'
[2014-11-09 19:33:12 +0100] notice/Process: PID 27083 ('sh' '-c' '/bin/echo "`date +%s` type: 'RECOVERY' host state: 'DOWN' service state: 'OK' recipient: 'critical-user'" >> /var/log/icinga2/trivago.log') terminated with exit code 0

notifications log

1415557989 type: 'PROBLEM' host state: 'DOWN' service state: 'CRITICAL' recipient: 'critical-user'
1415557992 type: 'RECOVERY' host state: 'DOWN' service state: 'OK' recipient: 'critical-user'

There also is a problem with type = Recovery, and missing OK state filter. That's handled in #7623

@icinga-migration
Copy link
Author

Updated by Anonymous on 2014-11-09 18:55:05 +00:00

  • Status changed from Assigned to Resolved
  • Done % changed from 0 to 100

Applied in changeset 885e770.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/notifications Notification events bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant