[dev.icinga.com #10208] Eventhandler trigger on all endpoints in high available zone #3431

icinga-migration · 2015-09-24T07:54:05Z

This issue has been migrated from Redmine: https://dev.icinga.com/issues/10208

Created by dgoetz on 2015-09-24 07:54:05 +00:00

Assignee: (none)
Status: New
Target Version: (none)
Last Update: 2016-03-18 17:09:10 +00:00 (in Redmine)

Backport?: Not yet backported
Include in Changelog: 1

Eventhandlers are triggerd on all endpoints in a high available setup.
We see an Eventhandler command triggered on both systems in the high available zone. First it is triggerd on the check source than after replication on the other endpoint.
Worst case scenario the eventhandler restarts a problematic service twice with some seconds delay which could cause more problems than fix!

Plattform is RHEL 7 64 bit, Icinga 2 Release 2.3.10 from packages.icinga.org

Attachments

node1-debug.log dgoetz - 2015-09-25 07:28:08 +00:00
node2-debug.log dgoetz - 2015-09-25 07:28:39 +00:00

The text was updated successfully, but these errors were encountered:

icinga-migration · 2015-09-24T19:59:26Z

Updated by mfriedrich on 2015-09-24 19:59:26 +00:00

Status changed from New to Feedback
Assigned to set to dgoetz

Please add the debug logs from both ends as well as the configuration objects (services).

icinga-migration · 2015-09-25T07:30:39Z

Updated by dgoetz on 2015-09-25 07:30:39 +00:00

File added node1-debug.log
File added node2-debug.log

Configuration object:

apply Service "aix_daemon_nrpe" {
   import "infrastructure-view-aix"
   import "generic-service"
   import "nrpe_ssl-service"
   if (host.vars.environment == "test") {
     check_interval = 15m
   }
   check_command = "nrpe"
   check_interval = 1m
   assign where host.vars.os == "aix"
   event_command = "default_eventhandler"
   vars.eventhandler_command = "/usr/local/nagios/admin/restart_nrpe.ksh"
   vars.eventhandler_plattform = "ssh"
   vars.eventhandler_expectedstatetype = "HARD"
   vars.eventhandler_expectedattempts = 1
}

Added only the lines matching the hostname su01003910 because debuglog exceeds upload limit although it was only running less than 5 minutes!

icinga-migration · 2015-10-15T12:47:37Z

Updated by mfriedrich on 2015-10-15 12:47:37 +00:00

Status changed from Feedback to New
Assigned to deleted ~~dgoetz~~

icinga-migration · 2016-03-18T17:09:10Z

Updated by mfriedrich on 2016-03-18 17:09:10 +00:00

Tracker changed from Bug to Feature
Priority changed from Normal to Low

Event handlers are triggered locally when the check result is processed. If multiple instances are involved each of them will of course trigger the event handler then.

Imho this is not a bug but a feature request to deal with event handlers inside a cluster zone somehow. Though I have no idea for a possible implementation. Leaving this open for suggestions.

leeclemens · 2017-03-04T19:24:08Z

I'm not sure why this is marked low/enhancement, using an EventCommand that performs a system restart with some timeout could likely reboot the server twice. This would be completely unexpected behavior. This would go the same for any event_command restarting a service (think services that take a while to start up like databases)...second execution would knock it back down.

triggered locally when the check result is processed

I thought checks were scheduled between cluster members (forgive me if I'm wrong), so the same check result would only be processed by one cluster member for any given check execution? If so, this wouldn't be a problem. (I have logged check state, state id, check_attempt to verify duplication.)

Ref: https://lists.icinga.org/pipermail/icinga-users/2017-February/011844.html

Edit: The OP describes the result being replicated. So this is "of course" the expected behavior. The issue is that I do not believe it should be the expected behavior (executing the same command multiple times in response to the same service check result). Other than requiring EventCommand scripts to perform their own locking/synchronization, I think it makes sense for that to be performed by the Icinga Cluster as a matter of design.

leeclemens · 2017-03-06T19:27:26Z

tag: @dgoetz @dnsmichi

MarcusCaepio · 2017-11-14T07:23:02Z

Hi all,
is this still not fixed?

Thomas-Gelf · 2017-11-23T15:01:03Z

@MarcusCaepio: i talked to @dgoetz, as far as he knows this isn't fixed.

dnsmichi · 2017-11-24T15:31:32Z

I've got the same question from @MarcusCaepio and I believe no-one looked into this yet. That's what I've told him at OSMC.

dnsmichi · 2018-05-11T11:00:02Z

Since questions came up why no-one looked into this for a long period of time: Event handlers were ported from 1.x where no cluster or HA feature was yet implemented. Event handlers in the current feature-set would need a re-design, especially when it comes to command_endpoint enabled clients (see #5658 for details) or how parameters can be passed. That is the reason why I've tagged this as a feature a while ago.

In terms of the problem itself, each instance determines whether to execute event handlers upon received check results in an HA-enabled zone. The object authority for checkable objects can be determined by the paused attribute.

Tests

Configuration

mbmif /usr/local/tests/icinga2/master-slave (master *+) # cat icinga2a/etc/icinga2/zones.d/master/hosts.conf
object Host "testmaster" {
  check_command = "dummy"
  address = "127.0.0.1"
}
object Host "testmaster1" {
  check_command = "dummy"
}
object Host "testmaster2" {
  check_command = "random"
  event_command = "testevent"
  check_interval = 1s
  retry_interval = 1s
}
object EventCommand "testevent" {
  command = [ "/usr/bin/true" ]
}

michi@mbmif ~/coding/icinga/icinga2 (fix/eventhandler-ha-zone *) $ curl -k -s -u root:icinga 'https://localhost:7000/v1/objects/hosts/testmaster2?attrs=paused&attrs=name&pretty=1'
{
    "results": [
        {
            "attrs": {
                "name": "testmaster2",
                "paused": false
            },
            "joins": {

            },
            "meta": {

            },
            "name": "testmaster2",
            "type": "Host"
        }
    ]
}

michi@mbmif ~/coding/icinga/icinga2 (fix/eventhandler-ha-zone *) $ curl -k -s -u root:icinga 'https://localhost:8000/v1/objects/hosts/testmaster2?attrs=paused&attrs=name&pretty=1'
{
    "results": [
        {
            "attrs": {
                "name": "testmaster2",
                "paused": true
            },
            "joins": {

            },
            "meta": {

            },
            "name": "testmaster2",
            "type": "Host"
        }
    ]
}

Fix

Git Master

icinga2a

[2018-05-11 12:48:12 +0200] notice/Checkable: State Change: Checkable 'testmaster2' hard state change from DOWN to UP detected.
[2018-05-11 12:48:12 +0200] notice/Checkable: Executing event handler 'testevent' for service 'testmaster2'

icinga2b

[2018-05-11 12:48:12 +0200] notice/Checkable: State Change: Checkable 'testmaster2' hard state change from DOWN to UP detected.
[2018-05-11 12:48:12 +0200] notice/Checkable: Executing event handler 'testevent' for service 'testmaster2'

Applied

icinga2a

[2018-05-11 12:41:25 +0200] notice/Checkable: State Change: Checkable 'testmaster2' soft state change from DOWN to DOWN detected.
[2018-05-11 12:41:25 +0200] notice/Checkable: Executing event handler 'testevent' for checkable 'testmaster2'

icinga2b

[2018-05-11 12:41:25 +0200] notice/Checkable: State Change: Checkable 'testmaster2' soft state change from DOWN to DOWN detected.
[2018-05-11 12:41:25 +0200] critical/Checkable: Skipping event handler for HA-paused checkable 'testmaster2'
Context:
	(0) Executing event handler for object 'testmaster2'

Note: The critical logging will be turned into notice inside the patch.

Single instance test

object Zone "master" {
  //endpoints = [ "icinga2a", "icinga2b" ]
  endpoints = [ "icinga2a" ]
}

[2018-05-11 12:54:08 +0200] notice/Checkable: State Change: Checkable 'testmaster2' soft state change from UP to DOWN detected.
[2018-05-11 12:54:08 +0200] notice/Checkable: Executing event handler 'testevent' for checkable 'testmaster2'

There won't be any 2.8.x minor release anymore, and since this change also requires further tests with the snapshot packages, I'll schedule this for 2.9 as an exception (actually the issue set is frozen).

…in an HA zone fixes #3431

icinga-migration added Low enhancement New feature or request area/distributed Distributed monitoring (master, satellites, clients) labels Jan 17, 2017

dnsmichi assigned N-o-X May 8, 2018

dnsmichi added bug Something isn't working and removed enhancement New feature or request low-priority labels May 11, 2018

dnsmichi pushed a commit that referenced this issue May 11, 2018

Execute event commands only on actively checked host/service objects …

2d87b66

…in an HA zone fixes #3431

dnsmichi mentioned this issue May 11, 2018

Execute event commands only on actively checked host/service objects in an HA zone #6297

Merged

dnsmichi assigned dnsmichi and unassigned N-o-X May 11, 2018

dnsmichi added this to the 2.9.0 milestone May 11, 2018

dnsmichi closed this as completed in #6297 May 11, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[dev.icinga.com #10208] Eventhandler trigger on all endpoints in high available zone #3431

[dev.icinga.com #10208] Eventhandler trigger on all endpoints in high available zone #3431

icinga-migration commented Sep 24, 2015

icinga-migration commented Sep 24, 2015

icinga-migration commented Sep 25, 2015

icinga-migration commented Oct 15, 2015

icinga-migration commented Mar 18, 2016

leeclemens commented Mar 4, 2017 •

edited

leeclemens commented Mar 6, 2017

MarcusCaepio commented Nov 14, 2017

Thomas-Gelf commented Nov 23, 2017

dnsmichi commented Nov 24, 2017

dnsmichi commented May 11, 2018

[dev.icinga.com #10208] Eventhandler trigger on all endpoints in high available zone #3431

[dev.icinga.com #10208] Eventhandler trigger on all endpoints in high available zone #3431

Comments

icinga-migration commented Sep 24, 2015

icinga-migration commented Sep 24, 2015

icinga-migration commented Sep 25, 2015

icinga-migration commented Oct 15, 2015

icinga-migration commented Mar 18, 2016

leeclemens commented Mar 4, 2017 • edited

leeclemens commented Mar 6, 2017

MarcusCaepio commented Nov 14, 2017

Thomas-Gelf commented Nov 23, 2017

dnsmichi commented Nov 24, 2017

dnsmichi commented May 11, 2018

Tests

Configuration

Fix

Git Master

Applied

Single instance test

leeclemens commented Mar 4, 2017 •

edited