Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[dev.icinga.com #10208] Eventhandler trigger on all endpoints in high available zone #3431

Closed
icinga-migration opened this issue Sep 24, 2015 · 10 comments · Fixed by #6297
Closed
Assignees
Labels
area/distributed Distributed monitoring (master, satellites, clients) bug Something isn't working
Milestone

Comments

@icinga-migration
Copy link

This issue has been migrated from Redmine: https://dev.icinga.com/issues/10208

Created by dgoetz on 2015-09-24 07:54:05 +00:00

Assignee: (none)
Status: New
Target Version: (none)
Last Update: 2016-03-18 17:09:10 +00:00 (in Redmine)

Backport?: Not yet backported
Include in Changelog: 1

Eventhandlers are triggerd on all endpoints in a high available setup.
We see an Eventhandler command triggered on both systems in the high available zone. First it is triggerd on the check source than after replication on the other endpoint.
Worst case scenario the eventhandler restarts a problematic service twice with some seconds delay which could cause more problems than fix!

Plattform is RHEL 7 64 bit, Icinga 2 Release 2.3.10 from packages.icinga.org

Attachments

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2015-09-24 19:59:26 +00:00

  • Status changed from New to Feedback
  • Assigned to set to dgoetz

Please add the debug logs from both ends as well as the configuration objects (services).

@icinga-migration
Copy link
Author

Updated by dgoetz on 2015-09-25 07:30:39 +00:00

  • File added node1-debug.log
  • File added node2-debug.log

Configuration object:

apply Service "aix_daemon_nrpe" {
   import "infrastructure-view-aix"
   import "generic-service"
   import "nrpe_ssl-service"
   if (host.vars.environment == "test") {
     check_interval = 15m
   }
   check_command = "nrpe"
   check_interval = 1m
   assign where host.vars.os == "aix"
   event_command = "default_eventhandler"
   vars.eventhandler_command = "/usr/local/nagios/admin/restart_nrpe.ksh"
   vars.eventhandler_plattform = "ssh"
   vars.eventhandler_expectedstatetype = "HARD"
   vars.eventhandler_expectedattempts = 1
}

Added only the lines matching the hostname su01003910 because debuglog exceeds upload limit although it was only running less than 5 minutes!

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2015-10-15 12:47:37 +00:00

  • Status changed from Feedback to New
  • Assigned to deleted dgoetz

@icinga-migration
Copy link
Author

Updated by mfriedrich on 2016-03-18 17:09:10 +00:00

  • Tracker changed from Bug to Feature
  • Priority changed from Normal to Low

Event handlers are triggered locally when the check result is processed. If multiple instances are involved each of them will of course trigger the event handler then.

Imho this is not a bug but a feature request to deal with event handlers inside a cluster zone somehow. Though I have no idea for a possible implementation. Leaving this open for suggestions.

@icinga-migration icinga-migration added Low enhancement New feature or request area/distributed Distributed monitoring (master, satellites, clients) labels Jan 17, 2017
@leeclemens
Copy link
Contributor

leeclemens commented Mar 4, 2017

I'm not sure why this is marked low/enhancement, using an EventCommand that performs a system restart with some timeout could likely reboot the server twice. This would be completely unexpected behavior. This would go the same for any event_command restarting a service (think services that take a while to start up like databases)...second execution would knock it back down.

triggered locally when the check result is processed

I thought checks were scheduled between cluster members (forgive me if I'm wrong), so the same check result would only be processed by one cluster member for any given check execution? If so, this wouldn't be a problem. (I have logged check state, state id, check_attempt to verify duplication.)

Ref: https://lists.icinga.org/pipermail/icinga-users/2017-February/011844.html

Edit: The OP describes the result being replicated. So this is "of course" the expected behavior. The issue is that I do not believe it should be the expected behavior (executing the same command multiple times in response to the same service check result). Other than requiring EventCommand scripts to perform their own locking/synchronization, I think it makes sense for that to be performed by the Icinga Cluster as a matter of design.

@leeclemens
Copy link
Contributor

tag: @dgoetz @dnsmichi

@MarcusCaepio
Copy link
Contributor

Hi all,
is this still not fixed?

@Thomas-Gelf
Copy link
Contributor

@MarcusCaepio: i talked to @dgoetz, as far as he knows this isn't fixed.

@dnsmichi
Copy link
Contributor

I've got the same question from @MarcusCaepio and I believe no-one looked into this yet. That's what I've told him at OSMC.

@dnsmichi dnsmichi added bug Something isn't working and removed enhancement New feature or request low-priority labels May 11, 2018
@dnsmichi
Copy link
Contributor

Since questions came up why no-one looked into this for a long period of time: Event handlers were ported from 1.x where no cluster or HA feature was yet implemented. Event handlers in the current feature-set would need a re-design, especially when it comes to command_endpoint enabled clients (see #5658 for details) or how parameters can be passed. That is the reason why I've tagged this as a feature a while ago.

In terms of the problem itself, each instance determines whether to execute event handlers upon received check results in an HA-enabled zone. The object authority for checkable objects can be determined by the paused attribute.

Tests

Configuration

mbmif /usr/local/tests/icinga2/master-slave (master *+) # cat icinga2a/etc/icinga2/zones.d/master/hosts.conf
object Host "testmaster" {
  check_command = "dummy"
  address = "127.0.0.1"
}
object Host "testmaster1" {
  check_command = "dummy"
}
object Host "testmaster2" {
  check_command = "random"
  event_command = "testevent"
  check_interval = 1s
  retry_interval = 1s
}
object EventCommand "testevent" {
  command = [ "/usr/bin/true" ]
}
michi@mbmif ~/coding/icinga/icinga2 (fix/eventhandler-ha-zone *) $ curl -k -s -u root:icinga 'https://localhost:7000/v1/objects/hosts/testmaster2?attrs=paused&attrs=name&pretty=1'
{
    "results": [
        {
            "attrs": {
                "name": "testmaster2",
                "paused": false
            },
            "joins": {

            },
            "meta": {

            },
            "name": "testmaster2",
            "type": "Host"
        }
    ]
}

michi@mbmif ~/coding/icinga/icinga2 (fix/eventhandler-ha-zone *) $ curl -k -s -u root:icinga 'https://localhost:8000/v1/objects/hosts/testmaster2?attrs=paused&attrs=name&pretty=1'
{
    "results": [
        {
            "attrs": {
                "name": "testmaster2",
                "paused": true
            },
            "joins": {

            },
            "meta": {

            },
            "name": "testmaster2",
            "type": "Host"
        }
    ]
}

Fix

Git Master

icinga2a

[2018-05-11 12:48:12 +0200] notice/Checkable: State Change: Checkable 'testmaster2' hard state change from DOWN to UP detected.
[2018-05-11 12:48:12 +0200] notice/Checkable: Executing event handler 'testevent' for service 'testmaster2'

icinga2b

[2018-05-11 12:48:12 +0200] notice/Checkable: State Change: Checkable 'testmaster2' hard state change from DOWN to UP detected.
[2018-05-11 12:48:12 +0200] notice/Checkable: Executing event handler 'testevent' for service 'testmaster2'

Applied

icinga2a

[2018-05-11 12:41:25 +0200] notice/Checkable: State Change: Checkable 'testmaster2' soft state change from DOWN to DOWN detected.
[2018-05-11 12:41:25 +0200] notice/Checkable: Executing event handler 'testevent' for checkable 'testmaster2'

icinga2b

[2018-05-11 12:41:25 +0200] notice/Checkable: State Change: Checkable 'testmaster2' soft state change from DOWN to DOWN detected.
[2018-05-11 12:41:25 +0200] critical/Checkable: Skipping event handler for HA-paused checkable 'testmaster2'
Context:
	(0) Executing event handler for object 'testmaster2'

Note: The critical logging will be turned into notice inside the patch.

Single instance test

object Zone "master" {
  //endpoints = [ "icinga2a", "icinga2b" ]
  endpoints = [ "icinga2a" ]
}
[2018-05-11 12:54:08 +0200] notice/Checkable: State Change: Checkable 'testmaster2' soft state change from UP to DOWN detected.
[2018-05-11 12:54:08 +0200] notice/Checkable: Executing event handler 'testevent' for checkable 'testmaster2'

There won't be any 2.8.x minor release anymore, and since this change also requires further tests with the snapshot packages, I'll schedule this for 2.9 as an exception (actually the issue set is frozen).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/distributed Distributed monitoring (master, satellites, clients) bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants