CheckCommand 'icinga' seems to ignore retry interval via command_endpoint #6603

miso231 · 2018-09-06T14:31:03Z

CheckCommand icinga ignores soft states when problem occurs. It jumps right into the hard state and send notification. Perhaps the icinga command ignores retry_interval parameter (in my case 30s) of service object as I can see in Web UI all attempts in just two seconds, see image below:

Expected Behavior

CheckCommand icinga should respect parameters retry_interval and max_check_attempts of service object

Context

I'm using 3-level architecture with HA master zone and 3 child zones. Issue happens on all 3 levels (masters, satelites, clients). Runtime service object:

Object 'generic-host!Icinga service health' of type 'Service':
  % declared in '/etc/icinga2/zones.d/global-templates/services.conf', lines 253:1-253:37
  * __name = "generic-host!Icinga service health"
  * action_url = ""
  * check_command = "icinga"
    % = modified in '/etc/icinga2/zones.d/global-templates/services.conf', lines 255:3-255:31
  * check_interval = 60
    % = modified in '/etc/icinga2/zones.d/global-templates/templates.conf', lines 13:3-13:25
    % = modified in '/etc/icinga2/zones.d/global-templates/services.conf', lines 258:3-258:25
  * check_period = ""
  * check_timeout = null
  * command_endpoint = "generic-host"
    % = modified in '/etc/icinga2/zones.d/global-templates/services.conf', lines 264:3-264:32
  * display_name = "Icinga service health"
  * enable_active_checks = true
  * enable_event_handler = true
  * enable_flapping = false
  * enable_notifications = true
  * enable_passive_checks = true
  * enable_perfdata = false
    % = modified in '/etc/icinga2/zones.d/global-templates/services.conf', lines 260:3-260:24
  * event_command = ""
  * flapping_threshold = 0
  * flapping_threshold_high = 30
  * flapping_threshold_low = 25
  * groups = [ ]
  * host_name = "generic-host"
    % = modified in '/etc/icinga2/zones.d/global-templates/services.conf', lines 253:1-253:37
    % = modified in '/etc/icinga2/zones.d/global-templates/services.conf', lines 257:3-257:40
  * icon_image = ""
  * icon_image_alt = ""
  * max_check_attempts = 5
    % = modified in '/etc/icinga2/zones.d/global-templates/templates.conf', lines 12:3-12:24
    % = modified in '/etc/icinga2/zones.d/global-templates/services.conf', lines 256:3-256:24
  * name = "Icinga service health"
    % = modified in '/etc/icinga2/zones.d/global-templates/services.conf', lines 253:1-253:37
  * notes = ""
  * notes_url = ""
  * package = "_etc"
    % = modified in '/etc/icinga2/zones.d/global-templates/services.conf', lines 253:1-253:37
  * retry_interval = 30
    % = modified in '/etc/icinga2/zones.d/global-templates/templates.conf', lines 14:3-14:25
    % = modified in '/etc/icinga2/zones.d/global-templates/services.conf', lines 259:3-259:26
  * source_location
    * first_column = 1
    * first_line = 253
    * last_column = 37
    * last_line = 253
    * path = "/etc/icinga2/zones.d/global-templates/services.conf"
  * templates = [ "Icinga service health", "generic-service" ]
    % = modified in '/etc/icinga2/zones.d/global-templates/services.conf', lines 253:1-253:37
    % = modified in '/etc/icinga2/zones.d/global-templates/templates.conf', lines 11:1-11:34
  * type = "Service"
  * vars
    * priority = "p2"
      % = modified in '/etc/icinga2/zones.d/global-templates/services.conf', lines 262:3-262:27
  * volatile = false
  * zone = "generic-zone"
    % = modified in '/etc/icinga2/zones.d/global-templates/services.conf', lines 253:1-253:37

Your Environment

Version used (icinga2 --version): r2.9.1-1
Operating System and version: CentOS Linux 7 (Core)
Enabled features on master (icinga2 feature list): api checker graphite ido-pgsql mainlog notification
Enabled features on satelite (icinga2 feature list): api checker mainlog
Enabled features on client (icinga2 feature list): api mainlog
Icinga Web 2 version and modules (System - About): 2.6.1
Config validation (icinga2 daemon -C):

[2018-09-06 14:12:19 +0000] information/cli: Icinga application loader (version: r2.9.1-1)
[2018-09-06 14:12:19 +0000] information/cli: Loading configuration file(s).
[2018-09-06 14:12:19 +0000] information/ConfigItem: Committing config item(s).
[2018-09-06 14:12:19 +0000] information/ApiListener: My API identity: icinga-master-001
[2018-09-06 14:12:22 +0000] information/ConfigItem: Instantiated 7293 Services.
[2018-09-06 14:12:22 +0000] information/ConfigItem: Instantiated 7 ServiceGroups.
[2018-09-06 14:12:22 +0000] information/ConfigItem: Instantiated 114 HostGroups.
[2018-09-06 14:12:22 +0000] information/ConfigItem: Instantiated 1 FileLogger.
[2018-09-06 14:12:22 +0000] information/ConfigItem: Instantiated 1 NotificationComponent.
[2018-09-06 14:12:22 +0000] information/ConfigItem: Instantiated 6 NotificationCommands.
[2018-09-06 14:12:22 +0000] information/ConfigItem: Instantiated 9857 Notifications.
[2018-09-06 14:12:22 +0000] information/ConfigItem: Instantiated 1 IcingaApplication.
[2018-09-06 14:12:22 +0000] information/ConfigItem: Instantiated 488 Hosts.
[2018-09-06 14:12:22 +0000] information/ConfigItem: Instantiated 1 ApiListener.
[2018-09-06 14:12:22 +0000] information/ConfigItem: Instantiated 68 Downtimes.
[2018-09-06 14:12:22 +0000] information/ConfigItem: Instantiated 6428 Dependencies.
[2018-09-06 14:12:22 +0000] information/ConfigItem: Instantiated 1 GraphiteWriter.
[2018-09-06 14:12:22 +0000] information/ConfigItem: Instantiated 19 Comments.
[2018-09-06 14:12:22 +0000] information/ConfigItem: Instantiated 1 CheckerComponent.
[2018-09-06 14:12:22 +0000] information/ConfigItem: Instantiated 456 Zones.
[2018-09-06 14:12:22 +0000] information/ConfigItem: Instantiated 458 Endpoints.
[2018-09-06 14:12:22 +0000] information/ConfigItem: Instantiated 2 ApiUsers.
[2018-09-06 14:12:22 +0000] information/ConfigItem: Instantiated 3 UserGroups.
[2018-09-06 14:12:22 +0000] information/ConfigItem: Instantiated 240 CheckCommands.
[2018-09-06 14:12:22 +0000] information/ConfigItem: Instantiated 1 IdoPgsqlConnection.
[2018-09-06 14:12:22 +0000] information/ConfigItem: Instantiated 4 TimePeriods.
[2018-09-06 14:12:22 +0000] information/ConfigItem: Instantiated 8 Users.
[2018-09-06 14:12:22 +0000] information/ScriptGlobal: Dumping variables to file '/var/cache/icinga2/icinga2.vars'
[2018-09-06 14:12:22 +0000] information/cli: Finished validating the configuration file(s).

If you run multiple Icinga 2 instances, the zones.conf file (or icinga2 object list --type Endpoint and icinga2 object list --type Zone) from all affected nodes.

object Endpoint "icinga-master-001" {
    host = "10.0.0.1"
}
object Endpoint "icinga-master-002" {
    host = "10.0.0.2"
}
object Zone "master" {
    endpoints = [ "icinga-master-002", "icinga-master-001" ]
}

object Zone "zone1" {
    endpoints = [ "icinga-satelite-001" ]
    parent = "master"
}
object Endpoint "icinga-satelite-001" {
    host = "10.1.0.1"
}

object Zone "zone2" {
    endpoints = [ "icinga-satelite-003", "icinga-satelite-002" ]
    parent = "master"
}
object Endpoint "icinga-satelite-003" {
    host = "10.1.0.3"
}
object Endpoint "icinga-satelite-002" {
    host = "10.1.0.2"
}

object Zone "zone3" {
    endpoints = [ "icinga-satelite-004", "icinga-satelite-005" ]
    parent = "master"
}
object Endpoint "icinga-satelite-004" {
    host = "10.1.0.4"
}
object Endpoint "icinga-satelite-005" {
    host = "10.1.0.5"
}

object Zone "global-templates"{
    global = true
}

The text was updated successfully, but these errors were encountered:

dnsmichi · 2018-09-11T15:07:20Z

Can you try to connect to the API event streams and verify that the check result is received multiple times a second? It looks strange from your Icinga Web 2 screenshot.

https://www.icinga.com/docs/icinga2/latest/doc/15-troubleshooting/#checks-are-not-executed (at the bottom of the section).

miso231 · 2018-09-12T15:40:59Z

Check result is indeed received multiple times a second. This is output from event stream:

{"check_result":{"active":true,"check_source":"generic-host","command":null,"execution_end":1536766192.8711719513,"execution_start":1536766192.8710689545,"exit_status":0.0,"output":"Remote Icinga instance 'generic-host' is not connected to 'satelite'","performance_data":[],"schedule_end":1536766192.8711719513,"schedule_start":1536766192.8708860874,"state":3.0,"ttl":0.0,"type":"CheckResult","vars_after":{"attempt":1.0,"reachable":true,"state":3.0,"state_type":0.0},"vars_before":{"attempt":1.0,"reachable":true,"state":0.0,"state_type":1.0}},"host":"generic-host","service":"Icinga service health","timestamp":1536766192.8902659416,"type":"CheckResult"}

{"check_result":{"active":true,"check_source":"generic-host","command":null,"execution_end":1536766195.2035400867,"execution_start":1536766195.2034769058,"exit_status":0.0,"output":"Remote Icinga instance 'generic-host' is not connected to 'satelite'","performance_data":[],"schedule_end":1536766195.2035400867,"schedule_start":1536766195.203152895,"state":3.0,"ttl":0.0,"type":"CheckResult","vars_after":{"attempt":2.0,"reachable":true,"state":3.0,"state_type":0.0},"vars_before":{"attempt":1.0,"reachable":true,"state":3.0,"state_type":0.0}},"host":"generic-host","service":"Icinga service health","timestamp":1536766195.2083389759,"type":"CheckResult"}

{"check_result":{"active":true,"check_source":"generic-host","command":null,"execution_end":1536766195.2039070129,"execution_start":1536766195.2038550377,"exit_status":0.0,"output":"Remote Icinga instance 'generic-host' is not connected to 'satelite'","performance_data":[],"schedule_end":1536766195.2039070129,"schedule_start":1536766195.2036719322,"state":3.0,"ttl":0.0,"type":"CheckResult","vars_after":{"attempt":3.0,"reachable":true,"state":3.0,"state_type":0.0},"vars_before":{"attempt":2.0,"reachable":true,"state":3.0,"state_type":0.0}},"host":"generic-host","service":"Icinga service health","timestamp":1536766195.2128579617,"type":"CheckResult"}

{"check_result":{"active":true,"check_source":"generic-host","command":null,"execution_end":1536766195.2140190601,"execution_start":1536766195.2139348984,"exit_status":0.0,"output":"Remote Icinga instance 'generic-host' is not connected to 'satelite'","performance_data":[],"schedule_end":1536766195.2140190601,"schedule_start":1536766195.2021250725,"state":3.0,"ttl":0.0,"type":"CheckResult","vars_after":{"attempt":4.0,"reachable":true,"state":3.0,"state_type":0.0},"vars_before":{"attempt":3.0,"reachable":true,"state":3.0,"state_type":0.0}},"host":"generic-host","service":"Icinga service health","timestamp":1536766195.2393369675,"type":"CheckResult"}

dnsmichi · 2018-09-14T17:39:29Z

Hmmm, so the checks are actually executed on the satellites themselves. Which zone is affected here from your configuration zones.conf?

miso231 · 2018-09-17T08:15:10Z

This would be the zone 'zone3' from the zones.conf

dnsmichi · 2018-09-17T11:39:23Z

Ok, thanks. Still strange .. scheduled_start points to nearly the same second for each attempt. I'd say this is a bug and needs a reproducer.

One last question - does this happen to only the icinga check via command endpoint, or are other check commands affected too?

miso231 · 2018-09-17T11:48:16Z

Yes, it happens only to icinga check. Other checks work fine.

miso231 · 2018-09-20T12:59:28Z

I've just found out that this issue occurs only when icinga2 agent on host is not running -> unknown state. If icinga2 is running and has other problem (in my case problem with reload: ... Last reload attempt failed ...) it works as expected -> retry interval is respected

marcelfischer · 2018-09-21T14:50:13Z

I have a similar problem but with the cluster-zone command:
https://www.monitoring-portal.org/t/service-with-cluster-zone-command-goes-hard-state-immediately/1301

dnsmichi · 2018-09-27T11:07:42Z

Are you using dependencies by chance?

miso231 · 2018-09-27T12:21:51Z

Yes, I do. Here it is

apply Dependency "disable-icinga-agent-checks" to Service {
  parent_service_name   = "Icinga service health"
  states                = [ OK, Warning ]
  disable_checks        = false
  disable_notifications = true

  assign where true
  ignore where service.name == "Icinga service health"
  ignore where host.vars.services != "common"
}

dnsmichi · 2018-09-27T12:35:40Z

So, when the check turns initially fails, the parents defined by the dependency run an immediate re-check to quicker know about reachability the next time the service is checked. That's what you see within Icinga Web 2, for anything you'll define as parent, e.g. icinga or cluster-zone checks. I guess it is the same as with #5022 and #5375.

miso231 · 2018-10-01T07:52:31Z

Yes, you are right. Everything works as expected after removing the dependencies.

dnsmichi · 2018-10-08T11:09:57Z

2.10 contains a PR which should fix this behaviour.

dnsmichi changed the title ~~[Bug] CheckCommand 'icinga' ignores soft states~~ CheckCommand 'icinga' seems to ignore retry interval via command_endpoint Sep 11, 2018

dnsmichi added area/distributed Distributed monitoring (master, satellites, clients) needs feedback We'll only proceed once we hear from you again area/checks Check execution and results labels Sep 11, 2018

dnsmichi added the bug Something isn't working label Sep 17, 2018

dnsmichi removed the needs feedback We'll only proceed once we hear from you again label Oct 8, 2018

dnsmichi added this to the 2.10.0 milestone Oct 8, 2018

dnsmichi closed this as completed Oct 8, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CheckCommand 'icinga' seems to ignore retry interval via command_endpoint #6603

CheckCommand 'icinga' seems to ignore retry interval via command_endpoint #6603

miso231 commented Sep 6, 2018

dnsmichi commented Sep 11, 2018

miso231 commented Sep 12, 2018

dnsmichi commented Sep 14, 2018

miso231 commented Sep 17, 2018

dnsmichi commented Sep 17, 2018

miso231 commented Sep 17, 2018

miso231 commented Sep 20, 2018

marcelfischer commented Sep 21, 2018

dnsmichi commented Sep 27, 2018

miso231 commented Sep 27, 2018

dnsmichi commented Sep 27, 2018

miso231 commented Oct 1, 2018

dnsmichi commented Oct 8, 2018