Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CheckCommand 'icinga' seems to ignore retry interval via command_endpoint #6603

Closed
miso231 opened this issue Sep 6, 2018 · 13 comments
Closed
Labels
area/checks Check execution and results area/distributed Distributed monitoring (master, satellites, clients) bug Something isn't working
Milestone

Comments

@miso231
Copy link

miso231 commented Sep 6, 2018

CheckCommand icinga ignores soft states when problem occurs. It jumps right into the hard state and send notification. Perhaps the icinga command ignores retry_interval parameter (in my case 30s) of service object as I can see in Web UI all attempts in just two seconds, see image below:

screenshot from 2018-09-06 15-09-14

Expected Behavior

CheckCommand icinga should respect parameters retry_interval and max_check_attempts of service object

Context

I'm using 3-level architecture with HA master zone and 3 child zones. Issue happens on all 3 levels (masters, satelites, clients). Runtime service object:

Object 'generic-host!Icinga service health' of type 'Service':
  % declared in '/etc/icinga2/zones.d/global-templates/services.conf', lines 253:1-253:37
  * __name = "generic-host!Icinga service health"
  * action_url = ""
  * check_command = "icinga"
    % = modified in '/etc/icinga2/zones.d/global-templates/services.conf', lines 255:3-255:31
  * check_interval = 60
    % = modified in '/etc/icinga2/zones.d/global-templates/templates.conf', lines 13:3-13:25
    % = modified in '/etc/icinga2/zones.d/global-templates/services.conf', lines 258:3-258:25
  * check_period = ""
  * check_timeout = null
  * command_endpoint = "generic-host"
    % = modified in '/etc/icinga2/zones.d/global-templates/services.conf', lines 264:3-264:32
  * display_name = "Icinga service health"
  * enable_active_checks = true
  * enable_event_handler = true
  * enable_flapping = false
  * enable_notifications = true
  * enable_passive_checks = true
  * enable_perfdata = false
    % = modified in '/etc/icinga2/zones.d/global-templates/services.conf', lines 260:3-260:24
  * event_command = ""
  * flapping_threshold = 0
  * flapping_threshold_high = 30
  * flapping_threshold_low = 25
  * groups = [ ]
  * host_name = "generic-host"
    % = modified in '/etc/icinga2/zones.d/global-templates/services.conf', lines 253:1-253:37
    % = modified in '/etc/icinga2/zones.d/global-templates/services.conf', lines 257:3-257:40
  * icon_image = ""
  * icon_image_alt = ""
  * max_check_attempts = 5
    % = modified in '/etc/icinga2/zones.d/global-templates/templates.conf', lines 12:3-12:24
    % = modified in '/etc/icinga2/zones.d/global-templates/services.conf', lines 256:3-256:24
  * name = "Icinga service health"
    % = modified in '/etc/icinga2/zones.d/global-templates/services.conf', lines 253:1-253:37
  * notes = ""
  * notes_url = ""
  * package = "_etc"
    % = modified in '/etc/icinga2/zones.d/global-templates/services.conf', lines 253:1-253:37
  * retry_interval = 30
    % = modified in '/etc/icinga2/zones.d/global-templates/templates.conf', lines 14:3-14:25
    % = modified in '/etc/icinga2/zones.d/global-templates/services.conf', lines 259:3-259:26
  * source_location
    * first_column = 1
    * first_line = 253
    * last_column = 37
    * last_line = 253
    * path = "/etc/icinga2/zones.d/global-templates/services.conf"
  * templates = [ "Icinga service health", "generic-service" ]
    % = modified in '/etc/icinga2/zones.d/global-templates/services.conf', lines 253:1-253:37
    % = modified in '/etc/icinga2/zones.d/global-templates/templates.conf', lines 11:1-11:34
  * type = "Service"
  * vars
    * priority = "p2"
      % = modified in '/etc/icinga2/zones.d/global-templates/services.conf', lines 262:3-262:27
  * volatile = false
  * zone = "generic-zone"
    % = modified in '/etc/icinga2/zones.d/global-templates/services.conf', lines 253:1-253:37

Your Environment

  • Version used (icinga2 --version): r2.9.1-1
  • Operating System and version: CentOS Linux 7 (Core)
  • Enabled features on master (icinga2 feature list): api checker graphite ido-pgsql mainlog notification
  • Enabled features on satelite (icinga2 feature list): api checker mainlog
  • Enabled features on client (icinga2 feature list): api mainlog
  • Icinga Web 2 version and modules (System - About): 2.6.1
  • Config validation (icinga2 daemon -C):
[2018-09-06 14:12:19 +0000] information/cli: Icinga application loader (version: r2.9.1-1)
[2018-09-06 14:12:19 +0000] information/cli: Loading configuration file(s).
[2018-09-06 14:12:19 +0000] information/ConfigItem: Committing config item(s).
[2018-09-06 14:12:19 +0000] information/ApiListener: My API identity: icinga-master-001
[2018-09-06 14:12:22 +0000] information/ConfigItem: Instantiated 7293 Services.
[2018-09-06 14:12:22 +0000] information/ConfigItem: Instantiated 7 ServiceGroups.
[2018-09-06 14:12:22 +0000] information/ConfigItem: Instantiated 114 HostGroups.
[2018-09-06 14:12:22 +0000] information/ConfigItem: Instantiated 1 FileLogger.
[2018-09-06 14:12:22 +0000] information/ConfigItem: Instantiated 1 NotificationComponent.
[2018-09-06 14:12:22 +0000] information/ConfigItem: Instantiated 6 NotificationCommands.
[2018-09-06 14:12:22 +0000] information/ConfigItem: Instantiated 9857 Notifications.
[2018-09-06 14:12:22 +0000] information/ConfigItem: Instantiated 1 IcingaApplication.
[2018-09-06 14:12:22 +0000] information/ConfigItem: Instantiated 488 Hosts.
[2018-09-06 14:12:22 +0000] information/ConfigItem: Instantiated 1 ApiListener.
[2018-09-06 14:12:22 +0000] information/ConfigItem: Instantiated 68 Downtimes.
[2018-09-06 14:12:22 +0000] information/ConfigItem: Instantiated 6428 Dependencies.
[2018-09-06 14:12:22 +0000] information/ConfigItem: Instantiated 1 GraphiteWriter.
[2018-09-06 14:12:22 +0000] information/ConfigItem: Instantiated 19 Comments.
[2018-09-06 14:12:22 +0000] information/ConfigItem: Instantiated 1 CheckerComponent.
[2018-09-06 14:12:22 +0000] information/ConfigItem: Instantiated 456 Zones.
[2018-09-06 14:12:22 +0000] information/ConfigItem: Instantiated 458 Endpoints.
[2018-09-06 14:12:22 +0000] information/ConfigItem: Instantiated 2 ApiUsers.
[2018-09-06 14:12:22 +0000] information/ConfigItem: Instantiated 3 UserGroups.
[2018-09-06 14:12:22 +0000] information/ConfigItem: Instantiated 240 CheckCommands.
[2018-09-06 14:12:22 +0000] information/ConfigItem: Instantiated 1 IdoPgsqlConnection.
[2018-09-06 14:12:22 +0000] information/ConfigItem: Instantiated 4 TimePeriods.
[2018-09-06 14:12:22 +0000] information/ConfigItem: Instantiated 8 Users.
[2018-09-06 14:12:22 +0000] information/ScriptGlobal: Dumping variables to file '/var/cache/icinga2/icinga2.vars'
[2018-09-06 14:12:22 +0000] information/cli: Finished validating the configuration file(s).
  • If you run multiple Icinga 2 instances, the zones.conf file (or icinga2 object list --type Endpoint and icinga2 object list --type Zone) from all affected nodes.
object Endpoint "icinga-master-001" {
    host = "10.0.0.1"
}
object Endpoint "icinga-master-002" {
    host = "10.0.0.2"
}
object Zone "master" {
    endpoints = [ "icinga-master-002", "icinga-master-001" ]
}

object Zone "zone1" {
    endpoints = [ "icinga-satelite-001" ]
    parent = "master"
}
object Endpoint "icinga-satelite-001" {
    host = "10.1.0.1"
}

object Zone "zone2" {
    endpoints = [ "icinga-satelite-003", "icinga-satelite-002" ]
    parent = "master"
}
object Endpoint "icinga-satelite-003" {
    host = "10.1.0.3"
}
object Endpoint "icinga-satelite-002" {
    host = "10.1.0.2"
}

object Zone "zone3" {
    endpoints = [ "icinga-satelite-004", "icinga-satelite-005" ]
    parent = "master"
}
object Endpoint "icinga-satelite-004" {
    host = "10.1.0.4"
}
object Endpoint "icinga-satelite-005" {
    host = "10.1.0.5"
}

object Zone "global-templates"{
    global = true
}
@dnsmichi
Copy link
Contributor

Can you try to connect to the API event streams and verify that the check result is received multiple times a second? It looks strange from your Icinga Web 2 screenshot.

https://www.icinga.com/docs/icinga2/latest/doc/15-troubleshooting/#checks-are-not-executed (at the bottom of the section).

@dnsmichi dnsmichi changed the title [Bug] CheckCommand 'icinga' ignores soft states CheckCommand 'icinga' seems to ignore retry interval via command_endpoint Sep 11, 2018
@dnsmichi dnsmichi added area/distributed Distributed monitoring (master, satellites, clients) needs feedback We'll only proceed once we hear from you again area/checks Check execution and results labels Sep 11, 2018
@miso231
Copy link
Author

miso231 commented Sep 12, 2018

Check result is indeed received multiple times a second. This is output from event stream:

{"check_result":{"active":true,"check_source":"generic-host","command":null,"execution_end":1536766192.8711719513,"execution_start":1536766192.8710689545,"exit_status":0.0,"output":"Remote Icinga instance 'generic-host' is not connected to 'satelite'","performance_data":[],"schedule_end":1536766192.8711719513,"schedule_start":1536766192.8708860874,"state":3.0,"ttl":0.0,"type":"CheckResult","vars_after":{"attempt":1.0,"reachable":true,"state":3.0,"state_type":0.0},"vars_before":{"attempt":1.0,"reachable":true,"state":0.0,"state_type":1.0}},"host":"generic-host","service":"Icinga service health","timestamp":1536766192.8902659416,"type":"CheckResult"}

{"check_result":{"active":true,"check_source":"generic-host","command":null,"execution_end":1536766195.2035400867,"execution_start":1536766195.2034769058,"exit_status":0.0,"output":"Remote Icinga instance 'generic-host' is not connected to 'satelite'","performance_data":[],"schedule_end":1536766195.2035400867,"schedule_start":1536766195.203152895,"state":3.0,"ttl":0.0,"type":"CheckResult","vars_after":{"attempt":2.0,"reachable":true,"state":3.0,"state_type":0.0},"vars_before":{"attempt":1.0,"reachable":true,"state":3.0,"state_type":0.0}},"host":"generic-host","service":"Icinga service health","timestamp":1536766195.2083389759,"type":"CheckResult"}

{"check_result":{"active":true,"check_source":"generic-host","command":null,"execution_end":1536766195.2039070129,"execution_start":1536766195.2038550377,"exit_status":0.0,"output":"Remote Icinga instance 'generic-host' is not connected to 'satelite'","performance_data":[],"schedule_end":1536766195.2039070129,"schedule_start":1536766195.2036719322,"state":3.0,"ttl":0.0,"type":"CheckResult","vars_after":{"attempt":3.0,"reachable":true,"state":3.0,"state_type":0.0},"vars_before":{"attempt":2.0,"reachable":true,"state":3.0,"state_type":0.0}},"host":"generic-host","service":"Icinga service health","timestamp":1536766195.2128579617,"type":"CheckResult"}

{"check_result":{"active":true,"check_source":"generic-host","command":null,"execution_end":1536766195.2140190601,"execution_start":1536766195.2139348984,"exit_status":0.0,"output":"Remote Icinga instance 'generic-host' is not connected to 'satelite'","performance_data":[],"schedule_end":1536766195.2140190601,"schedule_start":1536766195.2021250725,"state":3.0,"ttl":0.0,"type":"CheckResult","vars_after":{"attempt":4.0,"reachable":true,"state":3.0,"state_type":0.0},"vars_before":{"attempt":3.0,"reachable":true,"state":3.0,"state_type":0.0}},"host":"generic-host","service":"Icinga service health","timestamp":1536766195.2393369675,"type":"CheckResult"}

@dnsmichi
Copy link
Contributor

Hmmm, so the checks are actually executed on the satellites themselves. Which zone is affected here from your configuration zones.conf?

@miso231
Copy link
Author

miso231 commented Sep 17, 2018

This would be the zone 'zone3' from the zones.conf

@dnsmichi
Copy link
Contributor

Ok, thanks. Still strange .. scheduled_start points to nearly the same second for each attempt. I'd say this is a bug and needs a reproducer.

One last question - does this happen to only the icinga check via command endpoint, or are other check commands affected too?

@dnsmichi dnsmichi added the bug Something isn't working label Sep 17, 2018
@miso231
Copy link
Author

miso231 commented Sep 17, 2018

Yes, it happens only to icinga check. Other checks work fine.

@miso231
Copy link
Author

miso231 commented Sep 20, 2018

I've just found out that this issue occurs only when icinga2 agent on host is not running -> unknown state. If icinga2 is running and has other problem (in my case problem with reload: ... Last reload attempt failed ...) it works as expected -> retry interval is respected

@marcelfischer
Copy link

I have a similar problem but with the cluster-zone command:
https://www.monitoring-portal.org/t/service-with-cluster-zone-command-goes-hard-state-immediately/1301

@dnsmichi
Copy link
Contributor

Are you using dependencies by chance?

@miso231
Copy link
Author

miso231 commented Sep 27, 2018

Yes, I do. Here it is

apply Dependency "disable-icinga-agent-checks" to Service {
  parent_service_name   = "Icinga service health"
  states                = [ OK, Warning ]
  disable_checks        = false
  disable_notifications = true

  assign where true
  ignore where service.name == "Icinga service health"
  ignore where host.vars.services != "common"
}

@dnsmichi
Copy link
Contributor

So, when the check turns initially fails, the parents defined by the dependency run an immediate re-check to quicker know about reachability the next time the service is checked. That's what you see within Icinga Web 2, for anything you'll define as parent, e.g. icinga or cluster-zone checks. I guess it is the same as with #5022 and #5375.

@miso231
Copy link
Author

miso231 commented Oct 1, 2018

Yes, you are right. Everything works as expected after removing the dependencies.

@dnsmichi
Copy link
Contributor

dnsmichi commented Oct 8, 2018

2.10 contains a PR which should fix this behaviour.

@dnsmichi dnsmichi removed the needs feedback We'll only proceed once we hear from you again label Oct 8, 2018
@dnsmichi dnsmichi added this to the 2.10.0 milestone Oct 8, 2018
@dnsmichi dnsmichi closed this as completed Oct 8, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/checks Check execution and results area/distributed Distributed monitoring (master, satellites, clients) bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants