Shinken stops doing anything intermittently #1636

dmsimard · 2015-05-19T18:15:40Z

We have a pretty annoying issue that persists even after upgrading to Shinken 2.4: Every hour or so, Shinken schedulers and pollers will stop doing anything for up to 10 minutes.

We've tried many things in order to try and resolve the issue but running out of ideas right now..

Some insight I believe could be relevant on what the setup and configuration looks like:

All servers are running Ubuntu 14.04 with the latest packages, including Shinken 2.4.
Scheduler retention is in redis on the host where the arbiter is, latest version of the module.
All servers are ntp-synchronized (logs seem to imply something is off..)
Schedulers and pollers are SSL but not hard_ssl.
1k+ hosts, 15k+ services across 5 realms (5 realms member of "All", nothing is directly in "All")
Livestatus and Thruk

The realm configuration looks like this:

Realm "All":

Arbiter
Broker
Reactioner

Realm "Default":

Two active schedulers
One poller

Realm "01":

One scheduler
One poller

Realm "N":

One scheduler
One poller

Some data on the problem:

Looking at the logs, this seems to correspond to regular re-distribution of the configuration:

[Tue May 19 02:16:56 2015] INFO: [Shinken] Retention information updated in Redis
[Tue May 19 02:16:56 2015] WARNING: [Shinken] 6 actions never came back for the satellite 'node.monitoring.tld'.I reenable them for polling
[Tue May 19 02:16:56 2015] WARNING: [Shinken] A system time change of 932 has been detected. Compensating...
[Tue May 19 02:17:33 2015] INFO: [Shinken] Retention information updated in Redis
[Tue May 19 02:17:33 2015] INFO: [Shinken] Waiting for initial configuration
[Tue May 19 02:17:33 2015] INFO: [Shinken] New configuration received
[Tue May 19 02:17:33 2015] INFO: [Shinken] Correctly loaded dummy_scheduler as an old-new-style shinken module :|
[Tue May 19 02:17:33 2015] INFO: [Shinken] Correctly loaded dummy_poller as an old-new-style shinken module :|
[Tue May 19 02:17:33 2015] INFO: [Shinken] Correctly loaded dummy_broker_external as an old-new-style shinken module :|
[Tue May 19 02:17:33 2015] INFO: [Shinken] Correctly loaded retention-memcache as an old-new-style shinken module :|
[Tue May 19 02:17:33 2015] INFO: [Shinken] Correctly loaded retention-redis as an old-new-style shinken module :|
[Tue May 19 02:17:33 2015] INFO: [Shinken] Correctly loaded dummy_broker as an old-new-style shinken module :|
[Tue May 19 02:17:33 2015] INFO: [Shinken] Correctly loaded dummy_arbiter as an old-new-style shinken module :|
[Tue May 19 02:17:33 2015] INFO: [Shinken] Trying to init module: RedisRetention
[Tue May 19 02:17:33 2015] INFO: [Shinken] I correctly loaded the modules: [RedisRetention]
[Tue May 19 02:17:33 2015] INFO: [Shinken] Loading configuration.
[Tue May 19 02:17:33 2015] INFO: [Shinken] New configuration loaded
[Tue May 19 02:17:35 2015] INFO: [Shinken] [RedisRetention] Retention objects loaded successfully.
[Tue May 19 02:17:35 2015] INFO: [Shinken] [node.monitoring.tld] First scheduling launched
[Tue May 19 02:17:36 2015] INFO: [Shinken] [node.monitoring.tld] First scheduling done
[Tue May 19 02:21:59 2015] INFO: [Shinken] A new broker just connected : broker.monitoring.tld
[Tue May 19 02:22:00 2015] INFO: [Shinken] [node.monitoring.tld] Created 1245 initial Broks for broker broker.monitoring.tld
[Tue May 19 03:33:54 2015] INFO: [Shinken] Retention information updated in Redis
[Tue May 19 03:33:54 2015] WARNING: [Shinken] 11 actions never came back for the satellite 'node.monitoring.tld'.I reenable them for polling
[Tue May 19 03:33:54 2015] WARNING: [Shinken] A system time change of 933 has been detected. Compensating...
[Tue May 19 03:34:11 2015] INFO: [Shinken] Retention information updated in Redis
[Tue May 19 03:34:11 2015] INFO: [Shinken] Waiting for initial configuration
[Tue May 19 03:34:11 2015] INFO: [Shinken] New configuration received
[Tue May 19 03:34:11 2015] INFO: [Shinken] Correctly loaded dummy_scheduler as an old-new-style shinken module :|
[Tue May 19 03:34:11 2015] INFO: [Shinken] Correctly loaded dummy_poller as an old-new-style shinken module :|
[Tue May 19 03:34:11 2015] INFO: [Shinken] Correctly loaded dummy_broker_external as an old-new-style shinken module :|
[Tue May 19 03:34:11 2015] INFO: [Shinken] Correctly loaded retention-memcache as an old-new-style shinken module :|
[Tue May 19 03:34:11 2015] INFO: [Shinken] Correctly loaded retention-redis as an old-new-style shinken module :|
[Tue May 19 03:34:11 2015] INFO: [Shinken] Correctly loaded dummy_broker as an old-new-style shinken module :|
[Tue May 19 03:34:11 2015] INFO: [Shinken] Correctly loaded dummy_arbiter as an old-new-style shinken module :|
[Tue May 19 03:34:11 2015] INFO: [Shinken] Trying to init module: RedisRetention
[Tue May 19 03:34:11 2015] INFO: [Shinken] I correctly loaded the modules: [RedisRetention]
[Tue May 19 03:34:11 2015] INFO: [Shinken] Loading configuration.
[Tue May 19 03:34:11 2015] INFO: [Shinken] New configuration loaded
[Tue May 19 03:34:13 2015] INFO: [Shinken] [RedisRetention] Retention objects loaded successfully.
[Tue May 19 03:34:13 2015] INFO: [Shinken] [node.monitoring.tld] First scheduling launched
[Tue May 19 03:34:13 2015] INFO: [Shinken] [node.monitoring.tld] First scheduling done
[Tue May 19 03:34:13 2015] INFO: [Shinken] A new broker just connected : broker.monitoring.tld
[Tue May 19 03:34:14 2015] INFO: [Shinken] [node.monitoring.tld] Created 1245 initial Broks for broker broker.monitoring.tld
[Tue May 19 04:50:43 2015] INFO: [Shinken] Retention information updated in Redis
[Tue May 19 04:50:43 2015] WARNING: [Shinken] 15 actions never came back for the satellite 'node.monitoring.tld'.I reenable them for polling
[Tue May 19 04:50:43 2015] WARNING: [Shinken] A system time change of 944 has been detected. Compensating...
[Tue May 19 04:51:36 2015] INFO: [Shinken] Retention information updated in Redis
[Tue May 19 04:51:36 2015] INFO: [Shinken] Waiting for initial configuration
[Tue May 19 04:51:36 2015] INFO: [Shinken] New configuration received
[Tue May 19 04:51:37 2015] INFO: [Shinken] Correctly loaded dummy_scheduler as an old-new-style shinken module :|
[Tue May 19 04:51:37 2015] INFO: [Shinken] Correctly loaded dummy_poller as an old-new-style shinken module :|
[Tue May 19 04:51:37 2015] INFO: [Shinken] Correctly loaded dummy_broker_external as an old-new-style shinken module :|
[Tue May 19 04:51:37 2015] INFO: [Shinken] Correctly loaded retention-memcache as an old-new-style shinken module :|
[Tue May 19 04:51:37 2015] INFO: [Shinken] Correctly loaded retention-redis as an old-new-style shinken module :|
[Tue May 19 04:51:37 2015] INFO: [Shinken] Correctly loaded dummy_broker as an old-new-style shinken module :|
[Tue May 19 04:51:37 2015] INFO: [Shinken] Correctly loaded dummy_arbiter as an old-new-style shinken module :|
[Tue May 19 04:51:37 2015] INFO: [Shinken] Trying to init module: RedisRetention
[Tue May 19 04:51:37 2015] INFO: [Shinken] I correctly loaded the modules: [RedisRetention]
[Tue May 19 04:51:37 2015] INFO: [Shinken] Loading configuration.
[Tue May 19 04:51:37 2015] INFO: [Shinken] New configuration loaded
[Tue May 19 04:51:39 2015] INFO: [Shinken] [RedisRetention] Retention objects loaded successfully.
[Tue May 19 04:51:39 2015] INFO: [Shinken] [node.monitoring.tld] First scheduling launched
[Tue May 19 04:51:39 2015] INFO: [Shinken] [node.monitoring.tld] First scheduling done
[Tue May 19 04:51:39 2015] INFO: [Shinken] A new broker just connected : broker.monitoring.tld
[Tue May 19 04:51:39 2015] INFO: [Shinken] [node.monitoring.tld] Created 1245 initial Broks for broker broker.monitoring.tld

We definitely don't reload the configuration on a regular basis so it looks like this is Shinken's own doing.

There is also what might be related but could be otherwise another issue - our Arbiter often gets timeouts when distributing the configuration to the schedulers and pollers. This isn't a network issue - if you curl the https link directly, it will not respond "OK", it will really timeout and doing a tcpdump, you see that the curl request indeed makes it to the server. It looks like the poller or the scheduler gets stuck on something but eventually comes back available by itself. Stopping and starting the services resolves this but is not a viable solution.
I turned on debug logging on the pollers/schedulers but haven't found anything that seemed relevant.

Any ideas ? I can provide more details if need be, this is a critical issue for us.

The text was updated successfully, but these errors were encountered:

olivierHa · 2015-05-19T18:32:49Z

Did you try without the redis retention module ?
Le 19 mai 2015 8:15 PM, "dmsimard" notifications@github.com a écrit :

We have a pretty annoying issue that persists even after upgrading to
Shinken 2.4: Every hour or so, Shinken schedulers and pollers will stop
doing anything for up to 10 minutes.

We've tried many things in order to try and resolve the issue but running
out of ideas right now..

Some insight I believe could be relevant on what the setup and
configuration looks like:

All servers are running Ubuntu 14.04 with the latest packages,
including Shinken 2.4.

Scheduler retention is in redis on the host where the arbiter is,
latest version of the module.

All servers are ntp-synchronized (logs seem to imply something is
off..)

Schedulers and pollers are SSL but not hard_ssl.

1k+ hosts, 15k+ services across 5 realms (5 realms member of "All",
nothing is directly in "All")

Livestatus and Thruk

The realm configuration looks like this:

Realm "All":

Arbiter

Broker

Reactioner

Realm "Default":

Two active schedulers

One poller

Realm "01":

One scheduler

One poller

Realm "N":

One scheduler

One poller

Some data on the problem:
[image: Performance info]
https://camo.githubusercontent.com/3a4cfbb7cc7e6cd60294f1480889a19052984002/687474703a2f2f692e696d6775722e636f6d2f3869504d4e64362e706e67

[image: Graphs]
https://camo.githubusercontent.com/a8fd9aea2afa74f51e1d9cf38b714f4e68423d8d/687474703a2f2f692e696d6775722e636f6d2f344276366d78662e706e67

Looking at the logs, this seems to correspond to regular re-distribution
of the configuration:

[Tue May 19 02:16:56 2015] INFO: [Shinken] Retention information updated in Redis
[Tue May 19 02:16:56 2015] WARNING: [Shinken] 6 actions never came back for the satellite 'node.monitoring.tld'.I reenable them for polling
[Tue May 19 02:16:56 2015] WARNING: [Shinken] A system time change of 932 has been detected. Compensating...
[Tue May 19 02:17:33 2015] INFO: [Shinken] Retention information updated in Redis
[Tue May 19 02:17:33 2015] INFO: [Shinken] Waiting for initial configuration
[Tue May 19 02:17:33 2015] INFO: [Shinken] New configuration received
[Tue May 19 02:17:33 2015] INFO: [Shinken] Correctly loaded dummy_scheduler as an old-new-style shinken module :|
[Tue May 19 02:17:33 2015] INFO: [Shinken] Correctly loaded dummy_poller as an old-new-style shinken module :|
[Tue May 19 02:17:33 2015] INFO: [Shinken] Correctly loaded dummy_broker_external as an old-new-style shinken module :|
[Tue May 19 02:17:33 2015] INFO: [Shinken] Correctly loaded retention-memcache as an old-new-style shinken module :|
[Tue May 19 02:17:33 2015] INFO: [Shinken] Correctly loaded retention-redis as an old-new-style shinken module :|
[Tue May 19 02:17:33 2015] INFO: [Shinken] Correctly loaded dummy_broker as an old-new-style shinken module :|
[Tue May 19 02:17:33 2015] INFO: [Shinken] Correctly loaded dummy_arbiter as an old-new-style shinken module :|
[Tue May 19 02:17:33 2015] INFO: [Shinken] Trying to init module: RedisRetention
[Tue May 19 02:17:33 2015] INFO: [Shinken] I correctly loaded the modules: [RedisRetention]
[Tue May 19 02:17:33 2015] INFO: [Shinken] Loading configuration.
[Tue May 19 02:17:33 2015] INFO: [Shinken] New configuration loaded
[Tue May 19 02:17:35 2015] INFO: [Shinken] [RedisRetention] Retention objects loaded successfully.
[Tue May 19 02:17:35 2015] INFO: [Shinken] [node.monitoring.tld] First scheduling launched
[Tue May 19 02:17:36 2015] INFO: [Shinken] [node.monitoring.tld] First scheduling done
[Tue May 19 02:21:59 2015] INFO: [Shinken] A new broker just connected : broker.monitoring.tld
[Tue May 19 02:22:00 2015] INFO: [Shinken] [node.monitoring.tld] Created 1245 initial Broks for broker broker.monitoring.tld
[Tue May 19 03:33:54 2015] INFO: [Shinken] Retention information updated in Redis
[Tue May 19 03:33:54 2015] WARNING: [Shinken] 11 actions never came back for the satellite 'node.monitoring.tld'.I reenable them for polling
[Tue May 19 03:33:54 2015] WARNING: [Shinken] A system time change of 933 has been detected. Compensating...
[Tue May 19 03:34:11 2015] INFO: [Shinken] Retention information updated in Redis
[Tue May 19 03:34:11 2015] INFO: [Shinken] Waiting for initial configuration
[Tue May 19 03:34:11 2015] INFO: [Shinken] New configuration received
[Tue May 19 03:34:11 2015] INFO: [Shinken] Correctly loaded dummy_scheduler as an old-new-style shinken module :|
[Tue May 19 03:34:11 2015] INFO: [Shinken] Correctly loaded dummy_poller as an old-new-style shinken module :|
[Tue May 19 03:34:11 2015] INFO: [Shinken] Correctly loaded dummy_broker_external as an old-new-style shinken module :|
[Tue May 19 03:34:11 2015] INFO: [Shinken] Correctly loaded retention-memcache as an old-new-style shinken module :|
[Tue May 19 03:34:11 2015] INFO: [Shinken] Correctly loaded retention-redis as an old-new-style shinken module :|
[Tue May 19 03:34:11 2015] INFO: [Shinken] Correctly loaded dummy_broker as an old-new-style shinken module :|
[Tue May 19 03:34:11 2015] INFO: [Shinken] Correctly loaded dummy_arbiter as an old-new-style shinken module :|
[Tue May 19 03:34:11 2015] INFO: [Shinken] Trying to init module: RedisRetention
[Tue May 19 03:34:11 2015] INFO: [Shinken] I correctly loaded the modules: [RedisRetention]
[Tue May 19 03:34:11 2015] INFO: [Shinken] Loading configuration.
[Tue May 19 03:34:11 2015] INFO: [Shinken] New configuration loaded
[Tue May 19 03:34:13 2015] INFO: [Shinken] [RedisRetention] Retention objects loaded successfully.
[Tue May 19 03:34:13 2015] INFO: [Shinken] [node.monitoring.tld] First scheduling launched
[Tue May 19 03:34:13 2015] INFO: [Shinken] [node.monitoring.tld] First scheduling done
[Tue May 19 03:34:13 2015] INFO: [Shinken] A new broker just connected : broker.monitoring.tld
[Tue May 19 03:34:14 2015] INFO: [Shinken] [node.monitoring.tld] Created 1245 initial Broks for broker broker.monitoring.tld
[Tue May 19 04:50:43 2015] INFO: [Shinken] Retention information updated in Redis
[Tue May 19 04:50:43 2015] WARNING: [Shinken] 15 actions never came back for the satellite 'node.monitoring.tld'.I reenable them for polling
[Tue May 19 04:50:43 2015] WARNING: [Shinken] A system time change of 944 has been detected. Compensating...
[Tue May 19 04:51:36 2015] INFO: [Shinken] Retention information updated in Redis
[Tue May 19 04:51:36 2015] INFO: [Shinken] Waiting for initial configuration
[Tue May 19 04:51:36 2015] INFO: [Shinken] New configuration received
[Tue May 19 04:51:37 2015] INFO: [Shinken] Correctly loaded dummy_scheduler as an old-new-style shinken module :|
[Tue May 19 04:51:37 2015] INFO: [Shinken] Correctly loaded dummy_poller as an old-new-style shinken module :|
[Tue May 19 04:51:37 2015] INFO: [Shinken] Correctly loaded dummy_broker_external as an old-new-style shinken module :|
[Tue May 19 04:51:37 2015] INFO: [Shinken] Correctly loaded retention-memcache as an old-new-style shinken module :|
[Tue May 19 04:51:37 2015] INFO: [Shinken] Correctly loaded retention-redis as an old-new-style shinken module :|
[Tue May 19 04:51:37 2015] INFO: [Shinken] Correctly loaded dummy_broker as an old-new-style shinken module :|
[Tue May 19 04:51:37 2015] INFO: [Shinken] Correctly loaded dummy_arbiter as an old-new-style shinken module :|
[Tue May 19 04:51:37 2015] INFO: [Shinken] Trying to init module: RedisRetention
[Tue May 19 04:51:37 2015] INFO: [Shinken] I correctly loaded the modules: [RedisRetention]
[Tue May 19 04:51:37 2015] INFO: [Shinken] Loading configuration.
[Tue May 19 04:51:37 2015] INFO: [Shinken] New configuration loaded
[Tue May 19 04:51:39 2015] INFO: [Shinken] [RedisRetention] Retention objects loaded successfully.
[Tue May 19 04:51:39 2015] INFO: [Shinken] [node.monitoring.tld] First scheduling launched
[Tue May 19 04:51:39 2015] INFO: [Shinken] [node.monitoring.tld] First scheduling done
[Tue May 19 04:51:39 2015] INFO: [Shinken] A new broker just connected : broker.monitoring.tld
[Tue May 19 04:51:39 2015] INFO: [Shinken] [node.monitoring.tld] Created 1245 initial Broks for broker broker.monitoring.tld

We definitely don't reload the configuration on a regular basis so it
looks like this is Shinken's own doing.

Any ideas ? I can provide more details if need be, this is a critical
issue for us.

—
Reply to this email directly or view it on GitHub
#1636.

dmsimard · 2015-05-19T18:41:36Z

@olivierHa Part of the things we tried was to switch to Redis, we previously used Memcached retention.

olivierHa · 2015-05-19T19:43:06Z

Have you tried to launch a daemon using the cli to see more "output/debug"
messages ?
And did you try without retention module at all ?

2015-05-19 20:41 GMT+02:00 dmsimard notifications@github.com:

@olivierHa https://github.com/olivierHa Part of the things we tried was
to switch to Redis, we previously used Memcached retention.

—
Reply to this email directly or view it on GitHub
#1636 (comment).

dmsimard · 2015-05-19T22:19:08Z

There's no obvious errors in the logs, even in debug output.
Using retention to file (pickle) for scheduler inhibits this problem (shinken doesn't stop anything every hour or so) but reloading the arbiter flushes all the retention.

We don't want to lose the retention (acknowlegements, service statuses, etc.) across shinken reloads.

olivierHa · 2015-05-20T05:56:39Z

In debug mode you should see when redis retention is called ('asking ...')
Maybe saving to redis is taking too much time.
And there is something wrong with the time.
Le 20 mai 2015 12:19 AM, "dmsimard" notifications@github.com a écrit :

There's no obvious errors in the logs, even in debug output.
Using retention to file (pickle) for scheduler inhibits this problem
(shinken doesn't stop anything every hour or so) but reloading the arbiter
flushes all the retention.

We don't want to lose the retention (acknowlegements, service statuses,
etc.) across shinken reloads.

—
Reply to this email directly or view it on GitHub
#1636 (comment).

dmsimard · 2015-05-21T14:43:33Z

So further troubleshooting this got me thinking the issue might have been related to the fact that there are two active schedulers in one of my realms (the bulk of the hosts and services are there).

My understanding when you have multiple schedulers in one realm is that the arbiter will distribute evenly your highest level of hosts/services (parent hosts) across your schedulers. It looks like this distribution is random and non-determinist and might be the root cause of the problem where hosts and services are shuffled around, causing retention-related issues.

Yesterday, I've removed one of the two schedulers in this specific realm and even with the redis retention, I am no longer having the issue where shinken stops doing anything for 10 minutes every hour.

fpeterschmitt · 2016-01-22T13:25:10Z

Might be related to #1807 too. Shinken seems not to like many schedulers...

naparuba added the TOSORT label Apr 30, 2016

naparuba added this to the Not prioritized milestone Apr 30, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Shinken stops doing anything intermittently #1636

Shinken stops doing anything intermittently #1636

dmsimard commented May 19, 2015

olivierHa commented May 19, 2015

dmsimard commented May 19, 2015

olivierHa commented May 19, 2015

dmsimard commented May 19, 2015

olivierHa commented May 20, 2015

dmsimard commented May 21, 2015

fpeterschmitt commented Jan 22, 2016

Shinken stops doing anything intermittently #1636

Shinken stops doing anything intermittently #1636

Comments

dmsimard commented May 19, 2015

olivierHa commented May 19, 2015

dmsimard commented May 19, 2015

olivierHa commented May 19, 2015

dmsimard commented May 19, 2015

olivierHa commented May 20, 2015

dmsimard commented May 21, 2015

fpeterschmitt commented Jan 22, 2016