Improved heartbeat controller to engine monitoring for long running tasks #3290

chapmanb · 2013-05-08T01:34:58Z

I'm using Ipython parallel for long running tasks (hours to days) on larger clusters (100+ engines) and have been having trouble losing connectivity with engines. The ipcontroller log would intermittently report a block of 5 to 10 registration::unregister_engine messages and remove those engines.

In digging into the problem, it appears as if the current behavior is to ping engines every 3 seconds and kill an engine if it fails to respond to pings twice. The first failed ping results in an engine being put on_probation and the second triggers unregistration.

For longer running engines this is too restrictive, and is at odds with the new EngineFactory.max_heartbeat_misses which defaults to 50 misses.

This pull request provides a configuration variable HeartMonitor.max_heartmonitor_misses, which allows ipcontrollers to specify how many consecutive misses to allow before unregistering an engine. It defaults to failing to contact an engine for 1 minute.

I'm happy to adjust or include different defaults if desired. Thanks much for all your work on IPython.
Brad

…rom controller checking engine connectivity

…eat monitoring. Shuts down engines if controller fails to ping them for 1 consecutive minute

…ogging of misses

minrk · 2013-05-08T01:37:00Z

Excellent, I've been meaning to do this for ages.

minrk · 2013-05-08T01:38:00Z

IPython/parallel/controller/heartmonitor.py

@@ -74,6 +74,9 @@ class HeartMonitor(LoggingConfigurable):
        help='The frequency at which the Hub pings the engines for heartbeats '
        '(in ms)',
    )
+    max_heartmonitor_misses = Integer(20, config=True,
+        help='Allow consecutive misses from engine to controller heart monitor before shutting down.',


change 'shutting down' to 'unregistering'

and 20 seems awfully high - maybe 10?

chapmanb · 2013-05-08T01:46:18Z

Great points, thank you. I updated the documentation and default.

…elp handle long running engines. Requires ipython/ipython#3290 for improved large/long running cluster handling

minrk · 2013-05-09T17:33:06Z

Excellent, thanks!

Improved heartbeat controller to engine monitoring for long running tasks I'm using Ipython parallel for long running tasks (hours to days) on larger clusters (100+ engines) and have been having trouble losing connectivity with engines. The ipcontroller log would intermittently report a block of 5 to 10 `registration::unregister_engine` messages and remove those engines. In digging into the problem, it appears as if the current behavior is to ping engines every 3 seconds and kill an engine if it fails to respond to pings twice. The first failed ping results in an engine being put `on_probation` and the second triggers unregistration. For longer running engines this is too restrictive, and is at odds with the new `EngineFactory.max_heartbeat_misses` which defaults to 50 misses. This pull request provides a configuration variable `HeartMonitor.max_heartmonitor_misses`, which allows ipcontrollers to specify how many consecutive misses to allow before unregistering an engine. It defaults to failing to contact an engine for 1 minute. I'm happy to adjust or include different defaults if desired. Thanks much for all your work on IPython. Brad

chapmanb added 3 commits May 6, 2013 15:56

Provide configuration hook to specify allowable heartmonitor misses f…

d8c2d4f

…rom controller checking engine connectivity

Update heartmonitor miss default to be in line with new engine heartb…

9819283

…eat monitoring. Shuts down engines if controller fails to ping them for 1 consecutive minute

Improve documentation of default to indicate consecutive and supply l…

e3040cd

…ogging of misses

minrk reviewed May 8, 2013
View reviewed changes

Lower default misses to 10 and improve documentation of option

536686b

minrk merged commit 2b0be41 into ipython:master May 9, 2013

minrk mentioned this pull request Jul 6, 2013

ipcontroller purging some engines during connect #2887

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improved heartbeat controller to engine monitoring for long running tasks #3290

Improved heartbeat controller to engine monitoring for long running tasks #3290

chapmanb commented May 8, 2013

minrk commented May 8, 2013

minrk May 8, 2013

minrk May 8, 2013

chapmanb commented May 8, 2013

minrk commented May 9, 2013

Improved heartbeat controller to engine monitoring for long running tasks #3290

Improved heartbeat controller to engine monitoring for long running tasks #3290

Conversation

chapmanb commented May 8, 2013

minrk commented May 8, 2013

minrk May 8, 2013

Choose a reason for hiding this comment

minrk May 8, 2013

Choose a reason for hiding this comment

chapmanb commented May 8, 2013

minrk commented May 9, 2013