New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improved heartbeat controller to engine monitoring for long running tasks #3290
Conversation
…rom controller checking engine connectivity
…eat monitoring. Shuts down engines if controller fails to ping them for 1 consecutive minute
Excellent, I've been meaning to do this for ages. |
@@ -74,6 +74,9 @@ class HeartMonitor(LoggingConfigurable): | |||
help='The frequency at which the Hub pings the engines for heartbeats ' | |||
'(in ms)', | |||
) | |||
max_heartmonitor_misses = Integer(20, config=True, | |||
help='Allow consecutive misses from engine to controller heart monitor before shutting down.', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
change 'shutting down' to 'unregistering'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and 20 seems awfully high - maybe 10?
Great points, thank you. I updated the documentation and default. |
…elp handle long running engines. Requires ipython/ipython#3290 for improved large/long running cluster handling
Excellent, thanks! |
Improved heartbeat controller to engine monitoring for long running tasks I'm using Ipython parallel for long running tasks (hours to days) on larger clusters (100+ engines) and have been having trouble losing connectivity with engines. The ipcontroller log would intermittently report a block of 5 to 10 `registration::unregister_engine` messages and remove those engines. In digging into the problem, it appears as if the current behavior is to ping engines every 3 seconds and kill an engine if it fails to respond to pings twice. The first failed ping results in an engine being put `on_probation` and the second triggers unregistration. For longer running engines this is too restrictive, and is at odds with the new `EngineFactory.max_heartbeat_misses` which defaults to 50 misses. This pull request provides a configuration variable `HeartMonitor.max_heartmonitor_misses`, which allows ipcontrollers to specify how many consecutive misses to allow before unregistering an engine. It defaults to failing to contact an engine for 1 minute. I'm happy to adjust or include different defaults if desired. Thanks much for all your work on IPython. Brad
Improved heartbeat controller to engine monitoring for long running tasks I'm using Ipython parallel for long running tasks (hours to days) on larger clusters (100+ engines) and have been having trouble losing connectivity with engines. The ipcontroller log would intermittently report a block of 5 to 10 `registration::unregister_engine` messages and remove those engines. In digging into the problem, it appears as if the current behavior is to ping engines every 3 seconds and kill an engine if it fails to respond to pings twice. The first failed ping results in an engine being put `on_probation` and the second triggers unregistration. For longer running engines this is too restrictive, and is at odds with the new `EngineFactory.max_heartbeat_misses` which defaults to 50 misses. This pull request provides a configuration variable `HeartMonitor.max_heartmonitor_misses`, which allows ipcontrollers to specify how many consecutive misses to allow before unregistering an engine. It defaults to failing to contact an engine for 1 minute. I'm happy to adjust or include different defaults if desired. Thanks much for all your work on IPython. Brad
I'm using Ipython parallel for long running tasks (hours to days) on larger clusters (100+ engines) and have been having trouble losing connectivity with engines. The ipcontroller log would intermittently report a block of 5 to 10
registration::unregister_engine
messages and remove those engines.In digging into the problem, it appears as if the current behavior is to ping engines every 3 seconds and kill an engine if it fails to respond to pings twice. The first failed ping results in an engine being put
on_probation
and the second triggers unregistration.For longer running engines this is too restrictive, and is at odds with the new
EngineFactory.max_heartbeat_misses
which defaults to 50 misses.This pull request provides a configuration variable
HeartMonitor.max_heartmonitor_misses
, which allows ipcontrollers to specify how many consecutive misses to allow before unregistering an engine. It defaults to failing to contact an engine for 1 minute.I'm happy to adjust or include different defaults if desired. Thanks much for all your work on IPython.
Brad