Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improved heartbeat controller to engine monitoring for long running tasks #3290

Merged
merged 4 commits into from May 9, 2013

Conversation

chapmanb
Copy link
Contributor

@chapmanb chapmanb commented May 8, 2013

I'm using Ipython parallel for long running tasks (hours to days) on larger clusters (100+ engines) and have been having trouble losing connectivity with engines. The ipcontroller log would intermittently report a block of 5 to 10 registration::unregister_engine messages and remove those engines.

In digging into the problem, it appears as if the current behavior is to ping engines every 3 seconds and kill an engine if it fails to respond to pings twice. The first failed ping results in an engine being put on_probation and the second triggers unregistration.

For longer running engines this is too restrictive, and is at odds with the new EngineFactory.max_heartbeat_misses which defaults to 50 misses.

This pull request provides a configuration variable HeartMonitor.max_heartmonitor_misses, which allows ipcontrollers to specify how many consecutive misses to allow before unregistering an engine. It defaults to failing to contact an engine for 1 minute.

I'm happy to adjust or include different defaults if desired. Thanks much for all your work on IPython.
Brad

…eat monitoring. Shuts down engines if controller fails to ping them for 1 consecutive minute
@minrk
Copy link
Member

minrk commented May 8, 2013

Excellent, I've been meaning to do this for ages.

@@ -74,6 +74,9 @@ class HeartMonitor(LoggingConfigurable):
help='The frequency at which the Hub pings the engines for heartbeats '
'(in ms)',
)
max_heartmonitor_misses = Integer(20, config=True,
help='Allow consecutive misses from engine to controller heart monitor before shutting down.',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change 'shutting down' to 'unregistering'

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and 20 seems awfully high - maybe 10?

@chapmanb
Copy link
Contributor Author

chapmanb commented May 8, 2013

Great points, thank you. I updated the documentation and default.

chapmanb added a commit to roryk/ipython-cluster-helper that referenced this pull request May 8, 2013
…elp handle long running engines. Requires ipython/ipython#3290 for improved large/long running cluster handling
@minrk
Copy link
Member

minrk commented May 9, 2013

Excellent, thanks!

minrk added a commit that referenced this pull request May 9, 2013
Improved heartbeat controller to engine monitoring for long running tasks

I'm using Ipython parallel for long running tasks (hours to days) on larger clusters (100+ engines) and have been having trouble losing connectivity with engines. The ipcontroller log would intermittently report a block of 5 to 10 `registration::unregister_engine` messages and remove those engines.

In digging into the problem, it appears as if the current behavior is to ping engines every 3 seconds and kill an engine if it fails to respond to pings twice. The first failed ping results in an engine being put `on_probation` and the second triggers unregistration.

For longer running engines this is too restrictive, and is at odds with the new `EngineFactory.max_heartbeat_misses` which defaults to 50 misses.

This pull request provides a configuration variable `HeartMonitor.max_heartmonitor_misses`, which allows ipcontrollers to specify how many consecutive misses to allow before unregistering an engine. It defaults to failing to contact an engine for 1 minute.

I'm happy to adjust or include different defaults if desired. Thanks much for all your work on IPython.
Brad
@minrk minrk merged commit 2b0be41 into ipython:master May 9, 2013
mattvonrocketstein pushed a commit to mattvonrocketstein/ipython that referenced this pull request Nov 3, 2014
Improved heartbeat controller to engine monitoring for long running tasks

I'm using Ipython parallel for long running tasks (hours to days) on larger clusters (100+ engines) and have been having trouble losing connectivity with engines. The ipcontroller log would intermittently report a block of 5 to 10 `registration::unregister_engine` messages and remove those engines.

In digging into the problem, it appears as if the current behavior is to ping engines every 3 seconds and kill an engine if it fails to respond to pings twice. The first failed ping results in an engine being put `on_probation` and the second triggers unregistration.

For longer running engines this is too restrictive, and is at odds with the new `EngineFactory.max_heartbeat_misses` which defaults to 50 misses.

This pull request provides a configuration variable `HeartMonitor.max_heartmonitor_misses`, which allows ipcontrollers to specify how many consecutive misses to allow before unregistering an engine. It defaults to failing to contact an engine for 1 minute.

I'm happy to adjust or include different defaults if desired. Thanks much for all your work on IPython.
Brad
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants