New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Initial bare-metal implementation of elastic mode for fault tolerance and auto-scaling #1849
Merged
Changes from 108 commits
Commits
Show all changes
116 commits
Select commit
Hold shift + click to select a range
1b87e1f
Initial commit of Elastic Horovod
tgaddair 2ad2107
Fixed unit tests
tgaddair b1e559f
Removed elastic from run
tgaddair 3b3f960
Fixed Buildkite tests
tgaddair c6cd306
Fixed blacklist check
tgaddair 3d3f864
Added fault tolerance without scaling
tgaddair 8bc8f73
Refactored registration
tgaddair 6fa884d
Fixed unit tests
tgaddair e40f034
Added fault tolerance unit test
tgaddair e6a080c
Fixed file paths
tgaddair c4a39e7
Added modules
tgaddair 0c12de9
Fixed imports
tgaddair 4a4d128
More tests
tgaddair a0126ef
Reverting more tests
tgaddair eb6e0f5
Fixed slots
tgaddair a139280
Fixed keras state
tgaddair ac3963a
Removed requirement for host group
tgaddair cda7f8d
Fixed interactive tests
tgaddair 077c33d
Fixed keras tests
tgaddair 65e8e01
Fixed imports
tgaddair 8269341
Only test elastic
tgaddair 376b6ce
Fixed Spark tests
tgaddair 489d880
Fixed broadcast_object args
tgaddair c7edff4
Fixed spark test
tgaddair 0288423
Test remove test
tgaddair a4c80f3
TensorFlow 1.15
tgaddair 37a3d05
TensorFlow CPU mode
tgaddair 185d4b1
Fixed unit tests
tgaddair 3017726
Back to 1.14
tgaddair 75c7cdd
Renable all test
tgaddair f5ee955
Fixed unit tests
tgaddair a3446e4
Drop support for very old versions of frameworks
tgaddair b30838f
Merge branch 'master' into elastic
tgaddair 990240e
Added ncclCommAbort checks to ensure safe clean-up of GPU memory
tgaddair 6eab532
Fixed compilation
tgaddair 6a418db
Adasum Error Check
tgaddair 9307c66
Updated docs
tgaddair 9465fda
Merge branch 'master' into elastic
EnricoMi 2fd8719
Fixed wait_for_available_hosts
tgaddair 4cc79c7
Updated Buildkite tests
tgaddair 130c7b2
Merged master
tgaddair 484d271
Addressed comments
tgaddair 3c22e30
Added Keras example to doc
tgaddair c1dc521
Renamed sigterm_received -> signal_received
tgaddair 526b290
Added emphasis
tgaddair 2ac644e
Added more emphasis
tgaddair 30f99f9
Fixed host assignment to spawn processes for all new slots
tgaddair 9d4f7c3
Fixed pending slots
tgaddair 26e9e81
Fixed behavior of discovery background thread to fail if the first up…
tgaddair bdfb993
Added comments explaining barrier reset
tgaddair 8ee9fb9
Ensure that at least one previously active host is still assigned whe…
tgaddair c2a41e2
Skip notifying workers when host changes would not result in changes …
tgaddair 7164aa3
Only manage notifications on coordinator
tgaddair f6f6c7b
Fixed notification tests
tgaddair 5219909
Only check host message on rank 0 in integration tests
tgaddair 98409fc
Renamed assigned_hosts -> ordered_available_hosts
tgaddair 9458b0b
Directly compare host assignments with proposed next assignments
tgaddair 338d42d
Added rank assignments and removed iteration over worker clients
tgaddair 7789efd
Removed unused functions
tgaddair 6558a49
Fixed setting rank_assignments
tgaddair c1e10f0
Merge branch 'master' into elastic
EnricoMi ae6ddcc
Fix previous merge master
EnricoMi 1bd317b
Fixed flakiness in testing forward_stream by joining in all cases exc…
tgaddair d6ec90a
Try-except http requests
tgaddair 4875450
Revert "Try-except http requests"
tgaddair 9503b2b
Merge branch 'master' into elastic
EnricoMi 582801e
Added logging
tgaddair 61181b4
Experimental safe_shell_exec execute without fork
tgaddair 2429563
Remove joing_streams, set stop signal for background threads when pro…
tgaddair c8a6c16
Close streams
tgaddair 7a37356
Removed forking safe_shell_exec
tgaddair 767ecbc
Remove Python 3 code
tgaddair 4efa0e1
Fixed process termination in safe_shell_exec
tgaddair 69ae18d
Updated barrier reset comments
tgaddair 563b84e
Renamed slot_info -> coordinator_slot_info
tgaddair 9eb6e4f
Added comment about stability
tgaddair cdf7ded
Fixed oneCCL config
tgaddair ecf164e
Restored middleman to safe_shell_exec
tgaddair 000441a
Fix host updates check to avoid checking rank information explicitly
tgaddair c352d2d
Merged master
tgaddair 88544d1
Gen pipeline fix
tgaddair e9b2133
Merge
tgaddair b65b749
Refactored host management to ensure that host updates do not conflic…
tgaddair 46693a4
Removed redundant hosts variables and unified with host_slots
tgaddair 8f2bcd3
Added additional checks for auto-scaling jobs to detect common interf…
tgaddair 6717d54
Mock start hosts
tgaddair 0a9c531
Fixed updating current hosts with latest blacklist information
tgaddair f5f884d
Fixed previous timestamp
tgaddair 0b143cd
Test on CPU
tgaddair 6b3f62e
Broadcast tests on CPU
tgaddair 3ae4cb6
Added unit test coverage for size == 1 at start
tgaddair a0f0ec9
Fixed for TensorFlow
tgaddair 8eee5fe
Update gradient average divisor on world reset
tgaddair 5f204ea
Local size
tgaddair ab546b7
Added more robust exception handling to controller
tgaddair a8176ad
Fixed raw pointer accesses
tgaddair d8150c8
Added tests for killing process in addition to raising exceptions
tgaddair d842e2a
Add elastic_timeout to ElasticSettings
EnricoMi d628073
Fix earlier commit
EnricoMi 267fbb2
Guard against more psutil.NoSuchProcess
EnricoMi b8c7f62
Added tests
tgaddair a81e681
Force shutdown when initial host discovery fails and added test
tgaddair 8804eab
Added test for min hosts
tgaddair 220a868
Changed min hosts check condition
tgaddair 617091c
Added additional checks around Torch Horovod calls to raise HorovodIn…
tgaddair 4506a1d
Renaming for consistency
tgaddair f4d7c0b
Do not call _get_host_assignments when we have insufficient slots
tgaddair 959ef05
Removed caching
tgaddair 1bcd240
Removed size variable updating, will do in separate PR in C++
tgaddair c0869b0
Merge remote-tracking branch 'upstream/master' into elastic
EnricoMi b894f9c
Make rsh handle interrupt events
EnricoMi 090b8b5
Removed obsolete test image
EnricoMi d0fb6f4
Fix gloo test excludes for python 2
EnricoMi 08f136e
Upgrade to TensorFlow 2.2, fix skip tests for TensorFlow < 1.15
tgaddair dbe703f
Updated Buildkite
tgaddair 4fb7cb6
Fixed tests for TensorFlow 2.2
tgaddair File filter
Filter by extension
Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are just focusing on py3 for elastic feature?
It makes sense to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the
Barrier
feature used in the driver does not exist in Python 2. So rather than finding a less elegant solution, I thought it would be better to just drop Python 2 support :).