New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Initial bare-metal implementation of elastic mode for fault tolerance and auto-scaling #1849
Conversation
Signed-off-by: Travis Addair <taddair@uber.com>
Signed-off-by: Travis Addair <taddair@uber.com>
Signed-off-by: Travis Addair <taddair@uber.com>
Signed-off-by: Travis Addair <taddair@uber.com>
Signed-off-by: Travis Addair <taddair@uber.com>
Signed-off-by: Travis Addair <taddair@uber.com>
Signed-off-by: Travis Addair <taddair@uber.com>
Signed-off-by: Travis Addair <taddair@uber.com>
Signed-off-by: Travis Addair <taddair@uber.com>
Signed-off-by: Travis Addair <taddair@uber.com>
Signed-off-by: Travis Addair <taddair@uber.com>
Signed-off-by: Travis Addair <taddair@uber.com>
Signed-off-by: Travis Addair <taddair@uber.com>
Signed-off-by: Travis Addair <taddair@uber.com>
Signed-off-by: Travis Addair <taddair@uber.com>
Signed-off-by: Travis Addair <taddair@uber.com>
Signed-off-by: Travis Addair <taddair@uber.com>
Signed-off-by: Travis Addair <taddair@uber.com>
Signed-off-by: Travis Addair <taddair@uber.com>
Signed-off-by: Travis Addair <taddair@uber.com>
Signed-off-by: Travis Addair <taddair@uber.com>
Signed-off-by: Travis Addair <taddair@uber.com>
Signed-off-by: Travis Addair <taddair@uber.com>
Signed-off-by: Travis Addair <taddair@uber.com>
Signed-off-by: Travis Addair <taddair@uber.com>
Signed-off-by: Travis Addair <taddair@uber.com>
Signed-off-by: Travis Addair <taddair@uber.com>
Signed-off-by: Travis Addair <taddair@uber.com>
Signed-off-by: Travis Addair <taddair@uber.com>
Signed-off-by: Travis Addair <taddair@uber.com>
Signed-off-by: Travis Addair <taddair@uber.com>
Signed-off-by: Travis Addair <taddair@uber.com>
Signed-off-by: Travis Addair <taddair@uber.com>
Renamed HOROVOD_ELASTIC_START_TIMEOUT to HOROVOD_ELASTIC_TIMEOUT Renamed BaseSettings.timeout to BaseSettings.start_timeout Signed-off-by: Enrico Minack <github@enrico.minack.dev>
Signed-off-by: Enrico Minack <github@enrico.minack.dev>
Signed-off-by: Enrico Minack <github@enrico.minack.dev>
Looks like commit 5f204ea broke tensorflow. |
Signed-off-by: Travis Addair <taddair@uber.com>
Signed-off-by: Travis Addair <taddair@uber.com>
Signed-off-by: Travis Addair <taddair@uber.com>
Signed-off-by: Travis Addair <taddair@uber.com>
…ternalError Signed-off-by: Travis Addair <taddair@uber.com>
Signed-off-by: Travis Addair <taddair@uber.com>
Signed-off-by: Travis Addair <taddair@uber.com>
Signed-off-by: Travis Addair <taddair@uber.com>
return False | ||
return host in self._host_assignments and len(self._host_assignments[host]) > slot | ||
|
||
@property |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why a @property
, why not world_size
, coordinator_info
, results
and finished
as well? In any case, should not return the mutable dict self._host_assignments
but a copy of it, deep copy not necessary here
Signed-off-by: Travis Addair <taddair@uber.com>
Signed-off-by: Enrico Minack <github@enrico.minack.dev>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All comments have been addressed, LGTM!
Signed-off-by: Enrico Minack <github@enrico.minack.dev>
Signed-off-by: Enrico Minack <github@enrico.minack.dev>
Signed-off-by: Travis Addair <taddair@uber.com>
Signed-off-by: Travis Addair <taddair@uber.com>
Signed-off-by: Travis Addair <taddair@uber.com>
Resolves #1907.
RFC.