Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial bare-metal implementation of elastic mode for fault tolerance and auto-scaling #1849

Merged
merged 116 commits into from May 15, 2020

Conversation

tgaddair
Copy link
Collaborator

@tgaddair tgaddair commented Apr 7, 2020

Resolves #1907.

RFC.

Signed-off-by: Travis Addair <taddair@uber.com>
Signed-off-by: Travis Addair <taddair@uber.com>
Signed-off-by: Travis Addair <taddair@uber.com>
Signed-off-by: Travis Addair <taddair@uber.com>
Signed-off-by: Travis Addair <taddair@uber.com>
Signed-off-by: Travis Addair <taddair@uber.com>
Signed-off-by: Travis Addair <taddair@uber.com>
Signed-off-by: Travis Addair <taddair@uber.com>
Signed-off-by: Travis Addair <taddair@uber.com>
Signed-off-by: Travis Addair <taddair@uber.com>
Signed-off-by: Travis Addair <taddair@uber.com>
Signed-off-by: Travis Addair <taddair@uber.com>
Signed-off-by: Travis Addair <taddair@uber.com>
Signed-off-by: Travis Addair <taddair@uber.com>
Signed-off-by: Travis Addair <taddair@uber.com>
Signed-off-by: Travis Addair <taddair@uber.com>
Signed-off-by: Travis Addair <taddair@uber.com>
Signed-off-by: Travis Addair <taddair@uber.com>
Signed-off-by: Travis Addair <taddair@uber.com>
Signed-off-by: Travis Addair <taddair@uber.com>
Signed-off-by: Travis Addair <taddair@uber.com>
Signed-off-by: Travis Addair <taddair@uber.com>
Signed-off-by: Travis Addair <taddair@uber.com>
Signed-off-by: Travis Addair <taddair@uber.com>
Signed-off-by: Travis Addair <taddair@uber.com>
Signed-off-by: Travis Addair <taddair@uber.com>
Signed-off-by: Travis Addair <taddair@uber.com>
Signed-off-by: Travis Addair <taddair@uber.com>
Signed-off-by: Travis Addair <taddair@uber.com>
Signed-off-by: Travis Addair <taddair@uber.com>
tgaddair and others added 6 commits May 11, 2020 12:10
Signed-off-by: Travis Addair <taddair@uber.com>
Signed-off-by: Travis Addair <taddair@uber.com>
Signed-off-by: Travis Addair <taddair@uber.com>
Renamed HOROVOD_ELASTIC_START_TIMEOUT to HOROVOD_ELASTIC_TIMEOUT
Renamed BaseSettings.timeout to BaseSettings.start_timeout

Signed-off-by: Enrico Minack <github@enrico.minack.dev>
Signed-off-by: Enrico Minack <github@enrico.minack.dev>
Signed-off-by: Enrico Minack <github@enrico.minack.dev>
@EnricoMi
Copy link
Collaborator

Looks like commit 5f204ea broke tensorflow.

Signed-off-by: Travis Addair <taddair@uber.com>
Signed-off-by: Travis Addair <taddair@uber.com>
Signed-off-by: Travis Addair <taddair@uber.com>
Signed-off-by: Travis Addair <taddair@uber.com>
…ternalError

Signed-off-by: Travis Addair <taddair@uber.com>
Signed-off-by: Travis Addair <taddair@uber.com>
Signed-off-by: Travis Addair <taddair@uber.com>
Signed-off-by: Travis Addair <taddair@uber.com>
return False
return host in self._host_assignments and len(self._host_assignments[host]) > slot

@property
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why a @property, why not world_size, coordinator_info, results and finished as well? In any case, should not return the mutable dict self._host_assignments but a copy of it, deep copy not necessary here

tgaddair and others added 3 commits May 13, 2020 09:04
Copy link
Collaborator

@EnricoMi EnricoMi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All comments have been addressed, LGTM!

EnricoMi and others added 5 commits May 13, 2020 23:37
Signed-off-by: Enrico Minack <github@enrico.minack.dev>
Signed-off-by: Enrico Minack <github@enrico.minack.dev>
Signed-off-by: Travis Addair <taddair@uber.com>
Signed-off-by: Travis Addair <taddair@uber.com>
Signed-off-by: Travis Addair <taddair@uber.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

Elastic mode to enable fault tolerance and auto-scaling
7 participants