New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
torch estimator: check all ranks have same device type #2942
Conversation
8efcb87
to
10b9493
Compare
0bfaefa
to
9576eaa
Compare
horovod/spark/lightning/remote.py
Outdated
# We need to check all ranks have same device type for traning. | ||
# Horovod doesn't support heterogeneous allreduce for gradients. | ||
cuda_avail_list = hvd.allgather_object(cuda_available, name='device type') | ||
assert cuda_avail_list.count(cuda_available) == hvd.size(), "All ranks don't have same device type!" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing I forgot to point out: this should be an exception (probably RuntimeError
) rather than an assertion. Two reasons for this:
- Exceptions can be caught, whereas assertions effectively terminate the program, so there's nothing the caller can do to recover.
- Assertions can be removed by the interpreter at runtime when optimizers are enabled.
Assertions are primarily for debugging application code, whereas Exceptions are for errors at runtime.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
okay, I was thinking about raise assertion error instead of RuntimeError before.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed.
Limitation: currently it cannot be applied to keras estimator, since allgather_object() requires tensorflow session is ready at that time. Signed-off-by: Chongxiao Cao <chongxiaoc@uber.com>
9576eaa
to
a9364c8
Compare
@tgaddair Can you take one more look? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Signed-off-by: Chongxiao Cao chongxiaoc@uber.com
Checklist before submitting
Description
Fixes # (issue).
Review process to land