Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

torch estimator: check all ranks have same device type #2942

Merged
merged 1 commit into from Jun 2, 2021

Conversation

chongxiaoc
Copy link
Collaborator

@chongxiaoc chongxiaoc commented May 27, 2021

Signed-off-by: Chongxiao Cao chongxiaoc@uber.com

Checklist before submitting

  • Did you read the contributor guide?
  • Did you update the docs?
  • Did you write any tests to validate this change?
  • Did you update the CHANGELOG, if this change affects users?

Description

Fixes # (issue).

Review process to land

  1. All tests and other checks must succeed.
  2. At least one member of the technical steering committee must review and approve.
  3. If any member of the technical steering committee requests changes, they must be addressed.

@github-actions
Copy link

github-actions bot commented May 27, 2021

Unit Test Results

     801 files  ±0       801 suites  ±0   6h 15m 17s ⏱️ ±0s
     600 tests ±0       565 ✔️ ±0       35 💤 ±0  0 ❌ ±0 
16 651 runs  ±0  12 509 ✔️ ±0  4 142 💤 ±0  0 ❌ ±0 

Results for commit ef31a42. ± Comparison against base commit ef31a42.

♻️ This comment has been updated with latest results.

@chongxiaoc chongxiaoc changed the title spark: check all ranks have same device type torch estimator: check all ranks have same device type May 27, 2021
@chongxiaoc chongxiaoc force-pushed the check_same_device branch 2 times, most recently from 0bfaefa to 9576eaa Compare May 28, 2021 21:00
# We need to check all ranks have same device type for traning.
# Horovod doesn't support heterogeneous allreduce for gradients.
cuda_avail_list = hvd.allgather_object(cuda_available, name='device type')
assert cuda_avail_list.count(cuda_available) == hvd.size(), "All ranks don't have same device type!"
Copy link
Collaborator

@tgaddair tgaddair May 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing I forgot to point out: this should be an exception (probably RuntimeError) rather than an assertion. Two reasons for this:

  1. Exceptions can be caught, whereas assertions effectively terminate the program, so there's nothing the caller can do to recover.
  2. Assertions can be removed by the interpreter at runtime when optimizers are enabled.

Assertions are primarily for debugging application code, whereas Exceptions are for errors at runtime.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay, I was thinking about raise assertion error instead of RuntimeError before.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

Limitation: currently it cannot be applied to keras estimator, since
allgather_object() requires tensorflow session is ready at that time.

Signed-off-by: Chongxiao Cao <chongxiaoc@uber.com>
@chongxiaoc
Copy link
Collaborator Author

@tgaddair Can you take one more look?

Copy link
Collaborator

@tgaddair tgaddair left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@tgaddair tgaddair merged commit ef31a42 into horovod:master Jun 2, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants