torch estimator: check all ranks have same device type #2942

chongxiaoc · 2021-05-27T20:19:14Z

Signed-off-by: Chongxiao Cao chongxiaoc@uber.com

Checklist before submitting

Did you read the contributor guide?
Did you update the docs?
Did you write any tests to validate this change?
Did you update the CHANGELOG, if this change affects users?

Description

Fixes # (issue).

Review process to land

All tests and other checks must succeed.
At least one member of the technical steering committee must review and approve.
If any member of the technical steering committee requests changes, they must be addressed.

github-actions · 2021-05-27T21:13:58Z

Unit Test Results

    801 files ±0     801 suites ±0 6h 15m 17s ⏱️ ±0s
    600 tests ±0     565 ✔️ ±0     35 💤 ±0 0 ❌ ±0
16 651 runs ±0 12 509 ✔️ ±0 4 142 💤 ±0 0 ❌ ±0

Results for commit ef31a42. ± Comparison against base commit ef31a42.

♻️ This comment has been updated with latest results.

horovod/spark/lightning/remote.py

horovod/spark/torch/remote.py

tgaddair · 2021-05-29T02:20:32Z

horovod/spark/lightning/remote.py

+            # We need to check all ranks have same device type for traning.
+            # Horovod doesn't support heterogeneous allreduce for gradients.
+            cuda_avail_list = hvd.allgather_object(cuda_available, name='device type')
+            assert cuda_avail_list.count(cuda_available) == hvd.size(), "All ranks don't have same device type!"


One thing I forgot to point out: this should be an exception (probably RuntimeError) rather than an assertion. Two reasons for this:

Exceptions can be caught, whereas assertions effectively terminate the program, so there's nothing the caller can do to recover.

Assertions can be removed by the interpreter at runtime when optimizers are enabled.

Assertions are primarily for debugging application code, whereas Exceptions are for errors at runtime.

okay, I was thinking about raise assertion error instead of RuntimeError before.

Limitation: currently it cannot be applied to keras estimator, since allgather_object() requires tensorflow session is ready at that time. Signed-off-by: Chongxiao Cao <chongxiaoc@uber.com>

chongxiaoc · 2021-06-01T19:10:56Z

@tgaddair Can you take one more look?

tgaddair

LGTM!

chongxiaoc force-pushed the check_same_device branch from 8efcb87 to 10b9493 Compare May 27, 2021 22:14

chongxiaoc changed the title ~~spark: check all ranks have same device type~~ torch estimator: check all ranks have same device type May 27, 2021

chongxiaoc requested review from tgaddair and irasit May 27, 2021 22:14

tgaddair reviewed May 28, 2021

View reviewed changes

horovod/spark/lightning/remote.py Outdated Show resolved Hide resolved

tgaddair reviewed May 28, 2021

View reviewed changes

horovod/spark/torch/remote.py Outdated Show resolved Hide resolved

chongxiaoc force-pushed the check_same_device branch 2 times, most recently from 0bfaefa to 9576eaa Compare May 28, 2021 21:00

tgaddair reviewed May 29, 2021

View reviewed changes

torch estimator: check all ranks have same device type

a9364c8

Limitation: currently it cannot be applied to keras estimator, since allgather_object() requires tensorflow session is ready at that time. Signed-off-by: Chongxiao Cao <chongxiaoc@uber.com>

chongxiaoc force-pushed the check_same_device branch from 9576eaa to a9364c8 Compare May 29, 2021 03:59

tgaddair approved these changes Jun 2, 2021

View reviewed changes

tgaddair merged commit ef31a42 into horovod:master Jun 2, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

torch estimator: check all ranks have same device type #2942

torch estimator: check all ranks have same device type #2942

chongxiaoc commented May 27, 2021 •

edited

github-actions bot commented May 27, 2021 •

edited

tgaddair May 29, 2021 •

edited

chongxiaoc May 29, 2021

chongxiaoc May 29, 2021

chongxiaoc commented Jun 1, 2021

tgaddair left a comment

torch estimator: check all ranks have same device type #2942

torch estimator: check all ranks have same device type #2942

Conversation

chongxiaoc commented May 27, 2021 • edited

Checklist before submitting

Description

Review process to land

github-actions bot commented May 27, 2021 • edited

Unit Test Results

tgaddair May 29, 2021 • edited

Choose a reason for hiding this comment

chongxiaoc May 29, 2021

Choose a reason for hiding this comment

chongxiaoc May 29, 2021

Choose a reason for hiding this comment

chongxiaoc commented Jun 1, 2021

tgaddair left a comment

Choose a reason for hiding this comment

chongxiaoc commented May 27, 2021 •

edited

github-actions bot commented May 27, 2021 •

edited

tgaddair May 29, 2021 •

edited