Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add NCCL Broadcast operation #1579

Merged
merged 6 commits into from Jan 8, 2020
Merged

Add NCCL Broadcast operation #1579

merged 6 commits into from Jan 8, 2020

Conversation

@nvcastet
Copy link
Contributor

nvcastet commented Dec 10, 2019

Horovod needs to be compiled with HOROVOD_GPU_BROADCAST=NCCL.
Implements #1521.

@jessebenson

This comment has been minimized.

Copy link
Contributor

jessebenson commented Dec 12, 2019

Good to see more NCCL ops!

@tgaddair

This comment has been minimized.

Copy link
Collaborator

tgaddair commented Dec 17, 2019

Hey @nvcastet, looks like some unit tests are failing:

ERROR:tensorflow:2 root error(s) found.
(0) Unknown: Type uint8 is not supported in NCCL mode.
[[node HorovodBroadcast_Cast_0 (defined at /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py:1748) ]]
[[All/_7]]
(1) Unknown: Type uint8 is not supported in NCCL mode.
[[node HorovodBroadcast_Cast_0 (defined at /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py:1748) ]]
0 successful operations.
0 derived errors ignored.
@nvcastet nvcastet force-pushed the nvcastet:add_nccl_bcast branch 6 times, most recently from e08fc16 to c140a81 Dec 19, 2019
@nvcastet nvcastet force-pushed the nvcastet:add_nccl_bcast branch 3 times, most recently from 26fa419 to 225003a Jan 3, 2020
@nvcastet

This comment has been minimized.

Copy link
Contributor Author

nvcastet commented Jan 6, 2020

@tgaddair It is finally ready for review :)

Copy link
Collaborator

tgaddair left a comment

Looks great, @nvcastet! Just a couple questions.


Status Execute(std::vector<TensorTableEntry>& entries,
const Response& response) override;
NCCLOp(NCCLContext* nccl_context, HorovodGlobalState* global_state)

This comment has been minimized.

Copy link
@tgaddair

tgaddair Jan 6, 2020

Collaborator

In a previous PR (https://github.com/horovod/horovod/pull/1480/files#diff-5798de9e6fdc83e7f7bf0fa8344a56c0R65) we opted to use delegation to solve this problem by introducing CUDAOpContext. It looks like we could do something similar here, which would make things in our codebase more consistent. What do you think?

This comment has been minimized.

Copy link
@nvcastet

nvcastet Jan 6, 2020

Author Contributor

I agree to try to keep things consistent through the code base. Let me give it a try.

@@ -28,7 +28,8 @@ enum DataType:byte {
HOROVOD_FLOAT16 = 6,
HOROVOD_FLOAT32 = 7,
HOROVOD_FLOAT64 = 8,
HOROVOD_BOOL = 9
HOROVOD_BOOL = 9,
HOROVOD_BYTE = 10

This comment has been minimized.

Copy link
@tgaddair

tgaddair Jan 6, 2020

Collaborator

I'm a little confused by HOROVOD_BYTE. In what cases would it be used in place of HOROVOD_UINT8?

This comment has been minimized.

Copy link
@nvcastet

nvcastet Jan 6, 2020

Author Contributor

I was under the impression that we needed to keep the enums consistent between enum DataType in message.h and enum DataType in message.fbs/message_generated.h. Am i incorrect? HOROVOD_BYTE was already in message.h.

This comment has been minimized.

Copy link
@tgaddair

tgaddair Jan 6, 2020

Collaborator

Ah, I see what happened. The HOROVOD_BYTE dtype was added in aa605d6 to abstract an API call to MPI_Bcast, but was subsequently simplified in 669eb9e.

The result is that HOROVOD_BYTE is no longer used for anything, so I think we can safely remove it.

nvcastet added 5 commits Dec 10, 2019
Horovod needs to be compiled with HOROVOD_GPU_BROADCAST=NCCL.
Implements #1521.

Signed-off-by: Nicolas V Castet <nvcastet@us.ibm.com>
Signed-off-by: Nicolas V Castet <nvcastet@us.ibm.com>
Signed-off-by: Nicolas V Castet <nvcastet@us.ibm.com>
Signed-off-by: Nicolas V Castet <nvcastet@us.ibm.com>
Signed-off-by: Nicolas V Castet <nvcastet@us.ibm.com>
@nvcastet nvcastet force-pushed the nvcastet:add_nccl_bcast branch 3 times, most recently from 2ef7ed5 to db24f30 Jan 7, 2020
Signed-off-by: Nicolas V Castet <nvcastet@us.ibm.com>
@nvcastet nvcastet force-pushed the nvcastet:add_nccl_bcast branch from db24f30 to 62c41cf Jan 7, 2020
@nvcastet nvcastet requested a review from tgaddair Jan 7, 2020
Copy link
Collaborator

tgaddair left a comment

LGTM! Thanks for the quick turnaround following the review comments.

@tgaddair tgaddair merged commit 80167f6 into horovod:master Jan 8, 2020
3 checks passed
3 checks passed
build
Details
DCO DCO
Details
buildkite/horovod/pr Build #1799 passed (1 hour, 3 minutes, 59 seconds)
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
3 participants
You can’t perform that action at this time.