-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add NCCL Broadcast operation #1579
Conversation
Good to see more NCCL ops! |
Hey @nvcastet, looks like some unit tests are failing:
|
e08fc16
to
c140a81
Compare
26fa419
to
225003a
Compare
@tgaddair It is finally ready for review :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great, @nvcastet! Just a couple questions.
horovod/common/ops/nccl_operations.h
Outdated
|
||
Status Execute(std::vector<TensorTableEntry>& entries, | ||
const Response& response) override; | ||
NCCLOp(NCCLContext* nccl_context, HorovodGlobalState* global_state) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In a previous PR (https://github.com/horovod/horovod/pull/1480/files#diff-5798de9e6fdc83e7f7bf0fa8344a56c0R65) we opted to use delegation to solve this problem by introducing CUDAOpContext
. It looks like we could do something similar here, which would make things in our codebase more consistent. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree to try to keep things consistent through the code base. Let me give it a try.
horovod/common/wire/message.fbs
Outdated
@@ -28,7 +28,8 @@ enum DataType:byte { | |||
HOROVOD_FLOAT16 = 6, | |||
HOROVOD_FLOAT32 = 7, | |||
HOROVOD_FLOAT64 = 8, | |||
HOROVOD_BOOL = 9 | |||
HOROVOD_BOOL = 9, | |||
HOROVOD_BYTE = 10 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a little confused by HOROVOD_BYTE
. In what cases would it be used in place of HOROVOD_UINT8
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was under the impression that we needed to keep the enums consistent between enum DataType
in message.h and enum DataType
in message.fbs/message_generated.h. Am i incorrect? HOROVOD_BYTE
was already in message.h.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Horovod needs to be compiled with HOROVOD_GPU_BROADCAST=NCCL. Implements horovod#1521. Signed-off-by: Nicolas V Castet <nvcastet@us.ibm.com>
Signed-off-by: Nicolas V Castet <nvcastet@us.ibm.com>
Signed-off-by: Nicolas V Castet <nvcastet@us.ibm.com>
Signed-off-by: Nicolas V Castet <nvcastet@us.ibm.com>
Signed-off-by: Nicolas V Castet <nvcastet@us.ibm.com>
2ef7ed5
to
db24f30
Compare
Signed-off-by: Nicolas V Castet <nvcastet@us.ibm.com>
db24f30
to
62c41cf
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks for the quick turnaround following the review comments.
Horovod needs to be compiled with HOROVOD_GPU_BROADCAST=NCCL.
Implements #1521.