-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Join op (with only support for AllReduce and PyTorch for now) #1058
Conversation
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for starting this! I left some stylistic feedback. To answer your questions:
-
I don't think Join op needs a name, but we should make sure only one Join op is enqueued on any given rank. Duplicate name checking may be a way around it.
-
Yes, it should return last joined rank.
-
It feels like we need to expand the API of operations to allow for no-ops. All kinds of operations would then have to implement this additional API.
-
If I understand correctly, you're thinking about bit vectors in accelerated orchestration logic? I think two bits would be sufficient to indicate that some/all of ranks did Join().
5-6. Rank-tracking logic is all done in the orchestration layer. We should implement this both for bitvector and standard approach. The logic for both is here, I recommend careful study as it got a bit complicated over time.
-
I'd suggest sticking to a single PR for now until it becomes unmanageable (usually it doesn't).
-
We'd need to eventually support PyTorch, TensorFlow and Apache MXNet. Legacy PyTorch could be made unsupported if it's too much work.
Rebased on master, addressed style comments, fixed "Segmentation fault" because of trying to cache Join op. |
Hi @alsrgv, I've coded some basic logic for the JoinOp (the code is not in this PR currently yet): global state stores number of ranks already joined on the coordinator, and each rank knows about itself if it already joined. Then The problems is that the ranks that actually perform the reduce (not yet joined) use MPI or NCCL reduce, that expects all ranks to participate. So one way to solve it is to make every rank know the list of already joined ranks, and create new MPI communicator that only uses subset of all nodes; this seems not really feasible... Another idea is that all joined ranks still participate in MPI reduce (Join op Execute calls MPI reduce), but with zero tensors; but in this case the joined nodes need to somehow know the shape of the reduced tensor... Any suggestions? |
An update after my last comment. The coordinator tracks how many ranks sent Join request, and each rank knows if it Joined. @alsrgv, could you review if it makes sense? |
Another problem I have is how to keep Python thread alive after one rank Joins earlier. The wait must be in the Python thread, not in the background thread, and I'm not sure how to communicate back that the wait is over (all ranks sent Join requests). |
So I've implemented this design (similar to my previous comment, but with some modifications): The coordinator tracks how many ranks already sent Join requests, and each rank knows if it Joined. This seems to work OK with 1 small test I have. If you think this approach makes sense, I'll fix all TODOs and test/fix more scenarios (Fused responses, all data types, cache...) |
@kit1980, sorry for going dark on you. Your proposed approach sounds good! |
Note that it should work with AllGather, too (send empty slices from joined nodes). |
b131034
to
7374086
Compare
084f1c1
to
530e8a9
Compare
Join with AllReduce now passes tests for all data types for MPI and NCCL. |
I have a simple question how different it is from this: from mpi4py import MPI
MPI.COMM_WORLD.Barrier() # Waiting for all MPI processes to sync Looks quite identical objective to me |
@DEKHTIARJonathan, this is for the case when data is unevenly distributed between workers, so different workers do different number of steps. MPI barrier can't help in this case - the workers that still have data will forever wait in |
Oh I see. Thanks for the pointer. Never had this usecase |
@alsrgv Could you take a look at this? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left a few comments to help me understand the PR better. Could you rebase on the latest master to resolve conflicts?
0b4c542
to
d762c2a
Compare
Rebased on master (which was painful because of recent large code moves with simultaneous logic and formatting changes). |
Rebased on master again. Please review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great, just a few minor changes left.
@@ -211,6 +214,8 @@ class OpContext { | |||
std::shared_ptr<PersistentBuffer>* tensor) = 0; | |||
virtual Status AllocateOutput(TensorShape shape, | |||
std::shared_ptr<Tensor>* tensor) = 0; | |||
virtual Status AllocateZeros(int64_t num_elements, DataType dtype, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add stub implementations of this method to the other OpContext
children, including TFOpContext
in tensorflow/mpi_ops.cc
and MXOpContext
in mxnet/adapter.{h,cc}
? These implementations should raise an error similar to the one for PyTorch <1.0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done for TF and MXNet.
@kit1980 looks like TensorFlow just switched nightly over to 2.0. I'll make a quick update to our test script and then your tests should pass. |
Thanks, I was trying to understand what's going on - the previous nightly build passed. |
All the tests passed. |
Signed-off-by: Sergii Dymchenko <sedymche@microsoft.com>
Signed-off-by: Sergii Dymchenko <sedymche@microsoft.com>
Signed-off-by: Sergii Dymchenko <sedymche@microsoft.com>
Signed-off-by: Sergii Dymchenko <sedymche@microsoft.com>
Signed-off-by: Sergii Dymchenko <sedymche@microsoft.com>
Signed-off-by: Sergii Dymchenko <sedymche@microsoft.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Nice job!
Hi @huoliquankai7, |
Hey @huoliquankai7, any update on the TensorFlow PR you mentioned? Now that PyTorch has landed, would be great to get it for TF as well! |
Hey @kit1980, what is your plan for adding Allgather, Broadcast, TensorFlow, and MXNet support for this operation? No worries if there are no immediate plans, we can add some of this work to the backlog, but just wanted to make sure we don't duplicate our efforts. |
Hi, |
Hey @kit1980, and update on the cache issue? |
…rovod#1058) Signed-off-by: Sergii Dymchenko <sedymche@microsoft.com>
Thanks for the update, @kit1980. When the PR is ready, we should also get @romerojosh to take a look, as he's the creator of the bit cache. |
…rovod#1058) Signed-off-by: Sergii Dymchenko <sedymche@microsoft.com>
This is very earlier work in progress PR to add hvd.join() as described in #832
This is mostly boilerplate code, it compiles and I can call hvd.join() from the pytorch test (the Join op goes to the queue but does nothing).
The purpose of this PR for me is to understand if I'm going in the right direction.
Some other questions I have (in no particular order):