-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Initial support for ROCm: rename CUDA to GPU et al, pimpl for GPU details #1632
Conversation
ec5614e
to
72d0c4a
Compare
It's not clear to me whether the current test failures are related to the changes in this PR. |
Hey @jeffdaily, thanks for the new PR. Overall, I like the approach to abstracting GPU operations in this way. One thing I noticed is it looks like the DDL codepaths haven't been updated to use the new There was also a PR landed yesterday that added NCCL broadcast support which will require a rebase. For the unit tests, it looks like at least one test was failing because the Horovod MXNet plugin failed to compile, so we may need to dig into why that is. I'm rerunning some of the other tests to see if it was a transient error. I'll try and put together inline feedback as well. I think it would be good to have @nvcastet and @romerojosh take a look at this PR if they have the bandwidth. |
Rebased to pick up NCCL broadcast change. Also updated the DDL sources, though I did not test them. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR looks good to me.
@tgaddair are the last two test failures transient or do they indicate a real issue? Also, this PR currently does not integrate any kind of CI build for ROCm yet. What is your acceptance criteria? |
@jeffdaily one of the errors appears to be real: The MXNet test for mixed install is consistently failing because the Horovod MXNet plugin did not compile successfully. Once we figure that out, I think we should be safe to land. |
@tgaddair I tried building the docker container for the mixed install so I could see how horovod is failing to build the mxnet extension, and I'm seeing the following error:
|
@tgaddair disregard previous comment concerning the dockerfile build. I forgot to fetch submodules prior to attempting the dockerfile build. I was able to reproduce the mxnet build failure. Fix will be ready shortly. |
@tgaddair fixed the MXNet test for mixed install. Other errors have popped up now that were passing in the previous attempt. |
Hey @jeffdaily, looks like test failures may be transient. Rerunning them now. |
Hey @jeffdaily, took a closer look over the PR and discussed it with some other Horovod contributors. Overall, I think the abstraction layer is good. The one thing I think we should still address is the possibility for the ROCm/RCCL and CUDA/NCCL APIs to diverge in the future. With the pimpl pattern, we're locked in to the API exposed by the GPUContext. In this design how would a contributor add functionality specific to one of these frameworks but not the other? Also, are you familiar with / can comment on the current state of support for GPU MPI with AMD hardware? |
I am rather at a loss to suggest how to address a future ROCm/RCCL and CUDA/NCCL API divergence. With respect to HIP (ROCm) versus CUDA APIs, parity is intended by design. I am not privy to the CUDA development roadmap, so I can't predict new or deprecated APIs. Are there any plans for Horovod to use less common CUDA APIs? Concerning RCCL versus NCCL, again parity is by design in order to make it easier to support AMD hardware through the same interface. I think if there ever exists some CUDA or NCCL feature that isn't reflected by HIP or RCCL, we have the option of separating implementation details within the same source file using GPUDirect, or ROCmRDMA, is very well supported with AMD hardware. |
Rebased again to resolve conflicts from d4d54bc |
@jeffdaily I'm fine punting on the API divergence thing for now. I think the right longterm solution is to create separate NCCL vs RCCL derived classes with separate CUDA and ROCM implementations of the |
@jeffdaily @tgaddair Using derived classes would also cause issues because even if we use the base type through the code, we will still need to construct the object with the derived type and to avoid compilation errors, we would need have to have a bunch of "ifdef" which is not really clean code in modern C++. Were you thinking of another way to do that? |
@nvcastet unfortunately, there is not one main design pattern for supporting AMD hardware in open source projects. As one might expect, long-established open source projects tend to resist adding HIP APIs, and certainly not fully converting code bases to use HIP alone even though HIP can compile to CUDA. That said, I've been on the teams to support TensorFlow, PyTorch, and now Horovod -- and each project maintainer has dictated a different approach. TensorFlow already had a device abstraction layer to differentiate between CPU and GPU devices. The GPU was itself further abstracted as a stream executor. That was all done prior to our involvement, but it made the most sense to integrate our ROCm stack as an implementation of the stream executor abstract interface. Even so, there are plenty of cases where APIs or device features are not the same. For example, CUDNN has either different behavior or a different API interface compared to our MIOpen, and so such code must be protected with PyTorch has mandated that the code base is "hipified" during the build, so a"hipify" python script was developed and maintained as part of the PyTorch code base. There are very few pure-HIP source files; the rest of the PyTorch source looks like any other CUDA-based project. But, macros can still be used to project CUDA-only code and features. We have further been mandated to use this hipify script for building all ROCm PyTorch extensions. Those changes are prepared in a separate branch that will become a new PR if this current PR is accepted. |
@nvcastet Currently we use preprocessor macros to determine which set of ops to use here. Otherwise, the code is isolated into its own classes / files by framework so it's easier for developers to reason about the code paths. I think that if the APIs ever diverged, we would necessarily need to either do something similar, or mix the preprocessor macros into the So long as the APIs are identical, I agree the pimpl pattern is cleaner. And since there isn't any indication that they will diverge at the moment, I also agree it's the better solution for now. |
- add ROCm and ROCm TensorFlow support to setup.py - Renamed symbols and files - CUDA -> GPU - Cuda -> Gpu - cuda -> gpu - HAVE_CUDA -> HAVE_GPU --- iff referring generically to any GPU - CUDAContext becomes GPUContext and uses pimpl idiom to decouple CUDA and HIP implementations of GPU interface. - New GPUContext methods - StreamCreate - StreamSynchronize - GetDevice, SetDevice - MemcpyAsyncD2D, MemcpyAsyncH2D, MemcpyAsyncD2H Signed-off-by: Jeff Daily <jeff.daily@amd.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks @jeffdaily! Will land once tests pass.
…tion from PR horovod#1632 Signed-off-by: Max H. Gerlach <git@maxgerlach.de>
CUDA and HIP implementations of GPU interface.
Signed-off-by: Jeff Daily jeff.daily@amd.com