Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial support for ROCm: rename CUDA to GPU et al, pimpl for GPU details #1632

Merged
merged 1 commit into from
Jan 20, 2020

Conversation

jeffdaily
Copy link
Contributor

@jeffdaily jeffdaily commented Jan 3, 2020

  • add ROCm and ROCm TensorFlow support to setup.py
  • Renamed symbols and files
    • CUDA -> GPU
    • Cuda -> Gpu
    • cuda -> gpu
    • HAVE_CUDA -> HAVE_GPU --- iff generically referring to any GPU device
  • CUDAContext becomes GPUContext and uses pimpl pattern to decouple
    CUDA and HIP implementations of GPU interface.
  • New GPUContext methods
    • StreamCreate
    • StreamSynchronize
    • GetDevice, SetDevice
    • MemcpyAsyncD2D, MemcpyAsyncH2D, MemcpyAsyncD2H

Signed-off-by: Jeff Daily jeff.daily@amd.com

@jeffdaily
Copy link
Contributor Author

It's not clear to me whether the current test failures are related to the changes in this PR.

@tgaddair
Copy link
Collaborator

tgaddair commented Jan 8, 2020

Hey @jeffdaily, thanks for the new PR. Overall, I like the approach to abstracting GPU operations in this way.

One thing I noticed is it looks like the DDL codepaths haven't been updated to use the new gpu_operations.h. Those aren't currently under unit test, but the changes should be fairly straightforward.

There was also a PR landed yesterday that added NCCL broadcast support which will require a rebase.

For the unit tests, it looks like at least one test was failing because the Horovod MXNet plugin failed to compile, so we may need to dig into why that is. I'm rerunning some of the other tests to see if it was a transient error.

I'll try and put together inline feedback as well. I think it would be good to have @nvcastet and @romerojosh take a look at this PR if they have the bandwidth.

@jeffdaily
Copy link
Contributor Author

Rebased to pick up NCCL broadcast change. Also updated the DDL sources, though I did not test them.

Copy link
Collaborator

@nvcastet nvcastet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR looks good to me.

horovod/common/ops/gpu_operations.h Show resolved Hide resolved
@jeffdaily
Copy link
Contributor Author

@tgaddair are the last two test failures transient or do they indicate a real issue? Also, this PR currently does not integrate any kind of CI build for ROCm yet. What is your acceptance criteria?

@tgaddair
Copy link
Collaborator

@jeffdaily one of the errors appears to be real:
https://buildkite.com/horovod/horovod/builds/1844#c24a6237-3f92-4cc5-9310-bcf714008444

The MXNet test for mixed install is consistently failing because the Horovod MXNet plugin did not compile successfully.

Once we figure that out, I think we should be safe to land.

@jeffdaily
Copy link
Contributor Author

@tgaddair I tried building the docker container for the mixed install so I could see how horovod is failing to build the mxnet extension, and I'm seeing the following error:

  CMake Error: The source directory "/tmp/pip-req-build-7pquo6gf/third_party/gloo" does not exist.
  Specify --help for usage, or press the help button on the CMake GUI.
  INFO: Unable to build MXNet plugin, will skip it.
...
  error: None of TensorFlow, PyTorch, or MXNet plugins were built. See errors above.
  Building wheel for horovod (setup.py): finished with status 'error'
  ERROR: Failed building wheel for horovod

@jeffdaily
Copy link
Contributor Author

@tgaddair disregard previous comment concerning the dockerfile build. I forgot to fetch submodules prior to attempting the dockerfile build.

I was able to reproduce the mxnet build failure. Fix will be ready shortly.

@jeffdaily
Copy link
Contributor Author

@tgaddair fixed the MXNet test for mixed install. Other errors have popped up now that were passing in the previous attempt.

@tgaddair
Copy link
Collaborator

Hey @jeffdaily, looks like test failures may be transient. Rerunning them now.

@tgaddair
Copy link
Collaborator

tgaddair commented Jan 15, 2020

Hey @jeffdaily, took a closer look over the PR and discussed it with some other Horovod contributors. Overall, I think the abstraction layer is good.

The one thing I think we should still address is the possibility for the ROCm/RCCL and CUDA/NCCL APIs to diverge in the future. With the pimpl pattern, we're locked in to the API exposed by the GPUContext. In this design how would a contributor add functionality specific to one of these frameworks but not the other?

Also, are you familiar with / can comment on the current state of support for GPU MPI with AMD hardware?

@jeffdaily
Copy link
Contributor Author

I am rather at a loss to suggest how to address a future ROCm/RCCL and CUDA/NCCL API divergence. With respect to HIP (ROCm) versus CUDA APIs, parity is intended by design. I am not privy to the CUDA development roadmap, so I can't predict new or deprecated APIs. Are there any plans for Horovod to use less common CUDA APIs?

Concerning RCCL versus NCCL, again parity is by design in order to make it easier to support AMD hardware through the same interface.

I think if there ever exists some CUDA or NCCL feature that isn't reflected by HIP or RCCL, we have the option of separating implementation details within the same source file using #if HAVE_ROCM, or a source file that is compiled for CUDA/NCCL only. It would need to be documented as Horovod feature X is not supported for ROCm builds.

GPUDirect, or ROCmRDMA, is very well supported with AMD hardware.

@jeffdaily
Copy link
Contributor Author

Rebased again to resolve conflicts from d4d54bc

setup.py Outdated Show resolved Hide resolved
@tgaddair
Copy link
Collaborator

@jeffdaily I'm fine punting on the API divergence thing for now. I think the right longterm solution is to create separate NCCL vs RCCL derived classes with separate CUDA and ROCM implementations of the GPUContext. But I think perhaps we can cross that bridge when we come to it, or doing it in a followup PR.

@nvcastet
Copy link
Collaborator

nvcastet commented Jan 17, 2020

@jeffdaily @tgaddair Using derived classes would also cause issues because even if we use the base type through the code, we will still need to construct the object with the derived type and to avoid compilation errors, we would need have to have a bunch of "ifdef" which is not really clean code in modern C++. Were you thinking of another way to do that?
OOP polymorphism seems to work better when everything can get built and also in this case, both APIs will not be used in the same build.
The C++11 version of the pimpl pattern seems to be a good fit.
With any interface/encapsulation design, we have will have the potentiality of divergence in implementations. If one implementation does not support a feature, we can always document it and throw a not_implemented exception printing a clear error message.
Is anyone thinking of a better pattern? @jeffdaily is pimpl the main design pattern AMD has been using in open source projects where a single GPU API existed? I would assume so for C++ code.

@jeffdaily
Copy link
Contributor Author

@nvcastet unfortunately, there is not one main design pattern for supporting AMD hardware in open source projects.

As one might expect, long-established open source projects tend to resist adding HIP APIs, and certainly not fully converting code bases to use HIP alone even though HIP can compile to CUDA. That said, I've been on the teams to support TensorFlow, PyTorch, and now Horovod -- and each project maintainer has dictated a different approach.

TensorFlow already had a device abstraction layer to differentiate between CPU and GPU devices. The GPU was itself further abstracted as a stream executor. That was all done prior to our involvement, but it made the most sense to integrate our ROCm stack as an implementation of the stream executor abstract interface. Even so, there are plenty of cases where APIs or device features are not the same. For example, CUDNN has either different behavior or a different API interface compared to our MIOpen, and so such code must be protected with #if TENSORFLOW_USE_ROCM. In summary, TensorFlow uses both this preprocessor macro as well as separately compiled sources even though attempts were made to use abstract interfaces for the GPU device and libraries (BLAS, CUDNN, etc).

PyTorch has mandated that the code base is "hipified" during the build, so a"hipify" python script was developed and maintained as part of the PyTorch code base. There are very few pure-HIP source files; the rest of the PyTorch source looks like any other CUDA-based project. But, macros can still be used to project CUDA-only code and features. We have further been mandated to use this hipify script for building all ROCm PyTorch extensions. Those changes are prepared in a separate branch that will become a new PR if this current PR is accepted.

@tgaddair
Copy link
Collaborator

@nvcastet Currently we use preprocessor macros to determine which set of ops to use here. Otherwise, the code is isolated into its own classes / files by framework so it's easier for developers to reason about the code paths.

I think that if the APIs ever diverged, we would necessarily need to either do something similar, or mix the preprocessor macros into the nccl_operations.cc file, the latter of which I think is less ideal from a maintainability standpoint.

So long as the APIs are identical, I agree the pimpl pattern is cleaner. And since there isn't any indication that they will diverge at the moment, I also agree it's the better solution for now.

- add ROCm and ROCm TensorFlow support to setup.py
- Renamed symbols and files
  - CUDA -> GPU
  - Cuda -> Gpu
  - cuda -> gpu
  - HAVE_CUDA -> HAVE_GPU --- iff referring generically to any GPU
- CUDAContext becomes GPUContext and uses pimpl idiom to decouple
  CUDA and HIP implementations of GPU interface.
- New GPUContext methods
  - StreamCreate
  - StreamSynchronize
  - GetDevice, SetDevice
  - MemcpyAsyncD2D, MemcpyAsyncH2D, MemcpyAsyncD2H

Signed-off-by: Jeff Daily <jeff.daily@amd.com>
Copy link
Collaborator

@tgaddair tgaddair left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @jeffdaily! Will land once tests pass.

@tgaddair tgaddair changed the title rename CUDA to GPU et al, pimpl for GPU details Initial support for ROCm: rename CUDA to GPU et al, pimpl for GPU details Jan 20, 2020
@tgaddair tgaddair merged commit a385b64 into horovod:master Jan 20, 2020
maxhgerlach added a commit to maxhgerlach/horovod that referenced this pull request Dec 2, 2021
…tion from PR horovod#1632

Signed-off-by: Max H. Gerlach <git@maxgerlach.de>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants