New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add AMD GPU XLA Op Implementation #3486
Conversation
Hi @maxhgerlach could you please help me to move this PR forward? Thank you. |
can anybody help to kick off the CI tests? |
Hi @weihanmines, sorry for the long delay! Keeping the code paths separated initially to make it easy to ensure they build with either ROCM or CUDA sounds like a good plan. I think Horovod's full CI pipeline is blocked by the DCO step. To kick it off you would need to sign off your commits by rebasing as described here: https://github.com/horovod/horovod/pull/3486/checks?check_run_id=5636286517 |
159b5b6
to
19e9b48
Compare
Hi Max @maxhgerlach, thank you for replying my messages. I have followed your suggestion in the previous message. Let me do a relatively small change at a time to achieve final goal (unifying GPU interfaces). Thanks again. |
Unit Test Results 803 files - 18 803 suites - 18 9h 6m 50s ⏱️ - 35m 35s For more details on these errors, see this check. Results for commit ba0c354. ± Comparison against base commit 7707267. ♻️ This comment has been updated with latest results. |
Unit Test Results (with flaky tests) 961 files + 19 961 suites +19 9h 39m 39s ⏱️ - 34m 36s For more details on these failures and errors, see this check. Results for commit ba0c354. ± Comparison against base commit 7707267. ♻️ This comment has been updated with latest results. |
Hi @weihanmines, Just to clarify: As you are planning to continue work on this PR, is it going to fully supersede #3310? (that one was facing some build issues) In any case, please let us know when you feel that this PR is ready to be reviewed. |
Hi Max @maxhgerlach, I need to fix something in the current PR. Once I get it fixed, I will let you know as soon as possible so that you could help to review the changes. Thank you. |
Signed-off-by: Wei Han <wei.han3@amd.com> Signed-off-by: weihanmines <wei.han3@amd.com>
Signed-off-by: Wei Han <wei.han3@amd.com> Signed-off-by: weihanmines <wei.han3@amd.com>
Signed-off-by: Wei Han <wei.han3@amd.com> Signed-off-by: weihanmines <wei.han3@amd.com>
Signed-off-by: weihanmines <wei.han3@amd.com>
8502607
to
8fa52b7
Compare
Hi Max @maxhgerlach could you please help me to start reviewing this PR? Once this PR is merged, horovod would be available for ROCM users. I will file different PRs in the future to remove redundancy and reach the goal of unifying GPU interface. Thank you for your help. |
Thanks @weihanmines! I'm reserving some time next week to review this. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @weihanmines,
the changes look mostly fine to me. At this point I think it would already be good to avoid the duplication of code inside gpu_operations.{cc,h}
. Please see the more detailed comments that I've left for the individual files. More generalization of the two code paths (the APIs are mostly identical except for cuda
vs hip
or rocm
name particles) would in deed be a promising road to go in future PRs.
On a more general note: Do you think it would be feasible to integrate some CI for AMD GPUs? Currently, Horovod's CI system only runs builds and tests on systems with NVIDIA GPUs so it's very easy to accidentally break compatibility with AMD GPUs. It wouldn't have to be much, just one build configuration and a few tests would already be very helpful (maybe similar in scope to the ppc64le-checks that are in place now).
Hi Max @maxhgerlach, yes, thank you for your feedback. The purpose of the PR #3310 is to unify the CUDA and HIP interface so that we don't have code repetitions. Unfortunately, I do not know how fix the CI issues in that PR. I will have follow-up PRs to achieve the same goal. I think we will get there. Thanks again for your help.
May I ask if the upstream has AMD GPUs available for CI? |
I'm not sure if we would be able to set up a multi-GPU system with AMD GPUs on Buildkite here. An alternative might also be to integrate some CI job running on an appropriate external system if there is anything like that available on your end. A good first step might also just be a job that builds Horovod in a container meant to be run on ROCM systems even if we don't actually run any actual tests there initially. That would catch many build time regressions at least. |
Let me see what I can do to help here. Thank you for your suggestions. |
Signed-off-by: weihanmines <wei.han3@amd.com>
Signed-off-by: weihanmines <wei.han3@amd.com>
Hi Max @maxhgerlach, I removed duplications in xla_mpi_ops.cc file in the latest commit. Could you please help to review my updated changes? Thank you. |
@weihanmines any updates on how we could test continuously on a ROCM system? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor comments
Signed-off-by: weihanmines <wei.han3@amd.com>
Co-authored-by: Enrico Minack <github@enrico.minack.dev> Signed-off-by: weihanmines <wei.han3@amd.com>
Signed-off-by: weihanmines <wei.han3@amd.com>
Signed-off-by: weihanmines <wei.han3@amd.com>
Hi Enrico @EnricoMi Is it possible for us to have the upstream tokens for the pull requests and pushes? |
We need the token for the external CI. Thank you. |
Can you please explain what kind of token and how that would be used to integrate the external CI? |
We need help in configuring our Jenkins-ci in Horvod repo and use the webhook secret |
As I understand GitHub Webhooks: Whenever there is a I am referring to this: If this is not what you are trying to setup, please be more specific on what you need. Maybe a link to documentation or tutorials outlining the full picture of what you are setting up and which steps you need us to action on. |
Hi Enrico, |
Webhook created, initial ping request sent out successfully. I have enabled this webhook for push and pull request events. |
Please let me know if there is anything else I could do to get this PR merged. Thank you. |
@maxhgerlach what do you think? Any objections? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @weihanmines,
sorry for the radio silence, I was off the grid for a bit, but am back from vacation now.
Thank you very much for your updates! I appreciate reducing the code duplication in gpu_operations.{cc,h} and xla_mpi_ops.cc, too. The new changes all look good to me. I've only left two minor comments. Addressing them would not be critical, I think.
Also thanks for chiming in on the review, @EnricoMi!
Great that we could get started on integrating some CI with AMD GPUs. 👍
Signed-off-by: weihanmines <wei.han3@amd.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
@weihanmines when can I expect Horovod to appear in http://ml-ci.amd.com:21096/? |
I will talk to my colleague to figure it out. I will let you know as soon as I hear anything from him. |
@maxhgerlach @EnricoMi It is strange that four tests fail in the latest commit, while they pass in the commit 4f5a5db. These two commits are identical. Could you please help me to rerun the test again? Thank you. |
Hi @weihanmines, as far as I can tell, those new test failures are restricted to builds with current nightly versions of frameworks. So it's out of your control and should not block the merge. On master there are some failures with head versions, too: https://github.com/horovod/horovod/runs/6591350696?check_suite_focus=true So that's a separate issue to be investigated. |
Checklist before submitting
Description
The current HIP implementation is totally independent CUDA counterpart. In order not to touch on the CUDA files, I separate ROCM code completely from CUDA file. I only do addition (no subtraction at all) to the CMakeLists.txt. Such implementation will leave CUDA compilation path untouched and therefore introduce duplicate code which is the same as CUDA code. We are doing so right now just to avoid anything which might break upstream CI with respect to the CUDA implementation. After this PR is accepted, we will create a few PRs to remove duplicate code.
Fixes # (issue).
Implement HIP kernels which are completely separate from CUDA kernels.
Review process to land