Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add AMD GPU XLA Op Implementation #3486

Merged
merged 11 commits into from May 26, 2022
Merged

Conversation

weihanmines
Copy link
Contributor

@weihanmines weihanmines commented Mar 21, 2022

Checklist before submitting

  • Did you read the contributor guide?
  • Did you update the docs?
  • Did you write any tests to validate this change?
  • Did you update the CHANGELOG, if this change affects users?

Description

The current HIP implementation is totally independent CUDA counterpart. In order not to touch on the CUDA files, I separate ROCM code completely from CUDA file. I only do addition (no subtraction at all) to the CMakeLists.txt. Such implementation will leave CUDA compilation path untouched and therefore introduce duplicate code which is the same as CUDA code. We are doing so right now just to avoid anything which might break upstream CI with respect to the CUDA implementation. After this PR is accepted, we will create a few PRs to remove duplicate code.

Fixes # (issue).

Implement HIP kernels which are completely separate from CUDA kernels.

Review process to land

  1. All tests and other checks must succeed.
  2. At least one member of the technical steering committee must review and approve.
  3. If any member of the technical steering committee requests changes, they must be addressed.

@weihanmines
Copy link
Contributor Author

Hi @maxhgerlach could you please help me to move this PR forward? Thank you.

@weihanmines
Copy link
Contributor Author

weihanmines commented Apr 4, 2022

can anybody help to kick off the CI tests?

@maxhgerlach
Copy link
Collaborator

Hi @weihanmines, sorry for the long delay! Keeping the code paths separated initially to make it easy to ensure they build with either ROCM or CUDA sounds like a good plan.

I think Horovod's full CI pipeline is blocked by the DCO step. To kick it off you would need to sign off your commits by rebasing as described here: https://github.com/horovod/horovod/pull/3486/checks?check_run_id=5636286517

@weihanmines
Copy link
Contributor Author

Hi @weihanmines, sorry for the long delay! Keeping the code paths separated initially to make it easy to ensure they build with either ROCM or CUDA sounds like a good plan.

I think Horovod's full CI pipeline is blocked by the DCO step. To kick it off you would need to sign off your commits by rebasing as described here: https://github.com/horovod/horovod/pull/3486/checks?check_run_id=5636286517

Hi Max @maxhgerlach, thank you for replying my messages. I have followed your suggestion in the previous message. Let me do a relatively small change at a time to achieve final goal (unifying GPU interfaces). Thanks again.

@github-actions
Copy link

github-actions bot commented Apr 25, 2022

Unit Test Results

     803 files   -   18       803 suites   - 18   9h 6m 50s ⏱️ - 35m 35s
     776 tests +    8       725 ✔️ ±    0       43 💤 ±    0  0 ±0    8 🔥 +  8 
18 306 runs   - 644  13 114 ✔️  - 541  5 176 💤  - 119  0 ±0  16 🔥 +16 

For more details on these errors, see this check.

Results for commit ba0c354. ± Comparison against base commit 7707267.

♻️ This comment has been updated with latest results.

@github-actions
Copy link

github-actions bot commented Apr 25, 2022

Unit Test Results (with flaky tests)

     961 files  +  19       961 suites  +19   9h 39m 39s ⏱️ - 34m 36s
     776 tests +    8       724 ✔️ ±    0       43 💤 ±    0  1 ±0    8 🔥 +  8 
21 389 runs   - 659  15 077 ✔️  - 568  6 263 💤  - 139  1 ±0  48 🔥 +48 

For more details on these failures and errors, see this check.

Results for commit ba0c354. ± Comparison against base commit 7707267.

♻️ This comment has been updated with latest results.

@maxhgerlach
Copy link
Collaborator

Hi @weihanmines,
thanks for signing off the commits! The CI looks good: no build or test failures on any CUDA system. 🎉

Just to clarify: As you are planning to continue work on this PR, is it going to fully supersede #3310? (that one was facing some build issues)

In any case, please let us know when you feel that this PR is ready to be reviewed.

@weihanmines
Copy link
Contributor Author

Hi @weihanmines, sorry for the long delay! Keeping the code paths separated initially to make it easy to ensure they build with either ROCM or CUDA sounds like a good plan.

I think Horovod's full CI pipeline is blocked by the DCO step. To kick it off you would need to sign off your commits by rebasing as described here: https://github.com/horovod/horovod/pull/3486/checks?check_run_id=5636286517

Hi Max @maxhgerlach, I need to fix something in the current PR. Once I get it fixed, I will let you know as soon as possible so that you could help to review the changes. Thank you.

weihanmines and others added 4 commits April 27, 2022 05:37
Signed-off-by: Wei Han <wei.han3@amd.com>
Signed-off-by: weihanmines <wei.han3@amd.com>
Signed-off-by: Wei Han <wei.han3@amd.com>
Signed-off-by: weihanmines <wei.han3@amd.com>
Signed-off-by: Wei Han <wei.han3@amd.com>
Signed-off-by: weihanmines <wei.han3@amd.com>
Signed-off-by: weihanmines <wei.han3@amd.com>
@weihanmines
Copy link
Contributor Author

Hi @weihanmines, thanks for signing off the commits! The CI looks good: no build or test failures on any CUDA system. 🎉

Just to clarify: As you are planning to continue work on this PR, is it going to fully supersede #3310? (that one was facing some build issues)

In any case, please let us know when you feel that this PR is ready to be reviewed.

Hi Max @maxhgerlach could you please help me to start reviewing this PR? Once this PR is merged, horovod would be available for ROCM users. I will file different PRs in the future to remove redundancy and reach the goal of unifying GPU interface. Thank you for your help.

@maxhgerlach
Copy link
Collaborator

Hi Max @maxhgerlach could you please help me to start reviewing this PR? Once this PR is merged, horovod would be available for ROCM users. I will file different PRs in the future to remove redundancy and reach the goal of unifying GPU interface. Thank you for your help.

Thanks @weihanmines! I'm reserving some time next week to review this.

Copy link
Collaborator

@maxhgerlach maxhgerlach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @weihanmines,

the changes look mostly fine to me. At this point I think it would already be good to avoid the duplication of code inside gpu_operations.{cc,h}. Please see the more detailed comments that I've left for the individual files. More generalization of the two code paths (the APIs are mostly identical except for cuda vs hip or rocm name particles) would in deed be a promising road to go in future PRs.

On a more general note: Do you think it would be feasible to integrate some CI for AMD GPUs? Currently, Horovod's CI system only runs builds and tests on systems with NVIDIA GPUs so it's very easy to accidentally break compatibility with AMD GPUs. It wouldn't have to be much, just one build configuration and a few tests would already be very helpful (maybe similar in scope to the ppc64le-checks that are in place now).

horovod/common/common.h Show resolved Hide resolved
horovod/common/ops/gpu_operations.h Outdated Show resolved Hide resolved
horovod/common/ops/rocm/CMakeLists.txt Show resolved Hide resolved
horovod/common/ops/rocm/hip_kernels.cu Show resolved Hide resolved
horovod/common/ops/gpu_operations.cc Outdated Show resolved Hide resolved
horovod/common/ops/hip_operations.cc Show resolved Hide resolved
horovod/tensorflow/xla_mpi_ops.cc Outdated Show resolved Hide resolved
@weihanmines
Copy link
Contributor Author

Hey @weihanmines,

the changes look mostly fine to me. At this point I think it would already be good to avoid the duplication of code inside gpu_operations.{cc,h}. Please see the more detailed comments that I've left for the individual files. More generalization of the two code paths (the APIs are mostly identical except for cuda vs hip or rocm name particles) would in deed be a promising road to go in future PRs.

Hi Max @maxhgerlach, yes, thank you for your feedback. The purpose of the PR #3310 is to unify the CUDA and HIP interface so that we don't have code repetitions. Unfortunately, I do not know how fix the CI issues in that PR. I will have follow-up PRs to achieve the same goal. I think we will get there. Thanks again for your help.

On a more general note: Do you think it would be feasible to integrate some CI for AMD GPUs? Currently, Horovod's CI system only runs builds and tests on systems with NVIDIA GPUs so it's very easy to accidentally break compatibility with AMD GPUs. It wouldn't have to be much, just one build configuration and a few tests would already be very helpful (maybe similar in scope to the ppc64le-checks that are in place now).

May I ask if the upstream has AMD GPUs available for CI?

@maxhgerlach
Copy link
Collaborator

May I ask if the upstream has AMD GPUs available for CI?

I'm not sure if we would be able to set up a multi-GPU system with AMD GPUs on Buildkite here. An alternative might also be to integrate some CI job running on an appropriate external system if there is anything like that available on your end.

A good first step might also just be a job that builds Horovod in a container meant to be run on ROCM systems even if we don't actually run any actual tests there initially. That would catch many build time regressions at least.

@weihanmines
Copy link
Contributor Author

May I ask if the upstream has AMD GPUs available for CI?

I'm not sure if we would be able to set up a multi-GPU system with AMD GPUs on Buildkite here. An alternative might also be to integrate some CI job running on an appropriate external system if there is anything like that available on your end.

A good first step might also just be a job that builds Horovod in a container meant to be run on ROCM systems even if we don't actually run any actual tests there initially. That would catch many build time regressions at least.

Let me see what I can do to help here. Thank you for your suggestions.

Signed-off-by: weihanmines <wei.han3@amd.com>
Signed-off-by: weihanmines <wei.han3@amd.com>
@weihanmines
Copy link
Contributor Author

weihanmines commented May 10, 2022

Hi Max @maxhgerlach, I removed duplications in xla_mpi_ops.cc file in the latest commit. Could you please help to review my updated changes? Thank you.

@EnricoMi
Copy link
Collaborator

@weihanmines any updates on how we could test continuously on a ROCM system?

Copy link
Collaborator

@EnricoMi EnricoMi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor comments

horovod/common/ops/rocm/CMakeLists.txt Outdated Show resolved Hide resolved
horovod/common/ops/hip_operations.cc Outdated Show resolved Hide resolved
horovod/common/ops/rocm/hip_kernels.cu Show resolved Hide resolved
weihanmines and others added 2 commits May 16, 2022 18:34
Signed-off-by: weihanmines <wei.han3@amd.com>
Co-authored-by: Enrico Minack <github@enrico.minack.dev>
Signed-off-by: weihanmines <wei.han3@amd.com>
Signed-off-by: weihanmines <wei.han3@amd.com>
Signed-off-by: weihanmines <wei.han3@amd.com>
@weihanmines
Copy link
Contributor Author

Hi Enrico @EnricoMi Is it possible for us to have the upstream tokens for the pull requests and pushes?

@weihanmines
Copy link
Contributor Author

@weihanmines any updates on how we could test continuously on a ROCM system?

Hi Enrico @EnricoMi Is it possible for us to have the upstream tokens for the pull requests and pushes?

We need the token for the external CI. Thank you.

@EnricoMi
Copy link
Collaborator

Can you please explain what kind of token and how that would be used to integrate the external CI?

@weihanmines
Copy link
Contributor Author

Can you please explain what kind of token and how that would be used to integrate the external CI?

We need help in configuring our Jenkins-ci in Horvod repo and use the webhook secret
to let GitHub-webhooks trigger
the push event and pull requests. Thank you.

@EnricoMi
Copy link
Collaborator

As I understand GitHub Webhooks: Whenever there is a push or pull-request event on our repository, GitHub can call the Jenkins webhook to trigger the CI on your side. For that, Horovod needs a secret from Jenkins to authenticate to Jenkins. You can generate the secret. Please send it to my e-mail address if you are comfortable with that.

I am referring to this:
https://docs.github.com/en/developers/webhooks-and-events/webhooks/about-webhooks
https://docs.github.com/en/developers/webhooks-and-events/webhooks/creating-webhooks
https://docs.github.com/en/developers/webhooks-and-events/webhooks/securing-your-webhooks

If this is not what you are trying to setup, please be more specific on what you need. Maybe a link to documentation or tutorials outlining the full picture of what you are setting up and which steps you need us to action on.

@weihanmines
Copy link
Contributor Author

As I understand GitHub Webhooks: Whenever there is a push or pull-request event on our repository, GitHub can call the Jenkins webhook to trigger the CI on your side. For that, Horovod needs a secret from Jenkins to authenticate to Jenkins. You can generate the secret. Please send it to my e-mail address if you are comfortable with that.

I am referring to this: https://docs.github.com/en/developers/webhooks-and-events/webhooks/about-webhooks https://docs.github.com/en/developers/webhooks-and-events/webhooks/creating-webhooks https://docs.github.com/en/developers/webhooks-and-events/webhooks/securing-your-webhooks

If this is not what you are trying to setup, please be more specific on what you need. Maybe a link to documentation or tutorials outlining the full picture of what you are setting up and which steps you need us to action on.

Hi Enrico,
I am not familiar with DevOps. My colleague tells me that you only need the payload URL. For your convenience, I attach the image he sent to me. Please free to contact me if you have any question. Thanks.
MicrosoftTeams-image
.

@EnricoMi
Copy link
Collaborator

Webhook created, initial ping request sent out successfully. I have enabled this webhook for push and pull request events.

@weihanmines
Copy link
Contributor Author

Webhook created, initial ping request sent out successfully. I have enabled this webhook for push and pull request events.

Please let me know if there is anything else I could do to get this PR merged. Thank you.

@EnricoMi
Copy link
Collaborator

@maxhgerlach what do you think? Any objections?

Copy link
Collaborator

@maxhgerlach maxhgerlach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @weihanmines,

sorry for the radio silence, I was off the grid for a bit, but am back from vacation now.

Thank you very much for your updates! I appreciate reducing the code duplication in gpu_operations.{cc,h} and xla_mpi_ops.cc, too. The new changes all look good to me. I've only left two minor comments. Addressing them would not be critical, I think.

Also thanks for chiming in on the review, @EnricoMi!

Great that we could get started on integrating some CI with AMD GPUs. 👍

horovod/tensorflow/xla_mpi_ops.cc Outdated Show resolved Hide resolved
horovod/tensorflow/xla_mpi_ops.cc Show resolved Hide resolved
Signed-off-by: weihanmines <wei.han3@amd.com>
Copy link
Collaborator

@EnricoMi EnricoMi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@EnricoMi
Copy link
Collaborator

@weihanmines when can I expect Horovod to appear in http://ml-ci.amd.com:21096/?

@weihanmines
Copy link
Contributor Author

@weihanmines when can I expect Horovod to appear in http://ml-ci.amd.com:21096/?

I will talk to my colleague to figure it out. I will let you know as soon as I hear anything from him.

@weihanmines
Copy link
Contributor Author

@maxhgerlach @EnricoMi It is strange that four tests fail in the latest commit, while they pass in the commit 4f5a5db. These two commits are identical. Could you please help me to rerun the test again? Thank you.

@maxhgerlach
Copy link
Collaborator

Hi @weihanmines,

as far as I can tell, those new test failures are restricted to builds with current nightly versions of frameworks. So it's out of your control and should not block the merge.

On master there are some failures with head versions, too: https://github.com/horovod/horovod/runs/6591350696?check_suite_focus=true So that's a separate issue to be investigated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants