Add AMD GPU XLA Op Implementation #3486

weihanmines · 2022-03-21T23:34:56Z

Checklist before submitting

Did you read the contributor guide?
Did you update the docs?
Did you write any tests to validate this change?
Did you update the CHANGELOG, if this change affects users?

Description

The current HIP implementation is totally independent CUDA counterpart. In order not to touch on the CUDA files, I separate ROCM code completely from CUDA file. I only do addition (no subtraction at all) to the CMakeLists.txt. Such implementation will leave CUDA compilation path untouched and therefore introduce duplicate code which is the same as CUDA code. We are doing so right now just to avoid anything which might break upstream CI with respect to the CUDA implementation. After this PR is accepted, we will create a few PRs to remove duplicate code.

Fixes # (issue).

Implement HIP kernels which are completely separate from CUDA kernels.

Review process to land

All tests and other checks must succeed.
At least one member of the technical steering committee must review and approve.
If any member of the technical steering committee requests changes, they must be addressed.

weihanmines · 2022-04-01T03:16:51Z

Hi @maxhgerlach could you please help me to move this PR forward? Thank you.

weihanmines · 2022-04-04T18:05:47Z

can anybody help to kick off the CI tests?

maxhgerlach · 2022-04-22T21:14:34Z

Hi @weihanmines, sorry for the long delay! Keeping the code paths separated initially to make it easy to ensure they build with either ROCM or CUDA sounds like a good plan.

I think Horovod's full CI pipeline is blocked by the DCO step. To kick it off you would need to sign off your commits by rebasing as described here: https://github.com/horovod/horovod/pull/3486/checks?check_run_id=5636286517

weihanmines · 2022-04-25T03:07:24Z

Hi @weihanmines, sorry for the long delay! Keeping the code paths separated initially to make it easy to ensure they build with either ROCM or CUDA sounds like a good plan.

I think Horovod's full CI pipeline is blocked by the DCO step. To kick it off you would need to sign off your commits by rebasing as described here: https://github.com/horovod/horovod/pull/3486/checks?check_run_id=5636286517

Hi Max @maxhgerlach, thank you for replying my messages. I have followed your suggestion in the previous message. Let me do a relatively small change at a time to achieve final goal (unifying GPU interfaces). Thanks again.

github-actions · 2022-04-25T15:24:08Z

Unit Test Results

    803 files -   18     803 suites - 18 9h 6m 50s ⏱️ - 35m 35s
    776 tests +    8     725 ✔️ ±    0     43 💤 ±    0 0 ❌ ±0   8 🔥 +  8
18 306 runs - 644 13 114 ✔️ - 541 5 176 💤 - 119 0 ❌ ±0 16 🔥 +16

For more details on these errors, see this check.

Results for commit ba0c354. ± Comparison against base commit 7707267.

♻️ This comment has been updated with latest results.

github-actions · 2022-04-25T15:24:25Z

Unit Test Results (with flaky tests)

    961 files +  19     961 suites +19 9h 39m 39s ⏱️ - 34m 36s
    776 tests +    8     724 ✔️ ±    0     43 💤 ±    0 1 ❌ ±0   8 🔥 +  8
21 389 runs - 659 15 077 ✔️ - 568 6 263 💤 - 139 1 ❌ ±0 48 🔥 +48

For more details on these failures and errors, see this check.

Results for commit ba0c354. ± Comparison against base commit 7707267.

♻️ This comment has been updated with latest results.

maxhgerlach · 2022-04-25T16:25:00Z

Hi @weihanmines,
thanks for signing off the commits! The CI looks good: no build or test failures on any CUDA system. 🎉

Just to clarify: As you are planning to continue work on this PR, is it going to fully supersede #3310? (that one was facing some build issues)

In any case, please let us know when you feel that this PR is ready to be reviewed.

weihanmines · 2022-04-26T05:02:33Z

Hi @weihanmines, sorry for the long delay! Keeping the code paths separated initially to make it easy to ensure they build with either ROCM or CUDA sounds like a good plan.

I think Horovod's full CI pipeline is blocked by the DCO step. To kick it off you would need to sign off your commits by rebasing as described here: https://github.com/horovod/horovod/pull/3486/checks?check_run_id=5636286517

Hi Max @maxhgerlach, I need to fix something in the current PR. Once I get it fixed, I will let you know as soon as possible so that you could help to review the changes. Thank you.

Signed-off-by: Wei Han <wei.han3@amd.com> Signed-off-by: weihanmines <wei.han3@amd.com>

Signed-off-by: weihanmines <wei.han3@amd.com>

weihanmines · 2022-04-27T05:45:44Z

Hi @weihanmines, thanks for signing off the commits! The CI looks good: no build or test failures on any CUDA system. 🎉

Just to clarify: As you are planning to continue work on this PR, is it going to fully supersede #3310? (that one was facing some build issues)

In any case, please let us know when you feel that this PR is ready to be reviewed.

Hi Max @maxhgerlach could you please help me to start reviewing this PR? Once this PR is merged, horovod would be available for ROCM users. I will file different PRs in the future to remove redundancy and reach the goal of unifying GPU interface. Thank you for your help.

maxhgerlach · 2022-04-29T15:20:00Z

Hi Max @maxhgerlach could you please help me to start reviewing this PR? Once this PR is merged, horovod would be available for ROCM users. I will file different PRs in the future to remove redundancy and reach the goal of unifying GPU interface. Thank you for your help.

Thanks @weihanmines! I'm reserving some time next week to review this.

maxhgerlach

Hey @weihanmines,

the changes look mostly fine to me. At this point I think it would already be good to avoid the duplication of code inside gpu_operations.{cc,h}. Please see the more detailed comments that I've left for the individual files. More generalization of the two code paths (the APIs are mostly identical except for cuda vs hip or rocm name particles) would in deed be a promising road to go in future PRs.

On a more general note: Do you think it would be feasible to integrate some CI for AMD GPUs? Currently, Horovod's CI system only runs builds and tests on systems with NVIDIA GPUs so it's very easy to accidentally break compatibility with AMD GPUs. It wouldn't have to be much, just one build configuration and a few tests would already be very helpful (maybe similar in scope to the ppc64le-checks that are in place now).

horovod/common/common.h

horovod/common/ops/gpu_operations.h

horovod/common/ops/rocm/CMakeLists.txt

horovod/common/ops/rocm/hip_kernels.cu

horovod/common/ops/gpu_operations.cc

horovod/common/ops/hip_operations.cc

horovod/tensorflow/xla_mpi_ops.cc

weihanmines · 2022-05-04T19:01:46Z

Hey @weihanmines,

the changes look mostly fine to me. At this point I think it would already be good to avoid the duplication of code inside gpu_operations.{cc,h}. Please see the more detailed comments that I've left for the individual files. More generalization of the two code paths (the APIs are mostly identical except for cuda vs hip or rocm name particles) would in deed be a promising road to go in future PRs.

Hi Max @maxhgerlach, yes, thank you for your feedback. The purpose of the PR #3310 is to unify the CUDA and HIP interface so that we don't have code repetitions. Unfortunately, I do not know how fix the CI issues in that PR. I will have follow-up PRs to achieve the same goal. I think we will get there. Thanks again for your help.

On a more general note: Do you think it would be feasible to integrate some CI for AMD GPUs? Currently, Horovod's CI system only runs builds and tests on systems with NVIDIA GPUs so it's very easy to accidentally break compatibility with AMD GPUs. It wouldn't have to be much, just one build configuration and a few tests would already be very helpful (maybe similar in scope to the ppc64le-checks that are in place now).

May I ask if the upstream has AMD GPUs available for CI?

maxhgerlach · 2022-05-05T16:57:17Z

May I ask if the upstream has AMD GPUs available for CI?

I'm not sure if we would be able to set up a multi-GPU system with AMD GPUs on Buildkite here. An alternative might also be to integrate some CI job running on an appropriate external system if there is anything like that available on your end.

A good first step might also just be a job that builds Horovod in a container meant to be run on ROCM systems even if we don't actually run any actual tests there initially. That would catch many build time regressions at least.

weihanmines · 2022-05-05T16:59:50Z

May I ask if the upstream has AMD GPUs available for CI?

I'm not sure if we would be able to set up a multi-GPU system with AMD GPUs on Buildkite here. An alternative might also be to integrate some CI job running on an appropriate external system if there is anything like that available on your end.

A good first step might also just be a job that builds Horovod in a container meant to be run on ROCM systems even if we don't actually run any actual tests there initially. That would catch many build time regressions at least.

Let me see what I can do to help here. Thank you for your suggestions.

Signed-off-by: weihanmines <wei.han3@amd.com>

weihanmines · 2022-05-10T21:44:57Z

Hi Max @maxhgerlach, I removed duplications in xla_mpi_ops.cc file in the latest commit. Could you please help to review my updated changes? Thank you.

EnricoMi · 2022-05-16T10:32:08Z

@weihanmines any updates on how we could test continuously on a ROCM system?

EnricoMi

minor comments

horovod/common/ops/rocm/CMakeLists.txt

horovod/common/ops/hip_operations.cc

horovod/common/ops/rocm/hip_kernels.cu

Signed-off-by: weihanmines <wei.han3@amd.com>

Co-authored-by: Enrico Minack <github@enrico.minack.dev> Signed-off-by: weihanmines <wei.han3@amd.com>

Signed-off-by: weihanmines <wei.han3@amd.com>

weihanmines · 2022-05-17T18:44:52Z

Hi Enrico @EnricoMi Is it possible for us to have the upstream tokens for the pull requests and pushes?

weihanmines · 2022-05-18T15:24:08Z

@weihanmines any updates on how we could test continuously on a ROCM system?

Hi Enrico @EnricoMi Is it possible for us to have the upstream tokens for the pull requests and pushes?

We need the token for the external CI. Thank you.

EnricoMi · 2022-05-18T20:53:16Z

Can you please explain what kind of token and how that would be used to integrate the external CI?

weihanmines · 2022-05-18T23:00:20Z

Can you please explain what kind of token and how that would be used to integrate the external CI?

We need help in configuring our Jenkins-ci in Horvod repo and use the webhook secret
to let GitHub-webhooks trigger
the push event and pull requests. Thank you.

EnricoMi · 2022-05-20T07:59:26Z

As I understand GitHub Webhooks: Whenever there is a push or pull-request event on our repository, GitHub can call the Jenkins webhook to trigger the CI on your side. For that, Horovod needs a secret from Jenkins to authenticate to Jenkins. You can generate the secret. Please send it to my e-mail address if you are comfortable with that.

I am referring to this:
https://docs.github.com/en/developers/webhooks-and-events/webhooks/about-webhooks
https://docs.github.com/en/developers/webhooks-and-events/webhooks/creating-webhooks
https://docs.github.com/en/developers/webhooks-and-events/webhooks/securing-your-webhooks

If this is not what you are trying to setup, please be more specific on what you need. Maybe a link to documentation or tutorials outlining the full picture of what you are setting up and which steps you need us to action on.

weihanmines · 2022-05-20T18:13:32Z

As I understand GitHub Webhooks: Whenever there is a push or pull-request event on our repository, GitHub can call the Jenkins webhook to trigger the CI on your side. For that, Horovod needs a secret from Jenkins to authenticate to Jenkins. You can generate the secret. Please send it to my e-mail address if you are comfortable with that.

I am referring to this: https://docs.github.com/en/developers/webhooks-and-events/webhooks/about-webhooks https://docs.github.com/en/developers/webhooks-and-events/webhooks/creating-webhooks https://docs.github.com/en/developers/webhooks-and-events/webhooks/securing-your-webhooks

If this is not what you are trying to setup, please be more specific on what you need. Maybe a link to documentation or tutorials outlining the full picture of what you are setting up and which steps you need us to action on.

Hi Enrico,
I am not familiar with DevOps. My colleague tells me that you only need the payload URL. For your convenience, I attach the image he sent to me. Please free to contact me if you have any question. Thanks.

.

EnricoMi · 2022-05-22T18:22:52Z

Webhook created, initial ping request sent out successfully. I have enabled this webhook for push and pull request events.

weihanmines · 2022-05-22T23:14:58Z

Webhook created, initial ping request sent out successfully. I have enabled this webhook for push and pull request events.

Please let me know if there is anything else I could do to get this PR merged. Thank you.

EnricoMi · 2022-05-24T19:55:42Z

@maxhgerlach what do you think? Any objections?

maxhgerlach

Hi @weihanmines,

sorry for the radio silence, I was off the grid for a bit, but am back from vacation now.

Thank you very much for your updates! I appreciate reducing the code duplication in gpu_operations.{cc,h} and xla_mpi_ops.cc, too. The new changes all look good to me. I've only left two minor comments. Addressing them would not be critical, I think.

Also thanks for chiming in on the review, @EnricoMi!

Great that we could get started on integrating some CI with AMD GPUs. 👍

horovod/tensorflow/xla_mpi_ops.cc

Signed-off-by: weihanmines <wei.han3@amd.com>

EnricoMi

LGTM!

EnricoMi · 2022-05-25T20:05:26Z

@weihanmines when can I expect Horovod to appear in http://ml-ci.amd.com:21096/?

weihanmines · 2022-05-26T00:35:38Z

@weihanmines when can I expect Horovod to appear in http://ml-ci.amd.com:21096/?

I will talk to my colleague to figure it out. I will let you know as soon as I hear anything from him.

weihanmines · 2022-05-26T00:48:34Z

@maxhgerlach @EnricoMi It is strange that four tests fail in the latest commit, while they pass in the commit 4f5a5db. These two commits are identical. Could you please help me to rerun the test again? Thank you.

maxhgerlach · 2022-05-26T08:41:10Z

Hi @weihanmines,

as far as I can tell, those new test failures are restricted to builds with current nightly versions of frameworks. So it's out of your control and should not block the merge.

On master there are some failures with head versions, too: https://github.com/horovod/horovod/runs/6591350696?check_suite_focus=true So that's a separate issue to be investigated.

EnricoMi requested a review from maxhgerlach April 22, 2022 18:03

weihanmines force-pushed the sep-hip-impl branch from 159b5b6 to 19e9b48 Compare April 25, 2022 03:03

weihanmines and others added 4 commits April 27, 2022 05:37

separate rocm implementation added

59f17f8

Signed-off-by: Wei Han <wei.han3@amd.com> Signed-off-by: weihanmines <wei.han3@amd.com>

add rocm kenrels

e9d381e

Signed-off-by: Wei Han <wei.han3@amd.com> Signed-off-by: weihanmines <wei.han3@amd.com>

complete ROCm implementaiton

db4e412

Signed-off-by: Wei Han <wei.han3@amd.com> Signed-off-by: weihanmines <wei.han3@amd.com>

platform string changed in XLA backend

8fa52b7

Signed-off-by: weihanmines <wei.han3@amd.com>

weihanmines force-pushed the sep-hip-impl branch from 8502607 to 8fa52b7 Compare April 27, 2022 05:39

maxhgerlach reviewed May 4, 2022

View reviewed changes

maxhgerlach mentioned this pull request May 4, 2022

Integrating ROCm implementation with XLA auto clustering and unifying CUDA and ROCm Interfaces for TensorFlow(contd. #3178) #3310

Closed

3 tasks

remove duplates in gpu operations

703ec25

Signed-off-by: weihanmines <wei.han3@amd.com>

weihanmines force-pushed the sep-hip-impl branch from 3b98d94 to 703ec25 Compare May 9, 2022 20:56

use HAVE_GPU instead of HAVE_CUDA || HAVE_ROCM

7ad8abe

Signed-off-by: weihanmines <wei.han3@amd.com>

EnricoMi reviewed May 16, 2022

View reviewed changes

horovod/common/ops/rocm/CMakeLists.txt Outdated Show resolved Hide resolved

horovod/common/ops/hip_operations.cc Outdated Show resolved Hide resolved

horovod/common/ops/rocm/hip_kernels.cu Show resolved Hide resolved

weihanmines and others added 2 commits May 16, 2022 18:34

remove duplication in xla mpi ops impl

7fde67a

Signed-off-by: weihanmines <wei.han3@amd.com>

Update horovod/common/ops/rocm/CMakeLists.txt

b75eba9

Co-authored-by: Enrico Minack <github@enrico.minack.dev> Signed-off-by: weihanmines <wei.han3@amd.com>

weihanmines added 2 commits May 16, 2022 18:34

removed an extra line in hip_operations.cc

f7aa621

Signed-off-by: weihanmines <wei.han3@amd.com>

Added sync warning message in [cuda|hip]_kernels.[h|cu] files

4f5a5db

Signed-off-by: weihanmines <wei.han3@amd.com>

weihanmines force-pushed the sep-hip-impl branch from db44bb6 to 4f5a5db Compare May 16, 2022 18:35

maxhgerlach approved these changes May 25, 2022

View reviewed changes

horovod/tensorflow/xla_mpi_ops.cc Outdated Show resolved Hide resolved

horovod/tensorflow/xla_mpi_ops.cc Show resolved Hide resolved

fixed a comment string and added a preprocessor branch for ROCM

ba0c354

Signed-off-by: weihanmines <wei.han3@amd.com>

EnricoMi approved these changes May 25, 2022

View reviewed changes

maxhgerlach merged commit 9fd2dfe into horovod:master May 26, 2022

maxhgerlach mentioned this pull request Jun 7, 2022

Can't pip install horovod for rocm 5.0+ #3537

Closed

romerojosh mentioned this pull request Sep 7, 2022

Enable use of native ncclAvg op for NCCL allreduces. #3646

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add AMD GPU XLA Op Implementation #3486

Add AMD GPU XLA Op Implementation #3486

weihanmines commented Mar 21, 2022 •

edited

weihanmines commented Apr 1, 2022

weihanmines commented Apr 4, 2022 •

edited

maxhgerlach commented Apr 22, 2022

weihanmines commented Apr 25, 2022

github-actions bot commented Apr 25, 2022 •

edited

github-actions bot commented Apr 25, 2022 •

edited

maxhgerlach commented Apr 25, 2022

weihanmines commented Apr 26, 2022

weihanmines commented Apr 27, 2022

maxhgerlach commented Apr 29, 2022

maxhgerlach left a comment

weihanmines commented May 4, 2022

maxhgerlach commented May 5, 2022

weihanmines commented May 5, 2022

weihanmines commented May 10, 2022 •

edited

EnricoMi commented May 16, 2022

EnricoMi left a comment

weihanmines commented May 17, 2022

weihanmines commented May 18, 2022

EnricoMi commented May 18, 2022

weihanmines commented May 18, 2022

EnricoMi commented May 20, 2022

weihanmines commented May 20, 2022

EnricoMi commented May 22, 2022

weihanmines commented May 22, 2022

EnricoMi commented May 24, 2022

maxhgerlach left a comment

EnricoMi left a comment

EnricoMi commented May 25, 2022

weihanmines commented May 26, 2022

weihanmines commented May 26, 2022

maxhgerlach commented May 26, 2022

Add AMD GPU XLA Op Implementation #3486

Add AMD GPU XLA Op Implementation #3486

Conversation

weihanmines commented Mar 21, 2022 • edited

Checklist before submitting

Description

Review process to land

weihanmines commented Apr 1, 2022

weihanmines commented Apr 4, 2022 • edited

maxhgerlach commented Apr 22, 2022

weihanmines commented Apr 25, 2022

github-actions bot commented Apr 25, 2022 • edited

Unit Test Results

github-actions bot commented Apr 25, 2022 • edited

Unit Test Results (with flaky tests)

maxhgerlach commented Apr 25, 2022

weihanmines commented Apr 26, 2022

weihanmines commented Apr 27, 2022

maxhgerlach commented Apr 29, 2022

maxhgerlach left a comment

Choose a reason for hiding this comment

weihanmines commented May 4, 2022

maxhgerlach commented May 5, 2022

weihanmines commented May 5, 2022

weihanmines commented May 10, 2022 • edited

EnricoMi commented May 16, 2022

EnricoMi left a comment

Choose a reason for hiding this comment

weihanmines commented May 17, 2022

weihanmines commented May 18, 2022

EnricoMi commented May 18, 2022

weihanmines commented May 18, 2022

EnricoMi commented May 20, 2022

weihanmines commented May 20, 2022

EnricoMi commented May 22, 2022

weihanmines commented May 22, 2022

EnricoMi commented May 24, 2022

maxhgerlach left a comment

Choose a reason for hiding this comment

EnricoMi left a comment

Choose a reason for hiding this comment

EnricoMi commented May 25, 2022

weihanmines commented May 26, 2022

weihanmines commented May 26, 2022

maxhgerlach commented May 26, 2022

weihanmines commented Mar 21, 2022 •

edited

weihanmines commented Apr 4, 2022 •

edited

github-actions bot commented Apr 25, 2022 •

edited

github-actions bot commented Apr 25, 2022 •

edited

weihanmines commented May 10, 2022 •

edited