Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SYCL][AMDGCN] Fix up and down shuffles and reductions #5359

Merged
merged 3 commits into from Jan 26, 2022

Conversation

npmiller
Copy link
Contributor

This patch fixes the group collective implementation for AMDGCN, which
had two main issues, in one place it was calling a regular shuffle
instead of a shuffleUp which ended up breaking the reduction
algorithm. In addition it was also not using the correct interface for
the SPIR-V shuffleUp function.

Which leads to the second part of this patch which fixes the shuffleUp
and shuffleDown functions, mostly for the AMDGCN built-ins but also in
the SYCL header, as the SYCL built-ins were not implemented properly on
top of the SPIR-V built-ins.

At the SYCL level, the shuffleUp and shuffleDown built-ins take a
value to participate in the shuffle and a delta. The delta is used to
compute which thread to take the value from during the shuffle
operation. For shuffleUp it will be substracted from the thread id,
and for shuffleDown it will be added. And so in SYCL this delta must
be defined such as subgroup_local_id - delta falls within [0, subgroup_local_size[ for shuffleUp, and subgroup_local_id + delta
falls within [0, subgroup_local_size[ for shuffleDown.

However in SPIR-V, these built-ins are a bit more complicated and take
two values to participate in the shuffle and support twice the delta
range as the SYCL built-ins. For example for shuffleUp the valid range
for subgroup_local_id - delta is [-subgroup_local_size, subgroup_local_size[ and in this instance if it falls within
[-subgroup_local_size, 0[ the first value will be used to participate
in the shuffle, and if it falls within [0, subgroup_local_size[ the
second value will be used to participate in the shuffle. And it works in
a similar way for shuffleDown.

And so when implementing the SYCL built-ins using the SPIR-V built-ins,
only half of the range can be used in a properly defined way, which
means only one of the value parameters of the SPIR-V built-ins actually
matters. Therefore the SYCL built-ins are implemented passing in the
same value to both value parameters of the SPIR-V built-ins.

The complete definition of the SPIR-V built-ins can be found here:

This patch fixes the group collective implementation for AMDGCN, which
had two main issues, in one place it was calling a regular `shuffle`
instead of a `shuffleUp` which ended up breaking the reduction
algorithm. In addition it was also not using the correct interface for
the SPIR-V `shuffleUp` function.

Which leads to the second part of this patch which fixes the `shuffleUp`
and `shuffleDown` functions, mostly for the AMDGCN built-ins but also in
the SYCL header, as the SYCL built-ins were not implemented properly on
top of the SPIR-V built-ins.

At the SYCL level, the `shuffleUp` and `shuffleDown` built-ins take a
value to participate in the shuffle and a delta. The delta is used to
compute which thread to take the value from during the shuffle
operation. For `shuffleUp` it will be substracted from the thread id,
and for `shuffleDown` it will be added.  And so in SYCL this delta must
be defined such as `subgroup_local_id - delta` falls within `[0,
subgroup_local_size[` for `shuffleUp`, and `subgroup_local_id + delta`
falls within `[0, subgroup_local_size[` for `shuffleDown`.

However in SPIR-V, these built-ins are a bit more complicated and take
two values to participate in the shuffle and support twice the delta
range as the SYCL built-ins. For example for `shuffleUp` the valid range
for `subgroup_local_id - delta` is `[-subgroup_local_size,
subgroup_local_size[` and in this instance if it falls within
`[-subgroup_local_size, 0[` the first value will be used to participate
in the shuffle, and if it falls within `[0, subgroup_local_size[` the
second value will be used to participate in the shuffle. And it works in
a similar way for `shuffleDown`.

And so when implementing the SYCL built-ins using the SPIR-V built-ins,
only half of the range can be used in a properly defined way, which
means only one of the value parameters of the SPIR-V built-ins actually
matters. Therefore the SYCL built-ins are implemented passing in the
same value to both value parameters of the SPIR-V built-ins.

The complete definition of the SPIR-V built-ins can be found here:

* https://github.com/KhronosGroup/SPIRV-Registry/blob/main/extensions/INTEL/SPV_INTEL_subgroups.asciidoc#instructions
Pennycook
Pennycook previously approved these changes Jan 21, 2022
Copy link
Contributor

@Pennycook Pennycook left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. I have some thoughts about long-term direction (see below) but don't want to block this.

Comment on lines +266 to +276
int val;
if (index >= 0 && index < size) {
val = current;
} else if (index < 0 && index > -size) {
val = previous;
index = index + size;
} else {
// index out of bounds so return arbitrary data
val = current;
index = self;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a correct implementation of ShuffleUpINTEL to me, but I'm concerned about the extra complexity it introduces. We know that previous and current are always the same variable (if this intrinsic is called from the SYCL headers.

The SPIR-V non-uniform shuffle instructions (e.g. OpGroupNonUniformShuffleUp) are much closer semantically to the functionality required by SYCL, and this PR has me wondering whether it would be a good idea to switch over to those. It might be worth opening an issue or something to track this -- somebody would have to run some experiments to see which backends actually support these instructions, etc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, we should definitely investigate the performance of this at some point.

Also for what it's worth the CUDA backend implements this directly in the header using NVPTX built-ins rather than going through the SPIR-V interface. So that mostly would affect the HIP backend and maybe others as well.

I've filed a ticket to keep track of this:

@npmiller
Copy link
Contributor Author

The failing CI is because of an "Unexpectedly Passed" test which makes sense because this if fixing an issue, I've updated the XFAIL here:

However I was expecting a lot more new passes as locally this fixed a lot more than one test. So there may be another issue that doesn't show up on gfx908, I'm not too sure what the CI is using currently.

npmiller and others added 2 commits January 25, 2022 14:08
Using defines to figure out the wavefront size there is incorrect
because libclc is not built for a specific amdgcn version, so it will
always default to `64`.

Instead use the `__oclc_wavefront64` global variable provided by ROCm,
which will be set to a different value depending on the architecture.
Co-authored-by: Alexey Bader <alexey.bader@intel.com>
@npmiller
Copy link
Contributor Author

Added the commits from #5386 to see how testing on the CI goes with both of the PRs.

@npmiller
Copy link
Contributor Author

Okay, so with the changes from #5386 this PR isn't getting any failures and exactly unexpected passes I was expecting from this patch.

Plus it also fixes the broadcast tests disabled in intel/llvm-test-suite#740

So I'm, going to combine the two PRs into this one as we need both for the testing to work and the other patch is fairly small, closing #5386 and adding enabling the broadcast tests to intel/llvm-test-suite#759

npmiller added a commit to npmiller/llvm-test-suite that referenced this pull request Jan 25, 2022
@bader bader requested a review from Pennycook January 25, 2022 16:48
@bader
Copy link
Contributor

bader commented Jan 25, 2022

Thanks!

Copy link
Contributor

@steffenlarsen steffenlarsen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good stuff! And an additional 👍 for the thorough description.

@bader bader merged commit d99e957 into intel:sycl Jan 26, 2022
bader pushed a commit to intel/llvm-test-suite that referenced this pull request Jan 26, 2022
aelovikov-intel pushed a commit that referenced this pull request Feb 17, 2023
aelovikov-intel pushed a commit to aelovikov-intel/llvm that referenced this pull request Mar 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants