New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SYCL][AMDGCN] Fix up and down shuffles and reductions #5359
Conversation
This patch fixes the group collective implementation for AMDGCN, which had two main issues, in one place it was calling a regular `shuffle` instead of a `shuffleUp` which ended up breaking the reduction algorithm. In addition it was also not using the correct interface for the SPIR-V `shuffleUp` function. Which leads to the second part of this patch which fixes the `shuffleUp` and `shuffleDown` functions, mostly for the AMDGCN built-ins but also in the SYCL header, as the SYCL built-ins were not implemented properly on top of the SPIR-V built-ins. At the SYCL level, the `shuffleUp` and `shuffleDown` built-ins take a value to participate in the shuffle and a delta. The delta is used to compute which thread to take the value from during the shuffle operation. For `shuffleUp` it will be substracted from the thread id, and for `shuffleDown` it will be added. And so in SYCL this delta must be defined such as `subgroup_local_id - delta` falls within `[0, subgroup_local_size[` for `shuffleUp`, and `subgroup_local_id + delta` falls within `[0, subgroup_local_size[` for `shuffleDown`. However in SPIR-V, these built-ins are a bit more complicated and take two values to participate in the shuffle and support twice the delta range as the SYCL built-ins. For example for `shuffleUp` the valid range for `subgroup_local_id - delta` is `[-subgroup_local_size, subgroup_local_size[` and in this instance if it falls within `[-subgroup_local_size, 0[` the first value will be used to participate in the shuffle, and if it falls within `[0, subgroup_local_size[` the second value will be used to participate in the shuffle. And it works in a similar way for `shuffleDown`. And so when implementing the SYCL built-ins using the SPIR-V built-ins, only half of the range can be used in a properly defined way, which means only one of the value parameters of the SPIR-V built-ins actually matters. Therefore the SYCL built-ins are implemented passing in the same value to both value parameters of the SPIR-V built-ins. The complete definition of the SPIR-V built-ins can be found here: * https://github.com/KhronosGroup/SPIRV-Registry/blob/main/extensions/INTEL/SPV_INTEL_subgroups.asciidoc#instructions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. I have some thoughts about long-term direction (see below) but don't want to block this.
int val; | ||
if (index >= 0 && index < size) { | ||
val = current; | ||
} else if (index < 0 && index > -size) { | ||
val = previous; | ||
index = index + size; | ||
} else { | ||
// index out of bounds so return arbitrary data | ||
val = current; | ||
index = self; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like a correct implementation of ShuffleUpINTEL
to me, but I'm concerned about the extra complexity it introduces. We know that previous
and current
are always the same variable (if this intrinsic is called from the SYCL headers.
The SPIR-V non-uniform shuffle instructions (e.g. OpGroupNonUniformShuffleUp) are much closer semantically to the functionality required by SYCL, and this PR has me wondering whether it would be a good idea to switch over to those. It might be worth opening an issue or something to track this -- somebody would have to run some experiments to see which backends actually support these instructions, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, we should definitely investigate the performance of this at some point.
Also for what it's worth the CUDA backend implements this directly in the header using NVPTX built-ins rather than going through the SPIR-V interface. So that mostly would affect the HIP backend and maybe others as well.
I've filed a ticket to keep track of this:
The failing CI is because of an "Unexpectedly Passed" test which makes sense because this if fixing an issue, I've updated the However I was expecting a lot more new passes as locally this fixed a lot more than one test. So there may be another issue that doesn't show up on |
Using defines to figure out the wavefront size there is incorrect because libclc is not built for a specific amdgcn version, so it will always default to `64`. Instead use the `__oclc_wavefront64` global variable provided by ROCm, which will be set to a different value depending on the architecture.
Co-authored-by: Alexey Bader <alexey.bader@intel.com>
Added the commits from #5386 to see how testing on the CI goes with both of the PRs. |
Okay, so with the changes from #5386 this PR isn't getting any failures and exactly unexpected passes I was expecting from this patch. Plus it also fixes the broadcast tests disabled in intel/llvm-test-suite#740 So I'm, going to combine the two PRs into this one as we need both for the testing to work and the other patch is fairly small, closing #5386 and adding enabling the broadcast tests to intel/llvm-test-suite#759 |
This was fixed in the CI by intel/llvm#5359
Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good stuff! And an additional 👍 for the thorough description.
These tests are fixed by intel/llvm#5359
These tests are fixed by #5359
These tests are fixed by intel#5359
This patch fixes the group collective implementation for AMDGCN, which
had two main issues, in one place it was calling a regular
shuffle
instead of a
shuffleUp
which ended up breaking the reductionalgorithm. In addition it was also not using the correct interface for
the SPIR-V
shuffleUp
function.Which leads to the second part of this patch which fixes the
shuffleUp
and
shuffleDown
functions, mostly for the AMDGCN built-ins but also inthe SYCL header, as the SYCL built-ins were not implemented properly on
top of the SPIR-V built-ins.
At the SYCL level, the
shuffleUp
andshuffleDown
built-ins take avalue to participate in the shuffle and a delta. The delta is used to
compute which thread to take the value from during the shuffle
operation. For
shuffleUp
it will be substracted from the thread id,and for
shuffleDown
it will be added. And so in SYCL this delta mustbe defined such as
subgroup_local_id - delta
falls within[0, subgroup_local_size[
forshuffleUp
, andsubgroup_local_id + delta
falls within
[0, subgroup_local_size[
forshuffleDown
.However in SPIR-V, these built-ins are a bit more complicated and take
two values to participate in the shuffle and support twice the delta
range as the SYCL built-ins. For example for
shuffleUp
the valid rangefor
subgroup_local_id - delta
is[-subgroup_local_size, subgroup_local_size[
and in this instance if it falls within[-subgroup_local_size, 0[
the first value will be used to participatein the shuffle, and if it falls within
[0, subgroup_local_size[
thesecond value will be used to participate in the shuffle. And it works in
a similar way for
shuffleDown
.And so when implementing the SYCL built-ins using the SPIR-V built-ins,
only half of the range can be used in a properly defined way, which
means only one of the value parameters of the SPIR-V built-ins actually
matters. Therefore the SYCL built-ins are implemented passing in the
same value to both value parameters of the SPIR-V built-ins.
The complete definition of the SPIR-V built-ins can be found here: