-
Notifications
You must be signed in to change notification settings - Fork 407
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement SYCL TeamPolicy for vector_size > 1 #4183
Conversation
5a5bef6
to
38e18c8
Compare
This seems to also work on NVIDIA GPUs now. I don't have a good explanation for the deadlock in the subgroup barrier in |
Also, we can't specify a width for the sub-group shuffle operations, which means that we have to do some more index calculation ourselves compared to the corresponding |
// FIXME_SYCL This deadlocks in the subgroup_barrier when running on CUDA | ||
// devices. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"CUDA devices" means "NVIDIA GPUs"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would expect it to be an issue when compiling for the CUDA
backend in SYCL
(and thus the generated code) rather than that it's related to NVIDIA
GPU's directly.
Either way, I am happy to change the wording if that is preferred.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So that is SYCL terminology and not Kokkos, is that right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see more CUDA
than NVIDIA
in git@github.com/intel/llvm if that is the question. The specifications don't mention any of that, of course.
In this context, I use CUDA
and NVIDIA
GPUs as synonyms. Again, I'm happy to change this WIP comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should open an issue on this since a deadlock indicated a bug in the code or in the instructions generated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See #4204.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me given that we fix the static
problem for vector_length_max
.
f16d0f8
to
66591bb
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got a couple questions.
typename ReducerType::value_type tmp2 = tmp; | ||
|
||
for (int i = grange1; (i >>= 1);) { | ||
tmp2 = sg.shuffle_down(tmp, i); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this guaranteed to be a subgroup barrier?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://www.khronos.org/registry/OpenCL/extensions/intel/cl_intel_subgroups.html says:
The Intel extension adds a rich set of subgroup "shuffle" functions to allow work items within a work group to interchange data without the use of local memory and work group barriers.
It maps to shfl.sync.down.b32
in PTX
for CUDA
, see https://github.com/intel/llvm/blob/sycl/sycl/doc/cuda/opencl-subgroup-vs-cuda-crosslane-op.md.
So I believe the answer is "yes".
const auto grange1 = item.get_local_range(1); | ||
const auto sg = item.get_sub_group(); | ||
if (item.get_local_id(1) == 0) lambda(val); | ||
val = sg.shuffle(val, (sg.get_local_id() / grange1) * grange1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i don't get this. Shouldn't item.get_local_id(1)==sg.get_local_id()? and shouldn't 0<=sg.get_local_id()<grange1 be true too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
grange1
is the vector_size as requested and might be less than the subgroup size. Thus, sg.get_local_id()
might be larger or equal than grange1
.
Similarly, item.get_local_id(1) might be larger than sg.get_local_id()
if there are multiple subgroups.
// FIXME_SYCL This deadlocks in the subgroup_barrier when running on CUDA | ||
// devices. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should open an issue on this since a deadlock indicated a bug in the code or in the instructions generated.
Most of the implementation is mirrored from
CUDA
/HIP
. ForSYCL
kernels, we don't know the subgroup (warp size) at runtime; we can only obtain a vector of possible values from which we set the maximal vector size. In the end, the requirement is that the subgroup size is divisible by the vector size and this pull request adds a check when compiling withKokkos_ENABLE_DEBUG
.Also, some of the tests needed to be adapted for Intel hardware since the maximum workgroup size is limited by 256.