Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement SYCL TeamPolicy for vector_size > 1 #4183

Merged
merged 11 commits into from
Aug 2, 2021

Conversation

masterleinad
Copy link
Contributor

Most of the implementation is mirrored from CUDA/HIP. For SYCL kernels, we don't know the subgroup (warp size) at runtime; we can only obtain a vector of possible values from which we set the maximal vector size. In the end, the requirement is that the subgroup size is divisible by the vector size and this pull request adds a check when compiling with Kokkos_ENABLE_DEBUG.
Also, some of the tests needed to be adapted for Intel hardware since the maximum workgroup size is limited by 256.

@masterleinad masterleinad added this to In progress in Kokkos Release 3.5 via automation Jul 22, 2021
@masterleinad masterleinad added this to In progress in Developer: Daniel Arndt Jul 22, 2021
@masterleinad
Copy link
Contributor Author

This seems to also work on NVIDIA GPUs now. I don't have a good explanation for the deadlock in the subgroup barrier in Test12a_ThreadScratch and Test12b_TeamScratch, though. The corresponding vector range is executed with vector_size==1 anyway so replacing it with a serial loop should not hurt too much as a workaround.

@masterleinad masterleinad marked this pull request as ready for review July 23, 2021 20:25
@masterleinad masterleinad requested a review from nliber July 23, 2021 20:25
@masterleinad masterleinad moved this from In progress to Awaiting Feedback in Developer: Daniel Arndt Jul 23, 2021
@masterleinad masterleinad moved this from In progress to Awaiting Feedback in Kokkos Release 3.5 Jul 23, 2021
@masterleinad
Copy link
Contributor Author

Also, we can't specify a width for the sub-group shuffle operations, which means that we have to do some more index calculation ourselves compared to the corresponding CUDA instructions.

core/src/SYCL/Kokkos_SYCL_Parallel_Team.hpp Outdated Show resolved Hide resolved
Comment on lines +77 to +78
// FIXME_SYCL This deadlocks in the subgroup_barrier when running on CUDA
// devices.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"CUDA devices" means "NVIDIA GPUs"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would expect it to be an issue when compiling for the CUDA backend in SYCL (and thus the generated code) rather than that it's related to NVIDIA GPU's directly.
Either way, I am happy to change the wording if that is preferred.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So that is SYCL terminology and not Kokkos, is that right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see more CUDA than NVIDIA in git@github.com/intel/llvm if that is the question. The specifications don't mention any of that, of course.
In this context, I use CUDA and NVIDIA GPUs as synonyms. Again, I'm happy to change this WIP comment.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should open an issue on this since a deadlock indicated a bug in the code or in the instructions generated.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See #4204.

core/src/SYCL/Kokkos_SYCL.cpp Show resolved Hide resolved
core/src/SYCL/Kokkos_SYCL_Team.hpp Show resolved Hide resolved
core/unit_test/TestTeam.hpp Show resolved Hide resolved
Copy link
Member

@Rombur Rombur left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me given that we fix the static problem for vector_length_max.

Copy link
Member

@crtrott crtrott left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got a couple questions.

typename ReducerType::value_type tmp2 = tmp;

for (int i = grange1; (i >>= 1);) {
tmp2 = sg.shuffle_down(tmp, i);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this guaranteed to be a subgroup barrier?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://www.khronos.org/registry/OpenCL/extensions/intel/cl_intel_subgroups.html says:

The Intel extension adds a rich set of subgroup "shuffle" functions to allow work items within a work group to interchange data without the use of local memory and work group barriers.

It maps to shfl.sync.down.b32 in PTX for CUDA, see https://github.com/intel/llvm/blob/sycl/sycl/doc/cuda/opencl-subgroup-vs-cuda-crosslane-op.md.

So I believe the answer is "yes".

const auto grange1 = item.get_local_range(1);
const auto sg = item.get_sub_group();
if (item.get_local_id(1) == 0) lambda(val);
val = sg.shuffle(val, (sg.get_local_id() / grange1) * grange1);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't get this. Shouldn't item.get_local_id(1)==sg.get_local_id()? and shouldn't 0<=sg.get_local_id()<grange1 be true too?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

grange1 is the vector_size as requested and might be less than the subgroup size. Thus, sg.get_local_id() might be larger or equal than grange1.
Similarly, item.get_local_id(1) might be larger than sg.get_local_id() if there are multiple subgroups.

Comment on lines +77 to +78
// FIXME_SYCL This deadlocks in the subgroup_barrier when running on CUDA
// devices.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should open an issue on this since a deadlock indicated a bug in the code or in the instructions generated.

@crtrott crtrott merged commit 14434f9 into kokkos:develop Aug 2, 2021
Kokkos Release 3.5 automation moved this from Awaiting Feedback to Done Aug 2, 2021
@masterleinad masterleinad deleted the sycl_team_policy_work branch December 2, 2021 19:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Developer: Daniel Arndt
Awaiting Feedback
Development

Successfully merging this pull request may close these issues.

None yet

4 participants