Implement SYCL TeamPolicy for vector_size > 1 #4183

masterleinad · 2021-07-22T19:50:34Z

Most of the implementation is mirrored from CUDA/HIP. For SYCL kernels, we don't know the subgroup (warp size) at runtime; we can only obtain a vector of possible values from which we set the maximal vector size. In the end, the requirement is that the subgroup size is divisible by the vector size and this pull request adds a check when compiling with Kokkos_ENABLE_DEBUG.
Also, some of the tests needed to be adapted for Intel hardware since the maximum workgroup size is limited by 256.

This reverts commit 234d6f2.

This reverts commit 2557434.

masterleinad · 2021-07-23T20:25:16Z

This seems to also work on NVIDIA GPUs now. I don't have a good explanation for the deadlock in the subgroup barrier in Test12a_ThreadScratch and Test12b_TeamScratch, though. The corresponding vector range is executed with vector_size==1 anyway so replacing it with a serial loop should not hurt too much as a workaround.

masterleinad · 2021-07-24T16:59:22Z

Also, we can't specify a width for the sub-group shuffle operations, which means that we have to do some more index calculation ourselves compared to the corresponding CUDA instructions.

core/src/SYCL/Kokkos_SYCL_Parallel_Team.hpp

dalg24 · 2021-07-26T13:50:09Z

core/unit_test/incremental/Test12a_ThreadScratch.hpp

+    // FIXME_SYCL This deadlocks in the subgroup_barrier when running on CUDA
+    // devices.


"CUDA devices" means "NVIDIA GPUs"?

I would expect it to be an issue when compiling for the CUDA backend in SYCL (and thus the generated code) rather than that it's related to NVIDIA GPU's directly.
Either way, I am happy to change the wording if that is preferred.

So that is SYCL terminology and not Kokkos, is that right?

I see more CUDA than NVIDIA in git@github.com/intel/llvm if that is the question. The specifications don't mention any of that, of course.
In this context, I use CUDA and NVIDIA GPUs as synonyms. Again, I'm happy to change this WIP comment.

You should open an issue on this since a deadlock indicated a bug in the code or in the instructions generated.

core/src/SYCL/Kokkos_SYCL.cpp

core/src/SYCL/Kokkos_SYCL_Team.hpp

core/unit_test/TestTeam.hpp

Rombur

Looks good to me given that we fix the static problem for vector_length_max.

core/src/SYCL/Kokkos_SYCL_Parallel_Team.hpp

core/unit_test/incremental/Test12b_TeamScratch.hpp

core/unit_test/incremental/Test12a_ThreadScratch.hpp

crtrott

Got a couple questions.

crtrott · 2021-07-30T20:48:09Z

core/src/SYCL/Kokkos_SYCL_Team.hpp

+    typename ReducerType::value_type tmp2 = tmp;
+
+    for (int i = grange1; (i >>= 1);) {
+      tmp2 = sg.shuffle_down(tmp, i);


is this guaranteed to be a subgroup barrier?

https://www.khronos.org/registry/OpenCL/extensions/intel/cl_intel_subgroups.html says:

The Intel extension adds a rich set of subgroup "shuffle" functions to allow work items within a work group to interchange data without the use of local memory and work group barriers.

It maps to shfl.sync.down.b32 in PTX for CUDA, see https://github.com/intel/llvm/blob/sycl/sycl/doc/cuda/opencl-subgroup-vs-cuda-crosslane-op.md.

So I believe the answer is "yes".

crtrott · 2021-07-30T20:51:50Z

core/src/SYCL/Kokkos_SYCL_Team.hpp

+  const auto grange1          = item.get_local_range(1);
+  const auto sg               = item.get_sub_group();
+  if (item.get_local_id(1) == 0) lambda(val);
+  val = sg.shuffle(val, (sg.get_local_id() / grange1) * grange1);


i don't get this. Shouldn't item.get_local_id(1)==sg.get_local_id()? and shouldn't 0<=sg.get_local_id()<grange1 be true too?

grange1 is the vector_size as requested and might be less than the subgroup size. Thus, sg.get_local_id() might be larger or equal than grange1.
Similarly, item.get_local_id(1) might be larger than sg.get_local_id() if there are multiple subgroups.

crtrott · 2021-07-30T20:53:34Z

core/unit_test/incremental/Test12a_ThreadScratch.hpp

+    // FIXME_SYCL This deadlocks in the subgroup_barrier when running on CUDA
+    // devices.


You should open an issue on this since a deadlock indicated a bug in the code or in the instructions generated.

masterleinad added 4 commits July 22, 2021 15:30

Remove deprecated device information

00b5531

Use sycl::group_barrier

234d6f2

Use sycl::group_broadcast

2557434

Implement SYCL TeamPolicy for vector_size > 1

6f808a5

masterleinad added this to In progress in Kokkos Release 3.5 via automation Jul 22, 2021

masterleinad added this to In progress in Developer: Daniel Arndt Jul 22, 2021

Adapt tests for Intel hardware

38e18c8

masterleinad force-pushed the sycl_team_policy_work branch from 5a5bef6 to 38e18c8 Compare July 22, 2021 20:46

masterleinad added 3 commits July 23, 2021 10:40

Revert "Use sycl::group_barrier"

2de95e2

This reverts commit 234d6f2.

Revert "Use sycl::group_broadcast"

fea18fd

This reverts commit 2557434.

Fix tests

31cdac7

masterleinad marked this pull request as ready for review July 23, 2021 20:25

masterleinad requested a review from nliber July 23, 2021 20:25

masterleinad moved this from In progress to Awaiting Feedback in Developer: Daniel Arndt Jul 23, 2021

masterleinad moved this from In progress to Awaiting Feedback in Kokkos Release 3.5 Jul 23, 2021

masterleinad requested a review from crtrott July 26, 2021 15:46

dalg24 reviewed Jul 26, 2021

View reviewed changes

masterleinad mentioned this pull request Jul 26, 2021

Deprecate TeamPolicy::vector_length_max() and TeamPolicy::verify_requested_vector_length() #4186

Closed

Rombur approved these changes Jul 28, 2021

View reviewed changes

masterleinad added 2 commits July 28, 2021 20:48

Fix forgotten replacement m_maxThreadsPerSM->m_maxWorkgroupSize

c19b426

Query defualt execution space for maximum vector length

66591bb

masterleinad force-pushed the sycl_team_policy_work branch from f16d0f8 to 66591bb Compare July 28, 2021 20:52

dalg24 reviewed Jul 29, 2021

View reviewed changes

core/src/SYCL/Kokkos_SYCL_Parallel_Team.hpp Outdated Show resolved Hide resolved

core/unit_test/incremental/Test12b_TeamScratch.hpp Show resolved Hide resolved

core/unit_test/incremental/Test12a_ThreadScratch.hpp Show resolved Hide resolved

Make verify_requested_vector_length static again

4dacd74

crtrott requested changes Jul 30, 2021

View reviewed changes

masterleinad mentioned this pull request Aug 2, 2021

Deadlock ThreadVectorRange for SYCL+CUDA in incremental Tests 12a and 12b #4204

Open

crtrott approved these changes Aug 2, 2021

View reviewed changes

crtrott merged commit 14434f9 into kokkos:develop Aug 2, 2021

Kokkos Release 3.5 automation moved this from Awaiting Feedback to Done Aug 2, 2021

masterleinad deleted the sycl_team_policy_work branch December 2, 2021 19:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement SYCL TeamPolicy for vector_size > 1 #4183

Implement SYCL TeamPolicy for vector_size > 1 #4183

masterleinad commented Jul 22, 2021

masterleinad commented Jul 23, 2021

masterleinad commented Jul 24, 2021

dalg24 Jul 26, 2021

masterleinad Jul 26, 2021

dalg24 Jul 27, 2021

masterleinad Jul 27, 2021

crtrott Jul 30, 2021

masterleinad Aug 2, 2021

Rombur left a comment

crtrott left a comment

crtrott Jul 30, 2021

masterleinad Jul 30, 2021

crtrott Jul 30, 2021

masterleinad Jul 30, 2021

crtrott Jul 30, 2021

		// FIXME_SYCL This deadlocks in the subgroup_barrier when running on CUDA
		// devices.

Implement SYCL TeamPolicy for vector_size > 1 #4183

Implement SYCL TeamPolicy for vector_size > 1 #4183

Conversation

masterleinad commented Jul 22, 2021

masterleinad commented Jul 23, 2021

masterleinad commented Jul 24, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Rombur left a comment

Choose a reason for hiding this comment

crtrott left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment