SYCL Feature level 4 (parallel_for) #3474

masterleinad · 2020-10-09T13:55:28Z

Based on #3451. The most appropriate launch mechansim is still not quite clear so I am including the first one @nliber came up with (which has some severe restrictions on data types being trivially copyable to copy them to the device) and an indirect launch mechanism using shared memory.
I think that we can discuss further improvements to the lauch mechanism in a separate pull request.

nliber · 2020-10-09T16:00:40Z

@masterleinad Just to clarity: the one currently checked in here handles trivially copyable types by placement-newing them into USM shared memory. It should functionally work although isn't the mechanism we'll be using. I think our best course is to use this one for now unless problems show up, in which case update it to the latest.

masterleinad · 2020-10-12T20:33:07Z

Rebased.

core/src/SYCL/Kokkos_SYCL_Parallel_Range.hpp

core/src/SYCL/Kokkos_SYCL_Instance.hpp

crtrott

Have a couple of questions and change requests.

core/src/SYCL/Kokkos_SYCL_Instance.hpp

crtrott · 2020-10-13T14:55:29Z

core/src/SYCL/Kokkos_SYCL_KernelLaunch.hpp

+
+  q.submit([&](cl::sycl::handler& cgh) {
+    cgh.parallel_for(range, [=](cl::sycl::item<1> item) {
+      int id = item.get_linear_id();


haha and here we use an int and not the size_type of the execution space. Ok what it really should be is the index type of policy type.

Also I don't get this. Shouldn't this call functor()? I mean the ParallelFor class only has an argument free operator and then gets the id to call on to the actual functor?

We don't have any implicit arguments like blockDim and gridDim for SYCL so we have to due to the work mapping explicitly.
Currently, the ParallelFor call operator is not used and it's not quite clear to me if it needs to exist. On the other hand, it's also not quite clear to me if we can/want to generalize KernelLaunch in a way to support parallel_for, parallel_reduce and parallel_scan.
In #3480, everything is defined inline in ParallelReduce for example.

the thing is if you want to reuse the kernel launch for TeamPolicy and MDRangePolicy you need to do something like what CUDA does, and consideirng that we have at least the direct vs indirect launch mechanism, and this is before we hit any of the optimizations we consider for CUDA like occupancy and what not I think we should go the same route here.

@nliber Any thoughts about this? I want to avoid that we step on each other's toes with respect to anything related to the launch mechansim.

I'm not sure how general it can be. There are different SYCL calls for parallel_for and parallel_reduce (although SYCL parallel_reduce was not implemented at the time I initially wrote this). I want them to follow the same pattern but I want to use the correct SYCL call when it is there.

I agree that they should look the same. I guess sycl_indirect_lauch could possibly be shared but I think we can refactor when we at least also have parallel_reduce.

crtrott · 2020-10-13T14:56:42Z

core/src/SYCL/Kokkos_SYCL_KernelLaunch.hpp

+  // Placement new a copy of functor into USM shared memory
+  //
+  // Store it in a unique_ptr to call its destructor on scope exit
+  std::unique_ptr<Functor, Kokkos::Impl::destruct_delete> kernelFunctorPtr(


Do we need a fence before this?

At the moment, we fence in sycl_direct_launch and call the destructor at the end of this scope. We should reconsider this when refactoring the launch mechanism.

crtrott · 2020-10-13T14:57:39Z

core/src/SYCL/Kokkos_SYCL_Parallel_Range.hpp

+  typedef ExecPolicy Policy;
+
+ private:
+  typedef typename Policy::member_type Member;


what about our rule to use using everywhere??

obviously I don't mind typedefs ;-)

I'll run clang-tidy over it.

crtrott · 2020-10-13T15:00:43Z

core/src/SYCL/Kokkos_SYCL_Parallel_Range.hpp

+ public:
+  typedef FunctorType functor_type;
+
+  inline void operator()(cl::sycl::item<1> item) const {


how does this work? The direct launch passes in an int?? Isn't that weird? Should direct_launch just pass on the sycl item?

It's unused. Let me remove it for now to avoid confusion.

crtrott · 2020-10-13T15:01:18Z

core/src/SYCL/Kokkos_SYCL_Parallel_Range.hpp

+  typedef FunctorType functor_type;
+
+  inline void operator()(cl::sycl::item<1> item) const {
+    int id = item.get_linear_id();


this needs to use the member/index_type from the policy.

I'll just change to auto.

no the expectation is that the user functor gets the member type from the policy.

sycl::handler::parallel_forcalls the function it is passed with a sycl::item. We can't change what item.get_linear_id() returns (although it should be a size_t). Should I just cast it before to the member type from the policy before passing it in?

This discussion is outdated. We are doing this already, see https://github.com/kokkos/kokkos/pull/3474/files#diff-3b75050857c4c9a6392a5b0dd07241be60a4fed7c8c9fa56ce2242a4324e551fR80. 🙂

masterleinad · 2020-10-13T15:56:47Z

can we please change this to a signed type. I ALWAYS regret that we made this unsigned on CUDA.

See #3484.

crtrott · 2020-10-13T22:28:20Z

core/src/SYCL/Kokkos_SYCL_KernelLaunch.hpp

+  q.submit([&](cl::sycl::handler& cgh) {
+    cgh.parallel_for(range, [=](cl::sycl::item<1> item) {
+      const typename Policy::index_type id = item.get_linear_id();
+      functor(id);


so I now get that this is actually ONLY for the parallel for dispatch. Since this is not reusable the way it is written why not make this part of the ParallelFor class? Or do we intend to modify this later to make it reusable?

That's fine with me (and more in line with #3480). I pushed a corresponding commit. Maybe @nliber wants to comment?

I'd like to put all the launch stuff together, because if one of them needs to change, they probably all will. It's only a very weak preference though.

core/src/SYCL/Kokkos_SYCL_Instance.hpp

Request was addressed

dalg24 · 2020-10-14T16:14:04Z

Please rewrite history and we will merge. Make sure you make Nevin a co-author

An indirect kernel is one where we have a functor that is not trivially copyable and so is explicitly constructed by the host in USM shared memory before being passed "by pointer" (inside a reference_wrapper) to SYCL parallel_for. This is to address the limitation that SYCL data types can only be implicitly copied to the device if they are trivially copyable.

masterleinad · 2020-10-14T16:22:44Z

Please rewrite history and we will merge. Make sure you make Nevin a co-author

Here you go!

masterleinad force-pushed the sycl_parallel_for branch from 700f88d to 62d808a Compare October 9, 2020 14:12

masterleinad requested a review from nliber October 9, 2020 15:31

nliber approved these changes Oct 9, 2020

View reviewed changes

SYCL Feature level 4 (parallel_for)

68bfe5b

masterleinad force-pushed the sycl_parallel_for branch from 62d808a to d7ccafb Compare October 12, 2020 20:31

masterleinad marked this pull request as ready for review October 12, 2020 20:33

masterleinad mentioned this pull request Oct 12, 2020

SYCL feature level 5 #3480

Merged

dalg24 reviewed Oct 13, 2020

View reviewed changes

core/src/SYCL/Kokkos_SYCL_Parallel_Range.hpp Outdated Show resolved Hide resolved

core/src/SYCL/Kokkos_SYCL_Instance.hpp Outdated Show resolved Hide resolved

crtrott previously requested changes Oct 13, 2020

View reviewed changes

crtrott reviewed Oct 13, 2020

View reviewed changes

dalg24 reviewed Oct 13, 2020

View reviewed changes

core/src/SYCL/Kokkos_SYCL_Instance.hpp Outdated Show resolved Hide resolved

masterleinad force-pushed the sycl_parallel_for branch from 4202a86 to 2f3f8e7 Compare October 14, 2020 16:22

dalg24 merged commit ed83187 into kokkos:develop Oct 14, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SYCL Feature level 4 (parallel_for) #3474

SYCL Feature level 4 (parallel_for) #3474

masterleinad commented Oct 9, 2020

nliber commented Oct 9, 2020 •

edited

masterleinad commented Oct 12, 2020

crtrott left a comment

crtrott Oct 13, 2020

crtrott Oct 13, 2020

masterleinad Oct 13, 2020

crtrott Oct 13, 2020

masterleinad Oct 13, 2020

nliber Oct 14, 2020 •

edited

masterleinad Oct 14, 2020

crtrott Oct 13, 2020

masterleinad Oct 13, 2020

crtrott Oct 13, 2020

crtrott Oct 13, 2020

masterleinad Oct 13, 2020

crtrott Oct 13, 2020

masterleinad Oct 13, 2020

crtrott Oct 13, 2020

masterleinad Oct 13, 2020

crtrott Oct 13, 2020

nliber Oct 14, 2020

masterleinad Oct 14, 2020

masterleinad commented Oct 13, 2020

crtrott Oct 13, 2020

masterleinad Oct 14, 2020

nliber Oct 14, 2020

dalg24 commented Oct 14, 2020

masterleinad commented Oct 14, 2020

SYCL Feature level 4 (parallel_for) #3474

SYCL Feature level 4 (parallel_for) #3474

Conversation

masterleinad commented Oct 9, 2020

nliber commented Oct 9, 2020 • edited

masterleinad commented Oct 12, 2020

crtrott left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nliber Oct 14, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

masterleinad commented Oct 13, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dalg24 commented Oct 14, 2020

masterleinad commented Oct 14, 2020

nliber commented Oct 9, 2020 •

edited

nliber Oct 14, 2020 •

edited