New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SYCL Feature level 4 (parallel_for) #3474
Conversation
700f88d
to
62d808a
Compare
@masterleinad Just to clarity: the one currently checked in here handles trivially copyable types by placement-newing them into USM shared memory. It should functionally work although isn't the mechanism we'll be using. I think our best course is to use this one for now unless problems show up, in which case update it to the latest. |
62d808a
to
d7ccafb
Compare
Rebased. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have a couple of questions and change requests.
|
||
q.submit([&](cl::sycl::handler& cgh) { | ||
cgh.parallel_for(range, [=](cl::sycl::item<1> item) { | ||
int id = item.get_linear_id(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
haha and here we use an int and not the size_type of the execution space. Ok what it really should be is the index type of policy type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also I don't get this. Shouldn't this call functor()? I mean the ParallelFor class only has an argument free operator and then gets the id to call on to the actual functor?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't have any implicit arguments like blockDim
and gridDim
for SYCL
so we have to due to the work mapping explicitly.
Currently, the ParallelFor
call operator is not used and it's not quite clear to me if it needs to exist. On the other hand, it's also not quite clear to me if we can/want to generalize KernelLaunch
in a way to support parallel_for
, parallel_reduce
and parallel_scan
.
In #3480, everything is defined inline in ParallelReduce
for example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the thing is if you want to reuse the kernel launch for TeamPolicy and MDRangePolicy you need to do something like what CUDA does, and consideirng that we have at least the direct vs indirect launch mechanism, and this is before we hit any of the optimizations we consider for CUDA like occupancy and what not I think we should go the same route here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nliber Any thoughts about this? I want to avoid that we step on each other's toes with respect to anything related to the launch mechansim.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure how general it can be. There are different SYCL calls for parallel_for and parallel_reduce (although SYCL parallel_reduce was not implemented at the time I initially wrote this). I want them to follow the same pattern but I want to use the correct SYCL call when it is there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that they should look the same. I guess sycl_indirect_lauch
could possibly be shared but I think we can refactor when we at least also have parallel_reduce
.
// Placement new a copy of functor into USM shared memory | ||
// | ||
// Store it in a unique_ptr to call its destructor on scope exit | ||
std::unique_ptr<Functor, Kokkos::Impl::destruct_delete> kernelFunctorPtr( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need a fence before this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At the moment, we fence in sycl_direct_launch
and call the destructor at the end of this scope. We should reconsider this when refactoring the launch mechanism.
typedef ExecPolicy Policy; | ||
|
||
private: | ||
typedef typename Policy::member_type Member; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what about our rule to use using everywhere??
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
obviously I don't mind typedefs ;-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll run clang-tidy
over it.
public: | ||
typedef FunctorType functor_type; | ||
|
||
inline void operator()(cl::sycl::item<1> item) const { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how does this work? The direct launch passes in an int?? Isn't that weird? Should direct_launch just pass on the sycl item?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's unused. Let me remove it for now to avoid confusion.
typedef FunctorType functor_type; | ||
|
||
inline void operator()(cl::sycl::item<1> item) const { | ||
int id = item.get_linear_id(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this needs to use the member/index_type from the policy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll just change to auto
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no the expectation is that the user functor gets the member type from the policy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sycl::handler::parallel_for
calls the function it is passed with a sycl::item
. We can't change what item.get_linear_id()
returns (although it should be a size_t
). Should I just cast it before to the member type from the policy before passing it in?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This discussion is outdated. We are doing this already, see https://github.com/kokkos/kokkos/pull/3474/files#diff-3b75050857c4c9a6392a5b0dd07241be60a4fed7c8c9fa56ce2242a4324e551fR80. 🙂
See #3484. |
q.submit([&](cl::sycl::handler& cgh) { | ||
cgh.parallel_for(range, [=](cl::sycl::item<1> item) { | ||
const typename Policy::index_type id = item.get_linear_id(); | ||
functor(id); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so I now get that this is actually ONLY for the parallel for dispatch. Since this is not reusable the way it is written why not make this part of the ParallelFor class? Or do we intend to modify this later to make it reusable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to put all the launch stuff together, because if one of them needs to change, they probably all will. It's only a very weak preference though.
Please rewrite history and we will merge. Make sure you make Nevin a co-author |
An indirect kernel is one where we have a functor that is not trivially copyable and so is explicitly constructed by the host in USM shared memory before being passed "by pointer" (inside a reference_wrapper) to SYCL parallel_for. This is to address the limitation that SYCL data types can only be implicitly copied to the device if they are trivially copyable.
4202a86
to
2f3f8e7
Compare
Here you go! |
Based on #3451. The most appropriate launch mechansim is still not quite clear so I am including the first one @nliber came up with (which has some severe restrictions on data types being trivially copyable to copy them to the device) and an indirect launch mechanism using shared memory.
I think that we can discuss further improvements to the lauch mechanism in a separate pull request.