Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SYCL Feature level 4 (parallel_for) #3474

Merged
merged 2 commits into from Oct 14, 2020

Conversation

masterleinad
Copy link
Contributor

Based on #3451. The most appropriate launch mechansim is still not quite clear so I am including the first one @nliber came up with (which has some severe restrictions on data types being trivially copyable to copy them to the device) and an indirect launch mechanism using shared memory.
I think that we can discuss further improvements to the lauch mechanism in a separate pull request.

@nliber
Copy link
Contributor

nliber commented Oct 9, 2020

@masterleinad Just to clarity: the one currently checked in here handles trivially copyable types by placement-newing them into USM shared memory. It should functionally work although isn't the mechanism we'll be using. I think our best course is to use this one for now unless problems show up, in which case update it to the latest.

@masterleinad
Copy link
Contributor Author

Rebased.

@masterleinad masterleinad marked this pull request as ready for review October 12, 2020 20:33
core/src/SYCL/Kokkos_SYCL_Parallel_Range.hpp Outdated Show resolved Hide resolved
core/src/SYCL/Kokkos_SYCL_Instance.hpp Outdated Show resolved Hide resolved
Copy link
Member

@crtrott crtrott left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have a couple of questions and change requests.

core/src/SYCL/Kokkos_SYCL_Instance.hpp Outdated Show resolved Hide resolved

q.submit([&](cl::sycl::handler& cgh) {
cgh.parallel_for(range, [=](cl::sycl::item<1> item) {
int id = item.get_linear_id();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

haha and here we use an int and not the size_type of the execution space. Ok what it really should be is the index type of policy type.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also I don't get this. Shouldn't this call functor()? I mean the ParallelFor class only has an argument free operator and then gets the id to call on to the actual functor?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't have any implicit arguments like blockDim and gridDim for SYCL so we have to due to the work mapping explicitly.
Currently, the ParallelFor call operator is not used and it's not quite clear to me if it needs to exist. On the other hand, it's also not quite clear to me if we can/want to generalize KernelLaunch in a way to support parallel_for, parallel_reduce and parallel_scan.
In #3480, everything is defined inline in ParallelReduce for example.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the thing is if you want to reuse the kernel launch for TeamPolicy and MDRangePolicy you need to do something like what CUDA does, and consideirng that we have at least the direct vs indirect launch mechanism, and this is before we hit any of the optimizations we consider for CUDA like occupancy and what not I think we should go the same route here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nliber Any thoughts about this? I want to avoid that we step on each other's toes with respect to anything related to the launch mechansim.

Copy link
Contributor

@nliber nliber Oct 14, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how general it can be. There are different SYCL calls for parallel_for and parallel_reduce (although SYCL parallel_reduce was not implemented at the time I initially wrote this). I want them to follow the same pattern but I want to use the correct SYCL call when it is there.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that they should look the same. I guess sycl_indirect_lauch could possibly be shared but I think we can refactor when we at least also have parallel_reduce.

// Placement new a copy of functor into USM shared memory
//
// Store it in a unique_ptr to call its destructor on scope exit
std::unique_ptr<Functor, Kokkos::Impl::destruct_delete> kernelFunctorPtr(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need a fence before this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the moment, we fence in sycl_direct_launch and call the destructor at the end of this scope. We should reconsider this when refactoring the launch mechanism.

typedef ExecPolicy Policy;

private:
typedef typename Policy::member_type Member;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about our rule to use using everywhere??

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

obviously I don't mind typedefs ;-)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll run clang-tidy over it.

public:
typedef FunctorType functor_type;

inline void operator()(cl::sycl::item<1> item) const {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how does this work? The direct launch passes in an int?? Isn't that weird? Should direct_launch just pass on the sycl item?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's unused. Let me remove it for now to avoid confusion.

typedef FunctorType functor_type;

inline void operator()(cl::sycl::item<1> item) const {
int id = item.get_linear_id();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this needs to use the member/index_type from the policy.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll just change to auto.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no the expectation is that the user functor gets the member type from the policy.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sycl::handler::parallel_forcalls the function it is passed with a sycl::item. We can't change what item.get_linear_id() returns (although it should be a size_t). Should I just cast it before to the member type from the policy before passing it in?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@masterleinad
Copy link
Contributor Author

can we please change this to a signed type. I ALWAYS regret that we made this unsigned on CUDA.

See #3484.

q.submit([&](cl::sycl::handler& cgh) {
cgh.parallel_for(range, [=](cl::sycl::item<1> item) {
const typename Policy::index_type id = item.get_linear_id();
functor(id);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so I now get that this is actually ONLY for the parallel for dispatch. Since this is not reusable the way it is written why not make this part of the ParallelFor class? Or do we intend to modify this later to make it reusable?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's fine with me (and more in line with #3480). I pushed a corresponding commit. Maybe @nliber wants to comment?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to put all the launch stuff together, because if one of them needs to change, they probably all will. It's only a very weak preference though.

@dalg24 dalg24 dismissed crtrott’s stale review October 14, 2020 16:09

Request was addressed

@dalg24
Copy link
Member

dalg24 commented Oct 14, 2020

Please rewrite history and we will merge. Make sure you make Nevin a co-author

An indirect kernel is one where we have a functor
that is not trivially copyable and so is explicitly
constructed by the host in USM shared memory before being passed
"by pointer" (inside a reference_wrapper) to SYCL parallel_for.

This is to address the limitation that SYCL
data types can only be implicitly copied to the device if they
are trivially copyable.
@masterleinad
Copy link
Contributor Author

Please rewrite history and we will merge. Make sure you make Nevin a co-author

Here you go!

@dalg24 dalg24 merged commit ed83187 into kokkos:develop Oct 14, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants