Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SYCL RangePolicy: manually specify workgroup size through chunk size #4875

Merged
merged 1 commit into from
Sep 18, 2022

Conversation

masterleinad
Copy link
Contributor

In some cases, selecting a custom workgroup size gives significantly better performance than relying on the compiler to choose.

Copy link

@pvelesko pvelesko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the changes I left in the review, the code runs correctly and now outperforms Kokkos::OpenMPTarget for my example

@masterleinad masterleinad marked this pull request as ready for review March 16, 2022 12:52
@masterleinad
Copy link
Contributor Author

CUDA-10.1-Clang-Tidy timing out is clearly unrelated.

Copy link
Contributor

@nmm0 nmm0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm pending approval by @pvelesko

core/src/SYCL/Kokkos_SYCL_Parallel_Range.hpp Outdated Show resolved Hide resolved
core/src/SYCL/Kokkos_SYCL_Parallel_Range.hpp Outdated Show resolved Hide resolved
@masterleinad
Copy link
Contributor Author

Retest this please.

Copy link
Member

@crtrott crtrott left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to have a discussion about this. Note that what you use launch bounds for here, is NOT the intended use, and is NOT the effect you get on either HIP or CUDA. I.e. it does not actually set ranges. If people now start using it for that purpose because of SYCL we potentially create issues. I'd rather have you use chunk_size or something like that.

@masterleinad
Copy link
Contributor Author

We need to have a discussion about this. Note that what you use launch bounds for here, is NOT the intended use, and is NOT the effect you get on either HIP or CUDA. I.e. it does not actually set ranges. If people now start using it for that purpose because of SYCL we potentially create issues. I'd rather have you use chunk_size or something like that.

Sure, it's slightly abusing the concept. My thought was that it fits close enough to what I'm trying to do here since it's an optimization hint (that is actually taking into account). Also, the values for LaunchBounds are normally backend-specific and I don't see any other use for LaunchBounds in SYCL (upcoming). Anyway, I'm happy to discuss alternatives. Note that this is for RangePolicy where we don't have chunk_size.

@dalg24
Copy link
Member

dalg24 commented Mar 28, 2022

Actually we do (I had to look it up)

/** \brief set chunk_size to a discrete value*/
inline RangePolicy set_chunk_size(int chunk_size_) const {
RangePolicy p = *this;
p.m_granularity = chunk_size_;
p.m_granularity_mask = p.m_granularity - 1;
return p;
}

and, as a side comment, I don't quite understand why we don't just update *this and return a reference to it.

@masterleinad
Copy link
Contributor Author

OK, fair enough.

@masterleinad masterleinad changed the title SYCL RangePolicy: manually specify workgroup size through LaunchBounds SYCL RangePolicy: manually specify workgroup size through chunk size Apr 1, 2022
@masterleinad masterleinad added this to the Tentative 3.7 Release milestone Jun 6, 2022
(actual_range + wgroup_size - 1) / wgroup_size * wgroup_size;
FunctorWrapperRangePolicyParallelForCustom<Functor, Policy> f{
policy.begin(), functor, actual_range};
sycl::nd_range<1> range(launch_range, Policy::launch_bounds::maxTperB);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what happens if maxTperB is not set (i.e. -1 I think??

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

@crtrott
Copy link
Member

crtrott commented Sep 18, 2022

@masterleinad we need some clear documentation for this, since this in particular is different from how OpenMP works, where chunksize is the number of consecutive iterations given to a single thread. However, practically always the only performant choice is 1, so that usage of chunk size is kinda useless anyway and reinterpreting it as something like the block size/work group size is probably not a bad idea. We should consider doing this for the other backends I guess.

@crtrott crtrott merged commit 673a0ef into kokkos:develop Sep 18, 2022
@dalg24
Copy link
Member

dalg24 commented Sep 19, 2022

Title line is misleading because it only applies to parallel_for, not parallel_reduce.
Please comment whether this can/should be implemented for other parallel constructs with a RangePolicy.

@masterleinad
Copy link
Contributor Author

Please comment whether this can/should be implemented for other parallel constructs with a RangePolicy.

paralle_for was just the most relevant case, but I intended to add versions for other parallel constructs after merging this when I opened the pull request. Since we want to discuss what we want to do with different backends anyway, I would wait with that, though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants