-
Notifications
You must be signed in to change notification settings - Fork 407
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SYCL RangePolicy: manually specify workgroup size through chunk size #4875
SYCL RangePolicy: manually specify workgroup size through chunk size #4875
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the changes I left in the review, the code runs correctly and now outperforms Kokkos::OpenMPTarget for my example
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm pending approval by @pvelesko
Retest this please. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to have a discussion about this. Note that what you use launch bounds for here, is NOT the intended use, and is NOT the effect you get on either HIP or CUDA. I.e. it does not actually set ranges. If people now start using it for that purpose because of SYCL we potentially create issues. I'd rather have you use chunk_size or something like that.
Sure, it's slightly abusing the concept. My thought was that it fits close enough to what I'm trying to do here since it's an optimization hint (that is actually taking into account). Also, the values for |
Actually we do (I had to look it up) kokkos/core/src/Kokkos_ExecPolicy.hpp Lines 201 to 207 in e765d32
and, as a side comment, I don't quite understand why we don't just update *this and return a reference to it.
|
OK, fair enough. |
(actual_range + wgroup_size - 1) / wgroup_size * wgroup_size; | ||
FunctorWrapperRangePolicyParallelForCustom<Functor, Policy> f{ | ||
policy.begin(), functor, actual_range}; | ||
sycl::nd_range<1> range(launch_range, Policy::launch_bounds::maxTperB); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what happens if maxTperB is not set (i.e. -1 I think??
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
13d5e2e
to
0b7929a
Compare
0b7929a
to
07c32f5
Compare
@masterleinad we need some clear documentation for this, since this in particular is different from how OpenMP works, where chunksize is the number of consecutive iterations given to a single thread. However, practically always the only performant choice is 1, so that usage of chunk size is kinda useless anyway and reinterpreting it as something like the block size/work group size is probably not a bad idea. We should consider doing this for the other backends I guess. |
Title line is misleading because it only applies to parallel_for, not parallel_reduce. |
|
In some cases, selecting a custom workgroup size gives significantly better performance than relying on the compiler to choose.