-
Notifications
You must be signed in to change notification settings - Fork 113
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable vectorized global loads for the reduction algorithms #1470
Conversation
2adb25a
to
76097ed
Compare
include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_reduce.h
Outdated
Show resolved
Hide resolved
|
||
// Empirically found tuning parameters for typical devices. | ||
constexpr _Size __max_iters_per_work_item = 32; | ||
constexpr ::std::size_t __max_work_group_size = 256; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably, is better to make a request to a device, like oneapi::dpl::__internal::__max_work_group_size(...)
and oneapi::dpl::__internal::__max_sub_group_size(...)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These values are empirically found to achieve the highest throughput. The device-specific work-group limits are checked a couple of lines down.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comments I have are mostly stylistic to better understand the flow of the code.
include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_reduce.h
Outdated
Show resolved
Hide resolved
76097ed
to
cd2dba5
Compare
include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_reduce.h
Outdated
Show resolved
Hide resolved
111fc65
to
e39f642
Compare
8e37910
to
e6983fc
Compare
scalar_reduction_remainder(const _Size __start_idx, const _Size __adjusted_n, const _Size __max_iters, _Res& __res, | ||
const _Acc&... __acc) const | ||
{ | ||
const _Size __no_iters = std::min(static_cast<_Size>(__adjusted_n - __start_idx), __max_iters); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
static_cast<_Size>
here and everywhere looks very suspicious and bulky...
Probably, we can pass a "right" integer type as _Size? Or/and use auto
when it is applicable and doesn't break correctness?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. The _Size
type is provided by SYCL and I don't think we should change it. Instead, I've used auto
and added some temporaries to overcome this.
const _Size __global_idx = __item_id.get_global_id(0); | ||
if (__iters_per_work_item == 1) | ||
{ | ||
new (&__res.__v) _Tp(__unary_op(__global_idx, __acc...)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Due to __local_idx
is not used within this scope, the definition of __local_idx
(line 247) may be moved down after if
operator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a few small comments and am ready to approve once these are considered.
__adjust_iters_per_work_item(_Size __iters_per_work_item) -> _Size | ||
{ | ||
if (__iters_per_work_item > 1) | ||
return ((__iters_per_work_item + _VecSize - 1) / _VecSize) * _VecSize; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this can be written with __dpl_ceiling_div
like:
return __dpl_ceiling_div(__iters_per_work_item, _VecSize) * _VecSize;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
apologies for this late review. Mostly minor things.
I'm still wrapping my head around things fully so I probably shouldn't be the approver. However hopefully some of these comments can help. Continuing to look as time permits as well.
afd0d90
to
4d1cbf2
Compare
933790f
to
48bf347
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only a minor comment. Otherwise, I think this looks good, especially now since we have some time to react to any issues which may arise before a release.
I went through the PR with fresh eyes and couldn't find any real issues.
include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_reduce.h
Outdated
Show resolved
Hide resolved
include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_reduce.h
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Probably good to at least check for objections from others who have reviewed this PR before merging.
(and wait for green CI)
include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_reduce.h
Outdated
Show resolved
Hide resolved
@julianmi, how do you think, should we introduce some type for the union __storage
{
_Tp __v;
__storage() {}
}; ? |
I've added a union type to reduce the code duplication. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@@ -47,11 +47,19 @@ class __reduce_mid_work_group_kernel; | |||
template <typename... _Name> | |||
class __reduce_kernel; | |||
|
|||
// Storage helper since _Tp may not have a default constructor. | |||
template <typename _Tp> | |||
union __storage |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we are going to lift this type definition out, we probably need to rename it as well. (trying to think of a good name...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Additionally if we are going to lift this type definition out, we may cover the case when we have array of elements too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure about the array of elements, perhaps that reaches too far beyond the scope of this PR, but maybe something like __delayed_ctor_storage
?
I think we need something which describes its purpose.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
__optional_ctor_storage
?
__lazy_ctor_storage
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know how far we want to go in the context of this PR, but this trick is also used
oneDPL/include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_radix_sort_one_wg.h
Line 169 in a9aabb2
union __storage { _ValT __v[__block_size]; __storage(){} } __values; |
union __storage { _ValueT __v; __storage(){} } __in_val; |
If we are lifting this, it would be great to unify all the use to a single type. Then future improvements can be had by all, and it will improve readability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose the first is the array case Sergey was referring to, I'd be fine with leaving that one out for now to limit the scope of the PR if it makes it significantly more complicated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I propose to make additional changes with it in some separate PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, for this PR, lets just rename it, we can unify, etc. in a separate PR.
My vote is for __lazy_ctor_storage
because I think optional advertises more functionality than is provided here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this discussion. I agree that larger changes are outside the scope of this PR and change the naming to __lazy_ctor_storage
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Vectorization is performance critical on SIMD architectures. This patch enables vectorization by unrolling vector size wide loop iterations on both coalesced (commutative algorithms) and consecutive (non-commutative algorithms) loads. Coalesced loads will then load vectors of consecutive elements. This change improves the coalesced loads on Intel SIMD GPUs without decreasing the throughput on SIMT GPUs. Coalesced loads are therefore enabled on SPIR-V backends as well.
min_element
andmax_element
continue using consecutive loads on SPIR-V backends due to the performance penalty of the required index check when using coalesced global loads.Secondly, the vectorization enables dynamic number of elements to be processed per work-item. Launch parameter tuning with compile time constants is therefore not needed anymore. This reduces the number of template instantiations from 13 to 3, which improves the compile times significantly (e.g., half the time for
sycl_iterator_reduce.pass
).Thirdly, branch divergence is minimized by adding a flag showing whether the work-group can process full sequences of the input array. If so, branching withing the inner kernel can be removed. If not, all work-items in a group follow the same boundary-checked implementation.