Enable vectorized global loads for the reduction algorithms #1470

julianmi · 2024-03-27T10:46:39Z

Vectorization is performance critical on SIMD architectures. This patch enables vectorization by unrolling vector size wide loop iterations on both coalesced (commutative algorithms) and consecutive (non-commutative algorithms) loads. Coalesced loads will then load vectors of consecutive elements. This change improves the coalesced loads on Intel SIMD GPUs without decreasing the throughput on SIMT GPUs. Coalesced loads are therefore enabled on SPIR-V backends as well. min_element and max_element continue using consecutive loads on SPIR-V backends due to the performance penalty of the required index check when using coalesced global loads.

Secondly, the vectorization enables dynamic number of elements to be processed per work-item. Launch parameter tuning with compile time constants is therefore not needed anymore. This reduces the number of template instantiations from 13 to 3, which improves the compile times significantly (e.g., half the time for sycl_iterator_reduce.pass).

Thirdly, branch divergence is minimized by adding a flag showing whether the work-group can process full sequences of the input array. If so, branching withing the inner kernel can be removed. If not, all work-items in a group follow the same boundary-checked implementation.

include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_reduce.h

MikeDvorskiy · 2024-04-08T10:35:44Z

include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_reduce.h

+
+    // Empirically found tuning parameters for typical devices.
+    constexpr _Size __max_iters_per_work_item = 32;
+    constexpr ::std::size_t __max_work_group_size = 256;


Probably, is better to make a request to a device, like oneapi::dpl::__internal::__max_work_group_size(...) and oneapi::dpl::__internal::__max_sub_group_size(...) ?

These values are empirically found to achieve the highest throughput. The device-specific work-group limits are checked a couple of lines down.

mmichel11

The comments I have are mostly stylistic to better understand the flow of the code.

include/oneapi/dpl/pstl/hetero/dpcpp/unseq_backend_sycl.h

include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_reduce.h

include/oneapi/dpl/pstl/hetero/dpcpp/unseq_backend_sycl.h

include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_reduce.h

include/oneapi/dpl/pstl/hetero/dpcpp/unseq_backend_sycl.h

MikeDvorskiy · 2024-04-22T14:08:33Z

include/oneapi/dpl/pstl/hetero/dpcpp/unseq_backend_sycl.h

+    scalar_reduction_remainder(const _Size __start_idx, const _Size __adjusted_n, const _Size __max_iters, _Res& __res,
+                               const _Acc&... __acc) const
+    {
+        const _Size __no_iters = std::min(static_cast<_Size>(__adjusted_n - __start_idx), __max_iters);


static_cast<_Size> here and everywhere looks very suspicious and bulky...
Probably, we can pass a "right" integer type as _Size? Or/and use auto when it is applicable and doesn't break correctness?

I agree. The _Size type is provided by SYCL and I don't think we should change it. Instead, I've used auto and added some temporaries to overcome this.

MikeDvorskiy · 2024-04-22T14:11:18Z

include/oneapi/dpl/pstl/hetero/dpcpp/unseq_backend_sycl.h

+        const _Size __global_idx = __item_id.get_global_id(0);
+        if (__iters_per_work_item == 1)
+        {
+            new (&__res.__v) _Tp(__unary_op(__global_idx, __acc...));


Due to __local_idx is not used within this scope, the definition of __local_idx (line 247) may be moved down after if operator.

mmichel11

I have a few small comments and am ready to approve once these are considered.

include/oneapi/dpl/pstl/hetero/dpcpp/unseq_backend_sycl.h

adamfidel · 2024-04-22T17:14:23Z

include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_reduce.h

+__adjust_iters_per_work_item(_Size __iters_per_work_item) -> _Size
+{
+    if (__iters_per_work_item > 1)
+        return ((__iters_per_work_item + _VecSize - 1) / _VecSize) * _VecSize;


I think this can be written with __dpl_ceiling_div like:

return __dpl_ceiling_div(__iters_per_work_item, _VecSize) * _VecSize;

include/oneapi/dpl/pstl/hetero/dpcpp/unseq_backend_sycl.h

danhoeflinger

apologies for this late review. Mostly minor things.
I'm still wrapping my head around things fully so I probably shouldn't be the approver. However hopefully some of these comments can help. Continuing to look as time permits as well.

include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_reduce.h

include/oneapi/dpl/pstl/hetero/dpcpp/unseq_backend_sycl.h

danhoeflinger

Only a minor comment. Otherwise, I think this looks good, especially now since we have some time to react to any issues which may arise before a release.

I went through the PR with fresh eyes and couldn't find any real issues.

include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_reduce.h

danhoeflinger

LGTM. Probably good to at least check for objections from others who have reviewed this PR before merging.
(and wait for green CI)

include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_reduce.h

SergeyKopienko · 2024-05-21T13:23:21Z

@julianmi, how do you think, should we introduce some type for the union

    union __storage
    {
        _Tp __v;
        __storage() {}
    };

?

julianmi · 2024-05-21T14:00:33Z

@julianmi, how do you think, should we introduce some type for the union
    union __storage
    {
        _Tp __v;
        __storage() {}
    };
?

I've added a union type to reduce the code duplication.

SergeyKopienko

LGTM

mmichel11

LGTM

danhoeflinger · 2024-05-21T14:24:35Z

include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_reduce.h

@@ -47,11 +47,19 @@ class __reduce_mid_work_group_kernel;
 template <typename... _Name>
 class __reduce_kernel;

+// Storage helper since _Tp may not have a default constructor.
+template <typename _Tp>
+union __storage


If we are going to lift this type definition out, we probably need to rename it as well. (trying to think of a good name...)

Additionally if we are going to lift this type definition out, we may cover the case when we have array of elements too.

I'm not sure about the array of elements, perhaps that reaches too far beyond the scope of this PR, but maybe something like __delayed_ctor_storage?

I think we need something which describes its purpose.

__optional_ctor_storage ?
__lazy_ctor_storage ?

I don't know how far we want to go in the context of this PR, but this trick is also used

oneDPL/include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_radix_sort_one_wg.h

Line 169 in a9aabb2

union __storage { _ValT __v[__block_size]; __storage(){} } __values;

and

oneDPL/include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_radix_sort.h

Line 597 in a9aabb2

union __storage { _ValueT __v; __storage(){} } __in_val;

If we are lifting this, it would be great to unify all the use to a single type. Then future improvements can be had by all, and it will improve readability.

I suppose the first is the array case Sergey was referring to, I'd be fine with leaving that one out for now to limit the scope of the PR if it makes it significantly more complicated.

I propose to make additional changes with it in some separate PR.

Sure, for this PR, lets just rename it, we can unify, etc. in a separate PR.
My vote is for __lazy_ctor_storage because I think optional advertises more functionality than is provided here.

Thanks for this discussion. I agree that larger changes are outside the scope of this PR and change the naming to __lazy_ctor_storage.

SergeyKopienko

LGTM

danhoeflinger

LGTM

julianmi force-pushed the dev/julianmi/reduce_vectorization branch 2 times, most recently from 2adb25a to 76097ed Compare April 4, 2024 13:38

julianmi marked this pull request as ready for review April 4, 2024 15:09

julianmi requested review from AidanBeltonS, mmichel11, SergeyKopienko, MikeDvorskiy, dmitriy-sobolev and danhoeflinger and removed request for mmichel11 April 4, 2024 15:09

julianmi added this to the 2022.6.0 milestone Apr 5, 2024

mmichel11 reviewed Apr 5, 2024

View reviewed changes

include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_reduce.h Outdated Show resolved Hide resolved

MikeDvorskiy reviewed Apr 8, 2024

View reviewed changes

mmichel11 reviewed Apr 12, 2024

View reviewed changes

julianmi force-pushed the dev/julianmi/reduce_vectorization branch from 76097ed to cd2dba5 Compare April 15, 2024 14:28

SergeyKopienko reviewed Apr 15, 2024

View reviewed changes

include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_reduce.h Outdated Show resolved Hide resolved

SergeyKopienko reviewed Apr 15, 2024

View reviewed changes

include/oneapi/dpl/pstl/hetero/dpcpp/unseq_backend_sycl.h Show resolved Hide resolved

SergeyKopienko reviewed Apr 15, 2024

View reviewed changes

include/oneapi/dpl/pstl/hetero/dpcpp/unseq_backend_sycl.h Outdated Show resolved Hide resolved

julianmi force-pushed the dev/julianmi/reduce_vectorization branch from 111fc65 to e39f642 Compare April 16, 2024 14:23

akukanov reviewed Apr 17, 2024

View reviewed changes

include/oneapi/dpl/pstl/hetero/dpcpp/unseq_backend_sycl.h Outdated Show resolved Hide resolved

julianmi force-pushed the dev/julianmi/reduce_vectorization branch from 8e37910 to e6983fc Compare April 18, 2024 13:45

SergeyKopienko reviewed Apr 22, 2024

View reviewed changes

include/oneapi/dpl/pstl/hetero/dpcpp/unseq_backend_sycl.h Outdated Show resolved Hide resolved

MikeDvorskiy reviewed Apr 22, 2024

View reviewed changes

mmichel11 reviewed Apr 22, 2024

View reviewed changes

include/oneapi/dpl/pstl/hetero/dpcpp/unseq_backend_sycl.h Outdated Show resolved Hide resolved

include/oneapi/dpl/pstl/hetero/dpcpp/unseq_backend_sycl.h Outdated Show resolved Hide resolved

adamfidel reviewed Apr 22, 2024

View reviewed changes

danhoeflinger reviewed Apr 23, 2024

View reviewed changes

include/oneapi/dpl/pstl/hetero/dpcpp/unseq_backend_sycl.h Show resolved Hide resolved

danhoeflinger reviewed Apr 23, 2024

View reviewed changes

julianmi force-pushed the dev/julianmi/reduce_vectorization branch from afd0d90 to 4d1cbf2 Compare April 24, 2024 10:46

julianmi added 8 commits May 13, 2024 12:54

Remove move statement and ::std

2ef7d5b

Address review feedback

17d5bce

Address review feedback

291b1c6

Update is_device_copyable trait

fdaf78d

Update transform_reduce signature also in test

27a01f4

Add missing out-of-bounds check

70939d6

Improve bounds check based on review comments

0245bbd

Further bounds check improvements

48bf347

julianmi force-pushed the dev/julianmi/reduce_vectorization branch from 933790f to 48bf347 Compare May 13, 2024 10:55

julianmi requested a review from danhoeflinger May 13, 2024 10:55

danhoeflinger reviewed May 20, 2024

View reviewed changes

include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_reduce.h Outdated Show resolved Hide resolved

include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_reduce.h Outdated Show resolved Hide resolved

julianmi added 2 commits May 21, 2024 12:57

Add check for shorter addressing support

90c308b

Use static assert instead

3d93a31

danhoeflinger previously approved these changes May 21, 2024

View reviewed changes

SergeyKopienko reviewed May 21, 2024

View reviewed changes

include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_reduce.h Outdated Show resolved Hide resolved

Address review comments

3ebc344

julianmi dismissed danhoeflinger’s stale review via 3ebc344 May 21, 2024 13:59

SergeyKopienko previously approved these changes May 21, 2024

View reviewed changes

mmichel11 previously approved these changes May 21, 2024

View reviewed changes

danhoeflinger reviewed May 21, 2024

View reviewed changes

Rename union storeage based on review discussion

27fd437

julianmi dismissed stale reviews from mmichel11 and SergeyKopienko via 27fd437 May 21, 2024 16:05

SergeyKopienko approved these changes May 21, 2024

View reviewed changes

danhoeflinger approved these changes May 21, 2024

View reviewed changes

mmichel11 approved these changes May 22, 2024

View reviewed changes

julianmi merged commit b694665 into main May 22, 2024
20 checks passed

julianmi deleted the dev/julianmi/reduce_vectorization branch May 22, 2024 13:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable vectorized global loads for the reduction algorithms #1470

Enable vectorized global loads for the reduction algorithms #1470

julianmi commented Mar 27, 2024

MikeDvorskiy Apr 8, 2024 •

edited

Loading

julianmi Apr 8, 2024

mmichel11 left a comment

MikeDvorskiy Apr 22, 2024

julianmi Apr 22, 2024

MikeDvorskiy Apr 22, 2024

julianmi Apr 22, 2024

mmichel11 left a comment

adamfidel Apr 22, 2024

julianmi Apr 24, 2024

danhoeflinger left a comment

danhoeflinger left a comment

danhoeflinger left a comment

SergeyKopienko commented May 21, 2024 •

edited

Loading

julianmi commented May 21, 2024

SergeyKopienko left a comment

mmichel11 left a comment

danhoeflinger May 21, 2024 •

edited

Loading

SergeyKopienko May 21, 2024 •

edited

Loading

danhoeflinger May 21, 2024

SergeyKopienko May 21, 2024 •

edited

Loading

danhoeflinger May 21, 2024

danhoeflinger May 21, 2024

SergeyKopienko May 21, 2024

danhoeflinger May 21, 2024

julianmi May 21, 2024

SergeyKopienko left a comment

danhoeflinger left a comment

Enable vectorized global loads for the reduction algorithms #1470

Enable vectorized global loads for the reduction algorithms #1470

Conversation

julianmi commented Mar 27, 2024

MikeDvorskiy Apr 8, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mmichel11 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mmichel11 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

danhoeflinger left a comment

Choose a reason for hiding this comment

danhoeflinger left a comment

Choose a reason for hiding this comment

danhoeflinger left a comment

Choose a reason for hiding this comment

SergeyKopienko commented May 21, 2024 • edited Loading

julianmi commented May 21, 2024

SergeyKopienko left a comment

Choose a reason for hiding this comment

mmichel11 left a comment

Choose a reason for hiding this comment

danhoeflinger May 21, 2024 • edited Loading

Choose a reason for hiding this comment

SergeyKopienko May 21, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SergeyKopienko May 21, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SergeyKopienko left a comment

Choose a reason for hiding this comment

danhoeflinger left a comment

Choose a reason for hiding this comment

MikeDvorskiy Apr 8, 2024 •

edited

Loading

SergeyKopienko commented May 21, 2024 •

edited

Loading

danhoeflinger May 21, 2024 •

edited

Loading

SergeyKopienko May 21, 2024 •

edited

Loading

SergeyKopienko May 21, 2024 •

edited

Loading