[SYCL][Reduction] Prefer fast group reduce over fast atomics #6890

aelovikov-intel · 2022-09-27T23:24:07Z

#6434 enabled treating "float" as suitable for Reduction::has_fast_atomics implementation but that is slower than the one available under Reduction::has_fast_reduce. Make sure to check for the latter first.

intel#6434 enabled treating "float" as suitable for Reduction::has_fast_atomics implementation but that is slower than the one available under Reduction::has_fast_reduce. Make sure to check for the latter first.

JackAKirk · 2022-09-28T11:37:57Z

Apparently the fast_reduce is faster than atomics for array size of ~4million by around 70% on the Nvidia A100.
Probably you are already aware but I suspect that fast_atomics may still faster than fast_reduce for array sizes less than ~1 million (At least on the A100, although we've not made recent checks on this). Since there is no template parameter for array size in the interface I suppose a compromise has to be made, and probably it makes sense to go with this change.

There are many applications that can make use of the sycl 2020 reduction/parallel_for interfaces where there is a lot more going on in the kernel than just the reduction part of the kernel, such that the reduction will not be the main bottle neck, and the reduction itself could be much much less than ~4 million but the card is still saturated. Perhaps in the future we could construct some benchmarks where the reduction is not the only compute in the kernel so we can properly take these important use cases into account. What do you think?

By the way it looks like the sample used for benchmarking could be made faster (probably in both cases and definitely in the atomics case: this is the case we have tested) by the use of local memory for the array that is reduced by the implementation behind the scenes per workgroup. I think I remember this made ~10% difference, so not enough to affect the benchmark conclusions.
See for example the following kernel which uses the "local" local accessor.

     queue.submit([&](handler &h) {
        accessor d_in{buf_in, h, read_only};  
        accessor d_sum{buf_sum, h, read_write};  
accessor<double, 1, access::mode::read_write, access::target::local> local(range<1>{workgroup_size}, h); 
        h.parallel_for<class kernel_reduction>(nd_range{global_range, local_range},
sycl::reduction(d_sum, sycl::plus<>()), [=] (nd_item<1> id, auto& reducer) {
           int global_id = id.get_global_id(0);
           int locid = id.get_local_id();      
           local[locid] = d_in[global_id];
           reducer += local[locid];
        });
     });

aelovikov-intel · 2022-09-28T15:33:41Z

What do you think?

Yes, having a better precommit performance testing strategy for reduction is something we need in order to make improvements in the area.

My current focus is more on the maintainability and less on performance (yet) though. What I tried to investigate recently (to no success, unfortunately) is making a clearer boundary between reduction implementation and the sycl::handler's internals so that we could make it easier to pull the reduction code out to the user code for the experiments/prototyping.

Complementary change to intel/llvm#6890.

aelovikov-intel · 2022-09-28T16:53:32Z

/verify with intel/llvm-test-suite#1294

aelovikov-intel · 2022-09-29T16:35:51Z

Failure on

SYCL :: XPTI/kernel/content.cpp

needed llvm-test-suite change and passed in "/verify with".

@intel/llvm-gatekeepers , this PR is ready.

Complementary change to intel/llvm#6890.

Complementary change to intel#6890.

…-suite#1294) Complementary change to intel#6890.

[SYCL][Reduction] Prefer fast group reduce over fast atomics for FP32

43d856b

intel#6434 enabled treating "float" as suitable for Reduction::has_fast_atomics implementation but that is slower than the one available under Reduction::has_fast_reduce. Make sure to check for the latter first.

aelovikov-intel requested a review from a team as a code owner September 27, 2022 23:24

aelovikov-intel requested review from bso-intel, steffenlarsen, v-klochkov, pgorlani and JackAKirk September 27, 2022 23:24

steffenlarsen approved these changes Sep 28, 2022

View reviewed changes

pgorlani approved these changes Sep 28, 2022

View reviewed changes

aelovikov-intel added a commit to aelovikov-intel/llvm-test-suite that referenced this pull request Sep 28, 2022

[SYCL][Reduction] Prefer fast_reduce to fast_atomics

230a2e5

Complementary change to intel/llvm#6890.

aelovikov-intel mentioned this pull request Sep 28, 2022

[SYCL][Reduction] Prefer fast_reduce to fast_atomics intel/llvm-test-suite#1294

Merged

steffenlarsen pushed a commit to intel/llvm-test-suite that referenced this pull request Sep 29, 2022

[SYCL][Reduction] Prefer fast_reduce to fast_atomics (#1294)

7015143

Complementary change to intel/llvm#6890.

steffenlarsen merged commit 97725f1 into intel:sycl Sep 29, 2022

aelovikov-intel deleted the red-float branch November 8, 2022 20:46

aelovikov-intel added a commit to aelovikov-intel/llvm that referenced this pull request Feb 23, 2023

[SYCL][Reduction] Prefer fast_reduce to fast_atomics (intel#1294)

5343e43

Complementary change to intel#6890.

aelovikov-intel added a commit to aelovikov-intel/llvm that referenced this pull request Mar 27, 2023

[SYCL][Reduction] Prefer fast_reduce to fast_atomics (intel/llvm-test…

184714a

…-suite#1294) Complementary change to intel#6890.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SYCL][Reduction] Prefer fast group reduce over fast atomics #6890

[SYCL][Reduction] Prefer fast group reduce over fast atomics #6890

aelovikov-intel commented Sep 27, 2022

JackAKirk commented Sep 28, 2022 •

edited

Loading

aelovikov-intel commented Sep 28, 2022

aelovikov-intel commented Sep 28, 2022

aelovikov-intel commented Sep 29, 2022

[SYCL][Reduction] Prefer fast group reduce over fast atomics #6890

[SYCL][Reduction] Prefer fast group reduce over fast atomics #6890

Conversation

aelovikov-intel commented Sep 27, 2022

JackAKirk commented Sep 28, 2022 • edited Loading

aelovikov-intel commented Sep 28, 2022

aelovikov-intel commented Sep 28, 2022

aelovikov-intel commented Sep 29, 2022

JackAKirk commented Sep 28, 2022 •

edited

Loading