Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SYCL][Reduction] Prefer fast group reduce over fast atomics #6890

Merged
merged 1 commit into from
Sep 29, 2022

Conversation

aelovikov-intel
Copy link
Contributor

#6434 enabled treating "float" as suitable for Reduction::has_fast_atomics implementation but that is slower than the one available under Reduction::has_fast_reduce. Make sure to check for the latter first.

intel#6434 enabled treating "float" as
suitable for Reduction::has_fast_atomics implementation but that is
slower than the one available under Reduction::has_fast_reduce. Make
sure to check for the latter first.
@JackAKirk
Copy link
Contributor

JackAKirk commented Sep 28, 2022

Apparently the fast_reduce is faster than atomics for array size of ~4million by around 70% on the Nvidia A100.
Probably you are already aware but I suspect that fast_atomics may still faster than fast_reduce for array sizes less than ~1 million (At least on the A100, although we've not made recent checks on this). Since there is no template parameter for array size in the interface I suppose a compromise has to be made, and probably it makes sense to go with this change.

There are many applications that can make use of the sycl 2020 reduction/parallel_for interfaces where there is a lot more going on in the kernel than just the reduction part of the kernel, such that the reduction will not be the main bottle neck, and the reduction itself could be much much less than ~4 million but the card is still saturated. Perhaps in the future we could construct some benchmarks where the reduction is not the only compute in the kernel so we can properly take these important use cases into account. What do you think?

By the way it looks like the sample used for benchmarking could be made faster (probably in both cases and definitely in the atomics case: this is the case we have tested) by the use of local memory for the array that is reduced by the implementation behind the scenes per workgroup. I think I remember this made ~10% difference, so not enough to affect the benchmark conclusions.
See for example the following kernel which uses the "local" local accessor.

     queue.submit([&](handler &h) {
        accessor d_in{buf_in, h, read_only};  
        accessor d_sum{buf_sum, h, read_write};  
accessor<double, 1, access::mode::read_write, access::target::local> local(range<1>{workgroup_size}, h); 
        h.parallel_for<class kernel_reduction>(nd_range{global_range, local_range},
sycl::reduction(d_sum, sycl::plus<>()), [=] (nd_item<1> id, auto& reducer) {
           int global_id = id.get_global_id(0);
           int locid = id.get_local_id();      
           local[locid] = d_in[global_id];
           reducer += local[locid];
        });
     });

@aelovikov-intel
Copy link
Contributor Author

What do you think?

Yes, having a better precommit performance testing strategy for reduction is something we need in order to make improvements in the area.

My current focus is more on the maintainability and less on performance (yet) though. What I tried to investigate recently (to no success, unfortunately) is making a clearer boundary between reduction implementation and the sycl::handler's internals so that we could make it easier to pull the reduction code out to the user code for the experiments/prototyping.

aelovikov-intel added a commit to aelovikov-intel/llvm-test-suite that referenced this pull request Sep 28, 2022
@aelovikov-intel
Copy link
Contributor Author

/verify with intel/llvm-test-suite#1294

@aelovikov-intel
Copy link
Contributor Author

Failure on

SYCL :: XPTI/kernel/content.cpp

needed llvm-test-suite change and passed in "/verify with".

@intel/llvm-gatekeepers , this PR is ready.

steffenlarsen pushed a commit to intel/llvm-test-suite that referenced this pull request Sep 29, 2022
@steffenlarsen steffenlarsen merged commit 97725f1 into intel:sycl Sep 29, 2022
@aelovikov-intel aelovikov-intel deleted the red-float branch November 8, 2022 20:46
aelovikov-intel added a commit to aelovikov-intel/llvm that referenced this pull request Feb 23, 2023
aelovikov-intel added a commit to aelovikov-intel/llvm that referenced this pull request Mar 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants