-
Notifications
You must be signed in to change notification settings - Fork 724
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SYCL][Reduction] Prefer fast group reduce over fast atomics #6890
Conversation
intel#6434 enabled treating "float" as suitable for Reduction::has_fast_atomics implementation but that is slower than the one available under Reduction::has_fast_reduce. Make sure to check for the latter first.
Apparently the fast_reduce is faster than atomics for array size of ~4million by around 70% on the Nvidia A100. There are many applications that can make use of the sycl 2020 reduction/parallel_for interfaces where there is a lot more going on in the kernel than just the reduction part of the kernel, such that the reduction will not be the main bottle neck, and the reduction itself could be much much less than ~4 million but the card is still saturated. Perhaps in the future we could construct some benchmarks where the reduction is not the only compute in the kernel so we can properly take these important use cases into account. What do you think? By the way it looks like the sample used for benchmarking could be made faster (probably in both cases and definitely in the atomics case: this is the case we have tested) by the use of local memory for the array that is reduced by the implementation behind the scenes per workgroup. I think I remember this made ~10% difference, so not enough to affect the benchmark conclusions.
|
Yes, having a better precommit performance testing strategy for reduction is something we need in order to make improvements in the area. My current focus is more on the maintainability and less on performance (yet) though. What I tried to investigate recently (to no success, unfortunately) is making a clearer boundary between reduction implementation and the |
Complementary change to intel/llvm#6890.
/verify with intel/llvm-test-suite#1294 |
Failure on SYCL :: XPTI/kernel/content.cpp needed llvm-test-suite change and passed in "/verify with". @intel/llvm-gatekeepers , this PR is ready. |
Complementary change to intel/llvm#6890.
Complementary change to intel#6890.
…-suite#1294) Complementary change to intel#6890.
#6434 enabled treating "float" as suitable for Reduction::has_fast_atomics implementation but that is slower than the one available under Reduction::has_fast_reduce. Make sure to check for the latter first.