Improve performance for SYCL parallel_reduce #3732

masterleinad · 2021-01-19T23:27:34Z

let each thread load two values
use better memory access pattern
~~- let the last block do the final reduction~~

masterleinad · 2021-01-27T22:04:14Z

Benchmark showed better performance for the recursive implementation so I dropped the commit related to the one that lets the last block do the final reduction. A backup for that implementation can be found at https://github.com/masterleinad/kokkos/tree/sycl_reduce_performance_backup.

dalg24

It is hard to parse what is going on in this reduction algorithm :/

dalg24 · 2021-02-03T14:40:47Z

core/src/SYCL/Kokkos_SYCL_Parallel_Reduce.hpp

+              const typename Policy::index_type upper_bound =
+                  (global_id + values_per_thread * wgroup_size) < size
+                      ? global_id + values_per_thread * wgroup_size
+                      : size;


std::min might be more readable than this ternary op

Yes, I wasn't quite sure if that's supported on the device. I can check.

masterleinad · 2021-02-03T15:02:59Z

It is hard to parse what is going on in this reduction algorithm :/

I thought I didn't change it too much but I'm happy to add some more comments.

masterleinad · 2021-02-03T15:04:44Z

core/src/SYCL/Kokkos_SYCL_Parallel_Reduce.hpp

+    const auto init_size               = std::max<std::size_t>(
+        ((size + values_per_thread - 1) / values_per_thread + wgroup_size - 1) /
+            wgroup_size,
+        1);
    const auto value_count =
        FunctorValueTraits<ReducerTypeFwd, WorkTagFwd>::value_count(
            selected_reducer);
    const auto results_ptr =
        static_cast<pointer_type>(Experimental::SYCLSharedUSMSpace().allocate(
            "SYCL parallel_reduce result storage",
            sizeof(*m_result_ptr) * std::max(value_count, 1u) * init_size));


Yes, sizeof(value_type) should be the same. Let me try that.

crtrott

I think the values per thread thing is good for dot but bad for physics codes with reducitons

crtrott · 2021-02-04T16:55:54Z

core/src/SYCL/Kokkos_SYCL_Parallel_Reduce.hpp

-    const auto init_size =
-        std::max<std::size_t>((size + wgroup_size - 1) / wgroup_size, 1);
+    constexpr size_t wgroup_size       = 128;
+    constexpr size_t values_per_thread = 2;


this values per thread, is probably right for something like dot product, but probably wrong for LAMMPS where the reduction is on rather beefy kernels which often don't have enough parallelism (think 30k atoms, each looping serially over its neighbors and then do a reduction over all of them) not sure if you care right now

masterleinad added the [WIP] label Jan 19, 2021

masterleinad force-pushed the sycl_reduce_performance branch 2 times, most recently from 7c46954 to d238a47 Compare January 20, 2021 20:23

masterleinad force-pushed the sycl_reduce_performance branch from d238a47 to b9844de Compare January 27, 2021 22:02

masterleinad changed the title ~~[WIP] Improve performance for SYCL parallel_reduce~~ Improve performance for SYCL parallel_reduce Jan 27, 2021

Improve performance for SYCL parallel_reduce

6fc5a11

masterleinad force-pushed the sycl_reduce_performance branch from b9844de to 6fc5a11 Compare January 27, 2021 22:07

masterleinad removed the [WIP] label Jan 28, 2021

Reduce workgroup size for CI

19eda78

masterleinad force-pushed the sycl_reduce_performance branch from bce6cc5 to 19eda78 Compare January 28, 2021 19:12

Generalize for arbitrary number of values to load per thread

1f32a62

masterleinad force-pushed the sycl_reduce_performance branch 2 times, most recently from 19e1c92 to 1f32a62 Compare January 31, 2021 22:17

dalg24 approved these changes Feb 3, 2021

View reviewed changes

kokkos deleted a comment from dalg24 Feb 3, 2021

masterleinad commented Feb 3, 2021

View reviewed changes

Improve doc, std::min and sizeof(value_type)

8318f55

masterleinad added the Blocks Promotion Overview issue for release-blocking bugs label Feb 3, 2021

crtrott approved these changes Feb 4, 2021

View reviewed changes

dalg24 merged commit fe74423 into kokkos:develop Feb 4, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve performance for SYCL parallel_reduce #3732

Improve performance for SYCL parallel_reduce #3732

masterleinad commented Jan 19, 2021 •

edited

masterleinad commented Jan 27, 2021

dalg24 left a comment

dalg24 Feb 3, 2021

masterleinad Feb 3, 2021

masterleinad commented Feb 3, 2021

masterleinad Feb 3, 2021

crtrott left a comment

crtrott Feb 4, 2021

Improve performance for SYCL parallel_reduce #3732

Improve performance for SYCL parallel_reduce #3732

Conversation

masterleinad commented Jan 19, 2021 • edited

masterleinad commented Jan 27, 2021

dalg24 left a comment

Choose a reason for hiding this comment

dalg24 Feb 3, 2021

Choose a reason for hiding this comment

masterleinad Feb 3, 2021

Choose a reason for hiding this comment

masterleinad commented Feb 3, 2021

masterleinad Feb 3, 2021

Choose a reason for hiding this comment

crtrott left a comment

Choose a reason for hiding this comment

crtrott Feb 4, 2021

Choose a reason for hiding this comment

masterleinad commented Jan 19, 2021 •

edited