Allow non-power-of-two team sizes for team reductions and scans #4809

masterleinad · 2022-02-21T19:41:41Z

Fixes #4146. Basically, the idea is to shift all indices in a warp so that the last indices are used and ignore all contributions from indices that have not been mapped. Then, for the inter-warp algorithm, again shift all individual contributions to the end of the power-of-two range covered and again ignore contributions from unmapped indices.

core/unit_test/TestTeamScan.hpp

…on threads in warps executing in lock step

Co-authored-by: Phil Miller <unmobile+gh@gmail.com>

dalg24 · 2022-05-31T21:50:24Z

core/src/Cuda/Kokkos_Cuda_ReduceScan.hpp

+  const unsigned not_less_power_of_two =
+      (1 << (Impl::int_log2(blockDim.y - 1) + 1));


How did you come up with that formula?

for (int i = 0; i < 10; ++i) std::cout << i << " " << (2 << (Kokkos::Impl::int_log2(i - 1) + 1)) << '\n';

yields

0 2 1 2 2 4 3 8 4 8 5 16 6 16 7 16 8 16 9 32

which I assume is not what you wanted

Should I resurrect #4577 ?
Isn't bit_ceil what you want? https://godbolt.org/z/TdPcWeGqY

https://godbolt.org/z/1396qsaME works fine for me. Note that I'm doing

(1 << (Kokkos::Impl::int_log2(i - 1) + 1))

and not

(2 << (Kokkos::Impl::int_log2(i - 1) + 1))

I'm not quite sure if the single use here justifies introducing another helper function but I wouldn't be surprised to find other places.

Duh. You're right. Sorry about the noise.

I had been looking into these because they were a bunch of use cases throughout the codebase. The main issue with the PR was that the intrinsics are not usable in a constexpr context. If I recall correctly we were debating whether to also have a "fast" version that is not constexpr...

dalg24 · 2022-06-02T18:01:38Z

core/unit_test/TestTeamScan.hpp

+    // Set team size explicitly to check whether non-power-of-two team sizes can
+    // be used.
+    if (ExecutionSpace().concurrency() > 10000)
+      Kokkos::parallel_for(policy_type(M, 127), *this);
+    else if (ExecutionSpace().concurrency() > 2)
+      Kokkos::parallel_for(policy_type(M, 3), *this);
+    else
+      Kokkos::parallel_for(policy_type(M, 1), *this);


Should we keep the two code paths now that non-power-of-two teams are supported?

I wanted to make sure that we indeed test with a team size that is not a power of two and 3 seems to be unrealistic for the GPU backends but higher values are not really feasible for the host parallel backends. That being said, I'm happy to adapt if you have a good suggestion.

No that's ok. I haven't looked in details what this unit test does. BTW what about reductions?

BTW what about reductions?

Ping

I'll find another test that wouldn't have worked before.

After inspecting CUDA backend implementation thoroughly, I found that we can never hit this code path with a non-power-of-two block size if were are not also performing scans. For RangePolicy and MDRangePolicy we enforce using a power-of-two block size internally by adapting it if necessary. For TeamPolicy, we would call it with the shmem path which is never used.

masterleinad force-pushed the fix_reduce_power_two branch from d080dcd to 1367342 Compare February 21, 2022 23:26

masterleinad marked this pull request as ready for review February 22, 2022 20:40

masterleinad requested a review from Rombur February 22, 2022 20:40

masterleinad force-pushed the fix_reduce_power_two branch from c75e360 to 87fd95c Compare February 23, 2022 13:49

PhilMiller reviewed Mar 2, 2022

View reviewed changes

core/unit_test/TestTeamScan.hpp Outdated Show resolved Hide resolved

masterleinad and others added 6 commits March 11, 2022 11:57

Allow non-power-of-two team sizes for team reductions and scans

b7b7a7d

Update test

dbe76be

Fix shadowing warning

c0d54e9

Update HIP

be91614

Make the HIP implementation more similar to the previous one relying …

579f275

…on threads in warps executing in lock step

Fix typo in TestTeamScan.hpp

d633e3f

Co-authored-by: Phil Miller <unmobile+gh@gmail.com>

masterleinad force-pushed the fix_reduce_power_two branch from 5f9a7be to d633e3f Compare March 11, 2022 17:01

Replace ValueJoin

1d4a0af

masterleinad force-pushed the fix_reduce_power_two branch from 20a02d1 to 1d4a0af Compare March 11, 2022 20:03

masterleinad requested a review from PhilMiller April 6, 2022 21:03

nliber approved these changes May 31, 2022

View reviewed changes

dalg24 reviewed May 31, 2022

View reviewed changes

Rombur approved these changes Jun 1, 2022

View reviewed changes

dalg24 reviewed Jun 2, 2022

View reviewed changes

masterleinad added this to the Tentative 3.7 Release milestone Jun 6, 2022

dalg24 merged commit 52ead94 into kokkos:develop Jun 21, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow non-power-of-two team sizes for team reductions and scans #4809

Allow non-power-of-two team sizes for team reductions and scans #4809

masterleinad commented Feb 21, 2022

dalg24 May 31, 2022

dalg24 May 31, 2022

masterleinad May 31, 2022

dalg24 May 31, 2022

dalg24 Jun 2, 2022

masterleinad Jun 2, 2022

dalg24 Jun 2, 2022

dalg24 Jun 3, 2022 •

edited

masterleinad Jun 5, 2022

masterleinad Jun 6, 2022 •

edited

		const unsigned not_less_power_of_two =
		(1 << (Impl::int_log2(blockDim.y - 1) + 1));

Allow non-power-of-two team sizes for team reductions and scans #4809

Allow non-power-of-two team sizes for team reductions and scans #4809

Conversation

masterleinad commented Feb 21, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dalg24 Jun 3, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

masterleinad Jun 6, 2022 • edited

Choose a reason for hiding this comment

dalg24 Jun 3, 2022 •

edited

masterleinad Jun 6, 2022 •

edited