-
Notifications
You must be signed in to change notification settings - Fork 407
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow non-power-of-two team sizes for team reductions and scans #4809
Conversation
d080dcd
to
1367342
Compare
c75e360
to
87fd95c
Compare
…on threads in warps executing in lock step
Co-authored-by: Phil Miller <unmobile+gh@gmail.com>
5f9a7be
to
d633e3f
Compare
20a02d1
to
1d4a0af
Compare
const unsigned not_less_power_of_two = | ||
(1 << (Impl::int_log2(blockDim.y - 1) + 1)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How did you come up with that formula?
for (int i = 0; i < 10; ++i)
std::cout << i << " " << (2 << (Kokkos::Impl::int_log2(i - 1) + 1)) << '\n';
yields
0 2
1 2
2 4
3 8
4 8
5 16
6 16
7 16
8 16
9 32
which I assume is not what you wanted
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should I resurrect #4577 ?
Isn't bit_ceil
what you want? https://godbolt.org/z/TdPcWeGqY
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://godbolt.org/z/1396qsaME works fine for me. Note that I'm doing
(1 << (Kokkos::Impl::int_log2(i - 1) + 1))
and not
(2 << (Kokkos::Impl::int_log2(i - 1) + 1))
I'm not quite sure if the single use here justifies introducing another helper function but I wouldn't be surprised to find other places.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Duh. You're right. Sorry about the noise.
I had been looking into these because they were a bunch of use cases throughout the codebase. The main issue with the PR was that the intrinsics are not usable in a constexpr context. If I recall correctly we were debating whether to also have a "fast" version that is not constexpr...
// Set team size explicitly to check whether non-power-of-two team sizes can | ||
// be used. | ||
if (ExecutionSpace().concurrency() > 10000) | ||
Kokkos::parallel_for(policy_type(M, 127), *this); | ||
else if (ExecutionSpace().concurrency() > 2) | ||
Kokkos::parallel_for(policy_type(M, 3), *this); | ||
else | ||
Kokkos::parallel_for(policy_type(M, 1), *this); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we keep the two code paths now that non-power-of-two teams are supported?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wanted to make sure that we indeed test with a team size that is not a power of two and 3 seems to be unrealistic for the GPU backends but higher values are not really feasible for the host parallel backends. That being said, I'm happy to adapt if you have a good suggestion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No that's ok. I haven't looked in details what this unit test does. BTW what about reductions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW what about reductions?
Ping
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll find another test that wouldn't have worked before.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After inspecting CUDA backend implementation thoroughly, I found that we can never hit this code path with a non-power-of-two block size if were are not also performing scans. For RangePolicy
and MDRangePolicy
we enforce using a power-of-two block size internally by adapting it if necessary. For TeamPolicy
, we would call it with the shmem path which is never used.
Fixes #4146. Basically, the idea is to shift all indices in a warp so that the last indices are used and ignore all contributions from indices that have not been mapped. Then, for the inter-warp algorithm, again shift all individual contributions to the end of the power-of-two range covered and again ignore contributions from unmapped indices.