Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow non-power-of-two team sizes for team reductions and scans #4809

Merged
merged 7 commits into from
Jun 21, 2022

Conversation

masterleinad
Copy link
Contributor

Fixes #4146. Basically, the idea is to shift all indices in a warp so that the last indices are used and ignore all contributions from indices that have not been mapped. Then, for the inter-warp algorithm, again shift all individual contributions to the end of the power-of-two range covered and again ignore contributions from unmapped indices.

Comment on lines +458 to +459
const unsigned not_less_power_of_two =
(1 << (Impl::int_log2(blockDim.y - 1) + 1));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How did you come up with that formula?

for (int i = 0; i < 10; ++i)
  std::cout << i << "  " << (2 << (Kokkos::Impl::int_log2(i - 1) + 1)) << '\n';

yields

0  2
1  2
2  4
3  8
4  8
5  16
6  16
7  16
8  16
9  32

which I assume is not what you wanted

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should I resurrect #4577 ?
Isn't bit_ceil what you want? https://godbolt.org/z/TdPcWeGqY

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://godbolt.org/z/1396qsaME works fine for me. Note that I'm doing

(1 << (Kokkos::Impl::int_log2(i - 1) + 1))

and not

(2 << (Kokkos::Impl::int_log2(i - 1) + 1))

I'm not quite sure if the single use here justifies introducing another helper function but I wouldn't be surprised to find other places.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duh. You're right. Sorry about the noise.

I had been looking into these because they were a bunch of use cases throughout the codebase. The main issue with the PR was that the intrinsics are not usable in a constexpr context. If I recall correctly we were debating whether to also have a "fast" version that is not constexpr...

Comment on lines +96 to +103
// Set team size explicitly to check whether non-power-of-two team sizes can
// be used.
if (ExecutionSpace().concurrency() > 10000)
Kokkos::parallel_for(policy_type(M, 127), *this);
else if (ExecutionSpace().concurrency() > 2)
Kokkos::parallel_for(policy_type(M, 3), *this);
else
Kokkos::parallel_for(policy_type(M, 1), *this);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we keep the two code paths now that non-power-of-two teams are supported?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to make sure that we indeed test with a team size that is not a power of two and 3 seems to be unrealistic for the GPU backends but higher values are not really feasible for the host parallel backends. That being said, I'm happy to adapt if you have a good suggestion.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No that's ok. I haven't looked in details what this unit test does. BTW what about reductions?

Copy link
Member

@dalg24 dalg24 Jun 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW what about reductions?

Ping

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll find another test that wouldn't have worked before.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After inspecting CUDA backend implementation thoroughly, I found that we can never hit this code path with a non-power-of-two block size if were are not also performing scans. For RangePolicy and MDRangePolicy we enforce using a power-of-two block size internally by adapting it if necessary. For TeamPolicy, we would call it with the shmem path which is never used.

@masterleinad masterleinad added this to the Tentative 3.7 Release milestone Jun 6, 2022
@dalg24 dalg24 merged commit 52ead94 into kokkos:develop Jun 21, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants