-
Notifications
You must be signed in to change notification settings - Fork 412
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix TeamThreadMDRange parallel_reduce #6511
Fix TeamThreadMDRange parallel_reduce #6511
Conversation
The fix resolved the issue I posted in the above mentioned Slack discussion. |
960890e
to
dadb200
Compare
bcadb77
to
8b01efd
Compare
@@ -983,7 +983,10 @@ template <typename Rank, typename TeamHandle, typename Lambda, | |||
KOKKOS_INLINE_FUNCTION void parallel_reduce( | |||
TeamThreadMDRange<Rank, TeamHandle> const& policy, Lambda const& lambda, | |||
ReducerValueType& val) { | |||
val = ReducerValueType{}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a very vague memory that we intentionally did not default initialize the result variable here, expecting it to be handled further down the line when eventually non-md team policies are called. But seeing that the result does not reflect that thought, I think initializing the result value here works.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we exclusively support sum reductions?
Otherwise initializing this way does not look right to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we currently only allow sum reductions for this interface.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have a static assert to detect if a non-sum reducer or if a functor with a init/join is passed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added some static_assert
s for the value_type
in 1c85dd0.
It shouldn't be that difficult to extend the interface to support the full reduction interface but I would rather do that in a follow-up pull request.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, note that the nested reduction interface doesn't allow functors as reducers yet, see #6317.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, note that the current interface doesn't even allow passing in values as we oftentimes use for reducers.
if constexpr (false | ||
#ifdef KOKKOS_ENABLE_CUDA | ||
|| std::is_same_v<typename TeamHandle::execution_space, | ||
Kokkos::Cuda> | ||
#elif defined(KOKKOS_ENABLE_HIP) | ||
|| std::is_same_v<typename TeamHandle::execution_space, | ||
Kokkos::HIP> | ||
#elif defined(KOKKOS_ENABLE_SYCL) | ||
|| std::is_same_v<typename TeamHandle::execution_space, | ||
Kokkos::Experimental::SYCL> | ||
#endif | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we prefer this over if constexpr(!std::is_same_v<typename TeamHandle::execution_space, Kokkos::Serial>)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's see for which of the backends we need this in the end for the test to pass. We can discuss separately if we need a uniform vector_reduce
implementation for all backends or not.
f93d91c
to
7435619
Compare
core/unit_test/TestTeamMDRange.hpp
Outdated
Kokkos::TeamPolicy<ExecSpace>( | ||
leagueSize, Kokkos::AUTO | ||
#ifndef KOKKOS_ENABLE_OPENMPTARGET | ||
, | ||
Kokkos::TeamPolicy<ExecSpace>::vector_length_max() | ||
#endif | ||
), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need at least 32 threads in a team and setting vector_length_max
would give us 1 thread with vector_length
32.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does that happen even when you manually specify a number for team_size
, or is it only when Kokkos::AUTO
is used along with vector_length_max
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the combination the compiler complains. I didn't try too many things here and would rather leave it to @rgayatri23 to clean this up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are now using vector_length==2
if OpenMPTarget
was enabled which seems to work fine as well.
@@ -149,8 +155,9 @@ class OpenMPTargetExecTeamMember { | |||
} | |||
#pragma omp barrier | |||
} | |||
return team_scratch[0]; | |||
value = team_scratch[0]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldnt this have given a warning at-least since it is now a void
function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed offline that an earlier change in this pull request missed writing back to value
when removing the return
statement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nvm wrong question
46810ef
to
44fa290
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is the OpenMPTarget not proposed as its own PR?
Are they not more unit tests that can be enabled with that fix?
My motivation was to minimize the number of places where we would disable/modify a test (that wasn't using the respective functionality in the intended way) and this fix in the |
1c85dd0
to
1c15e18
Compare
@@ -983,7 +983,16 @@ template <typename Rank, typename TeamHandle, typename Lambda, | |||
KOKKOS_INLINE_FUNCTION void parallel_reduce( | |||
TeamThreadMDRange<Rank, TeamHandle> const& policy, Lambda const& lambda, | |||
ReducerValueType& val) { | |||
static_assert(/*!Kokkos::is_view_v<ReducerValueType> &&*/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is_view_v
requires pulling in Kokkos_View.hpp
which in turn pulls in this header. To avoid the circular dependency, I just commented this particular check.
Not quite sure. The motivation here was to not disable more tests. I wanted to let @rgayatri23 explore (in a different pull request) if more unit tests could be enabled when providing the correct interface but the one enabled additionally here is the only one that seemed obvious. |
Fixes #6513, fixes #6530. I decided both in the same pull request to minimize the number of pull request/load on the CI.
It came up in https://kokkosteam.slack.com/archives/C5BGU5NDQ/p1697207384819959?thread_ts=1697194377.868819&cid=C5BGU5NDQ that the behavior of
TeamThreadMDRange
parallel_reduce
is surprising. It just does the reduction for a single thread and is missing the reduction across multiple team members. Also, we don't initialize the result variable.This pull request fixes that. It appears that the same holds true for
ThreadVectorMDRange
andTeamVectorMDRange
but we are missing a uniformvector_reduce
implementation so that is commented at the moment.