Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OpenMPTarget: Adding Implementation for nested reducers #3845

Merged
merged 18 commits into from
Apr 7, 2021

Conversation

rgayatri23
Copy link
Contributor

The PR contains the following updates in the OpenMPTarget backend:

  1. nested array reduction
  2. nested kokkos-reducers
  3. nested reducers with join

Major contributor to the PR - Patrick Steinbrecher (@psteinbrecher)

@masterleinad
Copy link
Contributor

Major contributor to the PR - Patrick Steinbrecher (@psteinbrecher)

Then he should also co-author the commits here (https://docs.github.com/en/github/committing-changes-to-your-project/creating-a-commit-with-multiple-authors). 🙂

@masterleinad
Copy link
Contributor

You should enable the corresponding tests in this pull request.

@rgayatri23
Copy link
Contributor Author

You should enable the corresponding tests in this pull request.

Yes working on the unit tests right now. Will update them soon.

@rgayatri23 rgayatri23 removed the [WIP] label Mar 12, 2021
@masterleinad
Copy link
Contributor

The indentation is off. 😉

#pragma omp barrier

if constexpr (std::is_arithmetic<ValueType>::value) {
#pragma omp for reduction(+ : tmp_scratch[:1])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is wrong: it does + for the inter thread reduction, and then join within a thread.

@crtrott crtrott added the Blocks Promotion Overview issue for release-blocking bugs label Mar 15, 2021
@dhollman dhollman self-requested a review March 17, 2021 20:41
@rgayatri23 rgayatri23 removed the [WIP] label Mar 24, 2021
Copy link

@dhollman dhollman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is way too much new code duplication in this pull request for me to be comfortable saying I could maintain it. Please consolidate.

template <class FunctorType, class ReducerType, class PointerType,
class ValueType, class... PolicyArgs>
struct ParallelReduceSpecialize<FunctorType, TeamPolicyInternal<PolicyArgs...>,
ReducerType, PointerType, ValueType, 0, 1> {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/* functor has join = */ 0,
/* use reducer type = */ 1

Also, why are these ints instead of bools?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should still address this.


#pragma omp declare reduction( \
custom:ValueType \
: OpenMPTargetReducerWrapper <ReducerType>::join(omp_out, omp_in)) \

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

practically everything except for this pragma here is copied from the 0, 0 case above, but the only way to figure that out is to run diff on the code segments. This is way too much code duplication for me to feel comfortable approving this. It needs to be restructured so that the repeated lines aren't in two places; otherwise we can't maintain this.

inline static
typename std::enable_if<!std::is_same<TagType, void>::value>::type
execute_impl(const FunctorType& f, const PolicyType& p,
PointerType result_ptr) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This differs by one line from the 50+ lines of code in the non-tag version. Please refactor this to remove the copy-pasted code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This issue is addressed . The check for tag is pushed inside the function and the duplicated function is deleted.

@@ -1286,8 +1286,16 @@ struct TestTeamBroadcast<
using policy_type_f =
Kokkos::TeamPolicy<ScheduleType, ExecSpace, BroadcastTag>;

// FIXME_OPENMPTARGET temporary restriction for team size to be at least 32
#ifdef KOKKOS_ENABLE_OPENMPTARGET

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since execution spaces are unconditionally declared now, we shouldn't need this ifdef any more (right)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, we haven't merged #3838.

@rgayatri23
Copy link
Contributor Author

Major contributor to the PR - Patrick Steinbrecher (@psteinbrecher)

Then he should also co-author the commits here (https://docs.github.com/en/github/committing-changes-to-your-project/creating-a-commit-with-multiple-authors). 🙂

Unfortunately he cannot be associated with any PRs due to conflicts in his work policy.

@rgayatri23
Copy link
Contributor Author

There is way too much new code duplication in this pull request for me to be comfortable saying I could maintain it. Please consolidate.

I was thinking of consolidating the duplicated code in a separate PR once this gets merged. It has been an issue since before this PR. But if there is a strong opinion on doing it in this PR itself, I have no issues.

@dhollman
Copy link

I was thinking of consolidating the duplicated code in a separate PR once this gets merged. It has been an issue since before this PR. But if there is a strong opinion on doing it in this PR itself, I have no issues.

If other reviewers think that's fine, then I'm okay with it. I just think it's easier to avoid producing duplicate code in the first place, though. At the very least, mark the duplicated lines with comments like "begin copy-paste X" and "end copy-paste from X" so that we can remove it later.

@@ -66,7 +66,8 @@ TEST(TEST_CATEGORY, team_for) {
}

// FIXME_OPENMPTARGET wrong results
#ifndef KOKKOS_ENABLE_OPENMPTARGET
// FIXME_SYCL team reduction and broadcast not yet implemented
#if !defined(KOKKOS_ENABLE_SYCL) && !defined(KOKKOS_ENABLE_OPENMPTARGET)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That looks like an involuntary change. @masterleinad

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this works for SYCL.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will remove the ifdef around SYCL

}
#endif // KOKKOS_IMPL_HIERARCHICAL_REDUCERS_WORKAROUND
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would undef KOKKOS_IMPL_HIERARCHICAL_REDUCERS_WORKAROUND right there

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ping

value = *local_value;
#endif*/
KOKKOS_INLINE_FUNCTION void team_broadcast(ValueType& value,
const int& thread_id) const {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to quit passing arithmetic types by const reference.

@@ -51,6 +51,11 @@
#include <Kokkos_Atomic.hpp>
#include "Kokkos_OpenMPTarget_Abort.hpp"

// FIXME_OPENMPTARGET - Using this macro to implement a workaround for
// hierarchical reducers. It avoids hitting the code path which we wanted to
// write but dosen't work. undef'ed at the end.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// write but dosen't work. undef'ed at the end.
// write but doesn't work. undef'ed at the end.

Comment on lines 627 to 628
using type = typename if_c<sizeof(ValueType) < TEAM_REDUCE_SIZE, ValueType,
void>::type;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't you use std::conditional_t here?

using type = typename if_c<sizeof(ValueType) < TEAM_REDUCE_SIZE, ValueType,
void>::type;
type* team_scratch =
(type*)((char*)m_glb_scratch + TEAM_REDUCE_SIZE * omp_get_team_num());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

C-style casts for pointers 🙁

Comment on lines 1179 to 1180
typename std::enable_if<!Kokkos::is_reducer_type<ValueType>::value>::type
parallel_reduce(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't you implement this one in terms of the reducer implementation using Kokkos::Sum?


#if !defined(KOKKOS_IMPL_HIERARCHICAL_REDUCERS_WORKAROUND)
// For some reason the actual version we wanted to write doesn't work
// and crashes. We should trye this with every new compiler
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// and crashes. We should trye this with every new compiler
// and crashes. We should try this with every new compiler

Comment on lines +1553 to +1554
static_assert(sizeof(ValueType) <=
Impl::OpenMPTargetExecTeamMember::TEAM_REDUCE_SIZE);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a FIXME?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't this be implemented in terms of the reducer variant using Kokkos::Sum?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a FIXME?

No this is not a FIXME, essentially we are restricting the nested array-reduction size to be <= 32 elements per team.
The reduction uses a global scratch, so 32 elements with 16 bytes per element is essentially 512 bytes per team. Anything more than this might be detrimental for the performance.

Can't this be implemented in terms of the reducer variant using Kokkos::Sum?

We were trying to rely on OpenMP constructs as much as possible.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I am misunderstanding but does that mean that you can't do reductions for large types (and that is never intended to work)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to say its better to not perform reductions larger than 32 elements as long as we have the current implementation strategy where we allocate 16 bytes per element in the scratch memory. But maybe @crtrott can answer this better.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May question solely is if we should put a FIXME there or not. 🙂

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes you are right. We should put a FIXME there. Will do it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I forgot that we have already added a FIXME_OPENMPTARGET for this issue in one of the variants of the parallel_reduce functions. I will repeat that everywhere.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This static_assert seems to be duplicated now looking at lines 1573-1577.

rgayatri added 2 commits March 26, 2021 13:21
… with c++ style. Added FIXME messages to the restriction of 32 elements per team in the array reduction of parallel_reduce.
Comment on lines 1626 to 1627
static_assert(sizeof(ValueType) <=
Impl::OpenMPTargetExecTeamMember::TEAM_REDUCE_SIZE);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one seems duplicated looking at line 1614-1618.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed it.

Copy link
Member

@dalg24 dalg24 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You forgot to undef KOKKOS_IMPL_HIERARCHICAL_REDUCERS_WORKAROUND

#ifdef KOKKOS_ENABLE_OPENMPTARGET
if constexpr (!std::is_same<TEST_EXECSPACE,
Kokkos::Experimental::OpenMPTarget>::value)
#endif
{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
{

Same comment below

}
#endif // KOKKOS_IMPL_HIERARCHICAL_REDUCERS_WORKAROUND
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ping

static_assert(sizeof(ValueType) <=
Impl::OpenMPTargetExecTeamMember::TEAM_REDUCE_SIZE);

ValueType* TeamThread_scratch =
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Style: prefer team_thread_scratch

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted it to match TeamThreadRangeBoundariesStruct in the function definition. But I can change it if its about being consistent in the way we name variables.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not blocking

const Lambda& lambda, ValueType& result) {
ValueType* tmp_scratch =
KOKKOS_INLINE_FUNCTION
typename std::enable_if<!Kokkos::is_reducer_type<ValueType>::value>::type
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
typename std::enable_if<!Kokkos::is_reducer_type<ValueType>::value>::type
std::enable_if_t<!Kokkos::is_reducer_type<ValueType>::value>

// This is the variant we actually wanted to write
template <typename iType, class Lambda, typename ReducerType>
KOKKOS_INLINE_FUNCTION
typename std::enable_if<Kokkos::is_reducer_type<ReducerType>::value>::type
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
typename std::enable_if<Kokkos::is_reducer_type<ReducerType>::value>::type
std::enable_if_t<Kokkos::is_reducer_type<ReducerType>::value>

Copy link
Contributor

@masterleinad masterleinad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks OK to me.

@dalg24 dalg24 dismissed dhollman’s stale review March 31, 2021 21:50

Changes requested were partially addressed and will be fully addressed in follow-up PR

@dalg24 dalg24 merged commit 8b7a01f into kokkos:develop Apr 7, 2021
@rgayatri23 rgayatri23 deleted the OpenMPTarget_NestedReducers branch March 24, 2023 22:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Blocks Promotion Overview issue for release-blocking bugs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants