Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TeamThreadRange loop count on device is passed by reference to host static constexpr #1733

Closed
Char-Aznable opened this issue Aug 10, 2018 · 8 comments
Assignees
Labels
Enhancement Improve existing capability; will potentially require voting

Comments

@Char-Aznable
Copy link
Contributor

Char-Aznable commented Aug 10, 2018

Hi, I've got this weird behavior when running a TeamThreadRange parallel_for:

#include <Kokkos_Core.hpp>

using ExecutionSpace = Kokkos::DefaultExecutionSpace;

template < std::size_t T, std::size_t N >
struct Foo {
  static constexpr std::size_t nTrials = T;
  static constexpr std::size_t nReplica = N;
};

using FT = Foo<8,1>;

int main(int argc, char* argv[]) {
  Kokkos::initialize(argc,argv); {

  using TeamPolicy = Kokkos::TeamPolicy<ExecutionSpace>;
  using Team = TeamPolicy::member_type;
  const std::size_t nDomainsTotal = 8;
  const std::size_t nTeams = nDomainsTotal;
  const std::size_t nTasksPerDomain = FT::nTrials * FT::nReplica;
  const std::size_t teamSize = std::min(nTasksPerDomain,
    std::size_t(TeamPolicy::team_size_max([=](Team){})));

  TeamPolicy policy = TeamPolicy(nTeams,teamSize);
  Kokkos::parallel_for(policy, KOKKOS_LAMBDA (const Team& team) {

    Kokkos::single(Kokkos::PerTeam(team),[&]() {
      printf("Team:%10lu%10lu\n",FT::nTrials,FT::nReplica);
    });
    Kokkos::parallel_for(Kokkos::TeamThreadRange(team,FT::nReplica),
      [&] (const int& iReplica) {
        printf("TeamThreadRange:%10d\n",iReplica);
      });
    team.team_barrier();
  });

  Kokkos::fence();


  } Kokkos::finalize();
}

This only outputs:

Team:         8         1
Team:         8         1
Team:         8         1
Team:         8         1
Team:         8         1
Team:         8         1
Team:         8         1
Team:         8         1

Same behavior is observed if nTrials and nReplica are file scope constexpr:

static constexpr std::size_t nTrials = 8;
static constexpr std::size_t nReplica = 1;

or if they are declared inside main with static storage:

...
int main(int argc, char* argv[]) {
  static constexpr std::size_t nTrials = 8;
  static constexpr std::size_t nReplica = 1;
  ...
}
...

Only when they are declared and defined inside main without static storage:

...
int main(int argc, char* argv[]) {
  constexpr std::size_t nTrials = 8;
  constexpr std::size_t nReplica = 1;
  ...
}
...

will I get the expected output:

Team:         8         1
Team:         8         1
Team:         8         1
Team:         8         1
Team:         8         1
Team:         8         1
Team:         8         1
Team:         8         1
TeamThreadRange:         0
TeamThreadRange:         0
TeamThreadRange:         0
TeamThreadRange:         0
TeamThreadRange:         0
TeamThreadRange:         0
TeamThreadRange:         0
TeamThreadRange:         0

This only happens on GPU. The CPU version runs fine. All these examples are compiled using clang 6.0.1 and cuda 9.0.

The weird thing is that this seems to only affect the TeamThreadRange loop but not the team policy loop.

Does anyone got an explanation for this and how do I get around this issue for my original example where I need to use template parameters of a class to specify the loop count for TeamThreadRange?

@Char-Aznable Char-Aznable changed the title TeamThreadRange loop not launched if loop count is static TeamThreadRange loop not launched on GPU if loop count is static Aug 10, 2018
@ibaned ibaned added the Question For Kokkos internal and external contributors and users label Aug 10, 2018
@Char-Aznable
Copy link
Contributor Author

OK. I think the problem is that the loop count variable of TeamThreadRange on CUDA is passed by reference and causes a ODR-use of the static variable. I tried passing it by value and I got the expected output. Here's what the changes are to the Kokkos code:

diff --git a/core/src/Cuda/Kokkos_Cuda_Team.hpp b/core/src/Cuda/Kokkos_Cuda_Team.hpp
index 73ec409b..48091c18 100644
--- a/core/src/Cuda/Kokkos_Cuda_Team.hpp
+++ b/core/src/Cuda/Kokkos_Cuda_Team.hpp
@@ -647,7 +647,7 @@ struct ThreadVectorRangeBoundariesStruct<iType,CudaTeamMember> {
 template<typename iType>
 KOKKOS_INLINE_FUNCTION
 Impl::TeamThreadRangeBoundariesStruct< iType, Impl::CudaTeamMember >
-TeamThreadRange( const Impl::CudaTeamMember & thread, const iType & count ) {
+TeamThreadRange( const Impl::CudaTeamMember & thread, const iType count ) {
   return Impl::TeamThreadRangeBoundariesStruct< iType, Impl::CudaTeamMember >( thread, count );
 }

Would such a change to Kokkos make sense?

@Char-Aznable
Copy link
Contributor Author

This is more obvious with nvcc, which directly complains about the ODR use:

testTeam.cpp(30): error: identifier "Foo<(unsigned long)8ul, (unsigned long)1ul> ::nReplica" is undefined in device code

I would like to propose this as a bug from kokkos if it makes sense :)

@Char-Aznable Char-Aznable changed the title TeamThreadRange loop not launched on GPU if loop count is static TeamThreadRange loop count on device is passed by reference to host static constexpr Aug 11, 2018
@Char-Aznable
Copy link
Contributor Author

Char-Aznable commented Aug 11, 2018

Change the title a bit to make it more clear. @ibaned

@crtrott
Copy link
Member

crtrott commented Aug 14, 2018

One question: if you change it to not have the & will it still be wrong behavior (or not compile)? Or does that change the lambda capture behavior in a way that it ends up working?

@Char-Aznable
Copy link
Contributor Author

Removing the reference works as expected.

@Char-Aznable
Copy link
Contributor Author

Actually, which & did you mean? Removing the one in core/src/Cuda/Kokkos_Cuda_Team.hpp works but removing the lambda capture one ([&]) ends up having the same issue.

@crtrott crtrott added Enhancement Improve existing capability; will potentially require voting and removed Question For Kokkos internal and external contributors and users labels Aug 15, 2018
@crtrott crtrott added this to the 2018 September milestone Aug 15, 2018
@crtrott
Copy link
Member

crtrott commented Aug 15, 2018

Ok this needs fixing. Generally we gonna reduce the use of const &

@dhollman
Copy link

#2022 should fix this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Improve existing capability; will potentially require voting
Projects
None yet
Development

No branches or pull requests

5 participants