Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TeamScan for CUDA, Pthreads, OpenMPTarget, HIP #3536

Merged
merged 22 commits into from
Nov 19, 2020

Conversation

jrmadsen
Copy link
Contributor

  • implemented team-level parallel scan in CUDA

@jrmadsen
Copy link
Contributor Author

@crtrott
Copy link
Member

crtrott commented Oct 28, 2020

impl scatch
Threaded:

parallel_scan(..., N,...) {
  my_sum;
  for(int i = N/team_size*team_rank; ... ) {

     f(i,my_sum,false);
  }
  offset = team.team_scan(my_sum);
  my_sum = 0;
  for(int i = N/team_size*team_rank; ... ) {
     f(i,my_sum+offset,false);
  }
}

Cuda:

parallel_scan(..., N,...) {
  my_sum;
  offset = 0;
  for(int chunks = N/team_size) {
     f(chunks*team_size+team_rank,my_sum,false);
    local_offset = team.team_scan(my_sum);
     f(chunks*team_size+team_rank,my_sum+local_offset+offset,true);
     if(team_rank()==team_size-1)
     offset += ...
team.team_broadcast(offset);
  }
}

@jrmadsen
Copy link
Contributor Author

jrmadsen commented Nov 2, 2020

@crtrott I ended up with a slightly different implementation that you recommended. This appears to work as long as the team-size is a power of 2, otherwise the team_scan throws an error.

@dalg24
Copy link
Member

dalg24 commented Nov 4, 2020

Retest this please

@jrmadsen jrmadsen changed the title WIP: TeamScan for CUDA WIP: TeamScan for CUDA, Pthreads, OpenMPTarget, HIP Nov 4, 2020
Copy link
Member

@crtrott crtrott left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We met today and discussed correctness issues. Jonathan is working on this.

- Added TestTeamScan.hpp
- Renamed team_scan test to team_reduction_scan in TeamReductionScan due to naming conflict
@jrmadsen jrmadsen changed the title WIP: TeamScan for CUDA, Pthreads, OpenMPTarget, HIP TeamScan for CUDA, Pthreads, OpenMPTarget, HIP Nov 11, 2020
@jrmadsen
Copy link
Contributor Author

@dalg24 @masterleinad Do either of y'all know why the Jenkins build failed? I searched for all instances of the word "error" and "fail" in the Jenkins log and basically the only "failure" was "script exited with error code 2" but the build and tests appear to be fine

@masterleinad
Copy link
Contributor

/var/jenkins/workspace/Kokkos/core/unit_test/TestTeamScan.hpp:66:16: error: unused variable 'teamSize' [clang-diagnostic-unused-variable]
          auto teamSize   = team.team_size();
               ^
/var/jenkins/workspace/Kokkos/core/unit_test/TestTeamScan.hpp:87:5: note: in instantiation of member function 'Test::TestTeamScan<Kokkos::Serial, short>::operator()' requested here
    (*this)(M, N, a_d, a_r);
    ^

@masterleinad
Copy link
Contributor

/var/jenkins/workspace/Kokkos/install/include/Cuda/Kokkos_Cuda_Parallel.hpp(680): error: calling a constexpr __host__ function("operator()") from a __device__ function("exec_team") is not allowed. The experimental flag '--expt-relaxed-constexpr' can be used to allow this.
          detected during:
            instantiation of "std::enable_if<std::is_same<TagType, void>::value, void>::type Kokkos::Impl::ParallelFor<FunctorType, Kokkos::TeamPolicy<Properties...>, Kokkos::Cuda>::exec_team<TagType>(const Kokkos::Impl::ParallelFor<FunctorType, Kokkos::TeamPolicy<Properties...>, Kokkos::Cuda>::Member &) const [with FunctorType=lambda [](const Kokkos::Impl::CudaTeamMember &)->void, Properties=<Kokkos::Cuda>, TagType=void]" 
(731): here
            instantiation of "void Kokkos::Impl::ParallelFor<FunctorType, Kokkos::TeamPolicy<Properties...>, Kokkos::Cuda>::operator()() const [with FunctorType=lambda [](const Kokkos::Impl::CudaTeamMember &)->void, Properties=<Kokkos::Cuda>]" 
/var/jenkins/workspace/Kokkos/install/include/Cuda/Kokkos_Cuda_KernelLaunch.hpp(121): here
            instantiation of "void Kokkos::Impl::cuda_parallel_launch_local_memory(DriverType) [with DriverType=Kokkos::Impl::ParallelFor<lambda [](const Kokkos::Impl::CudaTeamMember &)->void, Kokkos::TeamPolicy<Kokkos::Cuda>, Kokkos::Cuda>]" 
/var/jenkins/workspace/Kokkos/install/include/Cuda/Kokkos_Cuda_KernelLaunch.hpp(319): here
            instantiation of "std::decay_t<decltype((<expression>))> Kokkos::Impl::CudaParallelLaunchKernelFunc<DriverType, Kokkos::LaunchBounds<0U, 0U>, Kokkos::Impl::Experimental::CudaLaunchMechanism::LocalMemory>::get_kernel_func() [with DriverType=Kokkos::Impl::ParallelFor<lambda [](const Kokkos::Impl::CudaTeamMember &)->void, Kokkos::TeamPolicy<Kokkos::Cuda>, Kokkos::Cuda>]" 
/var/jenkins/workspace/Kokkos/install/include/Cuda/Kokkos_Cuda_KernelLaunch.hpp(646): here
            instantiation of "cudaFuncAttributes Kokkos::Impl::CudaParallelLaunchImpl<DriverType, Kokkos::LaunchBounds<MaxThreadsPerBlock, MinBlocksPerSM>, LaunchMechanism>::get_cuda_func_attributes() [with DriverType=Kokkos::Impl::ParallelFor<lambda [](const Kokkos::Impl::CudaTeamMember &)->void, Kokkos::TeamPolicy<Kokkos::Cuda>, Kokkos::Cuda>, MaxThreadsPerBlock=0U, MinBlocksPerSM=0U, LaunchMechanism=Kokkos::Impl::Experimental::CudaLaunchMechanism::LocalMemory]" 
(764): here
            instantiation of "Kokkos::Impl::ParallelFor<FunctorType, Kokkos::TeamPolicy<Properties...>, Kokkos::Cuda>::ParallelFor(const FunctorType &, const Kokkos::Impl::ParallelFor<FunctorType, Kokkos::TeamPolicy<Properties...>, Kokkos::Cuda>::Policy &) [with FunctorType=lambda [](const Kokkos::Impl::CudaTeamMember &)->void, Properties=<Kokkos::Cuda>]" 
/var/jenkins/workspace/Kokkos/install/include/Kokkos_Parallel.hpp(168): here
            instantiation of "void Kokkos::parallel_for(const ExecPolicy &, const FunctorType &, const std::__cxx11::string &, std::enable_if<Kokkos::is_execution_policy<ExecPolicy>::value, void>::type *) [with ExecPolicy=Kokkos::TeamPolicy<Kokkos::Cuda>, FunctorType=lambda [](const Kokkos::Impl::CudaTeamMember &)->void]" 
/var/jenkins/workspace/Kokkos/core/unit_test/TestTeamScan.hpp(80): here
            instantiation of "void Test::TestTeamScan<Device, DataType>::operator()(int32_t, int32_t, Test::TestTeamScan<Device, DataType>::view_type, Test::TestTeamScan<Device, DataType>::view_type) const [with Device=Kokkos::Cuda, DataType=int16_t]" 
/var/jenkins/workspace/Kokkos/core/unit_test/TestTeamScan.hpp(87): here

@masterleinad
Copy link
Contributor

/var/jenkins/workspace/Kokkos/core/unit_test/TestTeamScan.hpp:125:57: error: typedef 'using TEST_POLICY = class Kokkos::TeamPolicy<Kokkos::OpenMP>' locally defined but not used [-Werror=unused-local-typedefs]
   using TEST_POLICY = Kokkos::TeamPolicy<TEST_EXECSPACE>;
                                                         ^

@jrmadsen
Copy link
Contributor Author

@crtrott @dalg24 So everything is passing here except for some SYCL stuff, which I didn't touch but it looks like the new tests instantiate something that is incomplete in the SYCL backend. How do we proceed here?

Copy link
Member

@crtrott crtrott left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minus the "exclude the test in the exclude list for SYCL" this looks good!

Copy link
Member

@Rombur Rombur left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The HIP backend looks good

Copy link
Member

@dalg24 dalg24 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please clarify the tolerance with floating point numbers

core/unit_test/TestTeamScan.hpp Outdated Show resolved Hide resolved
core/unit_test/TestTeamScan.hpp Outdated Show resolved Hide resolved
core/unit_test/TestTeamScan.hpp Outdated Show resolved Hide resolved
core/unit_test/TestTeamScan.hpp Outdated Show resolved Hide resolved
core/src/Cuda/Kokkos_Cuda_Team.hpp Outdated Show resolved Hide resolved
core/unit_test/TestTeamScan.hpp Outdated Show resolved Hide resolved
Co-authored-by: Damien L-G <dalg24+github@gmail.com>
@jrmadsen jrmadsen requested a review from dalg24 November 18, 2020 17:04
@crtrott
Copy link
Member

crtrott commented Nov 18, 2020

Retest this please.

@crtrott
Copy link
Member

crtrott commented Nov 19, 2020

So intereseting question: should we just merge? Windows likely failing because of the MSVC stuff, Jenkins its the AMD node, and travis is one timeout ...

@crtrott
Copy link
Member

crtrott commented Nov 19, 2020

Also: Jonathan you want to rewrite history or should I squash commit?

@Char-Aznable
Copy link
Contributor

Char-Aznable commented Jul 6, 2021

Hi @jrmadsen @crtrott , I have code using something like parallel_scan(TeamThreadRange(team, n), [&](const int i, double& udpate, const bool isFinal) { ... }); with n = 200 and it aborts at the line

if (BlockSizeMask & blockDim.y) {
Kokkos::abort("Cuda::cuda_intra_block_scan requires power-of-two blockDim");
}
and

(cuda-gdb) p blockDim.y
$4 = 224
(cuda-gdb) p blockDim.y - 1
$5 = 223

any idea what went wrong here? Does the code in this PR require the item counts n to be the same as number of threads in the team because it seems to be calling team_member.team_scan() judging from the stack trace?

@Char-Aznable
Copy link
Contributor

Judging from the unit test cases, I guess this implementation only works if the work item counts to TeamThreadRange is power of 2?

@jrmadsen jrmadsen deleted the team-parallel-scan branch July 6, 2021 05:46
@jrmadsen
Copy link
Contributor Author

jrmadsen commented Jul 6, 2021

Judging from the unit test cases, I guess this implementation only works if the work item counts to TeamThreadRange is power of 2?

Yes

@Char-Aznable
Copy link
Contributor

Would it be difficult to support non-power-of-2 work counts? I can help implementing it with the CUDA backend at least. I have a lot of small loops of a few hundreds work counts and it would be a big hit in performance if forced to use power-of-2 loop, especially on the host because I need to convert all the host side to be power of 2 to be portable

@masterleinad
Copy link
Contributor

We will discuss this but in general, contributions are very welcome.

@masterleinad
Copy link
Contributor

@Char-Aznable We decided that we are interested in making non-power-2 team sizes work, see #4146. Any help in implementing that is very welcome!

@Char-Aznable
Copy link
Contributor

Great! I'll take a look at the code and see what I can do

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants