TeamScan for CUDA, Pthreads, OpenMPTarget, HIP #3536

jrmadsen · 2020-10-28T07:45:54Z

implemented team-level parallel scan in CUDA

jrmadsen · 2020-10-28T07:49:17Z

Addresses Team-level prefix scan not supported in hierarchical parallelism #3197
Addresses parallel_scan on the device with a TeamThreadRange #45

crtrott · 2020-10-28T19:25:39Z

impl scatch
Threaded:

parallel_scan(..., N,...) {
  my_sum;
  for(int i = N/team_size*team_rank; ... ) {

     f(i,my_sum,false);
  }
  offset = team.team_scan(my_sum);
  my_sum = 0;
  for(int i = N/team_size*team_rank; ... ) {
     f(i,my_sum+offset,false);
  }
}

Cuda:

parallel_scan(..., N,...) {
  my_sum;
  offset = 0;
  for(int chunks = N/team_size) {
     f(chunks*team_size+team_rank,my_sum,false);
    local_offset = team.team_scan(my_sum);
     f(chunks*team_size+team_rank,my_sum+local_offset+offset,true);
     if(team_rank()==team_size-1)
     offset += ...
team.team_broadcast(offset);
  }
}

jrmadsen · 2020-11-02T20:41:52Z

@crtrott I ended up with a slightly different implementation that you recommended. This appears to work as long as the team-size is a power of 2, otherwise the team_scan throws an error.

dalg24 · 2020-11-04T04:16:26Z

Retest this please

crtrott

We met today and discussed correctness issues. Jonathan is working on this.

- Added TestTeamScan.hpp - Renamed team_scan test to team_reduction_scan in TeamReductionScan due to naming conflict

jrmadsen · 2020-11-11T15:13:12Z

@dalg24 @masterleinad Do either of y'all know why the Jenkins build failed? I searched for all instances of the word "error" and "fail" in the Jenkins log and basically the only "failure" was "script exited with error code 2" but the build and tests appear to be fine

masterleinad · 2020-11-11T16:23:01Z

/var/jenkins/workspace/Kokkos/core/unit_test/TestTeamScan.hpp:66:16: error: unused variable 'teamSize' [clang-diagnostic-unused-variable]
          auto teamSize   = team.team_size();
               ^
/var/jenkins/workspace/Kokkos/core/unit_test/TestTeamScan.hpp:87:5: note: in instantiation of member function 'Test::TestTeamScan<Kokkos::Serial, short>::operator()' requested here
    (*this)(M, N, a_d, a_r);
    ^

masterleinad · 2020-11-11T16:25:01Z

/var/jenkins/workspace/Kokkos/install/include/Cuda/Kokkos_Cuda_Parallel.hpp(680): error: calling a constexpr __host__ function("operator()") from a __device__ function("exec_team") is not allowed. The experimental flag '--expt-relaxed-constexpr' can be used to allow this.
          detected during:
            instantiation of "std::enable_if<std::is_same<TagType, void>::value, void>::type Kokkos::Impl::ParallelFor<FunctorType, Kokkos::TeamPolicy<Properties...>, Kokkos::Cuda>::exec_team<TagType>(const Kokkos::Impl::ParallelFor<FunctorType, Kokkos::TeamPolicy<Properties...>, Kokkos::Cuda>::Member &) const [with FunctorType=lambda [](const Kokkos::Impl::CudaTeamMember &)->void, Properties=<Kokkos::Cuda>, TagType=void]" 
(731): here
            instantiation of "void Kokkos::Impl::ParallelFor<FunctorType, Kokkos::TeamPolicy<Properties...>, Kokkos::Cuda>::operator()() const [with FunctorType=lambda [](const Kokkos::Impl::CudaTeamMember &)->void, Properties=<Kokkos::Cuda>]" 
/var/jenkins/workspace/Kokkos/install/include/Cuda/Kokkos_Cuda_KernelLaunch.hpp(121): here
            instantiation of "void Kokkos::Impl::cuda_parallel_launch_local_memory(DriverType) [with DriverType=Kokkos::Impl::ParallelFor<lambda [](const Kokkos::Impl::CudaTeamMember &)->void, Kokkos::TeamPolicy<Kokkos::Cuda>, Kokkos::Cuda>]" 
/var/jenkins/workspace/Kokkos/install/include/Cuda/Kokkos_Cuda_KernelLaunch.hpp(319): here
            instantiation of "std::decay_t<decltype((<expression>))> Kokkos::Impl::CudaParallelLaunchKernelFunc<DriverType, Kokkos::LaunchBounds<0U, 0U>, Kokkos::Impl::Experimental::CudaLaunchMechanism::LocalMemory>::get_kernel_func() [with DriverType=Kokkos::Impl::ParallelFor<lambda [](const Kokkos::Impl::CudaTeamMember &)->void, Kokkos::TeamPolicy<Kokkos::Cuda>, Kokkos::Cuda>]" 
/var/jenkins/workspace/Kokkos/install/include/Cuda/Kokkos_Cuda_KernelLaunch.hpp(646): here
            instantiation of "cudaFuncAttributes Kokkos::Impl::CudaParallelLaunchImpl<DriverType, Kokkos::LaunchBounds<MaxThreadsPerBlock, MinBlocksPerSM>, LaunchMechanism>::get_cuda_func_attributes() [with DriverType=Kokkos::Impl::ParallelFor<lambda [](const Kokkos::Impl::CudaTeamMember &)->void, Kokkos::TeamPolicy<Kokkos::Cuda>, Kokkos::Cuda>, MaxThreadsPerBlock=0U, MinBlocksPerSM=0U, LaunchMechanism=Kokkos::Impl::Experimental::CudaLaunchMechanism::LocalMemory]" 
(764): here
            instantiation of "Kokkos::Impl::ParallelFor<FunctorType, Kokkos::TeamPolicy<Properties...>, Kokkos::Cuda>::ParallelFor(const FunctorType &, const Kokkos::Impl::ParallelFor<FunctorType, Kokkos::TeamPolicy<Properties...>, Kokkos::Cuda>::Policy &) [with FunctorType=lambda [](const Kokkos::Impl::CudaTeamMember &)->void, Properties=<Kokkos::Cuda>]" 
/var/jenkins/workspace/Kokkos/install/include/Kokkos_Parallel.hpp(168): here
            instantiation of "void Kokkos::parallel_for(const ExecPolicy &, const FunctorType &, const std::__cxx11::string &, std::enable_if<Kokkos::is_execution_policy<ExecPolicy>::value, void>::type *) [with ExecPolicy=Kokkos::TeamPolicy<Kokkos::Cuda>, FunctorType=lambda [](const Kokkos::Impl::CudaTeamMember &)->void]" 
/var/jenkins/workspace/Kokkos/core/unit_test/TestTeamScan.hpp(80): here
            instantiation of "void Test::TestTeamScan<Device, DataType>::operator()(int32_t, int32_t, Test::TestTeamScan<Device, DataType>::view_type, Test::TestTeamScan<Device, DataType>::view_type) const [with Device=Kokkos::Cuda, DataType=int16_t]" 
/var/jenkins/workspace/Kokkos/core/unit_test/TestTeamScan.hpp(87): here

masterleinad · 2020-11-11T16:25:42Z

/var/jenkins/workspace/Kokkos/core/unit_test/TestTeamScan.hpp:125:57: error: typedef 'using TEST_POLICY = class Kokkos::TeamPolicy<Kokkos::OpenMP>' locally defined but not used [-Werror=unused-local-typedefs]
   using TEST_POLICY = Kokkos::TeamPolicy<TEST_EXECSPACE>;
                                                         ^

jrmadsen · 2020-11-12T21:17:11Z

@crtrott @dalg24 So everything is passing here except for some SYCL stuff, which I didn't touch but it looks like the new tests instantiate something that is incomplete in the SYCL backend. How do we proceed here?

core/unit_test/CMakeLists.txt

crtrott

Minus the "exclude the test in the exclude list for SYCL" this looks good!

…parallel-scan

Rombur

The HIP backend looks good

dalg24

Please clarify the tolerance with floating point numbers

core/unit_test/TestTeamScan.hpp

core/src/Cuda/Kokkos_Cuda_Team.hpp

core/unit_test/TestTeamScan.hpp

core/src/Cuda/Kokkos_Cuda_Team.hpp

Co-authored-by: Damien L-G <dalg24+github@gmail.com>

core/unit_test/TestTeamScan.hpp

crtrott · 2020-11-18T19:14:33Z

Retest this please.

crtrott · 2020-11-19T19:10:56Z

So intereseting question: should we just merge? Windows likely failing because of the MSVC stuff, Jenkins its the AMD node, and travis is one timeout ...

crtrott · 2020-11-19T19:11:19Z

Also: Jonathan you want to rewrite history or should I squash commit?

Char-Aznable · 2021-07-06T04:53:23Z

Hi @jrmadsen @crtrott , I have code using something like parallel_scan(TeamThreadRange(team, n), [&](const int i, double& udpate, const bool isFinal) { ... }); with n = 200 and it aborts at the line

kokkos/core/src/Cuda/Kokkos_Cuda_ReduceScan.hpp

Lines 698 to 700 in e483144

    
           if (BlockSizeMask & blockDim.y) { 
        
             Kokkos::abort("Cuda::cuda_intra_block_scan requires power-of-two blockDim"); 
        
           }

and

(cuda-gdb) p blockDim.y
$4 = 224
(cuda-gdb) p blockDim.y - 1
$5 = 223

any idea what went wrong here? Does the code in this PR require the item counts n to be the same as number of threads in the team because it seems to be calling team_member.team_scan() judging from the stack trace?

Char-Aznable · 2021-07-06T05:12:01Z

Judging from the unit test cases, I guess this implementation only works if the work item counts to TeamThreadRange is power of 2?

jrmadsen · 2021-07-06T05:47:00Z

Judging from the unit test cases, I guess this implementation only works if the work item counts to TeamThreadRange is power of 2?

Yes

Char-Aznable · 2021-07-06T06:19:37Z

Would it be difficult to support non-power-of-2 work counts? I can help implementing it with the CUDA backend at least. I have a lot of small loops of a few hundreds work counts and it would be a big hit in performance if forced to use power-of-2 loop, especially on the host because I need to convert all the host side to be power of 2 to be portable

masterleinad · 2021-07-06T16:28:54Z

We will discuss this but in general, contributions are very welcome.

masterleinad · 2021-07-07T19:19:30Z

@Char-Aznable We decided that we are interested in making non-power-2 team sizes work, see #4146. Any help in implementing that is very welcome!

Char-Aznable · 2021-07-07T20:56:27Z

Great! I'll take a look at the code and see what I can do

TeamScan for CUDA

fc2de9a

Updated cuda impl to use team_scan [skip ci]

0552e05

jrmadsen force-pushed the team-parallel-scan branch from 091ee96 to 0552e05 Compare November 2, 2020 20:43

Team scan for HIP, OpenMPTarget, Pthreads

3a17dd3

jrmadsen changed the title ~~WIP: TeamScan for CUDA~~ WIP: TeamScan for CUDA, Pthreads, OpenMPTarget, HIP Nov 4, 2020

crtrott requested changes Nov 5, 2020

View reviewed changes

Fixed parallel_scan for CUDA and HIP

1f66f61

- Added TestTeamScan.hpp - Renamed team_scan test to team_reduction_scan in TeamReductionScan due to naming conflict

jrmadsen changed the title ~~WIP: TeamScan for CUDA, Pthreads, OpenMPTarget, HIP~~ TeamScan for CUDA, Pthreads, OpenMPTarget, HIP Nov 11, 2020

Fixed erroneous use of non-mirror-views in TestTeamScan

147254d

jrmadsen requested a review from crtrott November 11, 2020 15:13

jrmadsen added 2 commits November 11, 2020 09:18

Minor tweaks to TestTeamScan to get rid of spurious warnings/errors

0973e29

Fixes for OpenMPTarget

8579b93

masterleinad reviewed Nov 12, 2020

View reviewed changes

core/unit_test/CMakeLists.txt Show resolved Hide resolved

crtrott approved these changes Nov 13, 2020

View reviewed changes

jrmadsen added 2 commits November 13, 2020 10:10

Merge branch 'develop' of https://github.com/kokkos/kokkos into team-…

8aada96

…parallel-scan

Removed TeamScan unit test from SYCL

b6f2fae

e10harvey approved these changes Nov 14, 2020

View reviewed changes

Rombur approved these changes Nov 16, 2020

View reviewed changes

dalg24 requested changes Nov 16, 2020

View reviewed changes

dalg24 reviewed Nov 16, 2020

View reviewed changes

core/src/Cuda/Kokkos_Cuda_Team.hpp Outdated Show resolved Hide resolved

Update core/unit_test/TestTeamScan.hpp

3d5150d

Co-authored-by: Damien L-G <dalg24+github@gmail.com>

dalg24 reviewed Nov 17, 2020

View reviewed changes

core/unit_test/TestTeamScan.hpp Outdated Show resolved Hide resolved

dalg24 reviewed Nov 18, 2020

View reviewed changes

core/unit_test/TestTeamScan.hpp Outdated Show resolved Hide resolved

jrmadsen added 2 commits November 18, 2020 09:00

Replaced uint32_t with DataType and test for non-multiples of 16

4c1e62d

Added zero tests

68444d1

jrmadsen requested a review from dalg24 November 18, 2020 17:04

dalg24 approved these changes Nov 18, 2020

View reviewed changes

jrmadsen added 7 commits November 18, 2020 12:06

Added debug message to team_scan tests to locate failures

6ca978c

Moved known failures out of testing temporarily

057e3dd

Forward declare Kokkos::Cuda

f0fb6db

Migrated more backend-specific variants to be excluded

b5e10c8

Updated known failure tests with variants that should pass

8b8f1df

Updated known failure tests with variants that should pass

e1dddfb

Removed failing tests for clang-cuda

2f795e2

crtrott merged commit 29ba663 into kokkos:develop Nov 19, 2020

ndellingwood mentioned this pull request Nov 20, 2020

cuda.team_scan failing in nightly tests #3607

Closed

ndellingwood mentioned this pull request Dec 16, 2020

Team-level prefix scan not supported in hierarchical parallelism #3197

Closed

breyerml mentioned this pull request Feb 11, 2021

Links to pull request in CHANGELOG.md are broken #3794

Closed

Char-Aznable mentioned this pull request Mar 9, 2021

Support Reducer in ThreadVectorRange parallel_scan #3602

Closed

jrmadsen deleted the team-parallel-scan branch July 6, 2021 05:46

masterleinad mentioned this pull request Jul 7, 2021

Allow non-power-2 team sizes for team reductions and team scans. #4146

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TeamScan for CUDA, Pthreads, OpenMPTarget, HIP #3536

TeamScan for CUDA, Pthreads, OpenMPTarget, HIP #3536

jrmadsen commented Oct 28, 2020

jrmadsen commented Oct 28, 2020

crtrott commented Oct 28, 2020 •

edited

jrmadsen commented Nov 2, 2020

dalg24 commented Nov 4, 2020

crtrott left a comment

jrmadsen commented Nov 11, 2020

masterleinad commented Nov 11, 2020

masterleinad commented Nov 11, 2020

masterleinad commented Nov 11, 2020

jrmadsen commented Nov 12, 2020

crtrott left a comment

Rombur left a comment

dalg24 left a comment

crtrott commented Nov 18, 2020

crtrott commented Nov 19, 2020

crtrott commented Nov 19, 2020

Char-Aznable commented Jul 6, 2021 •

edited

Char-Aznable commented Jul 6, 2021

jrmadsen commented Jul 6, 2021

Char-Aznable commented Jul 6, 2021

masterleinad commented Jul 6, 2021

masterleinad commented Jul 7, 2021

Char-Aznable commented Jul 7, 2021

TeamScan for CUDA, Pthreads, OpenMPTarget, HIP #3536

TeamScan for CUDA, Pthreads, OpenMPTarget, HIP #3536

Conversation

jrmadsen commented Oct 28, 2020

jrmadsen commented Oct 28, 2020

crtrott commented Oct 28, 2020 • edited

jrmadsen commented Nov 2, 2020

dalg24 commented Nov 4, 2020

crtrott left a comment

Choose a reason for hiding this comment

jrmadsen commented Nov 11, 2020

masterleinad commented Nov 11, 2020

masterleinad commented Nov 11, 2020

masterleinad commented Nov 11, 2020

jrmadsen commented Nov 12, 2020

crtrott left a comment

Choose a reason for hiding this comment

Rombur left a comment

Choose a reason for hiding this comment

dalg24 left a comment

Choose a reason for hiding this comment

crtrott commented Nov 18, 2020

crtrott commented Nov 19, 2020

crtrott commented Nov 19, 2020

Char-Aznable commented Jul 6, 2021 • edited

Char-Aznable commented Jul 6, 2021

jrmadsen commented Jul 6, 2021

Char-Aznable commented Jul 6, 2021

masterleinad commented Jul 6, 2021

masterleinad commented Jul 7, 2021

Char-Aznable commented Jul 7, 2021

crtrott commented Oct 28, 2020 •

edited

Char-Aznable commented Jul 6, 2021 •

edited