Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unique token improvement #4741

Merged
merged 12 commits into from
Feb 12, 2022
Merged

Conversation

crtrott
Copy link
Member

@crtrott crtrott commented Feb 1, 2022

Align interface with other Kokkos interfaces which default the execution space.

This also removes the usage of bitset in favor of using straight up int arrays as locks for UniqueToken in HIP and CUDA.
As a consequence atomic conflicts are vastly reduced, and also by default you are getting much better aligned indicies within a warp, improving data access patterns in common usecases.

Here is a performance example:

#include<Kokkos_Core.hpp>

int main(int argc, char* argv[]) {
  Kokkos::initialize(argc,argv);
  {
     int N = (argc>1) ? atoi(argv[1]) : 30000;
     int R = (argc>2) ? atoi(argv[2]) : 1;


     Kokkos::Experimental::UniqueToken<> token;
     Kokkos::View<double*> a("A",token.size());

     auto f1 = KOKKOS_LAMBDA(int) {
       int idx = token.acquire();
       for(int r=0; r<R; r++)
         a(idx) += a(idx)+1;
       Kokkos::memory_fence();
       token.release(idx);
     };
     auto f2 = KOKKOS_LAMBDA(int i) {
       int idx = i % a.extent(0);
       for(int r=0; r<R; r++)
         a(idx) += a(idx)+1;
       Kokkos::memory_fence();
     };

     Kokkos::parallel_for(N, f1);
     Kokkos::parallel_for(N, f2);
     Kokkos::fence();

     Kokkos::Timer timer;
     Kokkos::parallel_for(N, f1);
     Kokkos::fence();
     double time1 = timer.seconds();
     timer.reset();
     Kokkos::parallel_for(N, f2);
     Kokkos::fence();
     double time2 = timer.seconds();

     printf("%lf %lf\n",time1*1.e6, time2*1.e6);
  }
  Kokkos::finalize();
}

Prior to the change:
A100:

bash-4.2$ ./test.cuda 30000 1
1823.098000 16.880000
bash-4.2$ ./test.cuda 30000000 1
152508.359000 207.732000

MI100:

// MI100
bash-4.2$ ./test.host 30000 1
1585.862000 25.080000
bash-4.2$ ./test.host 30000000 1
495600.250000 357.516000

Post change:
A100

bash-4.2$ ./test.cuda 30000 1
17.940000 14.940000
bash-4.2$ ./test.cuda 30000000 1
836.698000 370.924000

MI100

bash-4.2$ ./test.host 30000 1
26.680000 24.060000
bash-4.2$ ./test.host 30000000 1
958.009000 353.497000

core/src/Cuda/Kokkos_Cuda_UniqueToken.hpp Outdated Show resolved Hide resolved
core/src/Cuda/Kokkos_Cuda_UniqueToken.hpp Outdated Show resolved Hide resolved
core/src/Cuda/Kokkos_Cuda_UniqueToken.hpp Show resolved Hide resolved
Copy link
Contributor

@masterleinad masterleinad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a good reason to not also do that for SYCL?

core/src/Cuda/Kokkos_Cuda_Instance.cpp Outdated Show resolved Hide resolved
@crtrott crtrott force-pushed the unique-token-improvement branch 3 times, most recently from bf3b861 to e96ca20 Compare February 2, 2022 03:38
@@ -45,9 +45,12 @@
#ifndef KOKKOS_HIP_UNIQUE_TOKEN_HPP
#define KOKKOS_HIP_UNIQUE_TOKEN_HPP

#include <impl/Kokkos_ConcurrentBitset.hpp>
#include <Kokkos_Macros.hpp>
#ifdef KOKKOS_ENABLE_HIP
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not needed. If HIP is not enabled, CMake will ignore the file

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

core/src/HIP/Kokkos_HIP_UniqueToken.hpp Show resolved Hide resolved
core/src/HIP/Kokkos_HIP_UniqueToken.hpp Show resolved Hide resolved
@@ -59,8 +59,8 @@ enum class UniqueTokenScope : int { Instance, Global };
///
/// This object should behave like a ref-counted object, so that when the last
/// instance is destroy resources are free if needed
template <typename ExecutionSpace,
UniqueTokenScope = UniqueTokenScope::Instance>
template <typename ExecutionSpace = Kokkos::DefaultExecutionSpace,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you change? This is not in the description of the PR.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is now ...

Everything else in Kokkos defaults the execution space. I think this should too.

core/src/Cuda/Kokkos_Cuda_UniqueToken.hpp Show resolved Hide resolved
@crtrott crtrott force-pushed the unique-token-improvement branch 3 times, most recently from 77c4f85 to 4cd03d6 Compare February 2, 2022 18:41
@crtrott crtrott added the Blocks Promotion Overview issue for release-blocking bugs label Feb 2, 2022
@ajpowelsnl ajpowelsnl added this to In progress in Kokkos Release 3.6 via automation Feb 7, 2022
@ajpowelsnl
Copy link
Contributor

Hi @crtrott and @dalg24 a couple of questions about this issue: 1) will this item be in Kokkos 3.6? 2) if so, is it still blocked?

@ajpowelsnl ajpowelsnl added this to In progress in Kokkos Release 3.7 -- 2022 Target Date via automation Feb 7, 2022
@ajpowelsnl ajpowelsnl removed this from In progress in Kokkos Release 3.6 Feb 7, 2022
@ajpowelsnl ajpowelsnl removed the Blocks Promotion Overview issue for release-blocking bugs label Feb 7, 2022
@crtrott crtrott added the Blocks Promotion Overview issue for release-blocking bugs label Feb 7, 2022
@crtrott
Copy link
Member Author

crtrott commented Feb 8, 2022

@dalg24 @masterleinad finally passed!!

Comment on lines 161 to 167
UniqueToken(size_type max_size) {
m_locks = Kokkos::View<uint32_t*, Kokkos::CudaSpace>(
"Kokkos::UniqueToken::m_locks", max_size);
}
UniqueToken(size_type max_size, execution_space const& exec) {
m_locks = Kokkos::View<uint32_t*, Kokkos::CudaSpace>(
Kokkos::view_alloc(exec, "Kokkos::UniqueToken::m_locks"), max_size);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I slightly prefer to define and call protected constructors here to avoid allocating memory for m_locks twice like in https://github.com/kokkos/kokkos/pull/4748/files.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what do you mean allocating twice?

core/src/Cuda/Kokkos_Cuda_UniqueToken.hpp Outdated Show resolved Hide resolved
core/src/Cuda/Kokkos_Cuda_UniqueToken.hpp Outdated Show resolved Hide resolved
core/src/HIP/Kokkos_HIP_UniqueToken.hpp Outdated Show resolved Hide resolved
Co-authored-by: Damien L-G <dalg24+github@gmail.com>
@masterleinad
Copy link
Contributor

It looks like m_scratchConcurrentBitset is unused now.

Uses static lock array for global variant, and locally owned
lock array for instance variant.
@masterleinad
Copy link
Contributor

We should still be able to get rid of the now unused m_scratchConcurrentBitset.

@dalg24
Copy link
Member

dalg24 commented Feb 10, 2022

Retest this please

@dalg24
Copy link
Member

dalg24 commented Feb 10, 2022

4: [ RUN      ] openmptarget.unique_token_global
4: /var/jenkins/workspace/Kokkos/core/unit_test/TestUniqueToken.hpp:151: Failure
4: Expected equality of these values:
4:   sum
4:     Which is: 9999980
4:   int64_t(N) * R
4:     Which is: 10000000
4: [  FAILED  ] openmptarget.unique_token_global (280 ms)

@crtrott
Copy link
Member Author

crtrott commented Feb 10, 2022

I don't get how either of these guys is failing (OpenMPTarget and OpenMP for NVHPC), I didn't really touch that stuff at all did I?

@dalg24
Copy link
Member

dalg24 commented Feb 10, 2022

Retest this please

@masterleinad
Copy link
Contributor

We have seen that OpenMPTarget test failure occasionally.

@masterleinad masterleinad added this to In progress in Kokkos Release 3.6 via automation Feb 10, 2022
Copy link
Contributor

@masterleinad masterleinad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why should we keep m_scratchConcurrentBitset?

@crtrott
Copy link
Member Author

crtrott commented Feb 11, 2022

Another PR failing UNiqueTOken in OpenMP here: https://cloud.cees.ornl.gov/jenkins-ci/blue/organizations/jenkins/Kokkos/detail/Kokkos/8028/pipeline/52

4: [ RUN      ] openmptarget.unique_token_global

4: /var/jenkins/workspace/Kokkos/core/unit_test/TestUniqueToken.hpp:151: Failure

4: Expected equality of these values:

4:   sum

4:     Which is: 9999990

4:   int64_t(N) * R

4:     Which is: 10000000

4: [  FAILED  ] openmptarget.unique_token_global (261 ms)

@dalg24
Copy link
Member

dalg24 commented Feb 12, 2022

Failure is unrelated (Clang crashing when compiling Kokkos::complex unit test, sigh)

@dalg24 dalg24 merged commit 9897502 into kokkos:develop Feb 12, 2022
Kokkos Release 3.6 automation moved this from In progress to Done Feb 12, 2022
@dalg24 dalg24 removed the Blocks Promotion Overview issue for release-blocking bugs label Feb 12, 2022
@crtrott crtrott deleted the unique-token-improvement branch February 12, 2022 04:39
@ajpowelsnl ajpowelsnl added the InDevelop Enhancement, fix, etc. has been merged into the develop branch; label Feb 15, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
InDevelop Enhancement, fix, etc. has been merged into the develop branch;
Projects
Development

Successfully merging this pull request may close these issues.

None yet

5 participants