Fix multi stream scratch #3269

crtrott · 2020-08-11T01:16:24Z

This fixes #3246 by making team scratch allocations a per Cuda instance property instead of global.

Reproduces issue kokkos#3246

dalg24

Looks good.
These are minor comments.
Also, I would prefer if you drop the Kokkos:: namespace for kokkos_malloc and kokkos_free.

core/src/Cuda/Kokkos_Cuda_Instance.cpp

dalg24 · 2020-08-11T03:29:40Z

core/src/Cuda/Kokkos_Cuda_Instance.hpp

+  mutable int64_t m_team_scratch_current_size;
+  mutable void* m_team_scratch_ptr;


Why did you declare them mutable?

Cuda:: impl_internal_space_instance returns a pointer to non-const*. So I don't think you need it in principle.

kokkos/core/src/Kokkos_Cuda.hpp

Lines 254 to 256 in c01580a

inline Impl::CudaInternal* impl_internal_space_instance() const {

return m_space_instance;

}

*Whether it should is another question 🙄

Just talking to Damien: this is interesting problem but maybe a bigger one? We can either get rid of the mutable and leave it a non-const pointer or leave the mutable and make this a const pointer. What is the best pattern here? @nliber

maybe we leave both of them for now and come back to this later.

core/unit_test/cuda/TestCuda_TeamScratch.cpp

CI does not pass

core/unit_test/cuda/TestCuda_TeamScratch.cpp

Rombur

The CUDA 11 tester and clang-tidy still fail. The other CUDA builds are passing.

Rombur · 2020-09-08T13:15:04Z

core/unit_test/cuda/TestCuda_TeamScratch.cpp

+    cudaStreamCreate(&stream[i]);
+    cuda[i] = Kokkos::Cuda(stream[i]);
+  }
+  // Test that growing scratch size in subsequent calls doesn' crash things


Suggested change

// Test that growing scratch size in subsequent calls doesn' crash things

// Test that growing scratch size in subsequent calls doesn't crash things

Does somebody else want to unroll all the changes and redo the commits to add the t? I don't have time right now. If nobody else feels the urgent need I will ignore this :-)

Also fix situation with UVM as default memory space for CUDA Forgot for the realloc to explicitly use CudaSpace to match the alloc

dalg24 · 2020-09-09T11:11:22Z

CUDA-11.0-NVCC-C++17-RDC build fails with

terminate called after throwing an instance of 'std::runtime_error'
  what():  Kokkos::Impl::ParallelFor< Cuda > requested too large team size.

Apparently the cuda function first sets all the members to zero before then writing to them. In a multi threaded environment where each thread calls the same kernel that can lead to a race.

crtrott · 2020-09-16T06:03:45Z

OK found the race and fixed it. Its in the caching of the cuda functor attributes. Comment (and commit message) is added to explain the race and the rational for the fix.

DavidPoliakoff

Address Bruno's comment on your type in your comment and it's good

dalg24 · 2020-09-16T15:28:06Z

core/src/Cuda/Kokkos_Cuda_KernelLaunch.hpp

-    static bool attr_set = false;
-    if (!attr_set) {
+    // Race condition inside of cudaFuncGetAttributes if the same address is
+    // given requires using a local variable as input instead of a static Rely


Suggested change

// given requires using a local variable as input instead of a static Rely

// given requires using a local variable as input instead of a static. Rely

dalg24 · 2020-09-16T16:03:56Z

core/src/Cuda/Kokkos_Cuda_KernelLaunch.hpp

-    static cudaFuncAttributes attr;
-    static bool attr_set = false;
-    if (!attr_set) {
+    // Race condition inside of cudaFuncGetAttributes if the same address is


Should we mark it as a workaround for CUDA 11? Is that something we are going to report upstream?

I don't think its in principle a workaround just for CUDA 11. There is no implementation guarantee for the function, its just bad API design to take pointers to stuff instead of just returning the struct.

…lloc From kokkos PR kokkos#3269 commit: 9206d4b Modified file: core/src/Cuda/Kokkos_Cuda_Instance.cpp

crtrott added 2 commits August 10, 2020 19:51

CUDA Add Test for MultiStream TeamScratch Issue

5bf7625

Reproduces issue kokkos#3246

CUDA fix multiple stream scratch Level 1 conflict

94d5c39

FIxes issue kokkos#3246

crtrott force-pushed the fix-multi-stream-scratch branch from 33ac6dd to 94d5c39 Compare August 11, 2020 02:51

dalg24 reviewed Aug 11, 2020

View reviewed changes

dalg24 previously approved these changes Sep 4, 2020

View reviewed changes

dalg24 reviewed Sep 5, 2020

View reviewed changes

core/unit_test/cuda/TestCuda_TeamScratch.cpp Outdated Show resolved Hide resolved

crtrott force-pushed the fix-multi-stream-scratch branch from 12c25d1 to 538e771 Compare September 7, 2020 22:45

Rombur reviewed Sep 8, 2020

View reviewed changes

Cuda Stream scratch: Change Test to use functor instead of lambda

9206d4b

Also fix situation with UVM as default memory space for CUDA Forgot for the realloc to explicitly use CudaSpace to match the alloc

crtrott force-pushed the fix-multi-stream-scratch branch from 538e771 to 9206d4b Compare September 9, 2020 05:14

ndellingwood mentioned this pull request Sep 9, 2020

Candidates for cherry-picking into 3.2.1 #3334

Closed

CUDA: Fix thread unsafe setting of functor attributes.

e3350ee

Apparently the cuda function first sets all the members to zero before then writing to them. In a multi threaded environment where each thread calls the same kernel that can lead to a race.

DavidPoliakoff approved these changes Sep 16, 2020

View reviewed changes

dalg24 reviewed Sep 16, 2020

View reviewed changes

dalg24 merged commit d13b4e7 into kokkos:develop Sep 18, 2020

ndellingwood added a commit to ndellingwood/kokkos that referenced this pull request Sep 22, 2020

Manual changes for the realloc to explicitly use CudaSpace to match a…

b3fa633

…lloc From kokkos PR kokkos#3269 commit: 9206d4b Modified file: core/src/Cuda/Kokkos_Cuda_Instance.cpp

crtrott mentioned this pull request Oct 8, 2020

Cuda Streams - some notes #3321

Open

masterleinad mentioned this pull request Oct 28, 2021

Scratch Allocations in Level 1 will conflict when using multiple streams #3246

Closed

PhilMiller mentioned this pull request Oct 6, 2022

Fix multi-stream team scratch space definition for HIP #3398

Merged

crtrott deleted the fix-multi-stream-scratch branch December 19, 2022 17:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix multi stream scratch #3269

Fix multi stream scratch #3269

crtrott commented Aug 11, 2020

dalg24 left a comment

dalg24 Aug 11, 2020

crtrott Sep 4, 2020

crtrott Sep 4, 2020

Rombur left a comment

Rombur Sep 8, 2020

crtrott Sep 16, 2020

dalg24 commented Sep 9, 2020

crtrott commented Sep 16, 2020 •

edited

DavidPoliakoff left a comment

dalg24 Sep 16, 2020

dalg24 Sep 16, 2020

crtrott Sep 16, 2020

		mutable int64_t m_team_scratch_current_size;
		mutable void* m_team_scratch_ptr;

	inline Impl::CudaInternal* impl_internal_space_instance() const {
	return m_space_instance;
	}

	// Test that growing scratch size in subsequent calls doesn' crash things
	// Test that growing scratch size in subsequent calls doesn't crash things

	// given requires using a local variable as input instead of a static Rely
	// given requires using a local variable as input instead of a static. Rely

Fix multi stream scratch #3269

Fix multi stream scratch #3269

Conversation

crtrott commented Aug 11, 2020

dalg24 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Rombur left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dalg24 commented Sep 9, 2020

crtrott commented Sep 16, 2020 • edited

DavidPoliakoff left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

crtrott commented Sep 16, 2020 •

edited