Fixup cache CUDA fallback execution space instance used by DualView::sync #3856

dalg24 · 2021-03-17T20:39:13Z

Fixup for #3822
Creating a new execution space instance every time is inefficient. I considered changing get_cuda_space(...) to get_cuda_stream(...) but it turns outKokkos::Impl::cuda_prefetch_pointer also requires the device ID so the CUDA execution space instance seems like the right abstraction to pass.

I am proposing to cache the CUDA execution space instance constructed on the deep_copy stream.

kokkos/core/src/Cuda/Kokkos_CudaSpace.cpp

Lines 572 to 591 in 018721f

    
           void cuda_prefetch_pointer(const Cuda &space, const void *ptr, size_t bytes, 
        
                                      bool to_device) { 
        
             if ((ptr == nullptr) || (bytes == 0)) return; 
        
             cudaPointerAttributes attr; 
        
             CUDA_SAFE_CALL(cudaPointerGetAttributes(&attr, ptr)); 
        
             // I measured this and it turns out prefetching towards the host slows 
        
             // DualView syncs down. Probably because the latency is not too bad in the 
        
             // first place for the pull down. If we want to change that provde 
        
             // cudaCpuDeviceId as the device if to_device is false 
        
           #if CUDA_VERSION < 10000 
        
             bool is_managed = attr.isManaged; 
        
           #else 
        
             bool is_managed = attr.type == cudaMemoryTypeManaged; 
        
           #endif 
        
             if (to_device && is_managed && 
        
                 space.cuda_device_prop().concurrentManagedAccess) { 
        
               CUDA_SAFE_CALL(cudaMemPrefetchAsync(ptr, bytes, space.cuda_device(), 
        
                                                   space.cuda_stream())); 
        
             } 
        
           }

crtrott · 2021-03-17T20:57:27Z

If we do this I think we should keep it at the same point as the special stream: i.e. keep it in core proper.

dalg24 · 2021-03-17T21:27:06Z

terminate called after throwing an instance of 'std::runtime_error'
  what():  Kokkos allocation "InternalScratchFlags" is being deallocated after Kokkos::finalize was called

crtrott · 2021-03-17T21:27:28Z

Making it part of core proper would also probably make it easier to free it in the right spot.

DavidPoliakoff

I like how you handled this, that's really slick.

dalg24 · 2021-03-19T02:59:06Z

core/src/Cuda/Kokkos_Cuda_Instance.cpp

@@ -720,6 +730,7 @@ void CudaInternal::finalize() {
  if (this == &singleton()) {
    cudaFreeHost(constantMemHostStaging);
    cudaEventDestroy(constantMemReusable);
+    cudaStreamDestroy(cuda_get_deep_copy_stream());


Note that now we pay for a stream create even if we never transferred data between host and device. I don't think that's a real issue, just making sure you see it.

where is the destruction of the Cuda instance?

At the exit of the program

Can't we just call finalize on the static execution space instance we create?

No because this would attempt to decrement tracked memory counters after Kokkos::finalize

kokkos/core/src/impl/Kokkos_SharedAlloc.cpp

Lines 234 to 246 in 0760c20

SharedAllocationRecord<void, void>* SharedAllocationRecord<

void, void>::decrement(SharedAllocationRecord<void, void>* arg_record) {

const int old_count = Kokkos::atomic_fetch_sub(&arg_record->m_count, 1);

if (old_count == 1) {

if (!Kokkos::is_initialized()) {

std::stringstream ss;

ss << "Kokkos allocation \"";

ss << arg_record->get_label();

ss << "\" is being deallocated after Kokkos::finalize was called\n";

auto s = ss.str();

Kokkos::Impl::throw_runtime_exception(s);

}

Are you talking about

kokkos/core/src/impl/Kokkos_SharedAlloc.cpp

Lines 238 to 246 in 1fb0c28

if (old_count == 1) {

if (!Kokkos::is_initialized()) {

std::stringstream ss;

ss << "Kokkos allocation \"";

ss << arg_record->get_label();

ss << "\" is being deallocated after Kokkos::finalize was called\n";

auto s = ss.str();

Kokkos::Impl::throw_runtime_exception(s);

}

? Decrementing the counter itself should not be a problem in itself, or?

We set g_is_initialized after calling finalize_all_spaces so we should not run into this assertion when we also finalize the static execution space used in DualView here or am I missing something?

crtrott

Did I just overlook the destruction of the static Cuda instance? Also why do we need that extra step with the special constructor etc?

crtrott · 2021-03-19T18:45:42Z

core/src/Cuda/Kokkos_CudaSpace.cpp

+
+Kokkos::Cuda const &
+Kokkos::Impl::cuda_get_execution_space_with_deep_copy_stream() {
+  static Cuda space(Kokkos::Cuda::DeepCopyTag{});


why this instead of just

static Cuda space(cuda_get_deep_copy_stream())

That' what I had initially done in 6a9b3f4 but it gave #3856 (comment)

But now that I look at it again, I agree that the current version is quite convoluted

crtrott · 2021-03-19T18:47:47Z

core/src/Cuda/Kokkos_Cuda_Instance.cpp

@@ -720,6 +730,7 @@ void CudaInternal::finalize() {
  if (this == &singleton()) {
    cudaFreeHost(constantMemHostStaging);
    cudaEventDestroy(constantMemReusable);
+    cudaStreamDestroy(cuda_get_deep_copy_stream());


where is the destruction of the Cuda instance?

dalg24 · 2021-03-24T19:33:53Z

Did I just overlook the destruction of the static Cuda instance? Also why do we need that extra step with the special constructor etc?

OK I looked into this again. The issue with the constructor that takes a stream is the custom deleter passed to the HostSharedPtr pointer to implementation that calls CudaInternal::finalize()

kokkos/core/src/Cuda/Kokkos_Cuda_Instance.cpp

Lines 817 to 821 in 9d613c3

    
           Cuda::Cuda(cudaStream_t stream) 
        
               : m_space_instance(new Impl::CudaInternal, [](Impl::CudaInternal *ptr) { 
        
                   ptr->finalize(); 
        
                   delete ptr; 
        
                 }) {

#3856 (comment)
The static object is destructed at the exit of the program after Kokkos::finalize() has been called.

dalg24 · 2021-03-25T18:03:21Z

@crtrott I implemented what we discussed over the phone, that is early return in CudaInternal::finalize() if the singleton has already been finalized which denotes that Kokkos::finalize() has been invoked.

masterleinad · 2021-03-30T20:11:46Z

core/src/Cuda/Kokkos_Cuda_Instance.cpp

+  // skip if Kokkos::finalize() has already been called
+  if (!singleton().was_finalized) return;


Does this mean that we are leaking memory in this case (for m_scratchFlags, m_scratchSpace, m_scratchConcurrentBitset, etc.)?

This also results in

$ ./core/unit_test/KokkosCore_UnitTest_CudaInterOpInit ========= CUDA-MEMCHECK [==========] Running 1 test from 1 test case. [----------] Global test environment set-up. [----------] 1 test from cuda [ RUN ] cuda.raw_cuda_interop [ OK ] cuda.raw_cuda_interop (278 ms) [----------] 1 test from cuda (278 ms total) [----------] Global test environment tear-down [==========] 1 test from 1 test case ran. (278 ms total) [ PASSED ] 1 test. Kokkos::Cuda ERROR: Failed to call Kokkos::Cuda::finalize()

You are right. Not sure why the CI is passing.

kokkos/core/src/Cuda/Kokkos_Cuda_Instance.cpp

Lines 274 to 278 in f6f3f14

if (m_stream || m_scratchSpace || m_scratchFlags || m_scratchUnified ||

m_scratchConcurrentBitset) {

std::cerr << "Kokkos::Cuda ERROR: Failed to call Kokkos::Cuda::finalize()"

<< std::endl;

}

I don't see any errors when omitting to skip finalize and think that is fine since we are not using the stream for calls to cudaFree.

Oh I see we just print std::cerr and do not actually throw or abort.

I don't see any errors when omitting to skip finalize and think that is fine since we are not using the stream for calls to cudaFree.

See #3856 (comment)

I am not sure how to go about this. I can revert 980deb8 to see if we get that error again.

This reverts commit 980deb8.

masterleinad

Looks good to me.

Apparently, this is still failing for a few tests and needs some more TLC.

masterleinad · 2021-03-30T22:26:52Z

diff --git a/core/src/Cuda/Kokkos_Cuda_Instance.cpp b/core/src/Cuda/Kokkos_Cuda_Instance.cpp
index c1dc5f276..bcba78d9c 100644
--- a/core/src/Cuda/Kokkos_Cuda_Instance.cpp
+++ b/core/src/Cuda/Kokkos_Cuda_Instance.cpp
@@ -662,9 +662,11 @@ void *CudaInternal::resize_team_scratch_space(std::int64_t bytes,
 
 //----------------------------------------------------------------------------
 
+const Kokkos::Cuda& get_cuda_space();
+
 void CudaInternal::finalize() {
-  // skip if Kokkos::finalize() has already been called
-  if (!singleton().was_finalized) return;
+  // skip if finalize() has already been called
+  if(was_finalized) return;
 
   was_finalized = true;
   if (nullptr != m_scratchSpace || nullptr != m_scratchFlags) {
@@ -709,6 +711,7 @@ void CudaInternal::finalize() {
   if (this == &singleton()) {
     cudaFreeHost(constantMemHostStaging);
     cudaEventDestroy(constantMemReusable);
+    Kokkos::Impl::get_cuda_space().impl_internal_space_instance()->finalize();
     cudaStreamDestroy(cuda_get_deep_copy_stream());
   }
 }

seems to work for me.

dalg24

I am not able to approve but Daniel's changes are fine by me.

dalg24 requested review from DavidPoliakoff and crtrott March 17, 2021 20:39

DavidPoliakoff approved these changes Mar 18, 2021

View reviewed changes

dalg24 commented Mar 19, 2021

View reviewed changes

crtrott requested changes Mar 19, 2021

View reviewed changes

dalg24 added 4 commits March 25, 2021 12:58

Cache the CUDA execution space instance using the deep copy stream

9535586

Fixup qualify cuda_get_deep_copy_stream in implementation

493e442

Skip CudaInteral::finalize() if Kokkos was already finalized

980deb8

Cleanup deep copy stream at finalization

f6f3f14

dalg24 force-pushed the fixup_dualview_reuse_execspace_instance branch from 8d8a0d7 to f6f3f14 Compare March 25, 2021 18:01

masterleinad reviewed Mar 30, 2021

View reviewed changes

Revert "Skip CudaInteral::finalize() if Kokkos was already finalized"

0760c20

This reverts commit 980deb8.

masterleinad previously approved these changes Mar 30, 2021

View reviewed changes

Finalize deep_copy_space with the CUDA space singleton

9266b54

dalg24 commented Apr 6, 2021

View reviewed changes

masterleinad requested a review from crtrott April 6, 2021 13:48

crtrott approved these changes Apr 7, 2021

View reviewed changes

crtrott merged commit 677e9fd into kokkos:develop Apr 7, 2021

dalg24 deleted the fixup_dualview_reuse_execspace_instance branch May 24, 2021 21:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixup cache CUDA fallback execution space instance used by DualView::sync #3856

Fixup cache CUDA fallback execution space instance used by DualView::sync #3856

dalg24 commented Mar 17, 2021

crtrott commented Mar 17, 2021

dalg24 commented Mar 17, 2021

crtrott commented Mar 17, 2021

DavidPoliakoff left a comment

dalg24 Mar 19, 2021

crtrott Mar 19, 2021

dalg24 Mar 19, 2021

masterleinad Mar 30, 2021

dalg24 Mar 30, 2021 •

edited

dalg24 Mar 30, 2021

masterleinad Mar 30, 2021

crtrott left a comment

crtrott Mar 19, 2021

dalg24 Mar 19, 2021

crtrott Mar 19, 2021

dalg24 commented Mar 24, 2021

dalg24 commented Mar 25, 2021

masterleinad Mar 30, 2021

dalg24 Mar 30, 2021

masterleinad Mar 30, 2021

dalg24 Mar 30, 2021

dalg24 Mar 30, 2021

masterleinad Mar 30, 2021

dalg24 Mar 30, 2021

dalg24 Mar 30, 2021

masterleinad left a comment

masterleinad commented Mar 30, 2021

dalg24 left a comment

	void cuda_prefetch_pointer(const Cuda &space, const void *ptr, size_t bytes,
	bool to_device) {
	if ((ptr == nullptr) \|\| (bytes == 0)) return;
	cudaPointerAttributes attr;
	CUDA_SAFE_CALL(cudaPointerGetAttributes(&attr, ptr));
	// I measured this and it turns out prefetching towards the host slows
	// DualView syncs down. Probably because the latency is not too bad in the
	// first place for the pull down. If we want to change that provde
	// cudaCpuDeviceId as the device if to_device is false
	#if CUDA_VERSION < 10000
	bool is_managed = attr.isManaged;
	#else
	bool is_managed = attr.type == cudaMemoryTypeManaged;
	#endif
	if (to_device && is_managed &&
	space.cuda_device_prop().concurrentManagedAccess) {
	CUDA_SAFE_CALL(cudaMemPrefetchAsync(ptr, bytes, space.cuda_device(),
	space.cuda_stream()));
	}
	}

	SharedAllocationRecord<void, void>* SharedAllocationRecord<
	void, void>::decrement(SharedAllocationRecord<void, void>* arg_record) {
	const int old_count = Kokkos::atomic_fetch_sub(&arg_record->m_count, 1);

	if (old_count == 1) {
	if (!Kokkos::is_initialized()) {
	std::stringstream ss;
	ss << "Kokkos allocation \"";
	ss << arg_record->get_label();
	ss << "\" is being deallocated after Kokkos::finalize was called\n";
	auto s = ss.str();
	Kokkos::Impl::throw_runtime_exception(s);
	}

		// skip if Kokkos::finalize() has already been called
		if (!singleton().was_finalized) return;

	if (m_stream \|\| m_scratchSpace \|\| m_scratchFlags \|\| m_scratchUnified \|\|
	m_scratchConcurrentBitset) {
	std::cerr << "Kokkos::Cuda ERROR: Failed to call Kokkos::Cuda::finalize()"
	<< std::endl;
	}

Fixup cache CUDA fallback execution space instance used by DualView::sync #3856

Fixup cache CUDA fallback execution space instance used by DualView::sync #3856

Conversation

dalg24 commented Mar 17, 2021

crtrott commented Mar 17, 2021

dalg24 commented Mar 17, 2021

crtrott commented Mar 17, 2021

DavidPoliakoff left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dalg24 Mar 30, 2021 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

crtrott left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dalg24 commented Mar 24, 2021

dalg24 commented Mar 25, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

masterleinad left a comment

Choose a reason for hiding this comment

masterleinad commented Mar 30, 2021

dalg24 left a comment

Choose a reason for hiding this comment

dalg24 Mar 30, 2021 •

edited