Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixup cache CUDA fallback execution space instance used by DualView::sync #3856

Merged

Conversation

dalg24
Copy link
Member

@dalg24 dalg24 commented Mar 17, 2021

Fixup for #3822
Creating a new execution space instance every time is inefficient. I considered changing get_cuda_space(...) to get_cuda_stream(...) but it turns outKokkos::Impl::cuda_prefetch_pointer also requires the device ID so the CUDA execution space instance seems like the right abstraction to pass.

I am proposing to cache the CUDA execution space instance constructed on the deep_copy stream.

void cuda_prefetch_pointer(const Cuda &space, const void *ptr, size_t bytes,
bool to_device) {
if ((ptr == nullptr) || (bytes == 0)) return;
cudaPointerAttributes attr;
CUDA_SAFE_CALL(cudaPointerGetAttributes(&attr, ptr));
// I measured this and it turns out prefetching towards the host slows
// DualView syncs down. Probably because the latency is not too bad in the
// first place for the pull down. If we want to change that provde
// cudaCpuDeviceId as the device if to_device is false
#if CUDA_VERSION < 10000
bool is_managed = attr.isManaged;
#else
bool is_managed = attr.type == cudaMemoryTypeManaged;
#endif
if (to_device && is_managed &&
space.cuda_device_prop().concurrentManagedAccess) {
CUDA_SAFE_CALL(cudaMemPrefetchAsync(ptr, bytes, space.cuda_device(),
space.cuda_stream()));
}
}

@crtrott
Copy link
Member

crtrott commented Mar 17, 2021

If we do this I think we should keep it at the same point as the special stream: i.e. keep it in core proper.

@dalg24
Copy link
Member Author

dalg24 commented Mar 17, 2021

terminate called after throwing an instance of 'std::runtime_error'
  what():  Kokkos allocation "InternalScratchFlags" is being deallocated after Kokkos::finalize was called

@crtrott
Copy link
Member

crtrott commented Mar 17, 2021

Making it part of core proper would also probably make it easier to free it in the right spot.

Copy link
Contributor

@DavidPoliakoff DavidPoliakoff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like how you handled this, that's really slick.

@@ -720,6 +730,7 @@ void CudaInternal::finalize() {
if (this == &singleton()) {
cudaFreeHost(constantMemHostStaging);
cudaEventDestroy(constantMemReusable);
cudaStreamDestroy(cuda_get_deep_copy_stream());
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that now we pay for a stream create even if we never transferred data between host and device. I don't think that's a real issue, just making sure you see it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where is the destruction of the Cuda instance?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the exit of the program

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't we just call finalize on the static execution space instance we create?

Copy link
Member Author

@dalg24 dalg24 Mar 30, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No because this would attempt to decrement tracked memory counters after Kokkos::finalize

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SharedAllocationRecord<void, void>* SharedAllocationRecord<
void, void>::decrement(SharedAllocationRecord<void, void>* arg_record) {
const int old_count = Kokkos::atomic_fetch_sub(&arg_record->m_count, 1);
if (old_count == 1) {
if (!Kokkos::is_initialized()) {
std::stringstream ss;
ss << "Kokkos allocation \"";
ss << arg_record->get_label();
ss << "\" is being deallocated after Kokkos::finalize was called\n";
auto s = ss.str();
Kokkos::Impl::throw_runtime_exception(s);
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you talking about

if (old_count == 1) {
if (!Kokkos::is_initialized()) {
std::stringstream ss;
ss << "Kokkos allocation \"";
ss << arg_record->get_label();
ss << "\" is being deallocated after Kokkos::finalize was called\n";
auto s = ss.str();
Kokkos::Impl::throw_runtime_exception(s);
}
? Decrementing the counter itself should not be a problem in itself, or?

We set g_is_initialized after calling finalize_all_spaces so we should not run into this assertion when we also finalize the static execution space used in DualView here or am I missing something?

Copy link
Member

@crtrott crtrott left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did I just overlook the destruction of the static Cuda instance? Also why do we need that extra step with the special constructor etc?


Kokkos::Cuda const &
Kokkos::Impl::cuda_get_execution_space_with_deep_copy_stream() {
static Cuda space(Kokkos::Cuda::DeepCopyTag{});
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why this instead of just

static Cuda space(cuda_get_deep_copy_stream())

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That' what I had initially done in 6a9b3f4 but it gave #3856 (comment)

But now that I look at it again, I agree that the current version is quite convoluted

@@ -720,6 +730,7 @@ void CudaInternal::finalize() {
if (this == &singleton()) {
cudaFreeHost(constantMemHostStaging);
cudaEventDestroy(constantMemReusable);
cudaStreamDestroy(cuda_get_deep_copy_stream());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where is the destruction of the Cuda instance?

@dalg24
Copy link
Member Author

dalg24 commented Mar 24, 2021

Did I just overlook the destruction of the static Cuda instance? Also why do we need that extra step with the special constructor etc?

OK I looked into this again. The issue with the constructor that takes a stream is the custom deleter passed to the HostSharedPtr pointer to implementation that calls CudaInternal::finalize()

Cuda::Cuda(cudaStream_t stream)
: m_space_instance(new Impl::CudaInternal, [](Impl::CudaInternal *ptr) {
ptr->finalize();
delete ptr;
}) {

#3856 (comment)
The static object is destructed at the exit of the program after Kokkos::finalize() has been called.

@dalg24 dalg24 force-pushed the fixup_dualview_reuse_execspace_instance branch from 8d8a0d7 to f6f3f14 Compare March 25, 2021 18:01
@dalg24
Copy link
Member Author

dalg24 commented Mar 25, 2021

@crtrott I implemented what we discussed over the phone, that is early return in CudaInternal::finalize() if the singleton has already been finalized which denotes that Kokkos::finalize() has been invoked.

Comment on lines 666 to 667
// skip if Kokkos::finalize() has already been called
if (!singleton().was_finalized) return;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean that we are leaking memory in this case (for m_scratchFlags, m_scratchSpace, m_scratchConcurrentBitset, etc.)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This also results in

$ ./core/unit_test/KokkosCore_UnitTest_CudaInterOpInit 
========= CUDA-MEMCHECK
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from cuda
[ RUN      ] cuda.raw_cuda_interop
[       OK ] cuda.raw_cuda_interop (278 ms)
[----------] 1 test from cuda (278 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test case ran. (278 ms total)
[  PASSED  ] 1 test.
Kokkos::Cuda ERROR: Failed to call Kokkos::Cuda::finalize()

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right. Not sure why the CI is passing.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if (m_stream || m_scratchSpace || m_scratchFlags || m_scratchUnified ||
m_scratchConcurrentBitset) {
std::cerr << "Kokkos::Cuda ERROR: Failed to call Kokkos::Cuda::finalize()"
<< std::endl;
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see any errors when omitting to skip finalize and think that is fine since we are not using the stream for calls to cudaFree.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see we just print std::cerr and do not actually throw or abort.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see any errors when omitting to skip finalize and think that is fine since we are not using the stream for calls to cudaFree.

See #3856 (comment)

I am not sure how to go about this. I can revert 980deb8 to see if we get that error again.

masterleinad
masterleinad previously approved these changes Mar 30, 2021
Copy link
Contributor

@masterleinad masterleinad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.

@masterleinad masterleinad dismissed their stale review March 30, 2021 22:13

Apparently, this is still failing for a few tests and needs some more TLC.

@masterleinad
Copy link
Contributor

diff --git a/core/src/Cuda/Kokkos_Cuda_Instance.cpp b/core/src/Cuda/Kokkos_Cuda_Instance.cpp
index c1dc5f276..bcba78d9c 100644
--- a/core/src/Cuda/Kokkos_Cuda_Instance.cpp
+++ b/core/src/Cuda/Kokkos_Cuda_Instance.cpp
@@ -662,9 +662,11 @@ void *CudaInternal::resize_team_scratch_space(std::int64_t bytes,
 
 //----------------------------------------------------------------------------
 
+const Kokkos::Cuda& get_cuda_space();
+
 void CudaInternal::finalize() {
-  // skip if Kokkos::finalize() has already been called
-  if (!singleton().was_finalized) return;
+  // skip if finalize() has already been called
+  if(was_finalized) return;
 
   was_finalized = true;
   if (nullptr != m_scratchSpace || nullptr != m_scratchFlags) {
@@ -709,6 +711,7 @@ void CudaInternal::finalize() {
   if (this == &singleton()) {
     cudaFreeHost(constantMemHostStaging);
     cudaEventDestroy(constantMemReusable);
+    Kokkos::Impl::get_cuda_space().impl_internal_space_instance()->finalize();
     cudaStreamDestroy(cuda_get_deep_copy_stream());
   }
 }

seems to work for me.

Copy link
Member Author

@dalg24 dalg24 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not able to approve but Daniel's changes are fine by me.

@crtrott crtrott merged commit 677e9fd into kokkos:develop Apr 7, 2021
@dalg24 dalg24 deleted the fixup_dualview_reuse_execspace_instance branch May 24, 2021 21:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants