-
Notifications
You must be signed in to change notification settings - Fork 407
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixup cache CUDA fallback execution space instance used by DualView::sync #3856
Fixup cache CUDA fallback execution space instance used by DualView::sync #3856
Conversation
If we do this I think we should keep it at the same point as the special stream: i.e. keep it in core proper. |
|
Making it part of core proper would also probably make it easier to free it in the right spot. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like how you handled this, that's really slick.
@@ -720,6 +730,7 @@ void CudaInternal::finalize() { | |||
if (this == &singleton()) { | |||
cudaFreeHost(constantMemHostStaging); | |||
cudaEventDestroy(constantMemReusable); | |||
cudaStreamDestroy(cuda_get_deep_copy_stream()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that now we pay for a stream create even if we never transferred data between host and device. I don't think that's a real issue, just making sure you see it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
where is the destruction of the Cuda instance?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At the exit of the program
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't we just call finalize
on the static
execution space instance we create?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No because this would attempt to decrement tracked memory counters after Kokkos::finalize
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
kokkos/core/src/impl/Kokkos_SharedAlloc.cpp
Lines 234 to 246 in 0760c20
SharedAllocationRecord<void, void>* SharedAllocationRecord< | |
void, void>::decrement(SharedAllocationRecord<void, void>* arg_record) { | |
const int old_count = Kokkos::atomic_fetch_sub(&arg_record->m_count, 1); | |
if (old_count == 1) { | |
if (!Kokkos::is_initialized()) { | |
std::stringstream ss; | |
ss << "Kokkos allocation \""; | |
ss << arg_record->get_label(); | |
ss << "\" is being deallocated after Kokkos::finalize was called\n"; | |
auto s = ss.str(); | |
Kokkos::Impl::throw_runtime_exception(s); | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you talking about
kokkos/core/src/impl/Kokkos_SharedAlloc.cpp
Lines 238 to 246 in 1fb0c28
if (old_count == 1) { | |
if (!Kokkos::is_initialized()) { | |
std::stringstream ss; | |
ss << "Kokkos allocation \""; | |
ss << arg_record->get_label(); | |
ss << "\" is being deallocated after Kokkos::finalize was called\n"; | |
auto s = ss.str(); | |
Kokkos::Impl::throw_runtime_exception(s); | |
} |
We set g_is_initialized
after calling finalize_all_spaces
so we should not run into this assertion when we also finalize the static
execution space used in DualView
here or am I missing something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did I just overlook the destruction of the static Cuda instance? Also why do we need that extra step with the special constructor etc?
core/src/Cuda/Kokkos_CudaSpace.cpp
Outdated
|
||
Kokkos::Cuda const & | ||
Kokkos::Impl::cuda_get_execution_space_with_deep_copy_stream() { | ||
static Cuda space(Kokkos::Cuda::DeepCopyTag{}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why this instead of just
static Cuda space(cuda_get_deep_copy_stream())
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That' what I had initially done in 6a9b3f4 but it gave #3856 (comment)
But now that I look at it again, I agree that the current version is quite convoluted
@@ -720,6 +730,7 @@ void CudaInternal::finalize() { | |||
if (this == &singleton()) { | |||
cudaFreeHost(constantMemHostStaging); | |||
cudaEventDestroy(constantMemReusable); | |||
cudaStreamDestroy(cuda_get_deep_copy_stream()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
where is the destruction of the Cuda instance?
OK I looked into this again. The issue with the constructor that takes a stream is the custom deleter passed to the kokkos/core/src/Cuda/Kokkos_Cuda_Instance.cpp Lines 817 to 821 in 9d613c3
#3856 (comment) |
8d8a0d7
to
f6f3f14
Compare
@crtrott I implemented what we discussed over the phone, that is early return in |
// skip if Kokkos::finalize() has already been called | ||
if (!singleton().was_finalized) return; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this mean that we are leaking memory in this case (for m_scratchFlags
, m_scratchSpace
, m_scratchConcurrentBitset
, etc.)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This also results in
$ ./core/unit_test/KokkosCore_UnitTest_CudaInterOpInit
========= CUDA-MEMCHECK
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from cuda
[ RUN ] cuda.raw_cuda_interop
[ OK ] cuda.raw_cuda_interop (278 ms)
[----------] 1 test from cuda (278 ms total)
[----------] Global test environment tear-down
[==========] 1 test from 1 test case ran. (278 ms total)
[ PASSED ] 1 test.
Kokkos::Cuda ERROR: Failed to call Kokkos::Cuda::finalize()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right. Not sure why the CI is passing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
kokkos/core/src/Cuda/Kokkos_Cuda_Instance.cpp
Lines 274 to 278 in f6f3f14
if (m_stream || m_scratchSpace || m_scratchFlags || m_scratchUnified || | |
m_scratchConcurrentBitset) { | |
std::cerr << "Kokkos::Cuda ERROR: Failed to call Kokkos::Cuda::finalize()" | |
<< std::endl; | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see any errors when omitting to skip finalize
and think that is fine since we are not using the stream for calls to cudaFree
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I see we just print std::cerr
and do not actually throw or abort.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see any errors when omitting to skip
finalize
and think that is fine since we are not using the stream for calls tocudaFree
.
See #3856 (comment)
I am not sure how to go about this. I can revert 980deb8 to see if we get that error again.
This reverts commit 980deb8.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me.
Apparently, this is still failing for a few tests and needs some more TLC.
diff --git a/core/src/Cuda/Kokkos_Cuda_Instance.cpp b/core/src/Cuda/Kokkos_Cuda_Instance.cpp
index c1dc5f276..bcba78d9c 100644
--- a/core/src/Cuda/Kokkos_Cuda_Instance.cpp
+++ b/core/src/Cuda/Kokkos_Cuda_Instance.cpp
@@ -662,9 +662,11 @@ void *CudaInternal::resize_team_scratch_space(std::int64_t bytes,
//----------------------------------------------------------------------------
+const Kokkos::Cuda& get_cuda_space();
+
void CudaInternal::finalize() {
- // skip if Kokkos::finalize() has already been called
- if (!singleton().was_finalized) return;
+ // skip if finalize() has already been called
+ if(was_finalized) return;
was_finalized = true;
if (nullptr != m_scratchSpace || nullptr != m_scratchFlags) {
@@ -709,6 +711,7 @@ void CudaInternal::finalize() {
if (this == &singleton()) {
cudaFreeHost(constantMemHostStaging);
cudaEventDestroy(constantMemReusable);
+ Kokkos::Impl::get_cuda_space().impl_internal_space_instance()->finalize();
cudaStreamDestroy(cuda_get_deep_copy_stream());
}
}
seems to work for me. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not able to approve but Daniel's changes are fine by me.
Fixup for #3822
Creating a new execution space instance every time is inefficient. I considered changing
get_cuda_space(...)
toget_cuda_stream(...)
but it turns outKokkos::Impl::cuda_prefetch_pointer
also requires the device ID so the CUDA execution space instance seems like the right abstraction to pass.I am proposing to cache the CUDA execution space instance constructed on the deep_copy stream.
kokkos/core/src/Cuda/Kokkos_CudaSpace.cpp
Lines 572 to 591 in 018721f