-
Notifications
You must be signed in to change notification settings - Fork 407
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix CUDA stream memory leak #3170
Fix CUDA stream memory leak #3170
Conversation
This doesn't compile with UVM and the stream test fails for the other configurations. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer not to have std::shared_ptr
in something this fundamental. One option is we could use a Kokkos::View<Impl::CudaInternal, Kokkos::HostSpace>
as a first step, and then conditionally do the finalize in the destructor of Impl::CudaInternal
or something like that.
core/src/Kokkos_Cuda.hpp
Outdated
} | ||
uint32_t impl_instance_id() const noexcept { return 0; } | ||
|
||
private: | ||
Impl::CudaInternal* m_space_instance; | ||
std::shared_ptr<Impl::CudaInternal> m_space_instance; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This adds overhead everywhere in every application, even if they don't use streams. Also, anyone I'm not sure we're comfortable with that. Also, any downstream users who are (probably by accident, but still) copying execution space instances on the device will now get SEGFAULT
s in code that previously worked fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why would it segfault? You mean If they try to assign it inside a kernel?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
alternatively I would just reference count it explicitly (i.e. add an int, atomic increment it in copy and decrement it destructor (though that makes it non-trivial copyable)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, if they try to copy. std::shared_ptr
has a nontrivial copy constructor that CUDA would try to invoke on the device. Since a lot of our users ignore warnings, it would cause a crash.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(though that makes it non-trivial copyable)
It's non-trivially copyable anyway. std::shared_ptr
isn't trivially copyable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe we should add a desul::scoped_shared_ptr<AtomicScope>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I looked, but have nothing to add. But I am trying to start the review process.
141b6f3
to
f081d2e
Compare
The |
Kokkos::atomic_sub(m_counter, 1); | ||
if (*m_counter <= 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You need an atomic_sub_fetch or something
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this needs sub_fetch and then check whether its zero.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
int count = atomic_sub_fetch(m_counter,1); if(count==0) …
f081d2e
to
4902ac7
Compare
This seems to pass CI finally and I don't see any memory leaks in |
cmake/kokkos_enable_devices.cmake
Outdated
@@ -74,7 +74,7 @@ SET(ClangOpenMPFlag -fopenmp=libomp) | |||
ENDIF() | |||
|
|||
COMPILER_SPECIFIC_FLAGS( | |||
Clang ${ClangOpenMPFlag} | |||
Clang ${ClangOpenMPFlag} -Wno-openmp-mapping |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should go away after rebasing.
#ifndef KOKKOS_ACTIVE_EXECUTION_MEMORY_SPACE_CUDA | ||
if (m_counter == nullptr) return; | ||
int const count = Kokkos::atomic_fetch_sub(m_counter, 1); | ||
if (count <= 1) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should consider throwing if count is 0, since that would indicate a reference counting failure. We definitely shouldn't delete for anything other than 1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can avoid throwing here. If we arrive here and the counter is less than 1 someone else will cleanup and there is nothing left to do.
if (count <= 1) { | ||
delete m_counter; | ||
m_counter = nullptr; | ||
if (m_use_stream) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this check (and the variable m_use_stream) should be unnecessary. Either it was the singleton, and hence m_counter == nullptr and the code returned early, or it was constructed from a stream and needs to call finalize.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I agree. I should have revisited after deciding how to handle the singleton case.
@@ -193,7 +193,7 @@ TEST(cuda, raw_cuda_streams) { | |||
CUDA_SAFE_CALL(cudaDeviceSynchronize()); | |||
cudaStreamDestroy(stream); | |||
|
|||
int* h_p = new int[100]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah does that resolve a leak?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we never freed the memory allocated here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Almost good to go. We should throw if the counter is ever returning zero (because it indicates a double free somewhere, and we don't need the m_use_stream, since m_counter==null implies m_use_stream==false and vice versa.
5d77000
to
a3954a2
Compare
@crtrott I addressed your comments. |
Fixes #3167.