Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix CUDA stream memory leak #3170

Merged
merged 5 commits into from
Jul 11, 2020

Conversation

masterleinad
Copy link
Contributor

Fixes #3167.

dalg24
dalg24 previously approved these changes Jul 8, 2020
@Rombur
Copy link
Member

Rombur commented Jul 8, 2020

This doesn't compile with UVM and the stream test fails for the other configurations.

@crtrott crtrott added the Blocks Promotion Overview issue for release-blocking bugs label Jul 8, 2020
Copy link

@dhollman dhollman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer not to have std::shared_ptr in something this fundamental. One option is we could use a Kokkos::View<Impl::CudaInternal, Kokkos::HostSpace> as a first step, and then conditionally do the finalize in the destructor of Impl::CudaInternal or something like that.

}
uint32_t impl_instance_id() const noexcept { return 0; }

private:
Impl::CudaInternal* m_space_instance;
std::shared_ptr<Impl::CudaInternal> m_space_instance;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This adds overhead everywhere in every application, even if they don't use streams. Also, anyone I'm not sure we're comfortable with that. Also, any downstream users who are (probably by accident, but still) copying execution space instances on the device will now get SEGFAULTs in code that previously worked fine.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why would it segfault? You mean If they try to assign it inside a kernel?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

alternatively I would just reference count it explicitly (i.e. add an int, atomic increment it in copy and decrement it destructor (though that makes it non-trivial copyable)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, if they try to copy. std::shared_ptr has a nontrivial copy constructor that CUDA would try to invoke on the device. Since a lot of our users ignore warnings, it would cause a crash.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(though that makes it non-trivial copyable)

It's non-trivially copyable anyway. std::shared_ptr isn't trivially copyable.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we should add a desul::scoped_shared_ptr<AtomicScope>

@calewis calewis requested review from calewis and removed request for calewis July 8, 2020 18:56
@calewis calewis self-assigned this Jul 8, 2020
@dalg24 dalg24 dismissed their stale review July 8, 2020 19:06

CI does not pass

@calewis calewis removed their assignment Jul 8, 2020
Copy link

@calewis calewis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked, but have nothing to add. But I am trying to start the review process.

@masterleinad
Copy link
Contributor Author

The shared_ptr implementation was failing because we seem to copy instances of this class on the device in the tasking framework (which is only enabled if Kokkos is compiled with relocatable device code) and the destructor would invoke a host function.
I couldn't get it to work with Kokkos::View since I was not able to figure out the correct combination of includes and forward declaration.
Hence, I went with @crtroot suggestion to basically reimplement the relevant parts of shared_ptr in the class (as a poor man's version). In particular, it is important that all the special member functions are actually __host__ __device__ functions for the tests to pass. I disabled any reference counting on the device, though to not mess with the pointers.
If that solution is acceptable, I am happy to encapsulate it a little better.

Comment on lines 841 to 842
Kokkos::atomic_sub(m_counter, 1);
if (*m_counter <= 0) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need an atomic_sub_fetch or something

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this needs sub_fetch and then check whether its zero.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

int count = atomic_sub_fetch(m_counter,1); if(count==0) …

@masterleinad
Copy link
Contributor Author

masterleinad commented Jul 10, 2020

This seems to pass CI finally and I don't see any memory leaks in core/unit_test/KokkosCore_UnitTest_CudaInterOpStreams anymore.

@@ -74,7 +74,7 @@ SET(ClangOpenMPFlag -fopenmp=libomp)
ENDIF()

COMPILER_SPECIFIC_FLAGS(
Clang ${ClangOpenMPFlag}
Clang ${ClangOpenMPFlag} -Wno-openmp-mapping
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should go away after rebasing.

#ifndef KOKKOS_ACTIVE_EXECUTION_MEMORY_SPACE_CUDA
if (m_counter == nullptr) return;
int const count = Kokkos::atomic_fetch_sub(m_counter, 1);
if (count <= 1) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should consider throwing if count is 0, since that would indicate a reference counting failure. We definitely shouldn't delete for anything other than 1.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can avoid throwing here. If we arrive here and the counter is less than 1 someone else will cleanup and there is nothing left to do.

if (count <= 1) {
delete m_counter;
m_counter = nullptr;
if (m_use_stream) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this check (and the variable m_use_stream) should be unnecessary. Either it was the singleton, and hence m_counter == nullptr and the code returned early, or it was constructed from a stream and needs to call finalize.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I agree. I should have revisited after deciding how to handle the singleton case.

@@ -193,7 +193,7 @@ TEST(cuda, raw_cuda_streams) {
CUDA_SAFE_CALL(cudaDeviceSynchronize());
cudaStreamDestroy(stream);

int* h_p = new int[100];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah does that resolve a leak?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we never freed the memory allocated here.

Copy link
Member

@crtrott crtrott left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Almost good to go. We should throw if the counter is ever returning zero (because it indicates a double free somewhere, and we don't need the m_use_stream, since m_counter==null implies m_use_stream==false and vice versa.

@masterleinad
Copy link
Contributor Author

@crtrott I addressed your comments.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Blocks Promotion Overview issue for release-blocking bugs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants