Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix multi stream scratch #3269

Merged
merged 4 commits into from
Sep 18, 2020
Merged

Conversation

crtrott
Copy link
Member

@crtrott crtrott commented Aug 11, 2020

This fixes #3246 by making team scratch allocations a per Cuda instance property instead of global.

Copy link
Member

@dalg24 dalg24 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.
These are minor comments.
Also, I would prefer if you drop the Kokkos:: namespace for kokkos_malloc and kokkos_free.

core/src/Cuda/Kokkos_Cuda_Instance.cpp Outdated Show resolved Hide resolved
Comment on lines +123 to +124
mutable int64_t m_team_scratch_current_size;
mutable void* m_team_scratch_ptr;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you declare them mutable?

Cuda:: impl_internal_space_instance returns a pointer to non-const*. So I don't think you need it in principle.

inline Impl::CudaInternal* impl_internal_space_instance() const {
return m_space_instance;
}

*Whether it should is another question 🙄

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just talking to Damien: this is interesting problem but maybe a bigger one? We can either get rid of the mutable and leave it a non-const pointer or leave the mutable and make this a const pointer. What is the best pattern here? @nliber

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we leave both of them for now and come back to this later.

core/unit_test/cuda/TestCuda_TeamScratch.cpp Show resolved Hide resolved
dalg24
dalg24 previously approved these changes Sep 4, 2020
@dalg24 dalg24 dismissed their stale review September 5, 2020 03:14

CI does not pass

Copy link
Member

@Rombur Rombur left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The CUDA 11 tester and clang-tidy still fail. The other CUDA builds are passing.

cudaStreamCreate(&stream[i]);
cuda[i] = Kokkos::Cuda(stream[i]);
}
// Test that growing scratch size in subsequent calls doesn' crash things
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Test that growing scratch size in subsequent calls doesn' crash things
// Test that growing scratch size in subsequent calls doesn't crash things

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does somebody else want to unroll all the changes and redo the commits to add the t? I don't have time right now. If nobody else feels the urgent need I will ignore this :-)

Also fix situation with UVM as default memory space for CUDA
Forgot for the realloc to explicitly use CudaSpace to match the alloc
@dalg24
Copy link
Member

dalg24 commented Sep 9, 2020

CUDA-11.0-NVCC-C++17-RDC build fails with

terminate called after throwing an instance of 'std::runtime_error'
  what():  Kokkos::Impl::ParallelFor< Cuda > requested too large team size.

Apparently the cuda function first sets all the members to zero
before then writing to them. In a multi threaded environment where
each thread calls the same kernel that can lead to a race.
@crtrott
Copy link
Member Author

crtrott commented Sep 16, 2020

OK found the race and fixed it. Its in the caching of the cuda functor attributes. Comment (and commit message) is added to explain the race and the rational for the fix.

Copy link
Contributor

@DavidPoliakoff DavidPoliakoff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Address Bruno's comment on your type in your comment and it's good

static bool attr_set = false;
if (!attr_set) {
// Race condition inside of cudaFuncGetAttributes if the same address is
// given requires using a local variable as input instead of a static Rely
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// given requires using a local variable as input instead of a static Rely
// given requires using a local variable as input instead of a static. Rely

static cudaFuncAttributes attr;
static bool attr_set = false;
if (!attr_set) {
// Race condition inside of cudaFuncGetAttributes if the same address is
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we mark it as a workaround for CUDA 11? Is that something we are going to report upstream?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think its in principle a workaround just for CUDA 11. There is no implementation guarantee for the function, its just bad API design to take pointers to stuff instead of just returning the struct.

@dalg24 dalg24 merged commit d13b4e7 into kokkos:develop Sep 18, 2020
ndellingwood added a commit to ndellingwood/kokkos that referenced this pull request Sep 22, 2020
…lloc

From kokkos PR kokkos#3269 commit: 9206d4b

Modified file:
core/src/Cuda/Kokkos_Cuda_Instance.cpp
@crtrott crtrott deleted the fix-multi-stream-scratch branch December 19, 2022 17:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants