-
Notifications
You must be signed in to change notification settings - Fork 407
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix multi stream scratch #3269
Fix multi stream scratch #3269
Conversation
Reproduces issue kokkos#3246
33ac6dd
to
94d5c39
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good.
These are minor comments.
Also, I would prefer if you drop the Kokkos::
namespace for kokkos_malloc
and kokkos_free
.
mutable int64_t m_team_scratch_current_size; | ||
mutable void* m_team_scratch_ptr; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did you declare them mutable?
Cuda:: impl_internal_space_instance
returns a pointer to non-const*. So I don't think you need it in principle.
kokkos/core/src/Kokkos_Cuda.hpp
Lines 254 to 256 in c01580a
inline Impl::CudaInternal* impl_internal_space_instance() const { | |
return m_space_instance; | |
} |
*Whether it should is another question 🙄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just talking to Damien: this is interesting problem but maybe a bigger one? We can either get rid of the mutable and leave it a non-const pointer or leave the mutable and make this a const pointer. What is the best pattern here? @nliber
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe we leave both of them for now and come back to this later.
12c25d1
to
538e771
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The CUDA 11 tester and clang-tidy still fail. The other CUDA builds are passing.
cudaStreamCreate(&stream[i]); | ||
cuda[i] = Kokkos::Cuda(stream[i]); | ||
} | ||
// Test that growing scratch size in subsequent calls doesn' crash things |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// Test that growing scratch size in subsequent calls doesn' crash things | |
// Test that growing scratch size in subsequent calls doesn't crash things |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does somebody else want to unroll all the changes and redo the commits to add the t? I don't have time right now. If nobody else feels the urgent need I will ignore this :-)
Also fix situation with UVM as default memory space for CUDA Forgot for the realloc to explicitly use CudaSpace to match the alloc
538e771
to
9206d4b
Compare
CUDA-11.0-NVCC-C++17-RDC build fails with
|
Apparently the cuda function first sets all the members to zero before then writing to them. In a multi threaded environment where each thread calls the same kernel that can lead to a race.
OK found the race and fixed it. Its in the caching of the cuda functor attributes. Comment (and commit message) is added to explain the race and the rational for the fix. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Address Bruno's comment on your type in your comment and it's good
static bool attr_set = false; | ||
if (!attr_set) { | ||
// Race condition inside of cudaFuncGetAttributes if the same address is | ||
// given requires using a local variable as input instead of a static Rely |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// given requires using a local variable as input instead of a static Rely | |
// given requires using a local variable as input instead of a static. Rely |
static cudaFuncAttributes attr; | ||
static bool attr_set = false; | ||
if (!attr_set) { | ||
// Race condition inside of cudaFuncGetAttributes if the same address is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we mark it as a workaround for CUDA 11? Is that something we are going to report upstream?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think its in principle a workaround just for CUDA 11. There is no implementation guarantee for the function, its just bad API design to take pointers to stuff instead of just returning the struct.
…lloc From kokkos PR kokkos#3269 commit: 9206d4b Modified file: core/src/Cuda/Kokkos_Cuda_Instance.cpp
This fixes #3246 by making team scratch allocations a per Cuda instance property instead of global.