-
Notifications
You must be signed in to change notification settings - Fork 407
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[HIP] Lock access to scratch memory when using Teams #3916
Conversation
@@ -433,6 +433,9 @@ class ParallelFor<FunctorType, Kokkos::TeamPolicy<Properties...>, | |||
int m_shmem_size; | |||
void* m_scratch_ptr[2]; | |||
int m_scratch_size[2]; | |||
// Only let one ParallelFor/Reduce modify the team scratch memory. The | |||
// constructor acquires the mutex which is released in the destructor. | |||
std::unique_lock<std::mutex> m_scratch_lock; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
std::lock_guard
is non-copyable. This means you implicitly deleted the copy constructor and copy assignment. Was it intentional? Did you make sure it plays well with the kernel launching?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This means you implicitly deleted the copy constructor and copy assignment.
True at the same time if you use the copy constructor currently, you are copying pointers without copying their data... Also I don't see what's your use case for that. Also I am using std::unique_lock
instead of std::lock_guard
. Unlike std::lock_guard
, std::unique_lock
is movable.
Did you make sure it plays well with the kernel launching?
I am not sure what you mean by that but all the tests pass on Tulip and on the CI.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I missed it was std::unique_lock
and not std::lock_guard
. I can't remember how we pass the driver around in the kernel launching. This is not a trick question I was just curious if you considered it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks OK to me! It would be good if we finally have some tests for the cases we try to cover here, though.
There is potentially a problem if multiple threads launch a
parallel_for
or aparallel_reduce
kernel on the same stream and useTeams
. Theparallel_for
and theparallel_reduce
may tried to reallocate the scratch memory and being used somewhere else. The current PR uses a mutex to ensure that only oneTeam
parallel_for
orparallel_reduce
is running for a given instance. I am open to suggestion, if someone has a better solution.Note that CUDA has the same problem.