Fix racing condition in HIPParallelLaunch #5008
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
I encountered a racing condition within the Kokkos HIP execution space whilst doing some performance measurements in another application on a MI100.
While the method
has a mutex and a lockguard for concurrent access, the same concurrency protections do not seem to hold for its return value
d_driver
in HIPParallelLaunch. Hence, it is possible that one thread gets itsd_driver
, but before it can use it in the kernel invocation, another thread hits them_maxDriverCycles
limit inget_next_driver
and calls fence (and the first thread only does its kernel invocation after this). At this point, it becomes a race: If enough kernels are scheduled on this HIP instance, we might overwrite thed_driver
of the first thread while it's still in use.Note, I think it is also possible for something similar to happen with the second branch (
driverTypeSize > m_maxDriverTypeSize
) inget_next_driver
and another thread's return value in HIPParallelLaunch.In my own use-case, I was seeing occasional crashes/hanging starting when 16 threads shared one HIP Executions Space (each thread launching multiple asynchronous kernels in rapid succession). At 64 threads and one exeuction space it was happening consistently enough to debug. Admittedly, the entire scenario is a bit of an edge case: We usually use more execution space instances and only tested it with a single one for some benchmark tests. Still, it should be fixed, hence this PR!
This PR fixes the issue by moving the lock guard from
get_next_driver
intoHIPParallelLaunch
, thus also protecting the return value until the kernel is launched! At least for me it seems to resolve the issue entirely!