Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix racing condition in HIPParallelLaunch #5008

Merged
merged 1 commit into from
May 10, 2022
Merged

Fix racing condition in HIPParallelLaunch #5008

merged 1 commit into from
May 10, 2022

Conversation

G-071
Copy link
Contributor

@G-071 G-071 commented May 10, 2022

I encountered a racing condition within the Kokkos HIP execution space whilst doing some performance measurements in another application on a MI100.

While the method

char *HIPInternal::get_next_driver(size_t driverTypeSize) const {

has a mutex and a lockguard for concurrent access, the same concurrency protections do not seem to hold for its return value d_driver in HIPParallelLaunch. Hence, it is possible that one thread gets its d_driver, but before it can use it in the kernel invocation, another thread hits the m_maxDriverCycles limit in get_next_driver and calls fence (and the first thread only does its kernel invocation after this). At this point, it becomes a race: If enough kernels are scheduled on this HIP instance, we might overwrite the d_driver of the first thread while it's still in use.

Note, I think it is also possible for something similar to happen with the second branch ( driverTypeSize > m_maxDriverTypeSize ) in get_next_driver and another thread's return value in HIPParallelLaunch.

In my own use-case, I was seeing occasional crashes/hanging starting when 16 threads shared one HIP Executions Space (each thread launching multiple asynchronous kernels in rapid succession). At 64 threads and one exeuction space it was happening consistently enough to debug. Admittedly, the entire scenario is a bit of an edge case: We usually use more execution space instances and only tested it with a single one for some benchmark tests. Still, it should be fixed, hence this PR!

This PR fixes the issue by moving the lock guard from get_next_driver into HIPParallelLaunch, thus also protecting the return value until the kernel is launched! At least for me it seems to resolve the issue entirely!

This change is necessary as we not only need the mutex for concurrency
control within hip_instance->get_next_driver(...), but also for its
return value! Otherwise, threads can overwrite the current thread's
return value (d_driver) before the kernel actually gets launched!
@dalg24-jenkins
Copy link
Collaborator

Can one of the admins verify this patch?

@G-071 G-071 changed the title Move workarray lockguard to HIPParallelLaunch Fix racing condition in HIPParallelLaunch May 10, 2022
Copy link
Contributor

@masterleinad masterleinad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense to me. Would you also have a test case that demonstrates the problem for us to add to the test suite?

@dalg24
Copy link
Member

dalg24 commented May 10, 2022

OK to test

@dalg24 dalg24 requested a review from Rombur May 10, 2022 11:39
@dalg24 dalg24 added the Bug Broken / incorrect code; it could be Kokkos' responsibility, or others’ (e.g., Trilinos) label May 10, 2022
@dalg24 dalg24 merged commit ba0caee into kokkos:develop May 10, 2022
@G-071
Copy link
Contributor Author

G-071 commented May 10, 2022

Makes sense to me. Would you also have a test case that demonstrates the problem for us to add to the test suite?

Unfortunately, I do not have a small test case handy right now! I encountered (debugged) this bug using our entire simulation code running a smallish scenario that fits on one MI100 node.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Broken / incorrect code; it could be Kokkos' responsibility, or others’ (e.g., Trilinos) Patch Release
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants