Fix racing condition in HIPParallelLaunch #5008

G-071 · 2022-05-10T04:50:32Z

I encountered a racing condition within the Kokkos HIP execution space whilst doing some performance measurements in another application on a MI100.

While the method

char *HIPInternal::get_next_driver(size_t driverTypeSize) const {

has a mutex and a lockguard for concurrent access, the same concurrency protections do not seem to hold for its return value d_driver in HIPParallelLaunch. Hence, it is possible that one thread gets its d_driver, but before it can use it in the kernel invocation, another thread hits the m_maxDriverCycles limit in get_next_driver and calls fence (and the first thread only does its kernel invocation after this). At this point, it becomes a race: If enough kernels are scheduled on this HIP instance, we might overwrite the d_driver of the first thread while it's still in use.

Note, I think it is also possible for something similar to happen with the second branch ( driverTypeSize > m_maxDriverTypeSize ) in get_next_driver and another thread's return value in HIPParallelLaunch.

In my own use-case, I was seeing occasional crashes/hanging starting when 16 threads shared one HIP Executions Space (each thread launching multiple asynchronous kernels in rapid succession). At 64 threads and one exeuction space it was happening consistently enough to debug. Admittedly, the entire scenario is a bit of an edge case: We usually use more execution space instances and only tested it with a single one for some benchmark tests. Still, it should be fixed, hence this PR!

This PR fixes the issue by moving the lock guard from get_next_driver into HIPParallelLaunch, thus also protecting the return value until the kernel is launched! At least for me it seems to resolve the issue entirely!

This change is necessary as we not only need the mutex for concurrency control within hip_instance->get_next_driver(...), but also for its return value! Otherwise, threads can overwrite the current thread's return value (d_driver) before the kernel actually gets launched!

dalg24-jenkins · 2022-05-10T04:50:34Z

Can one of the admins verify this patch?

masterleinad

Makes sense to me. Would you also have a test case that demonstrates the problem for us to add to the test suite?

dalg24 · 2022-05-10T11:39:38Z

OK to test

G-071 · 2022-05-10T15:00:41Z

Makes sense to me. Would you also have a test case that demonstrates the problem for us to add to the test suite?

Unfortunately, I do not have a small test case handy right now! I encountered (debugged) this bug using our entire simulation code running a smallish scenario that fits on one MI100 node.

G-071 changed the title ~~Move workarray lockguard to HIPParallelLaunch~~ Fix racing condition in HIPParallelLaunch May 10, 2022

masterleinad approved these changes May 10, 2022

View reviewed changes

dalg24 requested a review from Rombur May 10, 2022 11:39

dalg24 approved these changes May 10, 2022

View reviewed changes

Rombur approved these changes May 10, 2022

View reviewed changes

dalg24 added the Bug Broken / incorrect code; it could be Kokkos' responsibility, or others’ (e.g., Trilinos) label May 10, 2022

dalg24 merged commit ba0caee into kokkos:develop May 10, 2022

crtrott added the Patch Release label May 17, 2022

masterleinad mentioned this pull request May 20, 2022

Cherry picks for 3.6.01 #5043

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix racing condition in HIPParallelLaunch #5008

Fix racing condition in HIPParallelLaunch #5008

G-071 commented May 10, 2022 •

edited

dalg24-jenkins commented May 10, 2022

masterleinad left a comment

dalg24 commented May 10, 2022

G-071 commented May 10, 2022

Fix racing condition in HIPParallelLaunch #5008

Fix racing condition in HIPParallelLaunch #5008

Conversation

G-071 commented May 10, 2022 • edited

dalg24-jenkins commented May 10, 2022

masterleinad left a comment

Choose a reason for hiding this comment

dalg24 commented May 10, 2022

G-071 commented May 10, 2022

G-071 commented May 10, 2022 •

edited