Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[runtime] Concurrency issues in the HIP/CUDA stream/graph drivers #16903

Open
sogartar opened this issue Mar 26, 2024 · 3 comments
Open

[runtime] Concurrency issues in the HIP/CUDA stream/graph drivers #16903

sogartar opened this issue Mar 26, 2024 · 3 comments
Labels
bug 🐞 Something isn't working

Comments

@sogartar
Copy link
Contributor

What happened?

There are outstanding concurrency issues in the

HIP stream
HIP graph
CUDA stream
CUDA graph

IREE drivers.

Some of the driver conformance tests are failing intermittently. I have some logs that I can share later.

I have caught this ASan error on the CUDA stream driver
asan2.log
It is on slightly modified version that has some prints inserted and has collective operations disabled. These modifications are most likely not material to the issue.
I had to run the test for 1 hour on a 4 GPU system to produce the error.
Unfortunately I don't know which process exactly caused the error, since I was running multiple modules on multiple GPUs without separating the outputs.
This is the script used to run the tests in parallel.

All of the drivers share the same synchronization logic and it is likely that these concurrency issues are shared.

I have also noticed when collecting traces for HIP graph using this unmerged PR, that some traces don't get recorded.

Steps to reproduce your issue

  1. Go to '...'
  2. Click on '....'
  3. Scroll down to '....'
  4. See error

What component(s) does this issue relate to?

Runtime

Version information

No response

Additional context

No response

@sogartar
Copy link
Contributor Author

On 571b205a49578d7db39985341ca4a8049f306ceb

iree/experimental/hip/cts/hip_stream_semaphore_submission_test

sometimes fails.
test.log

It probably fails on main as well, I just noticed it there first.

@sogartar
Copy link
Contributor Author

sogartar commented Apr 9, 2024

I ran

CTS/semaphore_submission_test.WaitAnyHostAndDeviceSemaphoresAndHostSignals/hip

on 0c18971,

with GCC's thread sanitizer where libstdc++ and the HIP runtime where build with TSan as well.
I had to add some TSan annotations both in IREE and in HIP, since TSan does not support std::atomic_thread_fence.
I also disable the slim mutex in IREE in favor of pthreads as Linux futexes are also not supported by TSan.

Here are two logs. One from a test failure and one form a success.

fail.log
success.log

The interesting data race is with the timepoints that appears in both cases. I think what is going on is that we are recycling a timepoint from the pool. Nevertheless, there should still be a happens-before relation between the timepoint read and write. This may be a false-positive, but it may be that we are missing some synchronization between the two events.

In the case of a success the test hangs somewhere in the TSan code.
In the failure case there are some data races reported in the destruction of HIP. I don't know if we should pay much attention to those.

@sogartar
Copy link
Contributor Author

#17025 fixes one issue.
Even with this change I noticed that cuda_graph_semaphore_submission_test fails intermittently with a timeout.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug 🐞 Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant