You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Some of the driver conformance tests are failing intermittently. I have some logs that I can share later.
I have caught this ASan error on the CUDA stream driver asan2.log
It is on slightly modified version that has some prints inserted and has collective operations disabled. These modifications are most likely not material to the issue.
I had to run the test for 1 hour on a 4 GPU system to produce the error.
Unfortunately I don't know which process exactly caused the error, since I was running multiple modules on multiple GPUs without separating the outputs. This is the script used to run the tests in parallel.
All of the drivers share the same synchronization logic and it is likely that these concurrency issues are shared.
I have also noticed when collecting traces for HIP graph using this unmerged PR, that some traces don't get recorded.
Steps to reproduce your issue
Go to '...'
Click on '....'
Scroll down to '....'
See error
What component(s) does this issue relate to?
Runtime
Version information
No response
Additional context
No response
The text was updated successfully, but these errors were encountered:
with GCC's thread sanitizer where libstdc++ and the HIP runtime where build with TSan as well.
I had to add some TSan annotations both in IREE and in HIP, since TSan does not support std::atomic_thread_fence.
I also disable the slim mutex in IREE in favor of pthreads as Linux futexes are also not supported by TSan.
Here are two logs. One from a test failure and one form a success.
The interesting data race is with the timepoints that appears in both cases. I think what is going on is that we are recycling a timepoint from the pool. Nevertheless, there should still be a happens-before relation between the timepoint read and write. This may be a false-positive, but it may be that we are missing some synchronization between the two events.
In the case of a success the test hangs somewhere in the TSan code.
In the failure case there are some data races reported in the destruction of HIP. I don't know if we should pay much attention to those.
What happened?
There are outstanding concurrency issues in the
HIP stream
HIP graph
CUDA stream
CUDA graph
IREE drivers.
Some of the driver conformance tests are failing intermittently. I have some logs that I can share later.
I have caught this ASan error on the CUDA stream driver
asan2.log
It is on slightly modified version that has some prints inserted and has collective operations disabled. These modifications are most likely not material to the issue.
I had to run the test for 1 hour on a 4 GPU system to produce the error.
Unfortunately I don't know which process exactly caused the error, since I was running multiple modules on multiple GPUs without separating the outputs.
This is the script used to run the tests in parallel.
All of the drivers share the same synchronization logic and it is likely that these concurrency issues are shared.
I have also noticed when collecting traces for HIP graph using this unmerged PR, that some traces don't get recorded.
Steps to reproduce your issue
What component(s) does this issue relate to?
Runtime
Version information
No response
Additional context
No response
The text was updated successfully, but these errors were encountered: