-
Notifications
You must be signed in to change notification settings - Fork 156
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
dag_submitted_ops: Manage node lifetime by asynchronously waiting instead of event queries #761
Conversation
For best effect, we should probably turn the |
I think you should see a benefit both with and without #770 due to the overall lower submission latency. For full benefit, you probably need #771, otherwise you might still have event creation calls in the task submission path (depending on your DAG). Ideally, with #771 and this PR you should have |
Makes sense, I'll report back when I have some results. |
With all currently open MRs (#761, #770, #771) together I see ~26% improvement in the end-to-end runtime on a smaller test (adh_dodec) for which the runtime overheads are more significant. Although some multi-rank tests are failing on my machine the single-rank performance improvement looks promising: baseline with develop: 128.967 ns/day The failing tests:
|
Segfaults are presumably due to #771, this one is known to be still unstable. |
@sbalint98 Do you still observe overheads? If so, from where? |
One of the main issues I see is that there is still some kernel launch overhead for hipSYCL. If I look at the traces, I see that for every kernel launch in the case of hipSYCL there is a EDIT: Maybe the way how hipSYCL launches kernels is the problem? Is there anything that prevents replacing this with |
HIP will have It is not possible to call In my testing, I consistently see only negligible latency in addition to Can you maybe use a profile like vtune to create a flame graph or similar to gather some more insight into where exactly this overhead is coming from? EDIT: We could try to invoke |
Okay, it seems that This is new, in earlier ROCm versions it would just invoke |
Here's some benchmark with GROMACS data on an EPYC 7742 + MI100 (many cores so there should be no contention on resources), using ROCm 4.5.2/ The drop in improvement from 6-12k inputs is quite peculiar as I'd expect a more gradual roll-off as the time per iteration is increasingly dominated by GPU work. This also shows up in the relative perf vs native HIP (code which has device-kernel performance roughly on par, so different is total performance is mostly due to the runtime). |
@pszi1ard Thanks for this feedback! So, is the interpretation correct that we have solved the performance issue for small problem sizes, but there is still an issue for medium problem sizes? |
@sbalint98 has reported that |
Compared to HIP, we are frequently seeing lack of concurrency between nbnxm kernel and pme kernel in the problematic regime even though they are running in different streams. We suspect resource usage inside kernels to be the problem. @sbalint98 to investigate register usage. |
Yes, but I think there is some underlying HIP runtime overhead (or perhaps related to the lack of overlap?) that limits iteration rate with smaller inputs, that might be masking potential differences. Here's a three-way comparison of hipSYCL vs HIP native (on MI100) vs CUDA (2080 Ti) using two different inputs, "pme" has the two paths on the task graph with possibility for overlap, while "rf" has a single path, so all major tasks (with the exception of some smaller auxiliary kernels) that need to be sequentially executed.
@sbalint98 I assume this is the same experimental StreamHPC HIP code as you referred to before? |
We need to know when nodes have completed in order to be able to perform e.g. buffer memory management and not deallocate buffers while they are still in use.
Previously this was done by querying event state prior to submitting new tasks. This can be a bottleneck when many small tasks are submitted.
This PR changes this by instead waiting on all nodes from a DAG batch in an asynchronous worker thread. Once the wait is complete, we can release the nodes.
This should theoretically allow us to circumvent all of the event queries.
However, it is currently still unclear how this interacts with coarse grained events #754, where waiting on an event might map to a
cudaStreamSynchronize()
orhipStreamSynchronize()
call. Also the practical performance impact needs to be investigated, therefore this is still a draft PR.@al42and @pszi1ard This can also be interesting for you, as it cuts out any event queries completely from the submission path.