dag_submitted_ops: Manage node lifetime by asynchronously waiting instead of event queries #761

illuhad · 2022-07-05T01:22:11Z

We need to know when nodes have completed in order to be able to perform e.g. buffer memory management and not deallocate buffers while they are still in use.
Previously this was done by querying event state prior to submitting new tasks. This can be a bottleneck when many small tasks are submitted.
This PR changes this by instead waiting on all nodes from a DAG batch in an asynchronous worker thread. Once the wait is complete, we can release the nodes.
This should theoretically allow us to circumvent all of the event queries.

However, it is currently still unclear how this interacts with coarse grained events #754, where waiting on an event might map to a cudaStreamSynchronize() or hipStreamSynchronize() call. Also the practical performance impact needs to be investigated, therefore this is still a draft PR.

@al42and @pszi1ard This can also be interesting for you, as it cuts out any event queries completely from the submission path.

…them

illuhad · 2022-07-14T01:11:29Z

For best effect, we should probably turn the dag_node requirements list into something that stores weak_ptr instead of shared_ptr, otherwise the benefits of the event pool may not trigger, and this PR might not have a big impact for long DAG chains.

pszi1ard · 2022-07-14T10:27:20Z

Thanks @illuhad, I guess it would it make most sense to test this both with/without #770?

illuhad · 2022-07-14T14:44:42Z

I think you should see a benefit both with and without #770 due to the overall lower submission latency. For full benefit, you probably need #771, otherwise you might still have event creation calls in the task submission path (depending on your DAG).

Ideally, with #771 and this PR you should have hipLaunchKernel and cudaLaunchKernel dominate kernel submission latency, and see no other call to CUDA or HIP API for most submissions (except hip/cudaEventRecord if you are not using coarse grained events, but event recording seems to be fairly light-weight).

pszi1ard · 2022-07-14T16:23:11Z

Makes sense, I'll report back when I have some results.

sbalint98 · 2022-07-18T15:05:20Z

With all currently open MRs (#761, #770, #771) together I see ~26% improvement in the end-to-end runtime on a smaller test (adh_dodec) for which the runtime overheads are more significant. Although some multi-rank tests are failing on my machine the single-rank performance improvement looks promising:

baseline with develop: 128.967 ns/day
with (#761, #770, #771) : 162.925 ns/day

The failing tests:

         65 - MdrunTestsTwoRanks (SEGFAULT)
         74 - MdrunMpi2RankPmeTests (SEGFAULT)
         76 - MdrunMpiCoordinationBasicTestsTwoRanks (SEGFAULT)
         78 - MdrunMpiCoordinationCouplingTestsTwoRanks (SEGFAULT)
         80 - MdrunMpiCoordinationConstraintsTestsTwoRanks (SEGFAULT)

illuhad · 2022-07-18T15:15:36Z

Segfaults are presumably due to #771, this one is known to be still unstable.

illuhad · 2022-07-18T15:19:41Z

@sbalint98 Do you still observe overheads? If so, from where?

sbalint98 · 2022-07-20T08:24:59Z

One of the main issues I see is that there is still some kernel launch overhead for hipSYCL. If I look at the traces, I see that for every kernel launch in the case of hipSYCL there is a hipLaunchKernel, __hipPushCallConfiguration, and a __hipPopCallConfiguration. I don't see that in case of HIP. In one instance between two hipLaunchKernel calls there is 52 us while in case of HIP, there is only 12us. This, unfortunately, seeps into the end-to-end performance since in this particular example, the kernels are pretty short: between 500us and 50us

EDIT: Maybe the way how hipSYCL launches kernels is the problem? Is there anything that prevents replacing this with hipLaunchKernelGGL

illuhad · 2022-07-20T11:47:23Z

HIP will have __hipPush/PopCallConfiguration() too. The only difference is that there they are inserted by the compiler, while we do that manually. <<<>>> Syntax is translated into __hipPushCallConfiguration(); hipLaunchKernel(). I don't know why those calls don't show up in the traces for HIP. Maybe they get inlined.

It is not possible to call hipLaunchKernelGGL because hipSYCL kernels cannot be marked __global__ since they have to invoke functions not marked as __device__. But again, it's very hard to imagine a performance difference coming from there.

In my testing, I consistently see only negligible latency in addition to cudaLaunchKernel/hipLaunchKernel invocations. I don't know if your GPU maybe has vastly lower kernel submission latency, so that additional overheads become noticable.

Can you maybe use a profile like vtune to create a flame graph or similar to gather some more insight into where exactly this overhead is coming from?

EDIT: We could try to invoke hipLaunchKernel directly (it takes the function address of the kernel as argument). Not sure if it will work with hipSYCL kernels, but it might (need to double-check that we don't rely on having the kernel in the call graph).. This will bypass clang-generated kernel launches completely...

illuhad · 2022-07-20T12:41:14Z

Okay, it seems that hipLaunchKernelGGL either invokes kernel<<<>>>() syntax and relies on compiler-generated kernel launch stub (like we do), or it invokes hipLaunchKernel directly depending on some preprocessor definition:
https://github.com/ROCm-Developer-Tools/hipamd/blob/0d7eebeccaa4afd58b8ebddd5c6304e0eef0b422/include/hip/amd_detail/amd_hip_runtime.h#L234

This is new, in earlier ROCm versions it would just invoke kernel<<<>>>(). I don't know which of the paths is enabled by default, but I wonder if there is a performance difference between kernel<<<>>>() syntax and invoking hipLaunchKernel directly.

pszi1ard · 2022-07-22T19:13:16Z

Here's some benchmark with GROMACS data on an EPYC 7742 + MI100 (many cores so there should be no contention on resources), using ROCm 4.5.2/

The drop in improvement from 6-12k inputs is quite peculiar as I'd expect a more gradual roll-off as the time per iteration is increasingly dominated by GPU work. This also shows up in the relative perf vs native HIP (code which has device-kernel performance roughly on par, so different is total performance is mostly due to the runtime).

illuhad · 2022-07-22T19:52:48Z

@pszi1ard Thanks for this feedback! So, is the interpretation correct that we have solved the performance issue for small problem sizes, but there is still an issue for medium problem sizes?
I wonder if the number of cached DAG nodes plays into this (HIPSYCL_RT_MAX_CACHED_NODES), maybe the cache is too small as the problem gets larger...

…for garbage collection

illuhad · 2022-07-26T01:46:33Z

@sbalint98 has reported that HIPSYCL_RT_MAX_CACHED_NODES has no effect on performance. I currently do not know what is going on in the medium problem size regime.

illuhad · 2022-07-26T21:27:45Z

Compared to HIP, we are frequently seeing lack of concurrency between nbnxm kernel and pme kernel in the problematic regime even though they are running in different streams. We suspect resource usage inside kernels to be the problem. @sbalint98 to investigate register usage.

illuhad · 2022-08-02T14:22:31Z

@pszi1ard @al42and We found that SYCL kernels still use more resources - presumably this is why kernels do not run concurrently. We could also try to add hipSYCL_priority properties to SYCL queues in Gromacs, or experiment with CU masks to get HIP to overlap the kernels.

pszi1ard · 2022-08-03T18:37:56Z

@pszi1ard Thanks for this feedback! So, is the interpretation correct that we have solved the performance issue for small problem sizes, but there is still an issue for medium problem sizes?

Yes, but I think there is some underlying HIP runtime overhead (or perhaps related to the lack of overlap?) that limits iteration rate with smaller inputs, that might be masking potential differences.

Here's a three-way comparison of hipSYCL vs HIP native (on MI100) vs CUDA (2080 Ti) using two different inputs, "pme" has the two paths on the task graph with possibility for overlap, while "rf" has a single path, so all major tasks (with the exception of some smaller auxiliary kernels) that need to be sequentially executed.

@pszi1ard @al42and We found that SYCL kernels still use more resources - presumably this is why kernels do not run concurrently. We could also try to add hipSYCL_priority properties to SYCL queues in Gromacs, or experiment with CU masks to get HIP to overlap the kernels.

@sbalint98 I assume this is the same experimental StreamHPC HIP code as you referred to before?
Without priorities I do not expect kernel overlap -- other than kernel tails. What kind of resources limit kernel tail overlap -- I though registers are per CU so as soon as a CU drains another kernel's waves can get scheduled.

dag_submitted_ops: Manage node lifetime by asynchronously waiting on …

c77604e

…them

illuhad added 2 commits July 26, 2022 01:06

Merge branch 'develop' into feature/async-submitted-ops-management

166df92

dag_manager: It's unnecessary to wait for the requirements of a node …

4a54c90

…for garbage collection

illuhad marked this pull request as ready for review July 26, 2022 01:45

illuhad added 2 commits July 27, 2022 01:35

Optimize node purging by waiting on newest nodes first

deb87a8

Add documentation node on best performance for node purging

e6ddd26

illuhad merged commit e934e2e into develop Jul 27, 2022

illuhad deleted the feature/async-submitted-ops-management branch July 27, 2022 13:14

illuhad mentioned this pull request Jul 30, 2022

Add support for HIP explicit multipass #790

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dag_submitted_ops: Manage node lifetime by asynchronously waiting instead of event queries #761

dag_submitted_ops: Manage node lifetime by asynchronously waiting instead of event queries #761

illuhad commented Jul 5, 2022 •

edited

illuhad commented Jul 14, 2022

pszi1ard commented Jul 14, 2022 •

edited

illuhad commented Jul 14, 2022 •

edited

pszi1ard commented Jul 14, 2022

sbalint98 commented Jul 18, 2022

illuhad commented Jul 18, 2022

illuhad commented Jul 18, 2022

sbalint98 commented Jul 20, 2022 •

edited

illuhad commented Jul 20, 2022 •

edited

illuhad commented Jul 20, 2022

pszi1ard commented Jul 22, 2022

illuhad commented Jul 22, 2022

illuhad commented Jul 26, 2022

illuhad commented Jul 26, 2022

illuhad commented Aug 2, 2022

pszi1ard commented Aug 3, 2022

dag_submitted_ops: Manage node lifetime by asynchronously waiting instead of event queries #761

dag_submitted_ops: Manage node lifetime by asynchronously waiting instead of event queries #761

Conversation

illuhad commented Jul 5, 2022 • edited

illuhad commented Jul 14, 2022

pszi1ard commented Jul 14, 2022 • edited

illuhad commented Jul 14, 2022 • edited

pszi1ard commented Jul 14, 2022

sbalint98 commented Jul 18, 2022

illuhad commented Jul 18, 2022

illuhad commented Jul 18, 2022

sbalint98 commented Jul 20, 2022 • edited

illuhad commented Jul 20, 2022 • edited

illuhad commented Jul 20, 2022

pszi1ard commented Jul 22, 2022

illuhad commented Jul 22, 2022

illuhad commented Jul 26, 2022

illuhad commented Jul 26, 2022

illuhad commented Aug 2, 2022

pszi1ard commented Aug 3, 2022

illuhad commented Jul 5, 2022 •

edited

pszi1ard commented Jul 14, 2022 •

edited

illuhad commented Jul 14, 2022 •

edited

sbalint98 commented Jul 20, 2022 •

edited

illuhad commented Jul 20, 2022 •

edited