Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dag_submitted_ops: Manage node lifetime by asynchronously waiting instead of event queries #761

Merged
merged 5 commits into from
Jul 27, 2022

Conversation

illuhad
Copy link
Collaborator

@illuhad illuhad commented Jul 5, 2022

We need to know when nodes have completed in order to be able to perform e.g. buffer memory management and not deallocate buffers while they are still in use.
Previously this was done by querying event state prior to submitting new tasks. This can be a bottleneck when many small tasks are submitted.
This PR changes this by instead waiting on all nodes from a DAG batch in an asynchronous worker thread. Once the wait is complete, we can release the nodes.
This should theoretically allow us to circumvent all of the event queries.

However, it is currently still unclear how this interacts with coarse grained events #754, where waiting on an event might map to a cudaStreamSynchronize() or hipStreamSynchronize() call. Also the practical performance impact needs to be investigated, therefore this is still a draft PR.

@al42and @pszi1ard This can also be interesting for you, as it cuts out any event queries completely from the submission path.

@illuhad
Copy link
Collaborator Author

illuhad commented Jul 14, 2022

For best effect, we should probably turn the dag_node requirements list into something that stores weak_ptr instead of shared_ptr, otherwise the benefits of the event pool may not trigger, and this PR might not have a big impact for long DAG chains.

@pszi1ard
Copy link

pszi1ard commented Jul 14, 2022

Thanks @illuhad, I guess it would it make most sense to test this both with/without #770?

@illuhad
Copy link
Collaborator Author

illuhad commented Jul 14, 2022

I think you should see a benefit both with and without #770 due to the overall lower submission latency. For full benefit, you probably need #771, otherwise you might still have event creation calls in the task submission path (depending on your DAG).

Ideally, with #771 and this PR you should have hipLaunchKernel and cudaLaunchKernel dominate kernel submission latency, and see no other call to CUDA or HIP API for most submissions (except hip/cudaEventRecord if you are not using coarse grained events, but event recording seems to be fairly light-weight).

@pszi1ard
Copy link

Makes sense, I'll report back when I have some results.

@sbalint98
Copy link
Collaborator

With all currently open MRs (#761, #770, #771) together I see ~26% improvement in the end-to-end runtime on a smaller test (adh_dodec) for which the runtime overheads are more significant. Although some multi-rank tests are failing on my machine the single-rank performance improvement looks promising:

baseline with develop: 128.967 ns/day
with (#761, #770, #771) : 162.925 ns/day

The failing tests:

         65 - MdrunTestsTwoRanks (SEGFAULT)
         74 - MdrunMpi2RankPmeTests (SEGFAULT)
         76 - MdrunMpiCoordinationBasicTestsTwoRanks (SEGFAULT)
         78 - MdrunMpiCoordinationCouplingTestsTwoRanks (SEGFAULT)
         80 - MdrunMpiCoordinationConstraintsTestsTwoRanks (SEGFAULT)

@illuhad
Copy link
Collaborator Author

illuhad commented Jul 18, 2022

Segfaults are presumably due to #771, this one is known to be still unstable.

@illuhad
Copy link
Collaborator Author

illuhad commented Jul 18, 2022

@sbalint98 Do you still observe overheads? If so, from where?

@sbalint98
Copy link
Collaborator

sbalint98 commented Jul 20, 2022

One of the main issues I see is that there is still some kernel launch overhead for hipSYCL. If I look at the traces, I see that for every kernel launch in the case of hipSYCL there is a hipLaunchKernel, __hipPushCallConfiguration, and a __hipPopCallConfiguration. I don't see that in case of HIP. In one instance between two hipLaunchKernel calls there is 52 us while in case of HIP, there is only 12us. This, unfortunately, seeps into the end-to-end performance since in this particular example, the kernels are pretty short: between 500us and 50us

EDIT: Maybe the way how hipSYCL launches kernels is the problem? Is there anything that prevents replacing this with hipLaunchKernelGGL

@illuhad
Copy link
Collaborator Author

illuhad commented Jul 20, 2022

HIP will have __hipPush/PopCallConfiguration() too. The only difference is that there they are inserted by the compiler, while we do that manually. <<<>>> Syntax is translated into __hipPushCallConfiguration(); hipLaunchKernel(). I don't know why those calls don't show up in the traces for HIP. Maybe they get inlined.

It is not possible to call hipLaunchKernelGGL because hipSYCL kernels cannot be marked __global__ since they have to invoke functions not marked as __device__. But again, it's very hard to imagine a performance difference coming from there.

In my testing, I consistently see only negligible latency in addition to cudaLaunchKernel/hipLaunchKernel invocations. I don't know if your GPU maybe has vastly lower kernel submission latency, so that additional overheads become noticable.

Can you maybe use a profile like vtune to create a flame graph or similar to gather some more insight into where exactly this overhead is coming from?

EDIT: We could try to invoke hipLaunchKernel directly (it takes the function address of the kernel as argument). Not sure if it will work with hipSYCL kernels, but it might (need to double-check that we don't rely on having the kernel in the call graph).. This will bypass clang-generated kernel launches completely...

@illuhad
Copy link
Collaborator Author

illuhad commented Jul 20, 2022

Okay, it seems that hipLaunchKernelGGL either invokes kernel<<<>>>() syntax and relies on compiler-generated kernel launch stub (like we do), or it invokes hipLaunchKernel directly depending on some preprocessor definition:
https://github.com/ROCm-Developer-Tools/hipamd/blob/0d7eebeccaa4afd58b8ebddd5c6304e0eef0b422/include/hip/amd_detail/amd_hip_runtime.h#L234

This is new, in earlier ROCm versions it would just invoke kernel<<<>>>(). I don't know which of the paths is enabled by default, but I wonder if there is a performance difference between kernel<<<>>>() syntax and invoking hipLaunchKernel directly.

@pszi1ard
Copy link

Here's some benchmark with GROMACS data on an EPYC 7742 + MI100 (many cores so there should be no contention on resources), using ROCm 4.5.2/

image

The drop in improvement from 6-12k inputs is quite peculiar as I'd expect a more gradual roll-off as the time per iteration is increasingly dominated by GPU work. This also shows up in the relative perf vs native HIP (code which has device-kernel performance roughly on par, so different is total performance is mostly due to the runtime).

image

@illuhad
Copy link
Collaborator Author

illuhad commented Jul 22, 2022

@pszi1ard Thanks for this feedback! So, is the interpretation correct that we have solved the performance issue for small problem sizes, but there is still an issue for medium problem sizes?
I wonder if the number of cached DAG nodes plays into this (HIPSYCL_RT_MAX_CACHED_NODES), maybe the cache is too small as the problem gets larger...

@illuhad illuhad marked this pull request as ready for review July 26, 2022 01:45
@illuhad
Copy link
Collaborator Author

illuhad commented Jul 26, 2022

@sbalint98 has reported that HIPSYCL_RT_MAX_CACHED_NODES has no effect on performance. I currently do not know what is going on in the medium problem size regime.

@illuhad
Copy link
Collaborator Author

illuhad commented Jul 26, 2022

Compared to HIP, we are frequently seeing lack of concurrency between nbnxm kernel and pme kernel in the problematic regime even though they are running in different streams. We suspect resource usage inside kernels to be the problem. @sbalint98 to investigate register usage.

@illuhad illuhad merged commit e934e2e into develop Jul 27, 2022
@illuhad illuhad deleted the feature/async-submitted-ops-management branch July 27, 2022 13:14
@illuhad
Copy link
Collaborator Author

illuhad commented Aug 2, 2022

@pszi1ard @al42and We found that SYCL kernels still use more resources - presumably this is why kernels do not run concurrently. We could also try to add hipSYCL_priority properties to SYCL queues in Gromacs, or experiment with CU masks to get HIP to overlap the kernels.

@pszi1ard
Copy link

pszi1ard commented Aug 3, 2022

@pszi1ard Thanks for this feedback! So, is the interpretation correct that we have solved the performance issue for small problem sizes, but there is still an issue for medium problem sizes?

Yes, but I think there is some underlying HIP runtime overhead (or perhaps related to the lack of overlap?) that limits iteration rate with smaller inputs, that might be masking potential differences.

Here's a three-way comparison of hipSYCL vs HIP native (on MI100) vs CUDA (2080 Ti) using two different inputs, "pme" has the two paths on the task graph with possibility for overlap, while "rf" has a single path, so all major tasks (with the exception of some smaller auxiliary kernels) that need to be sequentially executed.

image

@pszi1ard @al42and We found that SYCL kernels still use more resources - presumably this is why kernels do not run concurrently. We could also try to add hipSYCL_priority properties to SYCL queues in Gromacs, or experiment with CU masks to get HIP to overlap the kernels.

@sbalint98 I assume this is the same experimental StreamHPC HIP code as you referred to before?
Without priorities I do not expect kernel overlap -- other than kernel tails. What kind of resources limit kernel tail overlap -- I though registers are per CU so as soon as a CU drains another kernel's waves can get scheduled.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants