Add support for HIP explicit multipass #790
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR adds support for the
hip.explicit-multipass
compilation flow in the HIP backend. In hipSYCL explicit multipass flows, hipSYCL takes control over multipass compilation, kernel embedding, kernel caching and low-level module management. It also uses low-level kernel launch mechanisms instead of relying on clang-generated kernel launch stubs.The motivation for this PR is the discussion regarding latency of
hipLaunchKernel
itself in PR #761 (CC @sbalint98). On the CUDA side, explicit multipass is known to substantially outperform the CUDA runtime API as far as kernel launch latencies are concerned - presumably because our kernel cache is better than whatever the CUDA runtime does. Because of this, I wanted to see if a similar speedup could be achieved on the HIP side.With a quick test on my APU, so far unfortunately I do not see evidence for a difference in kernel launch latency between the new HIP explicit multipass and the old integrated multipass flows. But to draw proper conclusions, we will have to try again with a proper discrete GPU and look at profiler timelines which I have not done yet.
It might also be the case that on HIP there is no such difference, because the HIP API ingests compiled binaries, not an IR like PTX on the CUDA side, and therefore the kernel cache performance is potentially less important.
HIP explicit multipass is only supported on clang 13+.
TODO: We actually need to enforce that limitation; currently things just will not work with earlier clangs without clear error message.