-
Notifications
You must be signed in to change notification settings - Fork 730
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[libclc][CUDA] Optimize async_work_group_copy for cuda backend #5611
Conversation
…kend case. Signed-off-by: jack.kirk <jack.kirk@codeplay.com>
It is not yet used by the runtime so I have removed it for the time being.
#include __CLC_BODY | ||
#undef __CLC_GENTYPE_MANGLED | ||
#undef __CLC_GENTYPE | ||
|
||
#define __CLC_GENTYPE float8 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Out of curiosity: why are only some of these types moved?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the ones which had moved were all cases that could use the "optimized" async copy route in the cuda backend (for > sm_80) because they had 4,8, or 16 byte alignment.
if (__clc_nvvm_reflect_arch() >= 800) { | ||
__nvvm_cp_async_wait_all(); | ||
} | ||
__spirv_ControlBarrier(scope, scope, SequentiallyConsistent); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think SequentiallyConsistent
is the right thing for this PR, because the SYCL 2020 specification is unclear about what the synchronization entails here. But I wanted to call this out as something for us to remember for #4950; the extension should either say what the memory order associated with the synchronization is supposed to be, or we should consider exposing it as part of the API.
@JackAKirk, please, take a look.
|
/verify with intel/llvm-test-suite#856 |
Thanks. This should be fixed now, I missed that prefetch.cl also uses the gentypes.inc. I had to add a dedicated gentypes.inc for the async copy in the cuda backend. |
@@ -71,6 +71,8 @@ relational/isfinite.cl | |||
relational/isinf.cl | |||
relational/isnan.cl | |||
synchronization/barrier.cl | |||
async/async_work_group_strided_copy.cl | |||
async/wait_group_events.cl |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think libclc/ptx-nvidiacl/include/clc/async/gentype.inc must be added here too.
Otherwise, libclc build rules won't re-build the library when libclc/ptx-nvidiacl/include/clc/async/gentype.inc changes.
Please, add all files from include directory.
I would also check that *.inc, *.h files added for CUDA/HIP backends are included in corresponding SOURCES files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK I'll have to try to understand this build system better. There is e.g. ptx-nvidiacl/libspirv/math/fma.inc that is not added to sources, although it is included in a directory with the corresponding source file, fma.cl. Should I add all such .inc files to SOURCES?
The libclc/generic/include/clc/async/gentype.inc is also not added to the generic/libspirv/SOURCES file. Is this a mistake or is it only necessary to add libclc/ptx-nvidiacl/include/clc/async/gentype.inc to ptx-nvidiacl/libspirv/SOURCES?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might be wrong as I'm not an expert in libclc - it's not used for Intel targets.
I think we can conduct an experiment to check if dependencies are properly handled by the build system.
Could you build libclc, add an empty space to fma.inc, do incremental build for libclc again and check that only fma.cl is built, please?
I'm not sure if clang is able to communicate dependencies to the build system automatically based on preprocessor output for OpenCL language mode.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK sure sounds good. I'll check this out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So CMake is setting CMAKE_CONFIGURE_DEPENDS for all files listed in the SOURCES file, then all source files listed in SOURCES are built incrementally. However if you add a .inc file to SOURCES that .inc file is not built incrementally: if you also change the .cl source file that includes the .inc file both the .cl file and the .inc file are built incrementally, irrespective of whether the .inc file was included in SOURCES or not.
I don't have an immediate fix to make .inc files build incrementally without also changing the .cl file that they are included from, but I could make an issue for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have a strong opinion, but comments like this: #5429 (comment) makes me think that there might be some problems with moving libclc functionality from *.cl files to *.inc or *.h.
I think it mostly impacts developers building libclc incrementally and I hope it doesn't impact our CI, where we use ccache.
I don't have an immediate fix to make .inc files build incrementally without also changing the .cl file that they are included from, but I could make an issue for this?
I'll leave it for @AerialMantis.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm merging the PR as the issue we discuss in this conversion is not blocking.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's worth creating a ticket for this, there are quite a few places where .inc
and .h
files are used and this is a bit of a pain point for working with libclc. I'm not a CMake expert but I believe I have seen this problem solved before, perhaps by also adding the .inc
and .h
files with target_include_directories
, though that may require more complex parsing of the SOURCES
files to distinguish those from .cl
files.
This makes use of the sm_80 DMA feature.
The test coverage has been increased in intel/llvm-test-suite#856, to cover all cases which have been optimized in the cuda backend case.
Signed-off-by: jack.kirk jack.kirk@codeplay.com