Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[libclc][CUDA] Optimize async_work_group_copy for cuda backend #5611

Merged
merged 7 commits into from
Feb 24, 2022

Conversation

JackAKirk
Copy link
Contributor

@JackAKirk JackAKirk commented Feb 18, 2022

This makes use of the sm_80 DMA feature.

The test coverage has been increased in intel/llvm-test-suite#856, to cover all cases which have been optimized in the cuda backend case.

Signed-off-by: jack.kirk jack.kirk@codeplay.com

…kend case.

Signed-off-by: jack.kirk <jack.kirk@codeplay.com>
It is not yet used by the runtime so I have removed it for the time being.
Pennycook
Pennycook previously approved these changes Feb 18, 2022
#include __CLC_BODY
#undef __CLC_GENTYPE_MANGLED
#undef __CLC_GENTYPE

#define __CLC_GENTYPE float8
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of curiosity: why are only some of these types moved?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the ones which had moved were all cases that could use the "optimized" async copy route in the cuda backend (for > sm_80) because they had 4,8, or 16 byte alignment.

if (__clc_nvvm_reflect_arch() >= 800) {
__nvvm_cp_async_wait_all();
}
__spirv_ControlBarrier(scope, scope, SequentiallyConsistent);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think SequentiallyConsistent is the right thing for this PR, because the SYCL 2020 specification is unclear about what the synchronization entails here. But I wanted to call this out as something for us to remember for #4950; the extension should either say what the memory order associated with the synchronization is supposed to be, or we should consider exposing it as part of the API.

@bader bader changed the title Optimized libclc implementation of async_work_group_copy for cuda bac… [libclc] Optimize async_work_group_copy for cuda backend Feb 19, 2022
@bader bader changed the title [libclc] Optimize async_work_group_copy for cuda backend [libclc][CUDA] Optimize async_work_group_copy for cuda backend Feb 19, 2022
@bader
Copy link
Contributor

bader commented Feb 21, 2022

@JackAKirk, please, take a look.

Failed Tests (1):
  LIBCLC :: ocl/prefetch.cl

@bader
Copy link
Contributor

bader commented Feb 21, 2022

/verify with intel/llvm-test-suite#856

@JackAKirk
Copy link
Contributor Author

@JackAKirk, please, take a look.

Failed Tests (1):
  LIBCLC :: ocl/prefetch.cl

Thanks. This should be fixed now, I missed that prefetch.cl also uses the gentypes.inc. I had to add a dedicated gentypes.inc for the async copy in the cuda backend.

@bader bader requested a review from Pennycook February 21, 2022 18:02
@@ -71,6 +71,8 @@ relational/isfinite.cl
relational/isinf.cl
relational/isnan.cl
synchronization/barrier.cl
async/async_work_group_strided_copy.cl
async/wait_group_events.cl
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think libclc/ptx-nvidiacl/include/clc/async/gentype.inc must be added here too.
Otherwise, libclc build rules won't re-build the library when libclc/ptx-nvidiacl/include/clc/async/gentype.inc changes.

Please, add all files from include directory.

I would also check that *.inc, *.h files added for CUDA/HIP backends are included in corresponding SOURCES files.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK I'll have to try to understand this build system better. There is e.g. ptx-nvidiacl/libspirv/math/fma.inc that is not added to sources, although it is included in a directory with the corresponding source file, fma.cl. Should I add all such .inc files to SOURCES?
The libclc/generic/include/clc/async/gentype.inc is also not added to the generic/libspirv/SOURCES file. Is this a mistake or is it only necessary to add libclc/ptx-nvidiacl/include/clc/async/gentype.inc to ptx-nvidiacl/libspirv/SOURCES?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might be wrong as I'm not an expert in libclc - it's not used for Intel targets.
I think we can conduct an experiment to check if dependencies are properly handled by the build system.
Could you build libclc, add an empty space to fma.inc, do incremental build for libclc again and check that only fma.cl is built, please?
I'm not sure if clang is able to communicate dependencies to the build system automatically based on preprocessor output for OpenCL language mode.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK sure sounds good. I'll check this out.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So CMake is setting CMAKE_CONFIGURE_DEPENDS for all files listed in the SOURCES file, then all source files listed in SOURCES are built incrementally. However if you add a .inc file to SOURCES that .inc file is not built incrementally: if you also change the .cl source file that includes the .inc file both the .cl file and the .inc file are built incrementally, irrespective of whether the .inc file was included in SOURCES or not.

I don't have an immediate fix to make .inc files build incrementally without also changing the .cl file that they are included from, but I could make an issue for this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have a strong opinion, but comments like this: #5429 (comment) makes me think that there might be some problems with moving libclc functionality from *.cl files to *.inc or *.h.
I think it mostly impacts developers building libclc incrementally and I hope it doesn't impact our CI, where we use ccache.

I don't have an immediate fix to make .inc files build incrementally without also changing the .cl file that they are included from, but I could make an issue for this?

I'll leave it for @AerialMantis.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm merging the PR as the issue we discuss in this conversion is not blocking.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's worth creating a ticket for this, there are quite a few places where .inc and .h files are used and this is a bit of a pain point for working with libclc. I'm not a CMake expert but I believe I have seen this problem solved before, perhaps by also adding the .inc and .h files with target_include_directories, though that may require more complex parsing of the SOURCES files to distinguish those from .cl files.

@bader bader merged commit ab1e594 into intel:sycl Feb 24, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants