[libclc][CUDA] Optimize async_work_group_copy for cuda backend #5611

JackAKirk · 2022-02-18T19:55:31Z

This makes use of the sm_80 DMA feature.

The test coverage has been increased in intel/llvm-test-suite#856, to cover all cases which have been optimized in the cuda backend case.

Signed-off-by: jack.kirk jack.kirk@codeplay.com

…kend case. Signed-off-by: jack.kirk <jack.kirk@codeplay.com>

It is not yet used by the runtime so I have removed it for the time being.

Pennycook · 2022-02-18T21:05:41Z

libclc/generic/include/clc/async/gentype.inc

-#include __CLC_BODY
-#undef __CLC_GENTYPE_MANGLED
-#undef __CLC_GENTYPE
-
 #define __CLC_GENTYPE float8


Out of curiosity: why are only some of these types moved?

So the ones which had moved were all cases that could use the "optimized" async copy route in the cuda backend (for > sm_80) because they had 4,8, or 16 byte alignment.

Pennycook · 2022-02-18T21:28:55Z

libclc/ptx-nvidiacl/libspirv/async/wait_group_events.cl

+  if (__clc_nvvm_reflect_arch() >= 800) {
+    __nvvm_cp_async_wait_all();
+  }
+  __spirv_ControlBarrier(scope, scope, SequentiallyConsistent);


I think SequentiallyConsistent is the right thing for this PR, because the SYCL 2020 specification is unclear about what the synchronization entails here. But I wanted to call this out as something for us to remember for #4950; the extension should either say what the memory order associated with the synchronization is supposed to be, or we should consider exposing it as part of the API.

bader · 2022-02-21T10:08:24Z

@JackAKirk, please, take a look.

Failed Tests (1):
  LIBCLC :: ocl/prefetch.cl

bader · 2022-02-21T10:11:15Z

/verify with intel/llvm-test-suite#856

JackAKirk · 2022-02-21T16:12:56Z

@JackAKirk, please, take a look.
Failed Tests (1):
  LIBCLC :: ocl/prefetch.cl

Thanks. This should be fixed now, I missed that prefetch.cl also uses the gentypes.inc. I had to add a dedicated gentypes.inc for the async copy in the cuda backend.

bader · 2022-02-22T09:34:28Z

libclc/ptx-nvidiacl/libspirv/SOURCES

@@ -71,6 +71,8 @@ relational/isfinite.cl
 relational/isinf.cl
 relational/isnan.cl
 synchronization/barrier.cl
+async/async_work_group_strided_copy.cl
+async/wait_group_events.cl


I think libclc/ptx-nvidiacl/include/clc/async/gentype.inc must be added here too.
Otherwise, libclc build rules won't re-build the library when libclc/ptx-nvidiacl/include/clc/async/gentype.inc changes.

Please, add all files from include directory.

I would also check that *.inc, *.h files added for CUDA/HIP backends are included in corresponding SOURCES files.

OK I'll have to try to understand this build system better. There is e.g. ptx-nvidiacl/libspirv/math/fma.inc that is not added to sources, although it is included in a directory with the corresponding source file, fma.cl. Should I add all such .inc files to SOURCES?
The libclc/generic/include/clc/async/gentype.inc is also not added to the generic/libspirv/SOURCES file. Is this a mistake or is it only necessary to add libclc/ptx-nvidiacl/include/clc/async/gentype.inc to ptx-nvidiacl/libspirv/SOURCES?

I might be wrong as I'm not an expert in libclc - it's not used for Intel targets.
I think we can conduct an experiment to check if dependencies are properly handled by the build system.
Could you build libclc, add an empty space to fma.inc, do incremental build for libclc again and check that only fma.cl is built, please?
I'm not sure if clang is able to communicate dependencies to the build system automatically based on preprocessor output for OpenCL language mode.

OK sure sounds good. I'll check this out.

So CMake is setting CMAKE_CONFIGURE_DEPENDS for all files listed in the SOURCES file, then all source files listed in SOURCES are built incrementally. However if you add a .inc file to SOURCES that .inc file is not built incrementally: if you also change the .cl source file that includes the .inc file both the .cl file and the .inc file are built incrementally, irrespective of whether the .inc file was included in SOURCES or not.

I don't have an immediate fix to make .inc files build incrementally without also changing the .cl file that they are included from, but I could make an issue for this?

I don't have a strong opinion, but comments like this: #5429 (comment) makes me think that there might be some problems with moving libclc functionality from *.cl files to *.inc or *.h.
I think it mostly impacts developers building libclc incrementally and I hope it doesn't impact our CI, where we use ccache.

I don't have an immediate fix to make .inc files build incrementally without also changing the .cl file that they are included from, but I could make an issue for this?

I'll leave it for @AerialMantis.

I'm merging the PR as the issue we discuss in this conversion is not blocking.

I think it's worth creating a ticket for this, there are quite a few places where .inc and .h files are used and this is a bit of a pain point for working with libclc. I'm not a CMake expert but I believe I have seen this problem solved before, perhaps by also adding the .inc and .h files with target_include_directories, though that may require more complex parsing of the SOURCES files to distinguish those from .cl files.

libclc/ptx-nvidiacl/include/clc/async/gentype.inc

Optimized libclc implementation of async_work_group_copy for cuda bac…

6d4904d

…kend case. Signed-off-by: jack.kirk <jack.kirk@codeplay.com>

JackAKirk requested a review from bader as a code owner February 18, 2022 19:55

Added new nvptx libclc implementation.

8e039d8

JackAKirk requested a review from Pennycook February 18, 2022 20:04

Remove subgroup async copy libclc support.

6a3471b

It is not yet used by the runtime so I have removed it for the time being.

JackAKirk mentioned this pull request Feb 18, 2022

[SYCL][CUDA] Add async_group_copy free function implementation #4907

Closed

format

a0181a4

Pennycook previously approved these changes Feb 18, 2022

View reviewed changes

bader changed the title ~~Optimized libclc implementation of async_work_group_copy for cuda bac…~~ [libclc] Optimize async_work_group_copy for cuda backend Feb 19, 2022

bader changed the title ~~[libclc] Optimize async_work_group_copy for cuda backend~~ [libclc][CUDA] Optimize async_work_group_copy for cuda backend Feb 19, 2022

Added separate gentype.inc file for cuda backend.

e218013

JackAKirk dismissed Pennycook’s stale review via e218013 February 21, 2022 15:12

Removed duplicate #undef.

f4a2fda

bader requested a review from Pennycook February 21, 2022 18:02

bader requested changes Feb 22, 2022

View reviewed changes

added missing file license headers.

8fe8d8b

bader approved these changes Feb 22, 2022

View reviewed changes

Pennycook approved these changes Feb 22, 2022

View reviewed changes

bader merged commit ab1e594 into intel:sycl Feb 24, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[libclc][CUDA] Optimize async_work_group_copy for cuda backend #5611

[libclc][CUDA] Optimize async_work_group_copy for cuda backend #5611

JackAKirk commented Feb 18, 2022 •

edited by bader

Loading

Pennycook Feb 18, 2022

JackAKirk Feb 21, 2022

Pennycook Feb 18, 2022

bader commented Feb 21, 2022

bader commented Feb 21, 2022

JackAKirk commented Feb 21, 2022

bader Feb 22, 2022

JackAKirk Feb 22, 2022

bader Feb 22, 2022

JackAKirk Feb 22, 2022

JackAKirk Feb 23, 2022

bader Feb 24, 2022

bader Feb 24, 2022

AerialMantis Mar 1, 2022

[libclc][CUDA] Optimize async_work_group_copy for cuda backend #5611

[libclc][CUDA] Optimize async_work_group_copy for cuda backend #5611

Conversation

JackAKirk commented Feb 18, 2022 • edited by bader Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bader commented Feb 21, 2022

bader commented Feb 21, 2022

JackAKirk commented Feb 21, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JackAKirk commented Feb 18, 2022 •

edited by bader

Loading