[SYCL] Fix marray math function impls #6038

JackAKirk · 2022-04-21T17:33:19Z

This PR aims to fix issue : #5991 and provide efficient working marray math function implementations for all backends.

marray math function support is currently switched on for {n} ({n} defined in #5991) but the implementations are currently broken and untested. There is also very limited test coverage for sycl::vec cases. The sycl 2020 specification states that the set {N} ({N} defined in #5991) should be supported for marray math function cases.

All SYCL 2020 math, native math, and half_precision math functions now have marray support when the function's arguments are of type genfloat and have the same argument type for all arguments.

Tests: intel/llvm-test-suite#1002

Signed-off-by: jack.kirk jack.kirk@codeplay.com

JackAKirk · 2022-04-27T15:54:10Z

/verify with intel/llvm-test-suite#1002

including sycl:: math/native/half_precision/experimental cases. removed marray from "floating_list" Signed-off-by: jack.kirk <jack.kirk@codeplay.com>

Signed-off-by: jack.kirk <jack.kirk@codeplay.com>

JackAKirk · 2022-05-11T16:03:27Z

I've added scalar_vector_* lists in this PR that omit marray types, so that math functions can distinguish the marray implementations I added.
The type lists including marrays, used in e.g. is_genfloat, are used in the has_known_identity trait class described in section 4.9.2. of the SYCL 2020 spec. The current marray lists include marrays of size from the set {n} (defined/discussed in #5991) which limits the spans used in array reductions to the set {n}. If we have array reductions then 4.9.2 does not state that they should be limited to the set {n}, although it does not specify what the admissible set of spans are.

I think that it makes more sense to allow array reductions with any span (or at least a larger range than {n}) which would mean updating the marray type lists.

@aobolensk @steffenlarsen what do you think?

JackAKirk · 2022-05-11T16:30:19Z

/verify with intel/llvm-test-suite#1002

JackAKirk · 2022-05-13T09:48:48Z

/verify with intel/llvm-test-suite#1002

FYI I don't have access to see the failures from this. The tests are passing locally for cuda.

sycl/include/CL/sycl/builtins.hpp

Signed-off-by: JackAKirk <jack.kirk@codeplay.com>

aelovikov-intel

Can we extend existing tests to capture the new sizes?

steffenlarsen · 2022-05-17T18:30:06Z

I've added scalar_vector_* lists in this PR that omit marray types, so that math functions can distinguish the marray implementations I added. The type lists including marrays, used in e.g. is_genfloat, are used in the has_known_identity trait class described in section 4.9.2. of the SYCL 2020 spec. The current marray lists include marrays of size from the set {n} (defined/discussed in #5991) which limits the spans used in array reductions to the set {n}. If we have array reductions then 4.9.2 does not state that they should be limited to the set {n}, although it does not specify what the admissible set of spans are.

I think that it makes more sense to allow array reductions with any span (or at least a larger range than {n}) which would mean updating the marray type lists.

@aobolensk @steffenlarsen what do you think?

I agree, genfloat is currently too restrictive on marray and we should loosen it. If we did, would this patch be obsolete or would these separate definitions still be required?

aelovikov-intel · 2022-05-17T18:15:52Z

sycl/include/CL/sycl/builtins.hpp

+      std::enable_if_t<detail::is_sgenfloat<T>::value, sycl::marray<T, N>>     \
+      NAME(sycl::marray<T, N> x) __NOEXC {                                     \
+    sycl::marray<T, N> res;                                                    \
+    auto x_vec2 = reinterpret_cast<sycl::vec<T, 2> const *>(&x);               \


I think type punning is UB in C++.

Yes this usage of reinterpret_cast is UB in C++. I've now switched to using std::memcpy instead which leads to identical asm at the default Opt level.

sycl/include/CL/sycl/builtins.hpp

aelovikov-intel · 2022-05-17T19:14:53Z

sycl/include/CL/sycl/builtins.hpp

+
+#undef __SYCL_MATH_FUNCTION_OVERLOAD
+
+#define __SYCL_MATH_FUNCTION_2_OVERLOAD(NAME)                                  \


Would something like this https://godbolt.org/z/adez76fTd be possible here to avoid duplicating this code for all 3 cases?

Yes I think that something along that line would be possible, although I'm not sure that it would necessarily be an improvement, particularly for the current implementation.

Having seen (almost?) the same for loop nine times in this PR, I'd argue it will be. Other reviewers, am I really the only one?

There's a couple of issues with applying the suggested approach in the simple example https://godbolt.org/z/adez76fTd. Firstly the two __invoke_* function calls that are made in each marray math function implementation require the explicit provision of a template parameter that is different in each of the calls, i.e. vec<T, 2> and T. This means that an immediate adaption of https://godbolt.org/z/adez76fTd that directly called the __invoke_* functions would required the provision of two lambdas, Callable Fvec and Callable F corresponding to the vec<T, 2> and T cases.
Also with this approach the usage of macros to prevent duplicating function declaration lines becomes less attractive because this would require passing the lambdas to the macros, although I imagine that you probably meant to remove the macros completely.

I brought this issue up with the team and everyone agreed that although it would be possible to work around these issues the resultant implementation would be more complex and the size of the code would be similar to the current implementation.

I have applied your change to to_vec2 that simplifies it.

Thanks

Signed-off-by: JackAKirk <jack.kirk@codeplay.com>

JackAKirk · 2022-05-18T16:33:49Z

Can we extend existing tests to capture the new sizes?

I did not find any existing tests for marray math builtins: this makes sense since the existing implementation was broken because the implementation that was written for scalars/vectors cannot be used for marray cases.

JackAKirk · 2022-05-18T16:35:17Z

I've added scalar_vector_* lists in this PR that omit marray types, so that math functions can distinguish the marray implementations I added. The type lists including marrays, used in e.g. is_genfloat, are used in the has_known_identity trait class described in section 4.9.2. of the SYCL 2020 spec. The current marray lists include marrays of size from the set {n} (defined/discussed in #5991) which limits the spans used in array reductions to the set {n}. If we have array reductions then 4.9.2 does not state that they should be limited to the set {n}, although it does not specify what the admissible set of spans are.
I think that it makes more sense to allow array reductions with any span (or at least a larger range than {n}) which would mean updating the marray type lists.
@aobolensk @steffenlarsen what do you think?

I agree, genfloat is currently too restrictive on marray and we should loosen it. If we did, would this patch be obsolete or would these separate definitions still be required?

Loosening genfloat marray restrictions would not make this patch obsolete because the scalar/vector implementations of these math functions cannot be used for marray cases.

keryell · 2022-05-18T23:27:27Z

sycl/include/sycl/ext/oneapi/experimental/builtins.hpp

+  for (size_t i = 0; i < N / 2; i++) {
+    auto partial_res = __sycl_std::__invoke_exp2<sycl::vec<half, 2>>(
+        sycl::detail::to_vec(x, i * 2));
+    std::memcpy(&res[i * 2], &partial_res, sizeof(vec<half, 2>));


Why not just having conversion operators between all these types to avoid atrocities?

Do you mean an explicit conversion operator allowing something like

marray<T, N> x; marray<vec<T, 2>, N/2> x_vec2 = x;

or similar.

Or an implicit conversion operator allowing x[i * 2] -> vec<T, 2> etc?

I think that the main issue (beyond the question of whether this would make the code more readable or less confusing) is that this would have to break SYCL spec definitions of marray/vec/etc? Perhaps I am missing something or misunderstanding what you mean?

Thanks

marray is supposed to be more general than vec.
What is the long term plan with vec? Will marray replace vec?
@Pennycook, can you comment on that?

So this marray implementation supports any marray size N where N is in the set of size_t: so it is more general than vec in this sense. I think that it makes sense to map the marray implementation onto the vec one like e.g. I have done here by using vec2 (although it would be a fair argument to suggest another impl that makes use of larger vec sizes, as we've discussed in another thread). In theory this should also allow for vectorized register loads, although for some reason in the CUDA backend when we cast from marray to vec the vectorized loads that we see when using the standard vec implementation are not used: This is something we have as a TODO to improve on.

There was a good presentation by your colleagues at syclcon.org this year that talks about it: Untangling Modern Parallel Programming Models https://www.youtube.com/watch?v=6FbW6zVYkxk&list=PL46sP9LM8GsyHAxj1k7MbWrv5f5SlMpIF&index=27

I think that it makes sense to map the marray implementation onto the vec one like e.g. I have done here by using vec2 (although it would be a fair argument to suggest another impl that makes use of larger vec sizes, as we've discussed in another thread). In theory this should also allow for vectorized register loads, although for some reason in the CUDA backend when we cast from marray to vec the vectorized loads that we see when using the standard vec implementation are not used: This is something we have as a TODO to improve on.

We've discussed this a bit offline.
The reason why loads and stores to/from marrays are not being vectorized is that LLVM's load-store-vectorizer pass has a strict requirement on the alignment; the pointers have to be aligned to at least what the resulting vector would require (or the target must allow misaligned operations). The alignment requirement can be easily achieved by changing the default alignment of marray to the "previous" vector (i.e.: marray<15, T> would be decorated with __attribute__((aligned(8 * sizeof(T))), making sure it's a power of 2 number).

WRT behind the scenes conversion of marray elements to vectors (to_vec2), it seems wrong, it is unlikely to bring performance benefits, as it will always result in temporary storage and extra loads, stores instructions. Perhaps we could use the same approach as cutlas does, harnessing the fact that alignment is set correctly, we could do a bit of type punning see: https://github.com/NVIDIA/cutlass/blob/e7a61c761a4bfb387b61c03cdbcd19ab300726b7/include/cutlass/functional.h#L1444

I had a go at the above in here: JackAKirk#1

As a side note, it feels like somewhere here there is a compiler optimization missed, we should be able to use the same logic as the vectorizer pass and gather those scalar intrinsic, converting them to vector equivalents, making this code a lot cleaner.

@aelovikov-intel @keryell

What do you think about @jchlanda 's suggestion above?

I don't think it becomes less UB.

it seems wrong, it is unlikely to bring performance benefits, as it will always result in temporary storage and extra loads, stores instructions.

Why wouldn't the compiler optimize this? This is a standard C++ idiom for the compiler - https://en.cppreference.com/w/cpp/numeric/bit_cast

Why wouldn't the compiler optimize this? This is a standard C++ idiom for the compiler - https://en.cppreference.com/w/cpp/numeric/bit_cast

The compiler mustn't optimize it away because of the difference in the alignment of (elements of) marray and the temporary vec2 variable, as marray follows the std::array alignment rules. For the same reason compiler was unable to generate vector loads and stores directly to/from marray, this is enforced in PTX spec:

By default, vector variables are aligned to a multiple of their overall size (vector length times base-type size), to enable vector load and store instructions which require addresses aligned to a multiple of the access size.

This patch changes it though, and marray satisfies the alignment requirement of the "previous" vector.

I don't think it becomes less UB.

You are right, I reverted the type punning casts, with alignment fix LLVM is clever enough to optimize vec2 variable and a call to memcpy away.

I don't think it becomes less UB.

You are right, I reverted the type punning casts, with alignment fix LLVM is clever enough to optimize vec2 variable and a call to memcpy away.

Sounds like we have a winner. I can merge JackAKirk#1 into this PR if everyone is happy now?

Signed-off-by: JackAKirk <jack.kirk@codeplay.com>

JackAKirk · 2022-10-19T17:47:42Z

/verify with intel/llvm-test-suite#1002

steffenlarsen

LGTM! Would still like to have the comment from https://github.com/intel/llvm/pull/6038/files#r973197193.

JackAKirk · 2022-10-21T10:50:48Z

LGTM! Would still like to have the comment from https://github.com/intel/llvm/pull/6038/files#r973197193.

I did add comments on the scalar functions:

llvm/sycl/include/sycl/ext/oneapi/experimental/builtins.hpp

Line 144 in 4824d63

// genfloath exp2 (genfloath x)

and

llvm/sycl/include/sycl/ext/oneapi/experimental/builtins.hpp

Line 97 in 4824d63

// backends we revert to the sycl::tanh impl.

I can also the same comment here

llvm/sycl/include/sycl/ext/oneapi/experimental/builtins.hpp

Line 118 in 4824d63

template <typename T, size_t N>

?

Or perhaps I also misunderstood the comment you meant to add?

Thanks

steffenlarsen · 2022-10-21T10:56:20Z

I think the comments you mentioned are good, but there seem to be more functions using native functions for NVPTX but not for other targets, like the place where the aforementioned comment is. May not be clear that there is a link between the other definitions using native for NVPTX only and the new ones doing the same.

Signed-off-by: JackAKirk <jack.kirk@codeplay.com>

JackAKirk · 2022-10-21T14:07:22Z

I think the comments you mentioned are good, but there seem to be more functions using native functions for NVPTX but not for other targets, like the place where the aforementioned comment is. May not be clear that there is a link between the other definitions using native for NVPTX only and the new ones doing the same.

I see what you mean. I've added the comments in the two other places now.

Signed-off-by: JackAKirk <jack.kirk@codeplay.com>

JackAKirk · 2022-11-17T09:34:18Z

Hi @keryell
Do you have any more reviews for this?
Thanks

JackAKirk · 2022-11-17T15:41:42Z

Hi @keryell Do you have any more reviews for this? Thanks

Also we want this in before we finish up the marray complex extension: #6550.

JackAKirk · 2022-12-01T17:08:13Z

Hi @keryell Do you have any more reviews for this? Thanks

Also we want this in before we finish up the marray complex extension: #6550.

@bader Do you want to ask someone else to review this perhaps? There are now new conflicts, which I can fix. But if this PR stays open it is inevitable that there will be future merge conflicts to deal with.

bader · 2022-12-01T17:14:13Z

@bader Do you want to ask someone else to review this perhaps?

No. I requested @keryell to review to make sure that his previous comments are addressed.
If you think that you received all needed approves, I suggest we merge it and address further comments in follow-up commits.

Signed-off-by: JackAKirk <jack.kirk@codeplay.com>

keryell

Thanks.
I recently noticed KhronosGroup/SYCL-Docs#320 by the way, so the marray are not yet a drop-in replacement for vec.

Signed-off-by: JackAKirk <jack.kirk@codeplay.com>

JackAKirk · 2022-12-02T09:54:27Z

/verify with intel/llvm-test-suite#1002

JackAKirk · 2022-12-02T14:54:24Z

/verify with intel/llvm-test-suite#1002

JackAKirk · 2022-12-02T17:14:56Z

@bader Do you want to ask someone else to review this perhaps?

No. I requested @keryell to review to make sure that his previous comments are addressed. If you think that you received all needed approves, I suggest we merge it and address further comments in follow-up commits.

OK. I'm now happy for this PR and intel/llvm-test-suite#1002 to be merged now. I've merged the latest sycl branch here, resolved conflicts and fixed one test that was resultantly failing. It is now passing the llvm-test-suite run using intel/llvm-test-suite#1002. The amd test failures are I think unrelated and also seen in other PRs.

Thanks.

Tests for marray/vec SYCL math functions from: intel/llvm#6038 Signed-off-by: jack.kirk <jack.kirk@codeplay.com>

Tests for marray/vec SYCL math functions from: intel#6038 Signed-off-by: jack.kirk <jack.kirk@codeplay.com>

JackAKirk requested a review from a team as a code owner April 21, 2022 17:33

JackAKirk requested a review from v-klochkov April 21, 2022 17:33

JackAKirk marked this pull request as draft April 21, 2022 17:33

JackAKirk removed the request for review from v-klochkov April 21, 2022 17:34

This was referenced Apr 21, 2022

[SYCL] Tests for vec/marray math intel/llvm-test-suite#1002

Merged

[SYCL] Add tests for native math extension intel/llvm-test-suite#895

Merged

JackAKirk force-pushed the fp16x2_marray branch from 7b797f2 to 6633586 Compare April 26, 2022 11:12

JackAKirk marked this pull request as ready for review April 26, 2022 11:12

Working marray math impls

8c8ab7e

including sycl:: math/native/half_precision/experimental cases. removed marray from "floating_list" Signed-off-by: jack.kirk <jack.kirk@codeplay.com>

JackAKirk force-pushed the fp16x2_marray branch from 76c4efd to 8c8ab7e Compare May 9, 2022 13:02

JackAKirk added 4 commits May 11, 2022 15:36

introduced scalar_vector lists not including marray.

477d079

Signed-off-by: jack.kirk <jack.kirk@codeplay.com>

Merge branch 'sycl' into fp16x2_marray

63e7f44

format

a0e5bac

Signed-off-by: jack.kirk <jack.kirk@codeplay.com>

format

a60c15c

Signed-off-by: jack.kirk <jack.kirk@codeplay.com>

steffenlarsen reviewed May 16, 2022

View reviewed changes

sycl/include/CL/sycl/builtins.hpp Outdated Show resolved Hide resolved

used is_sgenfloat where possible.

18df26a

Signed-off-by: JackAKirk <jack.kirk@codeplay.com>

steffenlarsen requested review from steffenlarsen and aelovikov-intel May 17, 2022 17:59

aelovikov-intel reviewed May 17, 2022

View reviewed changes

reinterpret_cast usage -> std::memcpy.

2056258

Signed-off-by: JackAKirk <jack.kirk@codeplay.com>

keryell reviewed May 18, 2022

View reviewed changes

to_vec -> to_vec2 naming

8fc29a8

Signed-off-by: JackAKirk <jack.kirk@codeplay.com>

steffenlarsen approved these changes Oct 21, 2022

View reviewed changes

Added non-native fallback comments for marray cases.

a409474

Signed-off-by: JackAKirk <jack.kirk@codeplay.com>

JackAKirk mentioned this pull request Oct 21, 2022

Incomplete SYCL 2020 math functions marray support #5991

Closed

bader requested a review from keryell October 23, 2022 17:43

c++14 -> c++17

e0cbd72

Signed-off-by: JackAKirk <jack.kirk@codeplay.com>

JackAKirk added 4 commits December 1, 2022 19:36

Merge branch 'sycl' into fp16x2_marray

a0759ad

Signed-off-by: JackAKirk <jack.kirk@codeplay.com>

Update host error msg, switch to sycl::bfloat16.

a0cdf60

Signed-off-by: JackAKirk <jack.kirk@codeplay.com>

format and used bfloat16 without fully qualified name.

1d88f2d

Signed-off-by: JackAKirk <jack.kirk@codeplay.com>

Remove duplicated bfloat16 math functs.

8772526

Signed-off-by: JackAKirk <jack.kirk@codeplay.com>

keryell approved these changes Dec 1, 2022

View reviewed changes

JackAKirk added 2 commits December 2, 2022 09:26

Format.

db896f2

Signed-off-by: JackAKirk <jack.kirk@codeplay.com>

format

201ae24

Signed-off-by: JackAKirk <jack.kirk@codeplay.com>

bader merged commit 73a992b into intel:sycl Dec 2, 2022

bader pushed a commit to intel/llvm-test-suite that referenced this pull request Dec 2, 2022

[SYCL] Tests for vec/marray math (#1002)

c873dbc

Tests for marray/vec SYCL math functions from: intel/llvm#6038 Signed-off-by: jack.kirk <jack.kirk@codeplay.com>

JackAKirk mentioned this pull request Dec 9, 2022

[SYCL] Add bfloat16 utils based on libdevice bfloat16 support. #7503

Closed

aelovikov-intel pushed a commit to aelovikov-intel/llvm that referenced this pull request Mar 27, 2023

[SYCL] Tests for vec/marray math (intel/llvm-test-suite#1002)

9d505d3

Tests for marray/vec SYCL math functions from: intel#6038 Signed-off-by: jack.kirk <jack.kirk@codeplay.com>

JackAKirk mentioned this pull request May 10, 2023

[SYCL] Set marray alignment enabling vectorized loads #9395

Closed


		#undef __SYCL_MATH_FUNCTION_OVERLOAD

		#define __SYCL_MATH_FUNCTION_2_OVERLOAD(NAME) \

[SYCL] Fix marray math function impls #6038

[SYCL] Fix marray math function impls #6038

Conversation

JackAKirk commented Apr 21, 2022 • edited Loading

JackAKirk commented Apr 27, 2022

JackAKirk commented May 11, 2022

JackAKirk commented May 11, 2022

JackAKirk commented May 13, 2022

aelovikov-intel left a comment

Choose a reason for hiding this comment

steffenlarsen commented May 17, 2022

Choose a reason for hiding this comment

JackAKirk May 18, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JackAKirk commented May 18, 2022

JackAKirk commented May 18, 2022

Choose a reason for hiding this comment

JackAKirk May 19, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JackAKirk Jun 7, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JackAKirk Jul 28, 2022 • edited Loading

Choose a reason for hiding this comment

JackAKirk commented Oct 19, 2022

steffenlarsen left a comment

Choose a reason for hiding this comment

JackAKirk commented Oct 21, 2022 • edited Loading

steffenlarsen commented Oct 21, 2022

JackAKirk commented Oct 21, 2022 • edited Loading

JackAKirk commented Nov 17, 2022

JackAKirk commented Nov 17, 2022

JackAKirk commented Dec 1, 2022 • edited Loading

bader commented Dec 1, 2022

keryell left a comment

Choose a reason for hiding this comment

JackAKirk commented Dec 2, 2022

JackAKirk commented Dec 2, 2022

JackAKirk commented Dec 2, 2022

JackAKirk commented Apr 21, 2022 •

edited

Loading

JackAKirk May 18, 2022 •

edited

Loading

JackAKirk May 19, 2022 •

edited

Loading

JackAKirk Jun 7, 2022 •

edited

Loading

JackAKirk Jul 28, 2022 •

edited

Loading

JackAKirk commented Oct 21, 2022 •

edited

Loading

JackAKirk commented Oct 21, 2022 •

edited

Loading

JackAKirk commented Dec 1, 2022 •

edited

Loading