Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SYCL] Fix marray math function impls #6038

Merged
merged 40 commits into from
Dec 2, 2022
Merged

Conversation

JackAKirk
Copy link
Contributor

@JackAKirk JackAKirk commented Apr 21, 2022

This PR aims to fix issue : #5991 and provide efficient working marray math function implementations for all backends.

marray math function support is currently switched on for {n} ({n} defined in #5991) but the implementations are currently broken and untested. There is also very limited test coverage for sycl::vec cases. The sycl 2020 specification states that the set {N} ({N} defined in #5991) should be supported for marray math function cases.

All SYCL 2020 math, native math, and half_precision math functions now have marray support when the function's arguments are of type genfloat and have the same argument type for all arguments.

Tests: intel/llvm-test-suite#1002

Signed-off-by: jack.kirk jack.kirk@codeplay.com

@JackAKirk
Copy link
Contributor Author

/verify with intel/llvm-test-suite#1002

including sycl:: math/native/half_precision/experimental cases.
removed marray from "floating_list"

Signed-off-by: jack.kirk <jack.kirk@codeplay.com>
Signed-off-by: jack.kirk <jack.kirk@codeplay.com>
Signed-off-by: jack.kirk <jack.kirk@codeplay.com>
Signed-off-by: jack.kirk <jack.kirk@codeplay.com>
@JackAKirk
Copy link
Contributor Author

I've added scalar_vector_* lists in this PR that omit marray types, so that math functions can distinguish the marray implementations I added.
The type lists including marrays, used in e.g. is_genfloat, are used in the has_known_identity trait class described in section 4.9.2. of the SYCL 2020 spec. The current marray lists include marrays of size from the set {n} (defined/discussed in #5991) which limits the spans used in array reductions to the set {n}. If we have array reductions then 4.9.2 does not state that they should be limited to the set {n}, although it does not specify what the admissible set of spans are.

I think that it makes more sense to allow array reductions with any span (or at least a larger range than {n}) which would mean updating the marray type lists.

@aobolensk @steffenlarsen what do you think?

@JackAKirk
Copy link
Contributor Author

/verify with intel/llvm-test-suite#1002

@JackAKirk
Copy link
Contributor Author

/verify with intel/llvm-test-suite#1002

FYI I don't have access to see the failures from this. The tests are passing locally for cuda.

Signed-off-by: JackAKirk <jack.kirk@codeplay.com>
Copy link
Contributor

@aelovikov-intel aelovikov-intel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we extend existing tests to capture the new sizes?

@steffenlarsen
Copy link
Contributor

I've added scalar_vector_* lists in this PR that omit marray types, so that math functions can distinguish the marray implementations I added. The type lists including marrays, used in e.g. is_genfloat, are used in the has_known_identity trait class described in section 4.9.2. of the SYCL 2020 spec. The current marray lists include marrays of size from the set {n} (defined/discussed in #5991) which limits the spans used in array reductions to the set {n}. If we have array reductions then 4.9.2 does not state that they should be limited to the set {n}, although it does not specify what the admissible set of spans are.

I think that it makes more sense to allow array reductions with any span (or at least a larger range than {n}) which would mean updating the marray type lists.

@aobolensk @steffenlarsen what do you think?

I agree, genfloat is currently too restrictive on marray and we should loosen it. If we did, would this patch be obsolete or would these separate definitions still be required?

std::enable_if_t<detail::is_sgenfloat<T>::value, sycl::marray<T, N>> \
NAME(sycl::marray<T, N> x) __NOEXC { \
sycl::marray<T, N> res; \
auto x_vec2 = reinterpret_cast<sycl::vec<T, 2> const *>(&x); \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think type punning is UB in C++.

Copy link
Contributor Author

@JackAKirk JackAKirk May 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes this usage of reinterpret_cast is UB in C++. I've now switched to using std::memcpy instead which leads to identical asm at the default Opt level.

sycl/include/CL/sycl/builtins.hpp Outdated Show resolved Hide resolved

#undef __SYCL_MATH_FUNCTION_OVERLOAD

#define __SYCL_MATH_FUNCTION_2_OVERLOAD(NAME) \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would something like this https://godbolt.org/z/adez76fTd be possible here to avoid duplicating this code for all 3 cases?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I think that something along that line would be possible, although I'm not sure that it would necessarily be an improvement, particularly for the current implementation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having seen (almost?) the same for loop nine times in this PR, I'd argue it will be. Other reviewers, am I really the only one?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a couple of issues with applying the suggested approach in the simple example https://godbolt.org/z/adez76fTd. Firstly the two __invoke_* function calls that are made in each marray math function implementation require the explicit provision of a template parameter that is different in each of the calls, i.e. vec<T, 2> and T. This means that an immediate adaption of https://godbolt.org/z/adez76fTd that directly called the __invoke_* functions would required the provision of two lambdas, Callable Fvec and Callable F corresponding to the vec<T, 2> and T cases.
Also with this approach the usage of macros to prevent duplicating function declaration lines becomes less attractive because this would require passing the lambdas to the macros, although I imagine that you probably meant to remove the macros completely.

I brought this issue up with the team and everyone agreed that although it would be possible to work around these issues the resultant implementation would be more complex and the size of the code would be similar to the current implementation.

I have applied your change to to_vec2 that simplifies it.

Thanks

Signed-off-by: JackAKirk <jack.kirk@codeplay.com>
@JackAKirk
Copy link
Contributor Author

Can we extend existing tests to capture the new sizes?

I did not find any existing tests for marray math builtins: this makes sense since the existing implementation was broken because the implementation that was written for scalars/vectors cannot be used for marray cases.

@JackAKirk
Copy link
Contributor Author

I've added scalar_vector_* lists in this PR that omit marray types, so that math functions can distinguish the marray implementations I added. The type lists including marrays, used in e.g. is_genfloat, are used in the has_known_identity trait class described in section 4.9.2. of the SYCL 2020 spec. The current marray lists include marrays of size from the set {n} (defined/discussed in #5991) which limits the spans used in array reductions to the set {n}. If we have array reductions then 4.9.2 does not state that they should be limited to the set {n}, although it does not specify what the admissible set of spans are.
I think that it makes more sense to allow array reductions with any span (or at least a larger range than {n}) which would mean updating the marray type lists.
@aobolensk @steffenlarsen what do you think?

I agree, genfloat is currently too restrictive on marray and we should loosen it. If we did, would this patch be obsolete or would these separate definitions still be required?

Loosening genfloat marray restrictions would not make this patch obsolete because the scalar/vector implementations of these math functions cannot be used for marray cases.

for (size_t i = 0; i < N / 2; i++) {
auto partial_res = __sycl_std::__invoke_exp2<sycl::vec<half, 2>>(
sycl::detail::to_vec(x, i * 2));
std::memcpy(&res[i * 2], &partial_res, sizeof(vec<half, 2>));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just having conversion operators between all these types to avoid atrocities?

Copy link
Contributor Author

@JackAKirk JackAKirk May 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean an explicit conversion operator allowing something like

marray<T, N> x;
marray<vec<T, 2>, N/2> x_vec2 = x;

or similar.

Or an implicit conversion operator allowing x[i * 2] -> vec<T, 2> etc?

I think that the main issue (beyond the question of whether this would make the code more readable or less confusing) is that this would have to break SYCL spec definitions of marray/vec/etc? Perhaps I am missing something or misunderstanding what you mean?

Thanks

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

marray is supposed to be more general than vec.
What is the long term plan with vec? Will marray replace vec?
@Pennycook, can you comment on that?

Copy link
Contributor Author

@JackAKirk JackAKirk Jun 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this marray implementation supports any marray size N where N is in the set of size_t: so it is more general than vec in this sense. I think that it makes sense to map the marray implementation onto the vec one like e.g. I have done here by using vec2 (although it would be a fair argument to suggest another impl that makes use of larger vec sizes, as we've discussed in another thread). In theory this should also allow for vectorized register loads, although for some reason in the CUDA backend when we cast from marray to vec the vectorized loads that we see when using the standard vec implementation are not used: This is something we have as a TODO to improve on.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There was a good presentation by your colleagues at syclcon.org this year that talks about it: Untangling Modern Parallel Programming Models https://www.youtube.com/watch?v=6FbW6zVYkxk&list=PL46sP9LM8GsyHAxj1k7MbWrv5f5SlMpIF&index=27

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that it makes sense to map the marray implementation onto the vec one like e.g. I have done here by using vec2 (although it would be a fair argument to suggest another impl that makes use of larger vec sizes, as we've discussed in another thread). In theory this should also allow for vectorized register loads, although for some reason in the CUDA backend when we cast from marray to vec the vectorized loads that we see when using the standard vec implementation are not used: This is something we have as a TODO to improve on.

We've discussed this a bit offline.
The reason why loads and stores to/from marrays are not being vectorized is that LLVM's load-store-vectorizer pass has a strict requirement on the alignment; the pointers have to be aligned to at least what the resulting vector would require (or the target must allow misaligned operations). The alignment requirement can be easily achieved by changing the default alignment of marray to the "previous" vector (i.e.: marray<15, T> would be decorated with __attribute__((aligned(8 * sizeof(T))), making sure it's a power of 2 number).

WRT behind the scenes conversion of marray elements to vectors (to_vec2), it seems wrong, it is unlikely to bring performance benefits, as it will always result in temporary storage and extra loads, stores instructions. Perhaps we could use the same approach as cutlas does, harnessing the fact that alignment is set correctly, we could do a bit of type punning see: https://github.com/NVIDIA/cutlass/blob/e7a61c761a4bfb387b61c03cdbcd19ab300726b7/include/cutlass/functional.h#L1444

I had a go at the above in here: JackAKirk#1

As a side note, it feels like somewhere here there is a compiler optimization missed, we should be able to use the same logic as the vectorizer pass and gather those scalar intrinsic, converting them to vector equivalents, making this code a lot cleaner.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aelovikov-intel @keryell

What do you think about @jchlanda 's suggestion above?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it becomes less UB.

it seems wrong, it is unlikely to bring performance benefits, as it will always result in temporary storage and extra loads, stores instructions.

Why wouldn't the compiler optimize this? This is a standard C++ idiom for the compiler - https://en.cppreference.com/w/cpp/numeric/bit_cast

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why wouldn't the compiler optimize this? This is a standard C++ idiom for the compiler - https://en.cppreference.com/w/cpp/numeric/bit_cast

The compiler mustn't optimize it away because of the difference in the alignment of (elements of) marray and the temporary vec2 variable, as marray follows the std::array alignment rules. For the same reason compiler was unable to generate vector loads and stores directly to/from marray, this is enforced in PTX spec:

By default, vector variables are aligned to a multiple of their overall size (vector length times base-type size), to enable vector load and store instructions which require addresses aligned to a multiple of the access size.

This patch changes it though, and marray satisfies the alignment requirement of the "previous" vector.

I don't think it becomes less UB.

You are right, I reverted the type punning casts, with alignment fix LLVM is clever enough to optimize vec2 variable and a call to memcpy away.

Copy link
Contributor Author

@JackAKirk JackAKirk Jul 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it becomes less UB.

You are right, I reverted the type punning casts, with alignment fix LLVM is clever enough to optimize vec2 variable and a call to memcpy away.

Sounds like we have a winner. I can merge JackAKirk#1 into this PR if everyone is happy now?

Signed-off-by: JackAKirk <jack.kirk@codeplay.com>
@JackAKirk
Copy link
Contributor Author

/verify with intel/llvm-test-suite#1002

Copy link
Contributor

@steffenlarsen steffenlarsen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Would still like to have the comment from https://github.com/intel/llvm/pull/6038/files#r973197193.

@JackAKirk
Copy link
Contributor Author

JackAKirk commented Oct 21, 2022

LGTM! Would still like to have the comment from https://github.com/intel/llvm/pull/6038/files#r973197193.

I did add comments on the scalar functions:

// genfloath exp2 (genfloath x)
and
// backends we revert to the sycl::tanh impl.

I can also the same comment here

template <typename T, size_t N>
?

Or perhaps I also misunderstood the comment you meant to add?

Thanks

@steffenlarsen
Copy link
Contributor

I think the comments you mentioned are good, but there seem to be more functions using native functions for NVPTX but not for other targets, like the place where the aforementioned comment is. May not be clear that there is a link between the other definitions using native for NVPTX only and the new ones doing the same.

Signed-off-by: JackAKirk <jack.kirk@codeplay.com>
@JackAKirk
Copy link
Contributor Author

JackAKirk commented Oct 21, 2022

I think the comments you mentioned are good, but there seem to be more functions using native functions for NVPTX but not for other targets, like the place where the aforementioned comment is. May not be clear that there is a link between the other definitions using native for NVPTX only and the new ones doing the same.

I see what you mean. I've added the comments in the two other places now.

Signed-off-by: JackAKirk <jack.kirk@codeplay.com>
@JackAKirk
Copy link
Contributor Author

Hi @keryell
Do you have any more reviews for this?
Thanks

@JackAKirk
Copy link
Contributor Author

Hi @keryell Do you have any more reviews for this? Thanks

Also we want this in before we finish up the marray complex extension: #6550.

@JackAKirk
Copy link
Contributor Author

JackAKirk commented Dec 1, 2022

Hi @keryell Do you have any more reviews for this? Thanks

Also we want this in before we finish up the marray complex extension: #6550.

@bader Do you want to ask someone else to review this perhaps? There are now new conflicts, which I can fix. But if this PR stays open it is inevitable that there will be future merge conflicts to deal with.

@bader
Copy link
Contributor

bader commented Dec 1, 2022

@bader Do you want to ask someone else to review this perhaps?

No. I requested @keryell to review to make sure that his previous comments are addressed.
If you think that you received all needed approves, I suggest we merge it and address further comments in follow-up commits.

JackAKirk added 4 commits December 1, 2022 19:36
Signed-off-by: JackAKirk <jack.kirk@codeplay.com>
Signed-off-by: JackAKirk <jack.kirk@codeplay.com>
Signed-off-by: JackAKirk <jack.kirk@codeplay.com>
Signed-off-by: JackAKirk <jack.kirk@codeplay.com>
Copy link
Contributor

@keryell keryell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.
I recently noticed KhronosGroup/SYCL-Docs#320 by the way, so the marray are not yet a drop-in replacement for vec.

JackAKirk added 2 commits December 2, 2022 09:26
Signed-off-by: JackAKirk <jack.kirk@codeplay.com>
Signed-off-by: JackAKirk <jack.kirk@codeplay.com>
@JackAKirk
Copy link
Contributor Author

/verify with intel/llvm-test-suite#1002

1 similar comment
@JackAKirk
Copy link
Contributor Author

/verify with intel/llvm-test-suite#1002

@JackAKirk
Copy link
Contributor Author

@bader Do you want to ask someone else to review this perhaps?

No. I requested @keryell to review to make sure that his previous comments are addressed. If you think that you received all needed approves, I suggest we merge it and address further comments in follow-up commits.

OK. I'm now happy for this PR and intel/llvm-test-suite#1002 to be merged now. I've merged the latest sycl branch here, resolved conflicts and fixed one test that was resultantly failing. It is now passing the llvm-test-suite run using intel/llvm-test-suite#1002. The amd test failures are I think unrelated and also seen in other PRs.

Thanks.

@bader bader merged commit 73a992b into intel:sycl Dec 2, 2022
bader pushed a commit to intel/llvm-test-suite that referenced this pull request Dec 2, 2022
Tests for marray/vec SYCL math functions from: intel/llvm#6038

Signed-off-by: jack.kirk <jack.kirk@codeplay.com>
aelovikov-intel pushed a commit to aelovikov-intel/llvm that referenced this pull request Mar 27, 2023
Tests for marray/vec SYCL math functions from: intel#6038

Signed-off-by: jack.kirk <jack.kirk@codeplay.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants