-
Notifications
You must be signed in to change notification settings - Fork 738
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SYCL] Fix marray math function impls #6038
Conversation
7b797f2
to
6633586
Compare
/verify with intel/llvm-test-suite#1002 |
including sycl:: math/native/half_precision/experimental cases. removed marray from "floating_list" Signed-off-by: jack.kirk <jack.kirk@codeplay.com>
Signed-off-by: jack.kirk <jack.kirk@codeplay.com>
I've added scalar_vector_* lists in this PR that omit marray types, so that math functions can distinguish the marray implementations I added. I think that it makes more sense to allow array reductions with any span (or at least a larger range than {n}) which would mean updating the marray type lists. @aobolensk @steffenlarsen what do you think? |
/verify with intel/llvm-test-suite#1002 |
FYI I don't have access to see the failures from this. The tests are passing locally for cuda. |
Signed-off-by: JackAKirk <jack.kirk@codeplay.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we extend existing tests to capture the new sizes?
I agree, |
sycl/include/CL/sycl/builtins.hpp
Outdated
std::enable_if_t<detail::is_sgenfloat<T>::value, sycl::marray<T, N>> \ | ||
NAME(sycl::marray<T, N> x) __NOEXC { \ | ||
sycl::marray<T, N> res; \ | ||
auto x_vec2 = reinterpret_cast<sycl::vec<T, 2> const *>(&x); \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think type punning is UB in C++.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes this usage of reinterpret_cast
is UB in C++. I've now switched to using std::memcpy
instead which leads to identical asm at the default Opt level.
sycl/include/CL/sycl/builtins.hpp
Outdated
|
||
#undef __SYCL_MATH_FUNCTION_OVERLOAD | ||
|
||
#define __SYCL_MATH_FUNCTION_2_OVERLOAD(NAME) \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would something like this https://godbolt.org/z/adez76fTd be possible here to avoid duplicating this code for all 3 cases?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I think that something along that line would be possible, although I'm not sure that it would necessarily be an improvement, particularly for the current implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having seen (almost?) the same for
loop nine times in this PR, I'd argue it will be. Other reviewers, am I really the only one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a couple of issues with applying the suggested approach in the simple example https://godbolt.org/z/adez76fTd. Firstly the two __invoke_*
function calls that are made in each marray
math function implementation require the explicit provision of a template parameter that is different in each of the calls, i.e. vec<T, 2>
and T
. This means that an immediate adaption of https://godbolt.org/z/adez76fTd that directly called the __invoke_*
functions would required the provision of two lambdas, Callable Fvec
and Callable F
corresponding to the vec<T, 2>
and T
cases.
Also with this approach the usage of macros to prevent duplicating function declaration lines becomes less attractive because this would require passing the lambdas to the macros, although I imagine that you probably meant to remove the macros completely.
I brought this issue up with the team and everyone agreed that although it would be possible to work around these issues the resultant implementation would be more complex and the size of the code would be similar to the current implementation.
I have applied your change to to_vec2
that simplifies it.
Thanks
Signed-off-by: JackAKirk <jack.kirk@codeplay.com>
I did not find any existing tests for marray math builtins: this makes sense since the existing implementation was broken because the implementation that was written for scalars/vectors cannot be used for marray cases. |
Loosening |
for (size_t i = 0; i < N / 2; i++) { | ||
auto partial_res = __sycl_std::__invoke_exp2<sycl::vec<half, 2>>( | ||
sycl::detail::to_vec(x, i * 2)); | ||
std::memcpy(&res[i * 2], &partial_res, sizeof(vec<half, 2>)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not just having conversion operators between all these types to avoid atrocities?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean an explicit conversion operator allowing something like
marray<T, N> x;
marray<vec<T, 2>, N/2> x_vec2 = x;
or similar.
Or an implicit conversion operator allowing x[i * 2]
-> vec<T, 2>
etc?
I think that the main issue (beyond the question of whether this would make the code more readable or less confusing) is that this would have to break SYCL spec definitions of marray/vec/etc? Perhaps I am missing something or misunderstanding what you mean?
Thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
marray is supposed to be more general than vec.
What is the long term plan with vec? Will marray replace vec?
@Pennycook, can you comment on that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this marray implementation supports any marray size N where N is in the set of size_t: so it is more general than vec in this sense. I think that it makes sense to map the marray implementation onto the vec one like e.g. I have done here by using vec2 (although it would be a fair argument to suggest another impl that makes use of larger vec sizes, as we've discussed in another thread). In theory this should also allow for vectorized register loads, although for some reason in the CUDA backend when we cast from marray to vec the vectorized loads that we see when using the standard vec implementation are not used: This is something we have as a TODO to improve on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a good presentation by your colleagues at syclcon.org this year that talks about it: Untangling Modern Parallel Programming Models https://www.youtube.com/watch?v=6FbW6zVYkxk&list=PL46sP9LM8GsyHAxj1k7MbWrv5f5SlMpIF&index=27
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that it makes sense to map the marray implementation onto the vec one like e.g. I have done here by using vec2 (although it would be a fair argument to suggest another impl that makes use of larger vec sizes, as we've discussed in another thread). In theory this should also allow for vectorized register loads, although for some reason in the CUDA backend when we cast from marray to vec the vectorized loads that we see when using the standard vec implementation are not used: This is something we have as a TODO to improve on.
We've discussed this a bit offline.
The reason why loads and stores to/from marray
s are not being vectorized is that LLVM's load-store-vectorizer
pass has a strict requirement on the alignment; the pointers have to be aligned to at least what the resulting vector would require (or the target must allow misaligned operations). The alignment requirement can be easily achieved by changing the default alignment of marray
to the "previous" vector (i.e.: marray<15, T>
would be decorated with __attribute__((aligned(8 * sizeof(T)))
, making sure it's a power of 2 number).
WRT behind the scenes conversion of marray
elements to vectors (to_vec2
), it seems wrong, it is unlikely to bring performance benefits, as it will always result in temporary storage and extra loads, stores instructions. Perhaps we could use the same approach as cutlas does, harnessing the fact that alignment is set correctly, we could do a bit of type punning see: https://github.com/NVIDIA/cutlass/blob/e7a61c761a4bfb387b61c03cdbcd19ab300726b7/include/cutlass/functional.h#L1444
I had a go at the above in here: JackAKirk#1
As a side note, it feels like somewhere here there is a compiler optimization missed, we should be able to use the same logic as the vectorizer pass and gather those scalar intrinsic, converting them to vector equivalents, making this code a lot cleaner.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think about @jchlanda 's suggestion above?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it becomes less UB.
it seems wrong, it is unlikely to bring performance benefits, as it will always result in temporary storage and extra loads, stores instructions.
Why wouldn't the compiler optimize this? This is a standard C++ idiom for the compiler - https://en.cppreference.com/w/cpp/numeric/bit_cast
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why wouldn't the compiler optimize this? This is a standard C++ idiom for the compiler - https://en.cppreference.com/w/cpp/numeric/bit_cast
The compiler mustn't optimize it away because of the difference in the alignment of (elements of) marray
and the temporary vec2
variable, as marray
follows the std::array
alignment rules. For the same reason compiler was unable to generate vector loads and stores directly to/from marray
, this is enforced in PTX spec:
By default, vector variables are aligned to a multiple of their overall size (vector length times base-type size), to enable vector load and store instructions which require addresses aligned to a multiple of the access size.
This patch changes it though, and marray
satisfies the alignment requirement of the "previous" vector.
I don't think it becomes less UB.
You are right, I reverted the type punning casts, with alignment fix LLVM is clever enough to optimize vec2
variable and a call to memcpy
away.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it becomes less UB.
You are right, I reverted the type punning casts, with alignment fix LLVM is clever enough to optimize
vec2
variable and a call tomemcpy
away.
Sounds like we have a winner. I can merge JackAKirk#1 into this PR if everyone is happy now?
Signed-off-by: JackAKirk <jack.kirk@codeplay.com>
/verify with intel/llvm-test-suite#1002 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Would still like to have the comment from https://github.com/intel/llvm/pull/6038/files#r973197193.
I did add comments on the scalar functions:
I can also the same comment here
Or perhaps I also misunderstood the comment you meant to add? Thanks |
I think the comments you mentioned are good, but there seem to be more functions using native functions for NVPTX but not for other targets, like the place where the aforementioned comment is. May not be clear that there is a link between the other definitions using native for NVPTX only and the new ones doing the same. |
Signed-off-by: JackAKirk <jack.kirk@codeplay.com>
I see what you mean. I've added the comments in the two other places now. |
Signed-off-by: JackAKirk <jack.kirk@codeplay.com>
Hi @keryell |
@bader Do you want to ask someone else to review this perhaps? There are now new conflicts, which I can fix. But if this PR stays open it is inevitable that there will be future merge conflicts to deal with. |
Signed-off-by: JackAKirk <jack.kirk@codeplay.com>
Signed-off-by: JackAKirk <jack.kirk@codeplay.com>
Signed-off-by: JackAKirk <jack.kirk@codeplay.com>
Signed-off-by: JackAKirk <jack.kirk@codeplay.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks.
I recently noticed KhronosGroup/SYCL-Docs#320 by the way, so the marray
are not yet a drop-in replacement for vec
.
/verify with intel/llvm-test-suite#1002 |
1 similar comment
/verify with intel/llvm-test-suite#1002 |
OK. I'm now happy for this PR and intel/llvm-test-suite#1002 to be merged now. I've merged the latest sycl branch here, resolved conflicts and fixed one test that was resultantly failing. It is now passing the llvm-test-suite run using intel/llvm-test-suite#1002. The amd test failures are I think unrelated and also seen in other PRs. Thanks. |
Tests for marray/vec SYCL math functions from: intel/llvm#6038 Signed-off-by: jack.kirk <jack.kirk@codeplay.com>
Tests for marray/vec SYCL math functions from: intel#6038 Signed-off-by: jack.kirk <jack.kirk@codeplay.com>
This PR aims to fix issue : #5991 and provide efficient working marray math function implementations for all backends.
marray math function support is currently switched on for {n} ({n} defined in #5991) but the implementations are currently broken and untested. There is also very limited test coverage for sycl::vec cases. The sycl 2020 specification states that the set {N} ({N} defined in #5991) should be supported for marray math function cases.
All SYCL 2020 math, native math, and half_precision math functions now have marray support when the function's arguments are of type
genfloat
and have the same argument type for all arguments.Tests: intel/llvm-test-suite#1002
Signed-off-by: jack.kirk jack.kirk@codeplay.com