-
Notifications
You must be signed in to change notification settings - Fork 10.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[AArch64][compiler-rt] Add memcpy, memset, memmove, memchr builtins. #77496
[AArch64][compiler-rt] Add memcpy, memset, memmove, memchr builtins. #77496
Conversation
✅ With the latest revision this PR passed the C/C++ code formatter. |
…lementation RT builtins. Add naive implementation of memcpy, memset, memmove, memchr for SME targets. Patch co-authored by David Sherwood <david.sherwood@arm.com>
1845737
to
2db7921
Compare
// restrict keywords here. This also matches the definition of the libc memcpy | ||
// according to the man page. | ||
void *__arm_sc_memcpy(void *__restrict__ dest, const void *__restrict__ src, | ||
size_t n) __arm_streaming_compatible __arm_preserves_za { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The compiler building compiler-rt may not support __arm_streaming_compatible
. You'll want to check for the SME attributes when configuring the project with cmake.
I'd also suggest removing __arm_preserves_za
in anticipation of this ACLE PR landing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, not sure what extra required for __arm_streaming_compatible
avaliblity check, We have already check that in compiler-rt/lib/builtins/CMakeLists.txt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
#include <time.h> | ||
|
||
#define N 1024 | ||
#define NREPS 1234 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like this test is testing more than just functionality (i.e. it seems to be testing performance). This can be seen from e.g. NREPS and get_time_diff
. I think you'll want to make this purely a functional test and try to test out specific edge cases for each of the routines.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
@@ -7,6 +7,7 @@ config.llvm_obj_root = "@LLVM_BINARY_DIR@" | |||
config.llvm_tools_dir = lit_config.substitute("@LLVM_TOOLS_DIR@") | |||
config.compiler_rt_src_root = "@COMPILER_RT_SOURCE_DIR@" | |||
config.compiler_rt_libdir = lit_config.substitute("@COMPILER_RT_RESOLVED_LIBRARY_OUTPUT_DIR@") | |||
config.sme = "@COMPILER_RT_HAS_ASM_SME@" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please call this aarch64.sme
, just to avoid any future name clashes with other kinds of 'sme'.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
int main() { | ||
struct timespec tv[2]; | ||
|
||
init(); | ||
|
||
// Test correctness of memcpy | ||
for (int i = 0; i < 67; i++) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a test that goes from 0
to 67
. Is 67
a special number that was chosen for a reason?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
} | ||
} | ||
|
||
void reinit_dst(int n) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's worth making this a .cpp file so that you can have a class with a constructor to do the init
and have a sum
method, rather than having to call functions like reinit_dst
. We can reduce the sizes of the arrays a bit more to reduce runtime of the test, which makes the impact of reinitialising the entire array negligible.
In general I'd like to see a more structured way to test these routines, where we can see that some of the edge-cases are being tested. At the moment it looks like you've stripped the timing functionality from this file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
@@ -561,7 +561,7 @@ set(aarch64_SOURCES | |||
) | |||
|
|||
if(COMPILER_RT_HAS_ASM_SME AND (COMPILER_RT_HAS_AUXV OR COMPILER_RT_BAREMETAL_BUILD)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
COMPILER_RT_HAS_ASM_SME tests if the compiler supports SME assembly, but not if the compiler supports the C/C++ level attributes. You'll need to add another check to the function which sets this variable and probably rename the variable to COMPILER_RT_HAS_AARCH64_SME
to make it clear the compiler supports more than just the ASM. (and also to add "aarch64" to the name to avoid future name clashes with other meanings of 'sme')
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
@@ -41,6 +41,12 @@ asm(\".arch armv9-a+sme\"); | |||
asm(\"smstart\"); | |||
") | |||
|
|||
builtin_check_c_compiler_source(COMPILER_RT_HAS_AARCH64_SME |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you combine this with COMPILER_RT_HAS_ASM_SME
and just have a single COMPILER_RT_HAS_AARCH64_SME
? That way we don't need to have two variables given that you're AND'ing the result later on anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
// WARNING: When building the scalar versions of these functions you need to | ||
// use the compiler flag "-mllvm -disable-loop-idiom-all" to prevent clang | ||
// from recognising a loop idiom and planting calls to memcpy! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a comment here that seems important, but doesn't seem to be addressed by your patch?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
// Test correctness of memcpy | ||
void test_memcpy() { | ||
for (int i = 0; i < 8; i++) { | ||
int t[2]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you use an array, rather than two separate variables?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
@@ -0,0 +1,108 @@ | |||
// REQUIRES: linux, aarch64-target-arch, aarch64-sme-available |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it require linux?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
void *__arm_sc_memchr(const void *, int, size_t); | ||
} | ||
|
||
class MemoryArea { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just an idea (no need to copy this literally), but I think you'd get more value from using a class like this if you'd do something like this:
template <unsigned N>
class Buffer {
public:
uint8_t ptr[N];
Buffer(int Stride = 0) {
for (unsigned I=0; I < N; I+= Stride) {
Src[I] = I;
}
}
// to more quickly check that the result buffer matches a specific buffer.
Buffer(std::initializer_list<...> S) { }
bool assert_equal(const Buffer &Other) {
// check here that ptr == Other.ptr
}
};
Then you can do things like:
{ // test 1
Memory<8> Src(1), Dst;
__arm_sc_memcpy(Dst.ptr, Src.ptr, 8);
Dst.assert_equal(Src);
Dst.assert_equal({0, 1, 2, 3, 4, 5, 6, 7});
}
{ // test 2
Memory<8> SrcDst(1);
__arm_sc_memcpy(SrcDst.ptr + 1, SrcDst.ptr, 7);
SrcDst.assert_equal({0, 0, 1, 2, 3, 4, 5, 6});
}
You could also compare it to the result from a regular memcpy
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
for (int i = 0; i < 8; i++) { | ||
int t[2]; | ||
if (!__arm_sc_memset(dst, src[i], i)) | ||
abort(); | ||
t[0] = sum_and_reset_dst(dst, N, i); | ||
__arm_sc_memset(dst, src[i], i); | ||
t[1] = sum_and_reset_dst(dst, N, i); | ||
if (t[0] != t[1]) | ||
abort(); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We discussed this offline, but please write some more specific tests that target the corner cases of the routines.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for rewriting the tests, this is looking much better!
" | ||
asm(\".arch armv9-a+sme\"); | ||
asm(\"smstart\"); | ||
void foo(int a) __arm_streaming_compatible { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit:
void foo(int a) __arm_streaming_compatible { | |
void foo(void) __arm_streaming_compatible { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
message(STATUS "AArch64 SME ABI routines enabled") | ||
if("${CMAKE_CXX_COMPILER_ID}" STREQUAL "Clang") | ||
set_source_files_properties(aarch64/sme-libc-routines.c PROPERTIES COMPILE_FLAGS "-mllvm -disable-loop-idiom-all") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that you can use the option -fno-builtin
to avoid builtins being used in the implementation of these functions. This flag is supported by both GNU and Clang, which is better than passing some internal LLVM flag.
It does mean you'll need to add: AND COMPILER_RT_HAS_FNO_BUILTIN_FLAG
to the condition on line 563 though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
if (Stride != 0) { | ||
for (unsigned I = 0, Elem = 0; I < N; I++, Elem += Stride) { | ||
ptr[I] = Elem; | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit:
if (Stride != 0) { | |
for (unsigned I = 0, Elem = 0; I < N; I++, Elem += Stride) { | |
ptr[I] = Elem; | |
} | |
} | |
if (Stride == 0) | |
return; | |
for (unsigned I = 0; I < N; ++I) | |
ptr[I] = I * Stride; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
public: | ||
uint8_t ptr[N]; | ||
|
||
Memory(int Stride = 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit:
Memory(int Stride = 0) { | |
Memory(unsigned Stride = 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
} | ||
|
||
void assert_equal(const Memory &Other) { | ||
for (unsigned I = 0; I < N; I++) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can use memcmp here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
} | ||
} | ||
|
||
void assert_equal(const Memory &Other) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you also need to verify that this.N == Other.N ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
if (!Elem) | ||
abort(); | ||
Src.assert_elemt_equal_at(Elem - Src.ptr, *Elem); | ||
assert(__arm_sc_memchr(Src.ptr, E, 8) == memchr(Src.ptr, E, 8)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would this assert is sufficient, i.e.
for (unsigned I=0; I<8; ++I)
assert(__arm_sc_memchr(Src.ptr, Src.ptr[I]) == memchr(Src.ptr, Src.ptr[I]));
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
// Testing memset with different pointer offset. | ||
{ | ||
for (unsigned I = 0; I < 16; I++) { | ||
Memory<16> Array(2); | ||
if (!__arm_sc_memset(Array.ptr + I, I, 16 - I)) | ||
abort(); | ||
|
||
uint8_t OrigElem = 0; | ||
for (unsigned J = 0; J < 16; J++) { | ||
if (I == 0) { | ||
Array.assert_elemt_equal_at(J, 0); | ||
} else if (J < I) { | ||
Array.assert_elemt_equal_at(J, OrigElem); | ||
} else { | ||
Array.assert_elemt_equal_at(J, (uint8_t)I); | ||
} | ||
OrigElem += 2; | ||
} | ||
} | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this testing anything that the above two tests aren't?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
return; | ||
for (unsigned I = 0, Elem = 0; I < N; I++) { | ||
ptr[I] = I * Stride; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
drop braces around simple single statement
lower_case
variable naming
assert(N == S.size()); | ||
auto It = S.begin(); | ||
for (unsigned I = 0; I < N; I++) { | ||
assert(ptr[I] == *It++); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
drop braces around simple single statement
compiler-rt/lib/builtins uses lower_case
variable naming.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
// TODO: Improve performance by copying larger chunks in reverse, or by | ||
// using SVE. | ||
while (n > 0) { | ||
n--; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
void *__arm_sc_memset(void *dest, int c, size_t n) __arm_streaming_compatible { | ||
unsigned char *destp = (unsigned char *)dest; | ||
unsigned char c8 = (unsigned char)c; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
delete blank line after variable declaration
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
unsigned char c8 = (unsigned char)c; | ||
|
||
for (size_t i = 0; i < n; i++) { | ||
destp[i] = c8; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
drop braces, ++i
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
size_t n) __arm_streaming_compatible { | ||
const unsigned char *srcp = (const unsigned char *)src; | ||
unsigned char c8 = (unsigned char)c; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
delete blank line
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
for (size_t i = 0; i < n; i++) { | ||
if (srcp[i] == c8) | ||
return &srcp[i]; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
drop braces
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
This can use more description why these functions need to go into compiler-rt. I assume that a future Clang will emit these calls (https://arm-software.github.io/acle/main/acle.html#streaming-compatible-versions-of-standard-routines).
The title has been wrapped by GitHub. It will be more readable by updating the title not to be wrapped. |
unsigned char *destp = (unsigned char *)dest; | ||
const unsigned char *srcp = (const unsigned char *)src; | ||
// If src and dest are identical there is nothing to do! | ||
if ((destp == srcp) || (n == 0)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder whether the two conditions (destp == srcp) || (n == 0)
are useful.
A lot of libc implementations don't check destp == srcp
. Some don't check n==0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, I think we can safely remove those now. Thanks for finding this.
or length parameter to move is zero.
@dtemirbulatov your change seems to cause some issues when trying to run the lit tests on some bots: For example: https://lab.llvm.org/buildbot/#/builders/247/builds/13502
Can you take a look? |
Do you need to add a Also, you do |
…iltins. (#77496)" This reverts commit 3ab8d2a. This change is causing issues running lit tests on many bots including: - https://lab.llvm.org/buildbot/#/builders/197/builds/12119 - https://lab.llvm.org/buildbot/#/builders/184/builds/9792 - https://lab.llvm.org/buildbot/#/builders/93/builds/18455 - https://lab.llvm.org/buildbot/#/builders/231/builds/19858 - https://lab.llvm.org/buildbot/#/builders/121/builds/38426 - https://lab.llvm.org/buildbot/#/builders/230/builds/23990 - https://lab.llvm.org/buildbot/#/builders/57/builds/32391 - https://lab.llvm.org/buildbot/#/builders/247/builds/13502 - https://lab.llvm.org/buildbot/#/builders/275/builds/3601 - https://lab.llvm.org/buildbot/#/builders/269/builds/4211 - https://lab.llvm.org/buildbot/#/builders/18/builds/14161 - https://lab.llvm.org/buildbot/#/builders/19/builds/23893 - https://lab.llvm.org/buildbot/#/builders/37/builds/30295 - https://lab.llvm.org/buildbot/#/builders/77/builds/33979
@dtemirbulatov sorry, I reverted your change as it is preventing many bots from running tests. |
When this relands, it'd be cool if you could put a change like 9a03d94 in the reland as well. |
…77496) Add naive implementation of memcpy, memset, memmove, memchr for SME targets. Co-authored-by: David Sherwood <david.sherwood@arm.com>
As mentioned above this seems to be breaking on the buildbots, e.g. https://lab.llvm.org/buildbot/#/builders/275/builds/3646 |
Commited solution for the last failure with 5f47687 |
The above commit hasn't fully fixed the issue - the error has now changed from the original:
To the new error:
|
NB, I've taken a shot at fixing this with 5176df5 , I think using pythonize_bool is adding "_PYBOOL" to the end of the variable name that gets defined, like the nearby config flags (COMPILER_RT_TEST_STANDALONE_BUILD_LIBS for example gets _PYBOOL on the end). |
Looks like that cleared the bots -- @dtemirbulatov you should be able to make the same fix to the line you deleted in |
This is a follow-up to 3112578 -- it looks like passing the added cmake flag to pythonize_bool also appends "_PYBOOL" to the flag name, which this lit config file is missing. This trips up builds such as: https://lab.llvm.org/buildbot/#/builders/275/builds/3661 https://lab.llvm.org/buildbot/#/builders/184/builds/9811 Where COMPILER_RT_HAS_AARCH64_SME ends up expanding to nothing.
Do we have an idea of the expected performance differences here in the presence of SME? |
@@ -0,0 +1,87 @@ | |||
#include <stdlib.h> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've proposed #79454 to switch this out for stddef.h
which is provided by the compiler resource headers rather than the libc (since this is inside builtins).
This PR adds the The reason to have streaming-compatible variants of these routines is because regular memcpy/memset/etc routines are not streaming-compatible. In practice this means they might be implemented using NEON instructions and so the compiler has to ensure that every call to it will result in a |
I dont usually follow compiler-rt commits, so didnt see this. memcpy and friends can be used a lot - performance for them can be critical and a lot of time has been spent over the years trying to optimize them. Is the idea that these are auto-vectorized? And so might use the SME engine for the copy? I'm not sure that is happening in the current version I have, and single byte memcpys sounds pretty slow. And that's before other considerations like alignment and systems wanting platform specific routines. I would add Wilco (I'm not sure what his username is), he has done a lot of work on these kinds of routines and it would be good to get his input. |
The point of this implementation is that it doesn't use NEON instructions, as those are not valid in Streaming-SVE mode. We'll definitely want to optimise these routines by using (streaming-compatible) SVE instructions or by using the MOPS instructions, if available. This PR is merely a starting point to guarantee functional correctness. It is also possible to use regular memcpy routines, but those will incur a |
I had forgotten that vectorization was disabled in streaming functions. That would make sense why we don't see SVE instructions so far. It sounds like there is plenty of work yet to go. Hopefully there will be a lot of overlap with mops instructions, which would help simplify things a lot. |
Add naive implementation of memcpy, memset, memmove, memchr for SME targets.
Co-authored-by: David Sherwood david.sherwood@arm.com