Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AArch64][compiler-rt] Add memcpy, memset, memmove, memchr builtins. #77496

Merged
merged 7 commits into from
Jan 22, 2024

Conversation

dtemirbulatov
Copy link
Contributor

@dtemirbulatov dtemirbulatov commented Jan 9, 2024

Add naive implementation of memcpy, memset, memmove, memchr for SME targets.
Co-authored-by: David Sherwood david.sherwood@arm.com

Copy link

github-actions bot commented Jan 9, 2024

✅ With the latest revision this PR passed the C/C++ code formatter.

…lementation RT builtins.

Add naive implementation of memcpy, memset, memmove, memchr for SME targets.
Patch co-authored by David Sherwood <david.sherwood@arm.com>
// restrict keywords here. This also matches the definition of the libc memcpy
// according to the man page.
void *__arm_sc_memcpy(void *__restrict__ dest, const void *__restrict__ src,
size_t n) __arm_streaming_compatible __arm_preserves_za {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The compiler building compiler-rt may not support __arm_streaming_compatible. You'll want to check for the SME attributes when configuring the project with cmake.

I'd also suggest removing __arm_preserves_za in anticipation of this ACLE PR landing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, not sure what extra required for __arm_streaming_compatible avaliblity check, We have already check that in compiler-rt/lib/builtins/CMakeLists.txt?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

#include <time.h>

#define N 1024
#define NREPS 1234
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this test is testing more than just functionality (i.e. it seems to be testing performance). This can be seen from e.g. NREPS and get_time_diff. I think you'll want to make this purely a functional test and try to test out specific edge cases for each of the routines.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@@ -7,6 +7,7 @@ config.llvm_obj_root = "@LLVM_BINARY_DIR@"
config.llvm_tools_dir = lit_config.substitute("@LLVM_TOOLS_DIR@")
config.compiler_rt_src_root = "@COMPILER_RT_SOURCE_DIR@"
config.compiler_rt_libdir = lit_config.substitute("@COMPILER_RT_RESOLVED_LIBRARY_OUTPUT_DIR@")
config.sme = "@COMPILER_RT_HAS_ASM_SME@"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please call this aarch64.sme, just to avoid any future name clashes with other kinds of 'sme'.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

int main() {
struct timespec tv[2];

init();

// Test correctness of memcpy
for (int i = 0; i < 67; i++) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a test that goes from 0 to 67. Is 67a special number that was chosen for a reason?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

}
}

void reinit_dst(int n) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's worth making this a .cpp file so that you can have a class with a constructor to do the init and have a sum method, rather than having to call functions like reinit_dst. We can reduce the sizes of the arrays a bit more to reduce runtime of the test, which makes the impact of reinitialising the entire array negligible.

In general I'd like to see a more structured way to test these routines, where we can see that some of the edge-cases are being tested. At the moment it looks like you've stripped the timing functionality from this file.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@@ -561,7 +561,7 @@ set(aarch64_SOURCES
)

if(COMPILER_RT_HAS_ASM_SME AND (COMPILER_RT_HAS_AUXV OR COMPILER_RT_BAREMETAL_BUILD))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

COMPILER_RT_HAS_ASM_SME tests if the compiler supports SME assembly, but not if the compiler supports the C/C++ level attributes. You'll need to add another check to the function which sets this variable and probably rename the variable to COMPILER_RT_HAS_AARCH64_SME to make it clear the compiler supports more than just the ASM. (and also to add "aarch64" to the name to avoid future name clashes with other meanings of 'sme')

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@@ -41,6 +41,12 @@ asm(\".arch armv9-a+sme\");
asm(\"smstart\");
")

builtin_check_c_compiler_source(COMPILER_RT_HAS_AARCH64_SME
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you combine this with COMPILER_RT_HAS_ASM_SME and just have a single COMPILER_RT_HAS_AARCH64_SME ? That way we don't need to have two variables given that you're AND'ing the result later on anyway.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Comment on lines +3 to +5
// WARNING: When building the scalar versions of these functions you need to
// use the compiler flag "-mllvm -disable-loop-idiom-all" to prevent clang
// from recognising a loop idiom and planting calls to memcpy!
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a comment here that seems important, but doesn't seem to be addressed by your patch?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

// Test correctness of memcpy
void test_memcpy() {
for (int i = 0; i < 8; i++) {
int t[2];
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you use an array, rather than two separate variables?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@@ -0,0 +1,108 @@
// REQUIRES: linux, aarch64-target-arch, aarch64-sme-available
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it require linux?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

void *__arm_sc_memchr(const void *, int, size_t);
}

class MemoryArea {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just an idea (no need to copy this literally), but I think you'd get more value from using a class like this if you'd do something like this:

template <unsigned N>
class Buffer {
public:
  uint8_t ptr[N];

  Buffer(int Stride = 0) {
    for (unsigned I=0; I < N; I+= Stride) {
      Src[I] = I;
    }
  }

  // to more quickly check that the result buffer matches a specific buffer.
  Buffer(std::initializer_list<...> S) { }

  bool assert_equal(const Buffer &Other) {
    // check here that ptr == Other.ptr
  }
};

Then you can do things like:

{ // test 1
  Memory<8> Src(1), Dst;
  __arm_sc_memcpy(Dst.ptr, Src.ptr, 8);
  Dst.assert_equal(Src);
  Dst.assert_equal({0, 1, 2, 3, 4, 5, 6, 7});
}
{ // test 2
  Memory<8> SrcDst(1);
  __arm_sc_memcpy(SrcDst.ptr + 1, SrcDst.ptr, 7);
  SrcDst.assert_equal({0, 0, 1, 2, 3, 4, 5, 6});
}

You could also compare it to the result from a regular memcpy.

Copy link
Contributor Author

@dtemirbulatov dtemirbulatov Jan 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Comment on lines 56 to 65
for (int i = 0; i < 8; i++) {
int t[2];
if (!__arm_sc_memset(dst, src[i], i))
abort();
t[0] = sum_and_reset_dst(dst, N, i);
__arm_sc_memset(dst, src[i], i);
t[1] = sum_and_reset_dst(dst, N, i);
if (t[0] != t[1])
abort();
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We discussed this offline, but please write some more specific tests that target the corner cases of the routines.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Collaborator

@sdesmalen-arm sdesmalen-arm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for rewriting the tests, this is looking much better!

"
asm(\".arch armv9-a+sme\");
asm(\"smstart\");
void foo(int a) __arm_streaming_compatible {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

Suggested change
void foo(int a) __arm_streaming_compatible {
void foo(void) __arm_streaming_compatible {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

message(STATUS "AArch64 SME ABI routines enabled")
if("${CMAKE_CXX_COMPILER_ID}" STREQUAL "Clang")
set_source_files_properties(aarch64/sme-libc-routines.c PROPERTIES COMPILE_FLAGS "-mllvm -disable-loop-idiom-all")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that you can use the option -fno-builtin to avoid builtins being used in the implementation of these functions. This flag is supported by both GNU and Clang, which is better than passing some internal LLVM flag.

It does mean you'll need to add: AND COMPILER_RT_HAS_FNO_BUILTIN_FLAG to the condition on line 563 though.

Copy link
Contributor Author

@dtemirbulatov dtemirbulatov Jan 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Comment on lines 22 to 26
if (Stride != 0) {
for (unsigned I = 0, Elem = 0; I < N; I++, Elem += Stride) {
ptr[I] = Elem;
}
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

Suggested change
if (Stride != 0) {
for (unsigned I = 0, Elem = 0; I < N; I++, Elem += Stride) {
ptr[I] = Elem;
}
}
if (Stride == 0)
return;
for (unsigned I = 0; I < N; ++I)
ptr[I] = I * Stride;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

public:
uint8_t ptr[N];

Memory(int Stride = 0) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

Suggested change
Memory(int Stride = 0) {
Memory(unsigned Stride = 0) {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

}

void assert_equal(const Memory &Other) {
for (unsigned I = 0; I < N; I++) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can use memcmp here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

}
}

void assert_equal(const Memory &Other) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you also need to verify that this.N == Other.N ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

if (!Elem)
abort();
Src.assert_elemt_equal_at(Elem - Src.ptr, *Elem);
assert(__arm_sc_memchr(Src.ptr, E, 8) == memchr(Src.ptr, E, 8));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would this assert is sufficient, i.e.

  for (unsigned I=0; I<8; ++I)
    assert(__arm_sc_memchr(Src.ptr, Src.ptr[I]) == memchr(Src.ptr, Src.ptr[I]));

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Comment on lines 98 to 118
// Testing memset with different pointer offset.
{
for (unsigned I = 0; I < 16; I++) {
Memory<16> Array(2);
if (!__arm_sc_memset(Array.ptr + I, I, 16 - I))
abort();

uint8_t OrigElem = 0;
for (unsigned J = 0; J < 16; J++) {
if (I == 0) {
Array.assert_elemt_equal_at(J, 0);
} else if (J < I) {
Array.assert_elemt_equal_at(J, OrigElem);
} else {
Array.assert_elemt_equal_at(J, (uint8_t)I);
}
OrigElem += 2;
}
}
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this testing anything that the above two tests aren't?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Collaborator

@sdesmalen-arm sdesmalen-arm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@MaskRay
Copy link
Member

MaskRay commented Jan 22, 2024

return;
for (unsigned I = 0, Elem = 0; I < N; I++) {
ptr[I] = I * Stride;
}
Copy link
Member

@MaskRay MaskRay Jan 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

drop braces around simple single statement

lower_case variable naming

assert(N == S.size());
auto It = S.begin();
for (unsigned I = 0; I < N; I++) {
assert(ptr[I] == *It++);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

drop braces around simple single statement

compiler-rt/lib/builtins uses lower_case variable naming.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

// TODO: Improve performance by copying larger chunks in reverse, or by
// using SVE.
while (n > 0) {
n--;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

void *__arm_sc_memset(void *dest, int c, size_t n) __arm_streaming_compatible {
unsigned char *destp = (unsigned char *)dest;
unsigned char c8 = (unsigned char)c;

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

delete blank line after variable declaration

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

unsigned char c8 = (unsigned char)c;

for (size_t i = 0; i < n; i++) {
destp[i] = c8;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

drop braces, ++i

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

size_t n) __arm_streaming_compatible {
const unsigned char *srcp = (const unsigned char *)src;
unsigned char c8 = (unsigned char)c;

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

delete blank line

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

for (size_t i = 0; i < n; i++) {
if (srcp[i] == c8)
return &srcp[i];
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

drop braces

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@MaskRay
Copy link
Member

MaskRay commented Jan 22, 2024

Add naive implementation of memcpy, memset, memmove, memchr for SME targets.

This can use more description why these functions need to go into compiler-rt. I assume that a future Clang will emit these calls (https://arm-software.github.io/acle/main/acle.html#streaming-compatible-versions-of-standard-routines).

…lementation RT builtins.

The title has been wrapped by GitHub. It will be more readable by updating the title not to be wrapped.

unsigned char *destp = (unsigned char *)dest;
const unsigned char *srcp = (const unsigned char *)src;
// If src and dest are identical there is nothing to do!
if ((destp == srcp) || (n == 0))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder whether the two conditions (destp == srcp) || (n == 0) are useful.

A lot of libc implementations don't check destp == srcp. Some don't check n==0

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I think we can safely remove those now. Thanks for finding this.

@dtemirbulatov dtemirbulatov changed the title [AArch64][compiler-rt] Add memcpy, memset, memmove, memchr simple imp… [AArch64][compiler-rt] Add memcpy, memset, memmove, memchr builtins. Jan 22, 2024
@dtemirbulatov dtemirbulatov merged commit 3ab8d2a into llvm:main Jan 22, 2024
4 checks passed
@dtemirbulatov dtemirbulatov deleted the compiler-rt-sme-libc-routines branch January 22, 2024 23:40
@dyung
Copy link
Collaborator

dyung commented Jan 23, 2024

@dtemirbulatov your change seems to cause some issues when trying to run the lit tests on some bots:

For example: https://lab.llvm.org/buildbot/#/builders/247/builds/13502

llvm-lit: /home/buildbot/buildbot-root/llvm-clang-x86_64-gcc-ubuntu/llvm-project/llvm/utils/lit/lit/TestingConfig.py:152: fatal: unable to parse config file '/home/buildbot/buildbot-root/llvm-clang-x86_64-gcc-ubuntu/build/runtimes/runtimes-bins/compiler-rt/test/lit.common.configured', traceback: Traceback (most recent call last):
  File "/home/buildbot/buildbot-root/llvm-clang-x86_64-gcc-ubuntu/llvm-project/llvm/utils/lit/lit/TestingConfig.py", line 140, in load_from_path
    exec(compile(data, path, "exec"), cfg_globals, None)
  File "/home/buildbot/buildbot-root/llvm-clang-x86_64-gcc-ubuntu/build/runtimes/runtimes-bins/compiler-rt/test/lit.common.configured", line 66, in <module>
    set_default("aarch64_sme", )
TypeError: set_default() missing 1 required positional argument: 'value'

Can you take a look?

@nico
Copy link
Contributor

nico commented Jan 23, 2024

Do you need to add a pythonize_bool() for COMPILER_RT_HAS_AARCH64_SME to compiler-rt/test/CMakeLists.txt?

Also, you do set_default("aarch64_sme", @COMPILER_RT_HAS_AARCH64_SME@) in lit.common.configured.in but config.aarch64_sme = "@COMPILER_RT_HAS_AARCH64_SME@" (as a string) in lit.common.unit.configured.in. Presumably the latter shouldn't have quotes either?

@dyung
Copy link
Collaborator

dyung commented Jan 23, 2024

@dtemirbulatov sorry, I reverted your change as it is preventing many bots from running tests.

@nico
Copy link
Contributor

nico commented Jan 23, 2024

When this relands, it'd be cool if you could put a change like 9a03d94 in the reland as well.

dtemirbulatov added a commit that referenced this pull request Jan 23, 2024
…77496)

Add naive implementation of memcpy, memset, memmove, memchr for SME
targets.
Co-authored-by: David Sherwood <david.sherwood@arm.com>
@SLTozer
Copy link
Contributor

SLTozer commented Jan 23, 2024

As mentioned above this seems to be breaking on the buildbots, e.g. https://lab.llvm.org/buildbot/#/builders/275/builds/3646

@dtemirbulatov
Copy link
Contributor Author

Commited solution for the last failure with 5f47687

@SLTozer
Copy link
Contributor

SLTozer commented Jan 23, 2024

The above commit hasn't fully fixed the issue - the error has now changed from the original:

llvm-lit: /home/buildbot/buildbot-root/debug-iterators/llvm-project/llvm/utils/lit/lit/TestingConfig.py:152: fatal: unable to parse config file '/home/buildbot/buildbot-root/debug-iterators/build/runtimes/runtimes-bins/compiler-rt/unittests/lit.common.unit.configured', traceback: Traceback (most recent call last):
  File "/home/buildbot/buildbot-root/debug-iterators/llvm-project/llvm/utils/lit/lit/TestingConfig.py", line 140, in load_from_path
    exec(compile(data, path, "exec"), cfg_globals, None)
  File "/home/buildbot/buildbot-root/debug-iterators/build/runtimes/runtimes-bins/compiler-rt/unittests/lit.common.unit.configured", line 23
    config.aarch64_sme =
                        ^
SyntaxError: invalid syntax

To the new error:

llvm-lit: /home/buildbot/buildbot-root/debug-iterators/llvm-project/llvm/utils/lit/lit/TestingConfig.py:152: fatal: unable to parse config file '/home/buildbot/buildbot-root/debug-iterators/build/runtimes/runtimes-bins/compiler-rt/test/lit.common.configured', traceback: Traceback (most recent call last):
  File "/home/buildbot/buildbot-root/debug-iterators/llvm-project/llvm/utils/lit/lit/TestingConfig.py", line 140, in load_from_path
    exec(compile(data, path, "exec"), cfg_globals, None)
  File "/home/buildbot/buildbot-root/debug-iterators/build/runtimes/runtimes-bins/compiler-rt/test/lit.common.configured", line 66, in <module>
    set_default("aarch64_sme", )
TypeError: set_default() missing 1 required positional argument: 'value'

@jmorse
Copy link
Member

jmorse commented Jan 23, 2024

NB, I've taken a shot at fixing this with 5176df5 , I think using pythonize_bool is adding "_PYBOOL" to the end of the variable name that gets defined, like the nearby config flags (COMPILER_RT_TEST_STANDALONE_BUILD_LIBS for example gets _PYBOOL on the end).

@jmorse
Copy link
Member

jmorse commented Jan 23, 2024

Looks like that cleared the bots -- @dtemirbulatov you should be able to make the same fix to the line you deleted in lit.common.unit.configured.in to restore coverage.

jmorse referenced this pull request Jan 23, 2024
This is a follow-up to 3112578 -- it looks like passing the added
cmake flag to pythonize_bool also appends "_PYBOOL" to the flag name, which
this lit config file is missing. This trips up builds such as:

  https://lab.llvm.org/buildbot/#/builders/275/builds/3661
  https://lab.llvm.org/buildbot/#/builders/184/builds/9811

Where COMPILER_RT_HAS_AARCH64_SME ends up expanding to nothing.
@echristo
Copy link
Contributor

Add naive implementation of memcpy, memset, memmove, memchr for SME targets. Co-authored-by: David Sherwood david.sherwood@arm.com

Do we have an idea of the expected performance differences here in the presence of SME?

@@ -0,0 +1,87 @@
#include <stdlib.h>
Copy link
Contributor

@peterwaller-arm peterwaller-arm Jan 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've proposed #79454 to switch this out for stddef.h which is provided by the compiler resource headers rather than the libc (since this is inside builtins).

@sdesmalen-arm
Copy link
Collaborator

Add naive implementation of memcpy, memset, memmove, memchr for SME targets.

Do we have an idea of the expected performance differences here in the presence of SME?

This PR adds the __arm_sc_ (prefixed) streaming-compatible variants to be compliant with the SME ACLE and are written with a focus on functional correctness as opposed to performance. They're more of a starting point right now, but we definitely want to optimise these routines going forward.

The reason to have streaming-compatible variants of these routines is because regular memcpy/memset/etc routines are not streaming-compatible. In practice this means they might be implemented using NEON instructions and so the compiler has to ensure that every call to it will result in a smstop sm -> bl memcpy -> smstart sm sequence when the caller is executed in Streaming-SVE mode. By adding the __arm_streaming_compatible attribute to these functions, the compiler will ensure the implementation generates no incompatible (e.g. NEON) instructions and also ensures that calls to functions with a streaming-compatible interface (e.g. __arm_sc_memcpy) don't result in toggling the streaming mode around the call.

@davemgreen
Copy link
Collaborator

I dont usually follow compiler-rt commits, so didnt see this. memcpy and friends can be used a lot - performance for them can be critical and a lot of time has been spent over the years trying to optimize them.

Is the idea that these are auto-vectorized? And so might use the SME engine for the copy? I'm not sure that is happening in the current version I have, and single byte memcpys sounds pretty slow. And that's before other considerations like alignment and systems wanting platform specific routines. I would add Wilco (I'm not sure what his username is), he has done a lot of work on these kinds of routines and it would be good to get his input.

@sdesmalen-arm
Copy link
Collaborator

The point of this implementation is that it doesn't use NEON instructions, as those are not valid in Streaming-SVE mode. We'll definitely want to optimise these routines by using (streaming-compatible) SVE instructions or by using the MOPS instructions, if available. This PR is merely a starting point to guarantee functional correctness.

It is also possible to use regular memcpy routines, but those will incur a smstop/smstart around the call-site when called from a streaming function.

@davemgreen
Copy link
Collaborator

I had forgotten that vectorization was disabled in streaming functions. That would make sense why we don't see SVE instructions so far. It sounds like there is plenty of work yet to go.

Hopefully there will be a lot of overlap with mops instructions, which would help simplify things a lot.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet