MAINT: CPUs that support unaligned access. #18065

Qiyu8 · 2020-12-24T02:12:38Z

Currently two aligned strategy is defined in the code base, NPY_CPU_HAVE_UNALIGNED_ACCESS points out X86 and AMD64 supports unaligned access, while NPY_STRONG_ALIGNMENT points out ARMV7 needs aligned access, This PR try to integrate two macros into one.

seberg · 2020-12-24T03:33:18Z

numpy/core/include/numpy/npy_cpu.h

 #define NPY_CPU_HAVE_UNALIGNED_ACCESS 0
+#else
+#define NPY_CPU_HAVE_UNALIGNED_ACCESS 1


Thanks. Does this reject SPARC? I am certain sparc does not support unaligned access, I have no clue about the others.

EDIT: I also seem to remember that some old PowerPCs support unaligned access but it was very slow, so that it was not a good idea to use it?

As far as I can see, CPUs that support unaligned access is a lot less. so I will use allow-list here, only X86/AMD64/ARMV8 support unaligned access.

Archtecture support of unaligned access side effects

X86 yes small performance loss

AMD64 yes small performance loss

Sparc no Exception or incorrect result

old PowerPC(s390) yes big performance loss

modern PowerPC no Exception or incorrect result

<=ARMV7 no Exception or incorrect result

ARMV8 yes small performance loss

MIPS no Exception or incorrect result

IA64 no warning

PA-RISC no Exception or incorrect result

ALPHA no Exception or incorrect result

I see it more as a compiler policy more than a CPU feature. IMHO NPY_STRONG_ALIGNMENT is more robust since the reason behind this #def is to
avoid bus errors. in fact, armv7 supports unaligned memory access limited to the natural alignment of data types which gives you the ability to work around the misaligned access but of course that will affect performance.

I am certain sparc does not support unaligned access,

sorry for misleading, we can just enable it for NPYV archs for now. somethings like that

#if NPY_CPU_X86 || NPY_CPU_AMD64 || defined(__aarch64__) || defined(__powerpc64__) #define NPY_STRONG_ALIGNMENT 0 #endif #ifndef NPY_STRONG_ALIGNMENT #define NPY_STRONG_ALIGNMENT 1 #endif

EDIT: I also seem to remember that some old PowerPCs support unaligned access but it was very slow, so that it was not a good idea to use it?

If you mean 64-bit unaligned access, well yeah but that also the same issue in any CPUs with 32-bit mode. two-word load, and two-word store.

@Qiyu8, didn't see your last comment, could you provide a reference for this table? and how compilers deal with it?

This table is made by myself, all of the data is collected from stackoverflow/officer website, I found that the unaligned access information(especially the compiler strategy) is not straightforward , some of the data is deduced from Linux, which may be the most general implementation of unaligned access. I basically agree with your macro defined above. but I read from IBM that modern powerpc don't support unaligned access, should we remove __powerpc64__?

I found that the unaligned access information(especially the compiler strategy) is not straightforward

because there's a difference between loading memory on bare-metal and under os(userspace). the kernel can handle CPU exception internally without raising errors and also some CPUs allowing control register for trap unaligned access but of course, that will affect performance.

I basically agree with your macro defined above. but I read from IBM that modern powerpc don't support unaligned access, should we remove powerpc64?

64-bit mode of Power supports 64-bit unaligned memory access so you should keep it, also as far as I know all PowerPC chips support 32-bit unaligned memory access and the compiler usually handle 64-bit via two-word routine to avoid the plenty of letting the kernel handle it.

You lost me a bit, is the name nicer if it is always used with a !, or should it be STRONG_ALIGNMENT_REQUIRED?There is also this comment in the current code, which makes things even more confusing for me:

/* * x86 platform works with unaligned access but the compiler is allowed to * assume all data is aligned to its size by the C standard. This means it can * vectorize instructions peeling only by the size of the type, if the data is * not aligned to this size one ends up with data not correctly aligned for SSE * instructions (16 byte). * So this flag can only be enabled if autovectorization is disabled. */

but maybe that is part of the reason for the "STRONG_ALIGNMENT". Anyway, I am happy with any solution you can agree on, the only request I would have is adding a comment above the macro pointing out what the "strong alignment" means and if you can maybe move/duplicate part of the above comment also?

The alignment is a complicated topic, I'll do my best to explain each situation clearly.

You lost me a bit, is the name nicer if it is always used with a !, or should it be STRONG_ALIGNMENT_REQUIRED?

Depend on the context however ALIGNMENT_REQUIRED or STRONG_ALIGNMENT or STRONG_ALIGNMENT_REQUIRED gives the same meanining.

There is also this comment in the current code, which makes things even more confusing for me:

This is comment is no longer accurate, modern x86 chips don't produce performance regression for unaligned memory access and same goes for aarch64 and Power8/9, and modern compilers will autovectorize it anyway.
NPY_USE_UNALIGNED_ACCESS already disabled for both cases, we can later replace the current unrolled loops with universal intrinsics and leave this comment as-is for now.

@seiko2plus can you explain the last comment again, I could not quite follow. The comment says that x86 supports unaligned access, but the C89 (what about C99) standard allows the compiler to assume aligned access (I am not sure it matters whether new compilers make use of that).
Or does "not strong alignment" mean that alignment is not required even when the compiler uses auto-vectorization?

The thing I would be scared about most is if the code would work fine with the default compiler flags, but potentially break if someone has the silly idea of compiling NumPy with -O3 to make it faster (or similar flags).

seiko2plus · 2021-01-02T07:35:39Z

numpy/core/include/numpy/npy_cpu.h

+
+NOTE: in x86 platform this flag can only be enabled if autovectorization is disabled.
+*/
+#if defined(NPY_CPU_X86) || defined(NPY_CPU_AMD64) || defined(__aarch64__) || defined(__powerpc64__)


@Qiyu8, Its should be a simple comment not a case study, and the current one not accurate enough. the comment can be simple as following:

// Except for the following architectures, memory access is limited to the natural // alignment of data types otherwise it may lead to bus error or performance regression.

the new #def NPY_STRONG_ALIGNMENT_REQUIRED is too long, three words is enough NPY_ALIGNMENT_REQUIRED or NPY_STRONG_ALIGNMENT

Sorry, I tried to shut down a bit over the break. I agree with Sayed, we don't need a list of how things can go wrong or how specific CPUs behave (it is interesting though!).

The more important thing is to write a short comment about where it is safe to do unaligned access (i.e. if you put a if (!NPY_STRONG_ALIGNMENT_REQUIRED) {code} what are you allowed to do in code? Are there remaining constraints about unaligned access?

EDIT: You could include the URL to this PR so that someone interested can find it!

I will choose the name NPY_ALIGNMENT_REQUIRED, which represents only aligned memory is allowed to access.

Thanks. About that auto vectorization note in the comment? I do not have to be worried about it?

NPY_USE_UNALIGNED_ACCESS already disabled no matter auto-vectorization is enabled or not, I will reconstruct lowlevel_strided_loops.c.src by using universal intrinsics so that NPY_USE_UNALIGNED_ACCESS can enabled when -O3 is specifiled, but that's not the point of this PR.

I realize that it is disabled, my question was just to make sure that the comment doesn't somehow apply to the new code as well, since I am not quite sure how it would differ from the one in lowlevel_strided_loops.c.src.

Btw. don't let it stop you, but that file will need at least two times bigger maintenance (one already open, the other is fixing most function signatures). That should be pretty orthogonal to anything you do though, making sure casts are vectorized seems definitely worth it.

Anyway, I am happy with this, feel free to merge. I just still don't quite understand the autovectorization comment unless it is specific to using NPY_GCC_UNROLL_LOOPS.

mattip · 2021-01-05T19:20:14Z

Thanks @Qiyu8

seberg · 2021-01-05T19:39:33Z

So the potential issue with the compiler assuming alignment only applies with the loop unrolling macro?

EDIT: Or is it that it is a performance issue with auto vectorization? Which probably does not matter elsewhere.

Only ARMV7 needs aligned access.

81a276c

github-actions bot added the 03 - Maintenance label Dec 24, 2020

seberg reviewed Dec 24, 2020

View reviewed changes

Qiyu8 changed the title ~~MAINT:Only ARMV7 needs aligned access.~~ MAINT: Only ARMV7 needs aligned access. Dec 25, 2020

Qiyu8 changed the title ~~MAINT: Only ARMV7 needs aligned access.~~ MAINT: CPUs that support unaligned access. Dec 25, 2020

Qiyu8 added 2 commits December 25, 2020 15:57

add CPUs that support unaligned access.

6ae52be

add comments demonstrate the common scenoirs of unaligned access.

1646c43

Qiyu8 requested a review from seberg December 30, 2020 08:40

seiko2plus reviewed Jan 2, 2021

View reviewed changes

seiko2plus mentioned this pull request Jan 2, 2021

BUG: Fixes for building on Cygwin. #16246

Closed

simplify the comment about aligned access.

f60217f

mattip merged commit da887a6 into numpy:master Jan 5, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MAINT: CPUs that support unaligned access. #18065

MAINT: CPUs that support unaligned access. #18065

Qiyu8 commented Dec 24, 2020

seberg Dec 24, 2020 •

edited

Qiyu8 Dec 25, 2020

seiko2plus Dec 25, 2020 •

edited

seiko2plus Dec 25, 2020

Qiyu8 Dec 25, 2020

seiko2plus Dec 25, 2020

seberg Dec 25, 2020

Qiyu8 Dec 28, 2020

seiko2plus Jan 2, 2021

seberg Jan 3, 2021 •

edited

seiko2plus Jan 2, 2021

seberg Jan 3, 2021 •

edited

Qiyu8 Jan 5, 2021

seberg Jan 5, 2021

Qiyu8 Jan 5, 2021

seberg Jan 5, 2021

seberg Jan 5, 2021

mattip commented Jan 5, 2021

seberg commented Jan 5, 2021 •

edited

Archtecture	support of unaligned access	side effects
X86	yes	small performance loss
AMD64	yes	small performance loss
Sparc	no	Exception or incorrect result
old PowerPC(s390)	yes	big performance loss
modern PowerPC	no	Exception or incorrect result
<=ARMV7	no	Exception or incorrect result
ARMV8	yes	small performance loss
MIPS	no	Exception or incorrect result
IA64	no	warning
PA-RISC	no	Exception or incorrect result
ALPHA	no	Exception or incorrect result

MAINT: CPUs that support unaligned access. #18065

MAINT: CPUs that support unaligned access. #18065

Conversation

Qiyu8 commented Dec 24, 2020

seberg Dec 24, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

seiko2plus Dec 25, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

seberg Jan 3, 2021 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

seberg Jan 3, 2021 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mattip commented Jan 5, 2021

seberg commented Jan 5, 2021 • edited

seberg Dec 24, 2020 •

edited

seiko2plus Dec 25, 2020 •

edited

seberg Jan 3, 2021 •

edited

seberg Jan 3, 2021 •

edited

seberg commented Jan 5, 2021 •

edited