-
-
Notifications
You must be signed in to change notification settings - Fork 9.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MAINT: CPUs that support unaligned access. #18065
Conversation
numpy/core/include/numpy/npy_cpu.h
Outdated
#define NPY_CPU_HAVE_UNALIGNED_ACCESS 0 | ||
#else | ||
#define NPY_CPU_HAVE_UNALIGNED_ACCESS 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. Does this reject SPARC? I am certain sparc does not support unaligned access, I have no clue about the others.
EDIT: I also seem to remember that some old PowerPCs support unaligned access but it was very slow, so that it was not a good idea to use it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I can see, CPUs that support unaligned access is a lot less. so I will use allow-list here, only X86/AMD64/ARMV8 support unaligned access.
Archtecture | support of unaligned access | side effects |
---|---|---|
X86 | yes | small performance loss |
AMD64 | yes | small performance loss |
Sparc | no | Exception or incorrect result |
old PowerPC(s390) | yes | big performance loss |
modern PowerPC | no | Exception or incorrect result |
<=ARMV7 | no | Exception or incorrect result |
ARMV8 | yes | small performance loss |
MIPS | no | Exception or incorrect result |
IA64 | no | warning |
PA-RISC | no | Exception or incorrect result |
ALPHA | no | Exception or incorrect result |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see it more as a compiler policy more than a CPU feature. IMHO NPY_STRONG_ALIGNMENT
is more robust since the reason behind this #def is to
avoid bus errors. in fact, armv7 supports unaligned memory access limited to the natural alignment of data types which gives you the ability to work around the misaligned access but of course that will affect performance.
I am certain sparc does not support unaligned access,
sorry for misleading, we can just enable it for NPYV archs for now. somethings like that
#if NPY_CPU_X86 || NPY_CPU_AMD64 || defined(__aarch64__) || defined(__powerpc64__)
#define NPY_STRONG_ALIGNMENT 0
#endif
#ifndef NPY_STRONG_ALIGNMENT
#define NPY_STRONG_ALIGNMENT 1
#endif
EDIT: I also seem to remember that some old PowerPCs support unaligned access but it was very slow, so that it was not a good idea to use it?
If you mean 64-bit unaligned access, well yeah but that also the same issue in any CPUs with 32-bit mode. two-word load, and two-word store.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Qiyu8, didn't see your last comment, could you provide a reference for this table? and how compilers deal with it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This table is made by myself, all of the data is collected from stackoverflow/officer website, I found that the unaligned access information(especially the compiler strategy) is not straightforward , some of the data is deduced from Linux, which may be the most general implementation of unaligned access. I basically agree with your macro defined above. but I read from IBM that modern powerpc don't support unaligned access, should we remove __powerpc64__
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found that the unaligned access information(especially the compiler strategy) is not straightforward
because there's a difference between loading memory on bare-metal and under os(userspace). the kernel can handle CPU exception internally without raising errors and also some CPUs allowing control register for trap unaligned access but of course, that will affect performance.
I basically agree with your macro defined above. but I read from IBM that modern powerpc don't support unaligned access, should we remove powerpc64?
64-bit mode of Power supports 64-bit unaligned memory access so you should keep it, also as far as I know all PowerPC chips support 32-bit unaligned memory access and the compiler usually handle 64-bit via two-word routine to avoid the plenty of letting the kernel handle it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You lost me a bit, is the name nicer if it is always used with a !
, or should it be STRONG_ALIGNMENT_REQUIRED
?There is also this comment in the current code, which makes things even more confusing for me:
/*
* x86 platform works with unaligned access but the compiler is allowed to
* assume all data is aligned to its size by the C standard. This means it can
* vectorize instructions peeling only by the size of the type, if the data is
* not aligned to this size one ends up with data not correctly aligned for SSE
* instructions (16 byte).
* So this flag can only be enabled if autovectorization is disabled.
*/
but maybe that is part of the reason for the "STRONG_ALIGNMENT". Anyway, I am happy with any solution you can agree on, the only request I would have is adding a comment above the macro pointing out what the "strong alignment" means and if you can maybe move/duplicate part of the above comment also?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The alignment is a complicated topic, I'll do my best to explain each situation clearly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You lost me a bit, is the name nicer if it is always used with a !, or should it be STRONG_ALIGNMENT_REQUIRED?
Depend on the context however ALIGNMENT_REQUIRED
or STRONG_ALIGNMENT
or STRONG_ALIGNMENT_REQUIRED
gives the same meanining.
There is also this comment in the current code, which makes things even more confusing for me:
This is comment is no longer accurate, modern x86 chips don't produce performance regression for unaligned memory access and same goes for aarch64 and Power8/9, and modern compilers will autovectorize it anyway.
NPY_USE_UNALIGNED_ACCESS
already disabled for both cases, we can later replace the current unrolled loops with universal intrinsics and leave this comment as-is for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@seiko2plus can you explain the last comment again, I could not quite follow. The comment says that x86
supports unaligned access, but the C89 (what about C99) standard allows the compiler to assume aligned access (I am not sure it matters whether new compilers make use of that).
Or does "not strong alignment" mean that alignment is not required even when the compiler uses auto-vectorization?
The thing I would be scared about most is if the code would work fine with the default compiler flags, but potentially break if someone has the silly idea of compiling NumPy with -O3
to make it faster (or similar flags).
|
||
NOTE: in x86 platform this flag can only be enabled if autovectorization is disabled. | ||
*/ | ||
#if defined(NPY_CPU_X86) || defined(NPY_CPU_AMD64) || defined(__aarch64__) || defined(__powerpc64__) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Qiyu8, Its should be a simple comment not a case study, and the current one not accurate enough. the comment can be simple as following:
// Except for the following architectures, memory access is limited to the natural
// alignment of data types otherwise it may lead to bus error or performance regression.
the new #def NPY_STRONG_ALIGNMENT_REQUIRED
is too long, three words is enough NPY_ALIGNMENT_REQUIRED
or NPY_STRONG_ALIGNMENT
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I tried to shut down a bit over the break. I agree with Sayed, we don't need a list of how things can go wrong or how specific CPUs behave (it is interesting though!).
The more important thing is to write a short comment about where it is safe to do unaligned access (i.e. if you put a if (!NPY_STRONG_ALIGNMENT_REQUIRED) {code}
what are you allowed to do in code
? Are there remaining constraints about unaligned access?
EDIT: You could include the URL to this PR so that someone interested can find it!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will choose the name NPY_ALIGNMENT_REQUIRED
, which represents only aligned memory is allowed to access.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. About that auto vectorization note in the comment? I do not have to be worried about it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NPY_USE_UNALIGNED_ACCESS
already disabled no matter auto-vectorization is enabled or not, I will reconstruct lowlevel_strided_loops.c.src by using universal intrinsics so that NPY_USE_UNALIGNED_ACCESS
can enabled when -O3 is specifiled, but that's not the point of this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I realize that it is disabled, my question was just to make sure that the comment doesn't somehow apply to the new code as well, since I am not quite sure how it would differ from the one in lowlevel_strided_loops.c.src
.
Btw. don't let it stop you, but that file will need at least two times bigger maintenance (one already open, the other is fixing most function signatures). That should be pretty orthogonal to anything you do though, making sure casts are vectorized seems definitely worth it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Anyway, I am happy with this, feel free to merge. I just still don't quite understand the autovectorization comment unless it is specific to using NPY_GCC_UNROLL_LOOPS
.
Thanks @Qiyu8 |
So the potential issue with the compiler assuming alignment only applies with the loop unrolling macro? EDIT: Or is it that it is a performance issue with auto vectorization? Which probably does not matter elsewhere. |
Currently two aligned strategy is defined in the code base,
NPY_CPU_HAVE_UNALIGNED_ACCESS
points outX86
andAMD64
supports unaligned access, whileNPY_STRONG_ALIGNMENT
points out ARMV7 needs aligned access, This PR try to integrate two macros into one.