Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MAINT: CPUs that support unaligned access. #18065

Merged
merged 4 commits into from
Jan 5, 2021
Merged

Conversation

Qiyu8
Copy link
Member

@Qiyu8 Qiyu8 commented Dec 24, 2020

Currently two aligned strategy is defined in the code base, NPY_CPU_HAVE_UNALIGNED_ACCESS points out X86 and AMD64 supports unaligned access, while NPY_STRONG_ALIGNMENT points out ARMV7 needs aligned access, This PR try to integrate two macros into one.

#define NPY_CPU_HAVE_UNALIGNED_ACCESS 0
#else
#define NPY_CPU_HAVE_UNALIGNED_ACCESS 1
Copy link
Member

@seberg seberg Dec 24, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Does this reject SPARC? I am certain sparc does not support unaligned access, I have no clue about the others.

EDIT: I also seem to remember that some old PowerPCs support unaligned access but it was very slow, so that it was not a good idea to use it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I can see, CPUs that support unaligned access is a lot less. so I will use allow-list here, only X86/AMD64/ARMV8 support unaligned access.

Archtecture support of unaligned access side effects
X86 yes small performance loss
AMD64 yes small performance loss
Sparc no Exception or incorrect result
old PowerPC(s390) yes big performance loss
modern PowerPC no Exception or incorrect result
<=ARMV7 no Exception or incorrect result
ARMV8 yes small performance loss
MIPS no Exception or incorrect result
IA64 no warning
PA-RISC no Exception or incorrect result
ALPHA no Exception or incorrect result

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see it more as a compiler policy more than a CPU feature. IMHO NPY_STRONG_ALIGNMENT is more robust since the reason behind this #def is to
avoid bus errors. in fact, armv7 supports unaligned memory access limited to the natural alignment of data types which gives you the ability to work around the misaligned access but of course that will affect performance.

I am certain sparc does not support unaligned access,

sorry for misleading, we can just enable it for NPYV archs for now. somethings like that

#if NPY_CPU_X86 || NPY_CPU_AMD64 || defined(__aarch64__) || defined(__powerpc64__)
    #define NPY_STRONG_ALIGNMENT 0
#endif
#ifndef NPY_STRONG_ALIGNMENT
    #define NPY_STRONG_ALIGNMENT 1
#endif

EDIT: I also seem to remember that some old PowerPCs support unaligned access but it was very slow, so that it was not a good idea to use it?

If you mean 64-bit unaligned access, well yeah but that also the same issue in any CPUs with 32-bit mode. two-word load, and two-word store.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Qiyu8, didn't see your last comment, could you provide a reference for this table? and how compilers deal with it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This table is made by myself, all of the data is collected from stackoverflow/officer website, I found that the unaligned access information(especially the compiler strategy) is not straightforward , some of the data is deduced from Linux, which may be the most general implementation of unaligned access. I basically agree with your macro defined above. but I read from IBM that modern powerpc don't support unaligned access, should we remove __powerpc64__?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found that the unaligned access information(especially the compiler strategy) is not straightforward

because there's a difference between loading memory on bare-metal and under os(userspace). the kernel can handle CPU exception internally without raising errors and also some CPUs allowing control register for trap unaligned access but of course, that will affect performance.

I basically agree with your macro defined above. but I read from IBM that modern powerpc don't support unaligned access, should we remove powerpc64?

64-bit mode of Power supports 64-bit unaligned memory access so you should keep it, also as far as I know all PowerPC chips support 32-bit unaligned memory access and the compiler usually handle 64-bit via two-word routine to avoid the plenty of letting the kernel handle it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You lost me a bit, is the name nicer if it is always used with a !, or should it be STRONG_ALIGNMENT_REQUIRED?There is also this comment in the current code, which makes things even more confusing for me:

/*
 * x86 platform works with unaligned access but the compiler is allowed to
 * assume all data is aligned to its size by the C standard. This means it can
 * vectorize instructions peeling only by the size of the type, if the data is
 * not aligned to this size one ends up with data not correctly aligned for SSE
 * instructions (16 byte).
 * So this flag can only be enabled if autovectorization is disabled.
 */

but maybe that is part of the reason for the "STRONG_ALIGNMENT". Anyway, I am happy with any solution you can agree on, the only request I would have is adding a comment above the macro pointing out what the "strong alignment" means and if you can maybe move/duplicate part of the above comment also?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The alignment is a complicated topic, I'll do my best to explain each situation clearly.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You lost me a bit, is the name nicer if it is always used with a !, or should it be STRONG_ALIGNMENT_REQUIRED?

Depend on the context however ALIGNMENT_REQUIRED or STRONG_ALIGNMENT or STRONG_ALIGNMENT_REQUIRED gives the same meanining.

There is also this comment in the current code, which makes things even more confusing for me:

This is comment is no longer accurate, modern x86 chips don't produce performance regression for unaligned memory access and same goes for aarch64 and Power8/9, and modern compilers will autovectorize it anyway.
NPY_USE_UNALIGNED_ACCESS already disabled for both cases, we can later replace the current unrolled loops with universal intrinsics and leave this comment as-is for now.

Copy link
Member

@seberg seberg Jan 3, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@seiko2plus can you explain the last comment again, I could not quite follow. The comment says that x86 supports unaligned access, but the C89 (what about C99) standard allows the compiler to assume aligned access (I am not sure it matters whether new compilers make use of that).
Or does "not strong alignment" mean that alignment is not required even when the compiler uses auto-vectorization?

The thing I would be scared about most is if the code would work fine with the default compiler flags, but potentially break if someone has the silly idea of compiling NumPy with -O3 to make it faster (or similar flags).

@Qiyu8 Qiyu8 changed the title MAINT:Only ARMV7 needs aligned access. MAINT: Only ARMV7 needs aligned access. Dec 25, 2020
@Qiyu8 Qiyu8 changed the title MAINT: Only ARMV7 needs aligned access. MAINT: CPUs that support unaligned access. Dec 25, 2020
@Qiyu8 Qiyu8 requested a review from seberg December 30, 2020 08:40

NOTE: in x86 platform this flag can only be enabled if autovectorization is disabled.
*/
#if defined(NPY_CPU_X86) || defined(NPY_CPU_AMD64) || defined(__aarch64__) || defined(__powerpc64__)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Qiyu8, Its should be a simple comment not a case study, and the current one not accurate enough. the comment can be simple as following:

// Except for the following architectures, memory access is limited to the natural
// alignment of data types otherwise it may lead to bus error or performance regression.

the new #def NPY_STRONG_ALIGNMENT_REQUIRED is too long, three words is enough NPY_ALIGNMENT_REQUIRED or NPY_STRONG_ALIGNMENT

Copy link
Member

@seberg seberg Jan 3, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I tried to shut down a bit over the break. I agree with Sayed, we don't need a list of how things can go wrong or how specific CPUs behave (it is interesting though!).

The more important thing is to write a short comment about where it is safe to do unaligned access (i.e. if you put a if (!NPY_STRONG_ALIGNMENT_REQUIRED) {code} what are you allowed to do in code? Are there remaining constraints about unaligned access?

EDIT: You could include the URL to this PR so that someone interested can find it!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will choose the name NPY_ALIGNMENT_REQUIRED, which represents only aligned memory is allowed to access.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. About that auto vectorization note in the comment? I do not have to be worried about it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NPY_USE_UNALIGNED_ACCESS already disabled no matter auto-vectorization is enabled or not, I will reconstruct lowlevel_strided_loops.c.src by using universal intrinsics so that NPY_USE_UNALIGNED_ACCESS can enabled when -O3 is specifiled, but that's not the point of this PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realize that it is disabled, my question was just to make sure that the comment doesn't somehow apply to the new code as well, since I am not quite sure how it would differ from the one in lowlevel_strided_loops.c.src.

Btw. don't let it stop you, but that file will need at least two times bigger maintenance (one already open, the other is fixing most function signatures). That should be pretty orthogonal to anything you do though, making sure casts are vectorized seems definitely worth it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anyway, I am happy with this, feel free to merge. I just still don't quite understand the autovectorization comment unless it is specific to using NPY_GCC_UNROLL_LOOPS.

@mattip mattip merged commit da887a6 into numpy:master Jan 5, 2021
@mattip
Copy link
Member

mattip commented Jan 5, 2021

Thanks @Qiyu8

@seberg
Copy link
Member

seberg commented Jan 5, 2021

So the potential issue with the compiler assuming alignment only applies with the loop unrolling macro?

EDIT: Or is it that it is a performance issue with auto vectorization? Which probably does not matter elsewhere.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants