Skip to content

Conversation

@saran-t
Copy link

@saran-t saran-t commented Feb 24, 2022

Fixes #21116

@seberg
Copy link
Member

seberg commented Feb 24, 2022

Thanks! We could remove the alignment required flag if we do this change, I think? I have to admit, I am a bit curious if this is worthwhile and whether it might slow down things a bit. Or is this actually incorrect and not just the sanitizer being very strict?

If you are up for it, I would be interested whether the bench_io benchmarks, pick this up: python3 runtests.py --bench-compare main --bench bench_io should work. (I think they use this code for some of the copyto benchmarks.)

@saran-t
Copy link
Author

saran-t commented Feb 24, 2022

I'm pretty busy right now but can probably pick this up sometime in the next couple of weeks. I wouldn't be surprised if this actually makes things a bit faster on largeish arrays. As thing stands if the array isn't uint-aligned then you would be repeatedly performing unaligned reads in the coarse-grained loop, and these may or may not come with a performance penalty (see e.g. https://stackoverflow.com/questions/70964604).

We can also replace the modulo calculation with switch sizeof(unsigned int) and bit shifts, assuming that sizeof(unsigned int) is a power of two (which is a safe assumption on any platform). Since the switch condition is a compile-time constant there shouldn't be any runtime branching. If we do that then I would not expect the change to have any noticeable negative performance impact (famous last words 🙂 ).

Edit: another thing to think about.

  • If the array size is < sizeof(unsigned int) then this change doesn't actually increase the number of comparisons at all.
  • If the array is already aligned then this change is a no-op, other than the modulo calculation upfront (which as mentioned, can be further optimised).
  • If the array is unaligned but its size isn't a multiple of sizeof(unsigned int), then half the time we also don't increase the number of comparisons. If the array is
  • If the array is large enough then the cost of the initial loop will be comparatively insubstantial.

@seberg
Copy link
Member

seberg commented Feb 25, 2022

repeatedly performing unaligned reads

Well, we are also interested in random arrays (i.e. we should aim to do well for short loops) – OTOH short loops have a lot of other overheads, so it may not matter there. The % will be optimized away, no problem.
I ran a quick test locally, and annoyingly it is the test that should not notice this change at all that slows down (i.e. where this loop just keeps going) – and the case that should bail out right away once (use the other branch mostly) is much faster...

Possibly just random, or the additional code makes gcc do some different optimization choices. The more "mixed" cases, seem to barely notice though.
May, need to time it more manual if I want results that can be trusted...

@saran-t
Copy link
Author

saran-t commented Feb 25, 2022

Thanks! In the meantime, I found a reference for a real situation where this UB can result in a real crash: https://blog.quarkslab.com/unaligned-accesses-in-cc-what-why-and-solutions-to-do-it-properly.html

@seberg
Copy link
Member

seberg commented Feb 25, 2022

@saran-t we test on platforms that do not support unaligned access. This code path is only taken when that define evaluates to 0, which is only the case on selected CPUs/platforms. So we know that this is OK. Unless, the compiler goes one step further and optimizes this for SIMDs that have higher alignment requirements somehow.

@saran-t
Copy link
Author

saran-t commented Feb 25, 2022

Ah, fair enough. I thought NPY_ALIGNMENT_REQUIRED was a property of the array but on second reading that's clearly a global macro.

@charris charris added the 09 - Backport-Candidate PRs tagged should be backported label Mar 3, 2022
@seberg
Copy link
Member

seberg commented Mar 7, 2022

Considering that my timings were a bit mixed, I am somewhat tempted to close this. At least unless we see it as a speed-enhancement on CPUs that do not support this unaligned access.

Or is that sanitizer warning problematic? (since I don't think the sanitizer is right to worry here.)

@saran-t
Copy link
Author

saran-t commented Mar 7, 2022

The issue with the sanitizer firing is that it means that we cannot use UBsan to detect bugs in our own code that depends on NumPy. An alternative "fix" would be to add __attribute__((no_sanitize("alignment"))) to the function, guarded behind #ifdef __clang__.

@charris charris closed this Apr 6, 2022
@charris charris reopened this Apr 6, 2022
@charris
Copy link
Member

charris commented May 5, 2022

What is the status of this?

@seberg
Copy link
Member

seberg commented May 6, 2022

Let me make a PR adding the attribute on clang for now, so we can backport that.

I do think there may be value in this (or similar changes) in terms of optimizing this code. But, right now, as far as I understand, the code is correct; and clang is only half correct (in the sense that it affects some platforms, but is OK on the one where it is run).

And since I am not sure that this might not slow down certain common code paths, I think optimization would be nice with benchmarks and a bit more care.

@seberg seberg closed this May 6, 2022
seberg added a commit to seberg/numpy that referenced this pull request May 6, 2022
Clangs sanitizer reports unaligned access here, which is correct
but intentional.  It may well be that the code would be better of
trying to avoid this unaligned access (and rather vectorizing harder).

But, this is a bit of a tricky choice, since we have to optimize for
different use-cases (in particular very short scans may be interesting).

So changing this would best be done together with some more careful
benchmarks.

See also numpygh-21117, which introduced manual loop unrolling to avoid the
unaligned access.

Closes numpygh-21116
seberg added a commit to seberg/numpy that referenced this pull request May 6, 2022
Clangs sanitizer reports unaligned access here, which is correct
but intentional.  It may well be that the code would be better of
trying to avoid this unaligned access (and rather vectorizing harder).

But, this is a bit of a tricky choice, since we have to optimize for
different use-cases (in particular very short scans may be interesting).

So changing this would best be done together with some more careful
benchmarks.

See also numpygh-21117, which introduced manual loop unrolling to avoid the
unaligned access.

Closes numpygh-21116
@charris charris removed the 09 - Backport-Candidate PRs tagged should be backported label May 6, 2022
charris pushed a commit to charris/numpy that referenced this pull request May 9, 2022
Clangs sanitizer reports unaligned access here, which is correct
but intentional.  It may well be that the code would be better of
trying to avoid this unaligned access (and rather vectorizing harder).

But, this is a bit of a tricky choice, since we have to optimize for
different use-cases (in particular very short scans may be interesting).

So changing this would best be done together with some more careful
benchmarks.

See also numpygh-21117, which introduced manual loop unrolling to avoid the
unaligned access.

Closes numpygh-21116
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

BUG: npy_memchr has misaligned memory access

3 participants