New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Android and ARM regressions in master 1.2 #2024
Comments
Thanks for the nice overview. I suspected the std.json |
I looked into the murmurhash segfault, looks like the issue we had before with the unaligned exception table on ARM, ldc-developers/druntime#51. The last test checks the digest when run on unaligned data and segfaults when it reads the first unaligned 128-bit ldr r2, [r0]
ldr r3, [r0, #4]
ldr r1, [r0, #8]
ldr r0, [r0, #12] That works fine because ldm r7, {r2, r3}
ldr r0, [r7, #8]
ldr r1, [r7, #12] I'm guessing this has to do with the I'm unsure what to do about this, the whole point of the test seems to be to check that the digest works with unaligned data. Update: I miscounted before and implied that the 128-bit |
I'd say it's invalid code. D pointers/arrays are guaranteed to be aligned. |
@klickverbot, just so we're clear, the API of One interesting thing I just noticed is that if I extract the unaligned test for 32-bit optimized Murmurhash with 128-bit Are you saying that users of Murmurhash should know never to pass unaligned |
ubyte arrays should never be unaligned as the ubyte alignment is 1. |
On ARM, it seems to depend on the optimization level, ie at
I don't know what's going on on |
I just skimmed through the code, no test performed. The way the test is set up (separately hashing 2 1KB blocks), So I guess the code should be updated for non-misalignment-friendly architectures by checking if Edit: I don't know if an additional LDC intrinsic for unaligned loads (emitting a |
What @kinke and I were getting at is that I presume you are using "aligned" to specifically mean word-size aligned. This is in general not enough, as some instructions – e.g. aligned SIMD loads even on x86 – might require larger alignments. In this specific case, word-size alignment might of course be enough to avoid the crash. The upshot is that the implementation that just casts |
Ah, I thought you were saying it was the reason for this failure, got it now.
Yes, this is why I clarified earlier that "I don't think it's unreasonable for the ubyte slice passed in to be sometimes unaligned to word boundaries like this," but maybe I should have emphasized that more.
I mentioned that cast as the likely issue earlier, but what is the kosher D way to write such code and deal with alignment issues? I don't really deal with these casting and alignment issues much, so any pointer to other code that does this right would be helpful. |
On ARM misalignment in mumurhash, I recalled it was discussed in the original PR and @jpf91 made suggestions that perhaps weren't implement correctly. See dlang/phobos#3916. The discussion on ARM alignment was good. On std.random, it seems to be a stress case for ARM codegen by LLVM maybe because the stack usage is huge by individual function after inlining. When I played with watchOS (somewhere in the attic now), there were extra push and pops around calls when the stack usage by a function was huge. The pops were not handled in the exception landing pad and eventual function return when to wrong address. Maybe this is happening? Maybe I have some time to catch up on LDC this weekend. |
I don't know of any better way than doing it manually, i.e. reading off |
The kosher way is to always keep alignment in mind when casting pointers (or accessing fields in packed structs etc.), but hardly anyone does as long as no errors pop up. A quickfix would be guarding the Element-chunk loop via And yet another variant I already suggested would be introducing a new LDC intrinsic for arbitrary unaligned loads (afaik, we only have such a thing for unaligned SIMD vector loads - edit: here) and use that in the unaligned case. I guess that takes care of shifting and or'ing for architectures not supporting unaligned loads directly. Edit: Looking at the code again, |
Here's a highly experimental hack. No idea about the performance impact and whether it actually works on ARM etc.; I just made sure the tests still work when using the unaligned code path with x86. Amusing side note: the code in question containing the cast from ubyte[] to Element[] is actually @safe and doesn't even produce a warning. ;) |
I haven't looked at your changes in any detail yet, but you will definitely want to keep the upstream people in the loop to make things easier. I also imagine there might be other (C) MurmurHash3 implementations for ARM dealing with the problem already. In particular, it might make sense to keep the current code for archs where unaligned loads are reasonably fast (x86). |
Sorry for the late reply and thx for putting me in the loop. The One thing I don't like about D's cast is that it sometimes reinterpret_cast and sometimes it static_cast but you don't really know if the operation is cheap or not. I was under the impression that slice cast would handle the misalignment depending on the platform but it seems it doesn't. |
Yes, it certainly would.
This can't really work for rather fundamental reasons – after casting the slice, it is as good an |
AFAIR it already happens when data size mismatches. e.g. https://dpaste.dzfl.pl/4514bcc4eea1 |
In essence, the 'problem' we are faced with is how to iteratively copy over chunks of 16 bytes in an efficient way. So any ARM memcpy() implementation could serve as base. We can't use it directly as we don't want a call and choose-best-method-logic for each chunk.
My hack does. I basically implemented the first more elaborate variant I suggested further up. It would still be interesting to see if, for instance, copying over 4 aligned ints is faster than copying 2 unaligned longs on X86 in case the start address is a multiple of 4 (but not 8). |
I'm pretty sure that streaming unaligned reads on modern x86 is free. For the first read the whole cache line is read, for the subsequent reads the prefetcher will load the cache lines in advance and hide the latency. This needs to be tested obviously. |
Thanks for all the info, @smolt and everyone else. I will read up more and get back. I just tried cross-compiling the stdlib tests using ldc 1.2 with llvm 4.0, getting a bunch of modules segfaulting in the tests, have to look into that too. |
I spent some time tracking down the segfaults when using llvm 4.0: there may be an llvm regression in using ARM SIMD, ie NEON, instructions to speed up loads (yes, yet another alignment issue). For many functions that have array arguments, the ARM codegen uses the There appears to have been a recent optimization in llvm 4.0, which is only applied at 000106f0 <_D3std9algorithm8mutation16__unittestL191_2FNfZv>:
106f0: e92d4070 push {r4, r5, r6, lr}
106f4: e24dd020 sub sp, sp, #32
106f8: e59f01e0 ldr r0, [pc, #480] ; 108e0 <_D3std9algorithm8mutation16__unittestL191_2FNfZv+0x1f0>
--- snip irrelevant instructions ---
10708: e08f0000 add r0, pc, r0
1070c: e58d401c str r4, [sp, #28]
10710: f4600aef vld1.64 {d16-d17}, [r0 :128]
--- skip to end of this test function ---
108e0: 00035a30 andeq r5, r3, r0, lsr sl
--- skip almost near end of executable ---
Disassembly of section .rodata:
00046140 <.arrayliteral>:
46140: 00000004 andeq r0, r0, r4
46144: 00000005 andeq r0, r0, r5
46148: 00000006 andeq r0, r0, r6
4614c: 00000007 andeq r0, r0, r7
46150: 00000001 andeq r0, r0, r1
46154: 00000002 andeq r0, r0, r2
46158: 00000003 andeq r0, r0, r3 With llvm 4.0 at 000120d8 <_D3std9algorithm8mutation16__unittestL191_2FNfZv>:
120d8: e92d4bf0 push {r4, r5, r6, r7, r8, r9, fp, lr}
120dc: e24dd028 sub sp, sp, #40 ; 0x28
--- snip irrelevant instructions ---
120ec: e28f0f41 add r0, pc, #260 ; 0x104
120f0: f4600aef vld1.64 {d16-d17}, [r0 :128]
--- skip to right after this test function ---
000121f8 <.arrayliteral>:
121f8: 00000004 andeq r0, r0, r4
121fc: 00000005 andeq r0, r0, r5
12200: 00000006 andeq r0, r0, r6
12204: 00000007 andeq r0, r0, r7
12208: 00000001 andeq r0, r0, r1
1220c: 00000002 andeq r0, r0, r2
12210: 00000003 andeq r0, r0, r3 This always segfaults at the I can see how someone missed this if they happened to only test with functions that happened to end on word boundaries. Out of all the tests in the stdlib, only 14 modules have some tests that fail because of this. What do you guys think, llvm bug or is the alignment supposed to be set elsewhere? |
Check the IR for the function/global in question. If there is no mismatch in alignment, I'd say it's an LLVM bug. |
Here are the relevant definitions from the IR at
There is no alignment specified for the function or constant in the IR at any optimization level, which seems to imply llvm should be setting the right alignment. I notice some bitcasts to @jmolloy, if I use the llvm flag to turn the new constant pools optimization off, |
Yep looks like an LLVM bug. The constant has no explicit alignment (and should thus default to at least 4 due to its type - quoting LLVM docs: |
I just tried @kinke's patch for murmurhash3, it fixes the problem on ARM. @jpf91 suggested copying each chunk into a union (his benchmarks showed no slowdown on x64) and this patch does something similar, but @gchatelet didn't bother with that originally, wonder why? @kinke, I think @klickverbot was suggesting not copying on x64, as you're currently doing for all arches. Either way, do you want to submit that as a PR for phobos, so we get this in upstream? You may want to modify the unaligned test to loop through several different misalignments, so you hit all your code paths. |
I did bother :-) When I tested this last year performance was ok with LDC but quite catastrophic with DMD (I don't remember GDC but it should be tested as well) Performance should remain as good as possible for all compilers so please check before submitting. |
Cleaning up this hack and going through the probably tedious process of getting it merged upstream is very low on my priorities list, so I'd very much favor someone else taking care of this. |
@gchatelet, mind doing it? You know the code the best. |
@joakim-noah: Is there a bugs.llvm.org entry for the constant alignment issue already? |
No, want to submit it? I haven't accessed my account there in awhile. |
With #2148, there are indeed a lot of new, fine-grained command line options ( |
[ |
I just tried commenting out |
Comparing the IR generated for this extracted function, it's almost identical:
Here are the two IR dumps taken from the last pass that successfully dumps IR when running |
After a quick glance, the IR seems identical, the only difference being the stripped names for master for some reason. So I'm still thinking it has to do with a wrong CPU (generic). When running with |
Yep, you got it. ltsmaster shows |
Alright, based on https://github.com/llvm-mirror/llvm/blob/master/include/llvm/Support/ARMTargetParser.def, there's no default CPU for architecture |
So you think just specifying the target triple for ldc isn't enough, we have to do something like the NDK clang does, ie |
I don't think that triple does anything different, and my reading of the ARM Target Parser is that there's no default CPU for armv7/armv7-a. Update: I just tried compiling the init function extracted above with every cpu listed under I've been trying to avoid the complexity of the variety of ARM configurations so far, and ldc setting cortex-a8 by default has allowed me to do that, but I guess we need to address that now. I'll start using |
There's definitely a default CPU for |
Not anymore, I was looking at the master version you linked earlier and I see now that it's not the default in 5.0 rc2 either. Given the variety of ARM CPUs, I'm fine with either way of specifying the features to target. |
Ah thanks, what an unholy mess. ;) |
While doing that, please also include |
I actually like it, it's a sign of the much greater competition with ARM chips, though it comes with some effort on the developer end too. As for setting the |
And the end user. I recall some superfluous troubles wrt. finding the right prebuilt video codec for MX Player for my phone a while ago... |
I don't see that last |
Oh nice, let's just hope it's not just 'fixed' by accident, popping up in other scenarios again. It may make sense to default to |
Does it fix the issue on armhf too? I'm not sure it makes sense to enforce |
That's ARMv6, Cortex-A8 is ARMv7. I haven't run the tests in the ARM emulator in ages, and may not run the emulator ever again, given the few dozen ARM downloads. |
Yeah, given that AArch64 is taking over and doesn't have this problem, don't think we should worry about this any more. |
I recently tried building master both natively and as a linux/x64 cross-compiler and ran into these issues with the tests. I will look into these more and submit patches for the phobos bugs upstream, would like some help with the LDC ones.
LDC issues
std.digest.murmurhash
segfaults with aBus error
on its last test, when it tries to compute theunalignedHash
with any llvm optimizations. This is reproducible going back to ltsmaster: all of 0.17.3, 1.1.1, and 1.2 (each built against llvm 3.9.1) pass this test without optimizations but segfault with-01
or higher optimizations. This module was added with 1.2 so obviously it wasn't tested with those older D versions before, but it seems to have uncovered some ARM codegen issue.std.random
segfaults with this error when compiled with-O2
or higher:Fatal error in EH code: _Unwind_RaiseException failed with reason code: 9 Aborted
A gdb backtrace shows that it fails in the massive last test block. ldc 0.17.3 can't build the 1.2 version of this module's tests because of a static assert, while 1.1.1 has no problem with it, even with all optimizations. This looks like the ARM codegen errors we were getting before, @smolt?
Phobos issues
std.file
on Android, it passes if I change the last digit from a 7 to 0.std.json
that needed to be relaxed for platforms with 64-bit reals. AddingAndroid
to those two version blocks gets the tests to pass again.File.reopen
tostd.stdio
which can't change modes on Android:std/stdio.d(658): Cannot reopen file in mode 'w' (Bad address)
I used to build for Android/x86 from upstream and catch these stdlib issues early, but haven't done that in awhile, especially since my last x86 device died more than a year ago. I may look into installing Android/x86 in a VPS and run the upstream tests there.
The good news is that @kinke's 1.2 changes to somewhat allow cross-compiling reals work fine, ie all the same druntime/phobos tests pass when cross-compiled from linux/x64, so that applying #1317 wasn't needed anymore. Also, the full dmd-testsuite still passes natively on Android/ARM. 💃
The text was updated successfully, but these errors were encountered: