Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Android and ARM regressions in master 1.2 #2024

Closed
joakim-noah opened this issue Mar 6, 2017 · 90 comments
Closed

Android and ARM regressions in master 1.2 #2024

joakim-noah opened this issue Mar 6, 2017 · 90 comments
Assignees
Labels

Comments

@joakim-noah
Copy link
Contributor

joakim-noah commented Mar 6, 2017

I recently tried building master both natively and as a linux/x64 cross-compiler and ran into these issues with the tests. I will look into these more and submit patches for the phobos bugs upstream, would like some help with the LDC ones.

LDC issues

  • std.digest.murmurhash segfaults with a Bus error on its last test, when it tries to compute the unalignedHash with any llvm optimizations. This is reproducible going back to ltsmaster: all of 0.17.3, 1.1.1, and 1.2 (each built against llvm 3.9.1) pass this test without optimizations but segfault with -01 or higher optimizations. This module was added with 1.2 so obviously it wasn't tested with those older D versions before, but it seems to have uncovered some ARM codegen issue.
  • std.random segfaults with this error when compiled with -O2 or higher:
    Fatal error in EH code: _Unwind_RaiseException failed with reason code: 9 Aborted
    A gdb backtrace shows that it fails in the massive last test block. ldc 0.17.3 can't build the 1.2 version of this module's tests because of a static assert, while 1.1.1 has no problem with it, even with all optimizations. This looks like the ARM codegen errors we were getting before, @smolt?

Phobos issues

I used to build for Android/x86 from upstream and catch these stdlib issues early, but haven't done that in awhile, especially since my last x86 device died more than a year ago. I may look into installing Android/x86 in a VPS and run the upstream tests there.

The good news is that @kinke's 1.2 changes to somewhat allow cross-compiling reals work fine, ie all the same druntime/phobos tests pass when cross-compiled from linux/x64, so that applying #1317 wasn't needed anymore. Also, the full dmd-testsuite still passes natively on Android/ARM. 💃

@kinke
Copy link
Member

kinke commented Mar 6, 2017

Thanks for the nice overview.

I suspected the std.json double roundtrip test to fail for 64-bit reals in general, but wasn't sure, so thanks for confirming. The relaxed tests should probably be upstreamed then (for GDC etc.).

@joakim-noah
Copy link
Contributor Author

joakim-noah commented Mar 12, 2017

I looked into the murmurhash segfault, looks like the issue we had before with the unaligned exception table on ARM, ldc-developers/druntime#51. The last test checks the digest when run on unaligned data and segfaults when it reads the first unaligned 128-bit Element in this loop. With unoptimized codegen, ie -O0, it reads the first block in the foreach loop with the following ARM assembly (r0 has the address of the first unaligned block):

ldr     r2, [r0]
ldr     r3, [r0, #4]
ldr     r1, [r0, #8]
ldr     r0, [r0, #12]

That works fine because ldr doesn't care if r0 is aligned. However, when any more optimizations are used, ie -O1, it tries to turn the first two loads into a ldm, which does care that the address isn't aligned (r7 now has the first unaligned address):

ldm     r7, {r2, r3}
ldr     r0, [r7, #8]
ldr     r1, [r7, #12]

I'm guessing this has to do with the cast in the foreach too, so that the optimizer just thinks it's a properly aligned Element, ie some type info has been lost.

I'm unsure what to do about this, the whole point of the test seems to be to check that the digest works with unaligned data.

Update: I miscounted before and implied that the 128-bit Element is loaded into two 32-bit registers, updated the assembly to show it's actually four. Doesn't change the substance of the problem, but may have been confusing before.

@dnadlinger
Copy link
Member

I'd say it's invalid code. D pointers/arrays are guaranteed to be aligned.

@joakim-noah
Copy link
Contributor Author

joakim-noah commented Mar 13, 2017

@klickverbot, just so we're clear, the API of digest.put is to take a ubyte array, and since anyone can always pass an arbitrary ubyte slice to put, I don't think it's unreasonable for the ubyte slice passed in to be sometimes unaligned to word boundaries like this. Murmurhash's put grabs those ubytes in 16-byte chunks, so it is prone to this issue, which is why @gchatelet checks for it with this test.

One interesting thing I just noticed is that if I extract the unaligned test for 32-bit optimized Murmurhash with 128-bit Elements into a standalone file, it also segfaults because of the same optimization, while it didn't in the unittest for some reason, so even the 32-bit version shows the same problem.

Are you saying that users of Murmurhash should know never to pass unaligned ubyte arrays to it? Or that something else is "invalid code?"

@kinke
Copy link
Member

kinke commented Mar 13, 2017

ubyte arrays should never be unaligned as the ubyte alignment is 1. MurmurHash3.put() after a quick glance seems buggy as it first fills up its internal buffer (e.g., 128-bit buffer, 11 bytes already used => read first 5 data bytes) and then accesses the remaining data (starting at offset 5) in chunks by casting it to a Element (potentially a uint[4] or ulong[2]) slice. And that's where the required alignment (4/8 bytes) is likely to be violated for data.ptr + 5. I.e., it not just depends on the alignment of data.ptr, but also on the status of the hasher's internal buffer.

@joakim-noah
Copy link
Contributor Author

ubyte arrays should never be unaligned as the ubyte alignment is 1.

On ARM, it seems to depend on the optimization level, ie at -O0 the ARM codegen will sometimes allocate ubyte arrays that are not aligned. In any case, a slice into an ubyte array can obviously be unaligned, which is what this test depends on. Given the way the test is setup, surely you agree that at least one of data[0 .. $-1] or data[1 .. $], the two slices passes to Murmurhash, must have its first ubyte at an unaligned address? After that, the below analysis means that the same address plus 16 is what goes into the loop.

MurmurHash3.put() after a quick glance seems buggy as it first fills up its internal buffer (e.g., 128-bit buffer, 11 bytes already used => read first 5 data bytes) and then accesses the remaining data (starting at offset 5) in chunks by casting it to a Element (potentially a uint[4] or ulong[2]) slice. And that's where the required alignment (4/8 bytes) is likely to be violated for data.ptr + 5. I.e., it not just depends on the alignment of data.ptr, but also on the status of the hasher's internal buffer.

I don't know what's going on on x86 or whatever arch you're testing, but on ARM with the given tests, the chunks always start at data.ptr + 16, because bufferSize is 0 and so bufferLeeway always jumps ahead by exactly one Element, ie 16 bytes. The status of the internal buffer seems to play no role for this bug on ARM.

@kinke
Copy link
Member

kinke commented Mar 13, 2017

I don't know what's going on on x86 or whatever arch you're testing, but on ARM with the given tests, the chunks always start at data.ptr + 16, because bufferSize is 0 and so bufferLeeway always jumps ahead by exactly one Element, ie 16 bytes. The status of the internal buffer seems to play no role for this bug on ARM.

I just skimmed through the code, no test performed. The way the test is set up (separately hashing 2 1KB blocks), bufferSize is indeed always 0, but it's not in the general case and the Phobos code doesn't handle resulting misalignments at all. It's not the reason for the particular test case failure on ARM but it's another way how unaligned memory accesses can result.
X86 allows misaligned memory access by default, although it may cost performance.

So I guess the code should be updated for non-misalignment-friendly architectures by checking if (cast(size_t) (data.ptr + bufferLeeway)) & (Element.sizeof - 1) == 0 and only then use the chunked approach (or try to use smaller chunks).

Edit: I don't know if an additional LDC intrinsic for unaligned loads (emitting a load T, T* ptr, align 1 in IR) would work (edit: obviously should as -O0 works) and whether it'd allow for more efficient code.

@dnadlinger
Copy link
Member

dnadlinger commented Mar 14, 2017

ubyte arrays should never be unaligned as the ubyte alignment is 1

On ARM, it seems to depend on the optimization level

What @kinke and I were getting at is that ubyte arrays (or pointers) can by definition never be misaligned, because their alignment constraint is only is 1 byte.

I presume you are using "aligned" to specifically mean word-size aligned. This is in general not enough, as some instructions – e.g. aligned SIMD loads even on x86 – might require larger alignments. In this specific case, word-size alignment might of course be enough to avoid the crash.

The upshot is that the implementation that just casts ubyte[] to Element[] is invalid.

@joakim-noah
Copy link
Contributor Author

It's not the reason for the particular test case failure on ARM but it's another way how unaligned memory accesses can result.

Ah, I thought you were saying it was the reason for this failure, got it now.

I presume you are using "aligned" to specifically mean word-size aligned.

Yes, this is why I clarified earlier that "I don't think it's unreasonable for the ubyte slice passed in to be sometimes unaligned to word boundaries like this," but maybe I should have emphasized that more.

The upshot is that the implementation that just casts ubyte[] to Element[] is invalid.

I mentioned that cast as the likely issue earlier, but what is the kosher D way to write such code and deal with alignment issues? I don't really deal with these casting and alignment issues much, so any pointer to other code that does this right would be helpful.

@smolt
Copy link
Member

smolt commented Mar 14, 2017

On ARM misalignment in mumurhash, I recalled it was discussed in the original PR and @jpf91 made suggestions that perhaps weren't implement correctly. See dlang/phobos#3916. The discussion on ARM alignment was good.

On std.random, it seems to be a stress case for ARM codegen by LLVM maybe because the stack usage is huge by individual function after inlining. When I played with watchOS (somewhere in the attic now), there were extra push and pops around calls when the stack usage by a function was huge. The pops were not handled in the exception landing pad and eventual function return when to wrong address. Maybe this is happening?

Maybe I have some time to catch up on LDC this weekend.

@dnadlinger
Copy link
Member

but what is the kosher D way to write such code and deal with alignment issues?

I don't know of any better way than doing it manually, i.e. reading off data[0 .. Element.sizeof] manually and then casting the value to Element. You could also do it the C way and manually piece together the 32 bit integer from the appropriately shifted 8 bit values. Hm, thinking about it, I think there is a read function in std.bitmanip for endian-aware conversion you might be able to re-use for this.

@kinke
Copy link
Member

kinke commented Mar 14, 2017

but what is the kosher D way to write such code and deal with alignment issues?

The kosher way is to always keep alignment in mind when casting pointers (or accessing fields in packed structs etc.), but hardly anyone does as long as no errors pop up.

A quickfix would be guarding the Element-chunk loop via if ((cast(size_t) (data.ptr + bufferLeeway)) & (Element.sizeof - 1) == 0) (just checking whether the start address is a multiple of Element.sizeof, which must be a power of 2) as I already suggested and falling back to copying the bytes in the unaligned case. A more elaborate variant would check whether the alignment of data.ptr + bufferLeeway is suitable for 64-bit longs and use Element = ulong then, otherwise check if it's at least a multiple of 4 for Element = uint, maybe even another thingy for 16-bit shorts, and only then fall back to copying the bytes.

And yet another variant I already suggested would be introducing a new LDC intrinsic for arbitrary unaligned loads (afaik, we only have such a thing for unaligned SIMD vector loads - edit: here) and use that in the unaligned case. I guess that takes care of shifting and or'ing for architectures not supporting unaligned loads directly.

Edit: Looking at the code again, Element is actually the full 128-bit type/chunk (ulong[2] / uint[4]). What I meant by Element above is a scalar element, i.e., ulong / uint, defining the alignment requirements.

@kinke
Copy link
Member

kinke commented Mar 14, 2017

Here's a highly experimental hack. No idea about the performance impact and whether it actually works on ARM etc.; I just made sure the tests still work when using the unaligned code path with x86.

Amusing side note: the code in question containing the cast from ubyte[] to Element[] is actually @safe and doesn't even produce a warning. ;)

@dnadlinger
Copy link
Member

dnadlinger commented Mar 15, 2017

I haven't looked at your changes in any detail yet, but you will definitely want to keep the upstream people in the loop to make things easier.

I also imagine there might be other (C) MurmurHash3 implementations for ARM dealing with the problem already. In particular, it might make sense to keep the current code for archs where unaligned loads are reasonably fast (x86).

@gchatelet
Copy link

Sorry for the late reply and thx for putting me in the loop.
I remember murmurhash PR to have taken quite a long time to submit and I couldn't test it on ARM or platform with bad/unsupported unaligned loads, so I just went on with what I had - adding some tests to catch the failure when it happens.
I wanted to create a version for "poorly supported unaligned read platform" but there didn't seem to be a version I could test against back then (not sure one exists nowadays either). I recommend creating one, the problem will keep on happening for other hashes as well.

The put(ubyte[]) interface is already quite slow compared to the plain Element interface, it would totally hurt performance to bit shift/mask on x86.

One thing I don't like about D's cast is that it sometimes reinterpret_cast and sometimes it static_cast but you don't really know if the operation is cheap or not. I was under the impression that slice cast would handle the misalignment depending on the platform but it seems it doesn't.

@dnadlinger
Copy link
Member

dnadlinger commented Mar 15, 2017

The put(ubyte[]) interface is already quite slow compared to the plain Element interface, it would totally hurt performance to bit shift/mask on x86.

Yes, it certainly would.

I was under the impression that slice cast would handle the misalignment depending on the platform but it seems it doesn't.

This can't really work for rather fundamental reasons – after casting the slice, it is as good an Element[] value as any other slice is, and there is no way to remember the lack of alignment (consider for example passing it on as a function parameter, or returning it from a function). The only possible recourse after detecting an alignment mismatch at runtime would be to throw an error – which, thinking about it, might be a helpful functionality in debug mode.

@gchatelet
Copy link

gchatelet commented Mar 15, 2017

The only possible recourse after detecting an alignment mismatch at runtime would be to throw an error – which, thinking about it, might be a helpful functionality in debug mode.

AFAIR it already happens when data size mismatches. e.g. https://dpaste.dzfl.pl/4514bcc4eea1
It would totally make sense to throw on misalignment if it's an issue for the platform.

@kinke
Copy link
Member

kinke commented Mar 15, 2017

I also imagine there might be other (C) MurmurHash3 implementations for ARM dealing with the problem already.

In essence, the 'problem' we are faced with is how to iteratively copy over chunks of 16 bytes in an efficient way. So any ARM memcpy() implementation could serve as base. We can't use it directly as we don't want a call and choose-best-method-logic for each chunk.

In particular, it might make sense to keep the current code for archs where unaligned loads are reasonably fast (x86).

My hack does. I basically implemented the first more elaborate variant I suggested further up. It would still be interesting to see if, for instance, copying over 4 aligned ints is faster than copying 2 unaligned longs on X86 in case the start address is a multiple of 4 (but not 8).

@gchatelet
Copy link

It would still be interesting to see if, for instance, copying over 4 aligned ints is faster than copying 2 unaligned longs on X86 in case the start address is a multiple of 4 (but not 8).

I'm pretty sure that streaming unaligned reads on modern x86 is free. For the first read the whole cache line is read, for the subsequent reads the prefetcher will load the cache lines in advance and hide the latency. This needs to be tested obviously.

@joakim-noah
Copy link
Contributor Author

Thanks for all the info, @smolt and everyone else. I will read up more and get back.

I just tried cross-compiling the stdlib tests using ldc 1.2 with llvm 4.0, getting a bunch of modules segfaulting in the tests, have to look into that too.

@joakim-noah
Copy link
Contributor Author

joakim-noah commented Mar 18, 2017

I spent some time tracking down the segfaults when using llvm 4.0: there may be an llvm regression in using ARM SIMD, ie NEON, instructions to speed up loads (yes, yet another alignment issue). For many functions that have array arguments, the ARM codegen uses the vld1.64 andvst1.64 instructions to quickly load small arrays onto the stack, maybe in particular for array literals.

There appears to have been a recent optimization in llvm 4.0, which is only applied at -O1 or higher, to place the array literals right after the function calling them, rather than somewhere at the end of the object file. With llvm 3.9.1 at any optimization level and llvm 4.0 at -O0, the ARM assembly for the affected functions always looks something like this (showing the assembly for this test):

000106f0 <_D3std9algorithm8mutation16__unittestL191_2FNfZv>:
   106f0:       e92d4070        push    {r4, r5, r6, lr}
   106f4:       e24dd020        sub     sp, sp, #32
   106f8:       e59f01e0        ldr     r0, [pc, #480]  ; 108e0 <_D3std9algorithm8mutation16__unittestL191_2FNfZv+0x1f0>
--- snip irrelevant instructions ---
   10708:       e08f0000        add     r0, pc, r0
   1070c:       e58d401c        str     r4, [sp, #28]
   10710:       f4600aef        vld1.64 {d16-d17}, [r0 :128]
--- skip to end of this test function ---
   108e0:       00035a30        andeq   r5, r3, r0, lsr sl

--- skip almost near end of executable ---

Disassembly of section .rodata:
00046140 <.arrayliteral>:
   46140:       00000004        andeq   r0, r0, r4
   46144:       00000005        andeq   r0, r0, r5
   46148:       00000006        andeq   r0, r0, r6
   4614c:       00000007        andeq   r0, r0, r7
   46150:       00000001        andeq   r0, r0, r1
   46154:       00000002        andeq   r0, r0, r2
   46158:       00000003        andeq   r0, r0, r3

With llvm 4.0 at -O1, it turns into this:

000120d8 <_D3std9algorithm8mutation16__unittestL191_2FNfZv>:
   120d8:       e92d4bf0        push    {r4, r5, r6, r7, r8, r9, fp, lr}
   120dc:       e24dd028        sub     sp, sp, #40     ; 0x28
--- snip irrelevant instructions ---
   120ec:       e28f0f41        add     r0, pc, #260    ; 0x104
   120f0:       f4600aef        vld1.64 {d16-d17}, [r0 :128]
--- skip to right after this test function ---

000121f8 <.arrayliteral>:
   121f8:       00000004        andeq   r0, r0, r4
   121fc:       00000005        andeq   r0, r0, r5
   12200:       00000006        andeq   r0, r0, r6               
   12204:       00000007        andeq   r0, r0, r7          
   12208:       00000001        andeq   r0, r0, r1              
   1220c:       00000002        andeq   r0, r0, r2               
   12210:       00000003        andeq   r0, r0, r3

This always segfaults at the vld1.64 instruction because the starting address of the arrayliteral data is not 128-bit word-aligned. What's interesting is that the same function compiled at -O2 starts working again, apparently because the more optimized function now happens to place the arrayliteral data after the function at a word boundary.

I can see how someone missed this if they happened to only test with functions that happened to end on word boundaries. Out of all the tests in the stdlib, only 14 modules have some tests that fail because of this. What do you guys think, llvm bug or is the alignment supposed to be set elsewhere?

@dnadlinger
Copy link
Member

dnadlinger commented Mar 18, 2017

What do you guys think, llvm bug or is the alignment supposed to be set elsewhere?

Check the IR for the function/global in question. If there is no mismatch in alignment, I'd say it's an LLVM bug.

@joakim-noah
Copy link
Contributor Author

Here are the relevant definitions from the IR at -O1, when the optimization is applied:

@.arrayliteral = internal unnamed_addr constant [7 x i32] [i32 4, i32 5, i32 6, i32 7, i32 1, i32 2, i32 3] ; [#uses = 1]
define void @_D3std9algorithm8mutation16__unittestL191_2FNfZv() local_unnamed_addr #0 comdat {
%list = alloca %"std.container.slist.SList!int.SList", align 4 ; [#uses = 5, size/byte = 4]                                           
%arrayliteral = alloca [7 x i32], align 4       ; [#uses = 2, size/byte = 28]                                                         
%r2 = alloca %"std.container.slist.SList!int.SList.Range", align 4 ; [#uses = 2, size/byte = 4]                                     
%1 = bitcast %"std.container.slist.SList!int.SList"* %list to i32* ; [#uses = 1]                                                     
store i32 0, i32* %1, align 4                                      
%2 = bitcast [7 x i32]* %arrayliteral to i8*    ; [#uses = 1]
call void @llvm.memcpy.p0i8.p0i8.i32(i8* nonnull %2, i8* bitcast ([7 x i32]* @.arrayliteral to i8*), i32 28, i32 4, i1 false)         
%3 = getelementptr inbounds [7 x i32], [7 x i32]* %arrayliteral, i32 0, i32 0 ; [#uses = 1, type = i32*]                             
%4 = insertvalue { i32, i32* } { i32 7, i32* undef }, i32* %3, 1 ; [#uses = 1]                                                        
%5 = call %"std.container.slist.SList!int.SList"* (%"std.container.slist.SList!int.SList"*, { i32, i32* }, ...)
@_D3std9container5slist12__T5SListTiZ5SList13__T6__ctorTiZ6__ctorMFNaNbNcNfAiXS3std9container5slist12__T5SListTiZ5SList(%"std.container.slist.SList!int.SList"* nonnull returned %list, { i32, i32* } %4) #0 ; [#uses = 0]
--- more IR ---

There is no alignment specified for the function or constant in the IR at any optimization level, which seems to imply llvm should be setting the right alignment. I notice some bitcasts to i8* in the function, not sure if that will affect how the constant is aligned beforehand.

@jmolloy, if I use the llvm flag to turn the new constant pools optimization off, --arm-promote-constant=false, the segfaults go away. Is it possible this is another alignment issue, stemming from a bad interaction with the NEON load optimization (detailed two comments above)?

@kinke
Copy link
Member

kinke commented Mar 21, 2017

Yep looks like an LLVM bug. The constant has no explicit alignment (and should thus default to at least 4 due to its type - quoting LLVM docs: If not present, or if the alignment is set to zero, the alignment of the global is set by the target to whatever it feels convenient.). The local %arrayliteral has an alignment of 4, and the memcpy() intrinsic gets an alignment of 4 (4th arg) as well.

@joakim-noah
Copy link
Contributor Author

joakim-noah commented Mar 22, 2017

I just tried @kinke's patch for murmurhash3, it fixes the problem on ARM. @jpf91 suggested copying each chunk into a union (his benchmarks showed no slowdown on x64) and this patch does something similar, but @gchatelet didn't bother with that originally, wonder why?

@kinke, I think @klickverbot was suggesting not copying on x64, as you're currently doing for all arches. Either way, do you want to submit that as a PR for phobos, so we get this in upstream? You may want to modify the unaligned test to loop through several different misalignments, so you hit all your code paths.

@gchatelet
Copy link

I did bother :-) When I tested this last year performance was ok with LDC but quite catastrophic with DMD (I don't remember GDC but it should be tested as well)

Performance should remain as good as possible for all compilers so please check before submitting.

@kinke
Copy link
Member

kinke commented Mar 22, 2017

Cleaning up this hack and going through the probably tedious process of getting it merged upstream is very low on my priorities list, so I'd very much favor someone else taking care of this.

@joakim-noah
Copy link
Contributor Author

@gchatelet, mind doing it? You know the code the best.

@dnadlinger
Copy link
Member

@joakim-noah: Is there a bugs.llvm.org entry for the constant alignment issue already?

@joakim-noah
Copy link
Contributor Author

No, want to submit it? I haven't accessed my account there in awhile.

@kinke
Copy link
Member

kinke commented Aug 16, 2017

With #2148, there are indeed a lot of new, fine-grained command line options (-enable-no-infs-fp-math etc.), but right now, these function attributes should be simple duplicates of the global settings (and I don't think you can customize all of them via cmdline right now). Johan added these not too long ago, apparently required for LTO: https://github.com/ldc-developers/ldc/blob/master/gen/functions.cpp#L423-L464. I would have thought that omitting these extra attributes wouldn't change a thing without LTO.

@kinke
Copy link
Member

kinke commented Aug 16, 2017

["target-cpu"="generic" seems a bit suspicious to me, I would have expected a concrete ARM[v7] CPU. It's x86-64 for some .ll I have lying around. See https://github.com/ldc-developers/ldc/blob/master/driver/targetmachine.cpp#L149-L188; I recently let LLVM handle this, ltsmaster doesn't have that.]

@joakim-noah
Copy link
Contributor Author

I just tried commenting out applyTargetMachineAttributes, so that 1.4 applies the same function attributes as lts, and ldc still errors out. Very strange, not sure what else is different at this point. I'll look further.

@joakim-noah
Copy link
Contributor Author

Comparing the IR generated for this extracted function, it's almost identical:

extern(C):
void _d_array_init_double(double* a, size_t n, double v)
{
    auto p = a;
    auto end = a+n;
    while (p !is end)
        *p++ = v;
}

Here are the two IR dumps taken from the last pass that successfully dumps IR when running -print-before-all, for lts and 1.4 master. I'm not sure exactly why it tries to bitcast to the address of %vector.body, but it works if the variable is given a name by the loop vectorization pass, but not a number? Let me know if any of you have an idea.

@kinke
Copy link
Member

kinke commented Aug 23, 2017

After a quick glance, the IR seems identical, the only difference being the stripped names for master for some reason. So I'm still thinking it has to do with a wrong CPU (generic). When running with -vv, the first line should show something like Targeting 'x86_64-pc-windows-msvc' (CPU 'x86-64' with features '+cx16'); the CPU would be the interesting bit to see if master really chooses generic over a concrete one.
If specifying the CPU used by ltsmaster via -mcpu for master works, we know where the problem is.

@joakim-noah
Copy link
Contributor Author

Yep, you got it. ltsmaster shows Targeting 'armv7-none-linux-android' (CPU 'cortex-a8' with features '') while master shows Targeting 'armv7-none-linux-android' (CPU 'generic' with features ''). Passing -mcpu=cortex-a8 to ldc 1.4 gets that single function to compile with the loop vectorization applied, and all the druntime tests compile and pass again.

@kinke
Copy link
Member

kinke commented Aug 23, 2017

Alright, based on https://github.com/llvm-mirror/llvm/blob/master/include/llvm/Support/ARMTargetParser.def, there's no default CPU for architecture armv7. Almost certainly because it's too generic; cortex-a8 is the default CPU for armv7-a, cortex-r4 for armv7-r, cortex-m3 for armv7-m, cortex-m4 for armv7e-m, and swift for armv7s. I guess the main difference is their FPU features. So specifying a more precise triple should be enough.
[And the clang command line also includes -march=armv7-a.]

@joakim-noah
Copy link
Contributor Author

So you think just specifying the target triple for ldc isn't enough, we have to do something like the NDK clang does, ie -target armv7-none-linux-androideabi -march=armv7-a -mfloat-abi=softfp -mfpu=vfpv3-d16 -mthumb? Until ldc 1.4, all I had to set was the triple and it would infer cortex-a8, just verified with ldc 1.3. Is this so other ARM variations can be supported too?

@kinke
Copy link
Member

kinke commented Aug 24, 2017

We previously defaulted manually to cortex-a8 for armv7, LLVM doesn't, so if something like armv7-a-none-linux-androideabi exists, that should work, otherwise specifying an additional -march might. I don't think we support -mfpu, but it might come with #2148.

@joakim-noah
Copy link
Contributor Author

joakim-noah commented Aug 24, 2017

I don't think that triple does anything different, and my reading of the ARM Target Parser is that there's no default CPU for armv7/armv7-a. -march when passed to ldc doesn't accept armv7 or armv7-a, only basic options like arm or thumb, looks like -mcpu and -mattr are the way to go.

Update: I just tried compiling the init function extracted above with every cpu listed under ARMV7A, ie cortex-a5 through krait, all build except for cortex-a17, guess it has some different features. Also, I just ran the Phobos tests, which I built overnight with -mcpu=cortex-a8, and everything works as normal again.

I've been trying to avoid the complexity of the variety of ARM configurations so far, and ldc setting cortex-a8 by default has allowed me to do that, but I guess we need to address that now. I'll start using -mcpu=cortex-a8 and document that for ldc Android users.

@kinke
Copy link
Member

kinke commented Aug 24, 2017

There's definitely a default CPU for armv7-a, the cortex-a8 (see https://github.com/llvm-mirror/llvm/blob/release_40/include/llvm/Support/ARMTargetParser.def#L206; the 4th macro arg, true, represents whether it is the default CPU for the architecture). The question is simply how to specify that architecture. I hoped it'd be possible via the triple directly, but if armv7-a-... doesn't work (and something like armv7a-... neither), we should be looking into making -march work, as it apparently does work for clang. Having to specify a full CPU is way too specific IMO. [I'm not a fan of the heterogeneous ARM architectures at all.]

@joakim-noah
Copy link
Contributor Author

Not anymore, I was looking at the master version you linked earlier and I see now that it's not the default in 5.0 rc2 either. Given the variety of ARM CPUs, I'm fine with either way of specifying the features to target.

@kinke
Copy link
Member

kinke commented Aug 24, 2017

Ah thanks, what an unholy mess. ;)

@kinke
Copy link
Member

kinke commented Aug 24, 2017

I'll start using -mcpu=cortex-a8 and document that for ldc Android users.

While doing that, please also include TARGET_SYSTEM="Android;Linux;UNIX" in the ldc-build-runtime command line, I think it was missing in the forum post. We already had guys ending up targeting armvN-...-windows-msvc because they only used -march=armvN on a Windows host... ;)

@joakim-noah
Copy link
Contributor Author

Ah thanks, what an unholy mess. ;)

I actually like it, it's a sign of the much greater competition with ARM chips, though it comes with some effort on the developer end too.

As for setting the TARGET_SYSTEM, guess I missed that since I was only cross-compiling from linux. I'll move the full command to the wiki, and hopefully my PR to just set the OS and arch will make things much easier.

@kinke
Copy link
Member

kinke commented Aug 25, 2017

though it comes with some effort on the developer end too.

And the end user. I recall some superfluous troubles wrt. finding the right prebuilt video codec for MX Player for my phone a while ago...

@joakim-noah
Copy link
Contributor Author

I don't see that last std.random issue with LLVM 7 lately, so I tried the reduced test case above, but it still segfaults. However, it goes away if I specify the CPU features with -mcpu=cortex-a8, which is recommended because of the loop vectorization issues listed above anyway, so it shouldn't hit anymore.

@kinke
Copy link
Member

kinke commented Oct 29, 2018

Oh nice, let's just hope it's not just 'fixed' by accident, popping up in other scenarios again.

It may make sense to default to cortex-a8 for suited triples/architectures instead of hoping that people will read the release notes/wiki pages.

@joakim-noah
Copy link
Contributor Author

Does it fix the issue on armhf too? I'm not sure it makes sense to enforce cortex-a8 as the default, given the variety of ARM CPUs out there.

@kinke
Copy link
Member

kinke commented Oct 30, 2018

That's ARMv6, Cortex-A8 is ARMv7. I haven't run the tests in the ARM emulator in ages, and may not run the emulator ever again, given the few dozen ARM downloads.

@joakim-noah
Copy link
Contributor Author

Yeah, given that AArch64 is taking over and doesn't have this problem, don't think we should worry about this any more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

8 participants