Android and ARM regressions in master 1.2 #2024

joakim-noah · 2017-03-06T16:05:19Z

I recently tried building master both natively and as a linux/x64 cross-compiler and ran into these issues with the tests. I will look into these more and submit patches for the phobos bugs upstream, would like some help with the LDC ones.

LDC issues

std.digest.murmurhash segfaults with a Bus error on its last test, when it tries to compute the unalignedHash with any llvm optimizations. This is reproducible going back to ltsmaster: all of 0.17.3, 1.1.1, and 1.2 (each built against llvm 3.9.1) pass this test without optimizations but segfault with -01 or higher optimizations. This module was added with 1.2 so obviously it wasn't tested with those older D versions before, but it seems to have uncovered some ARM codegen issue.
std.random segfaults with this error when compiled with -O2 or higher:
Fatal error in EH code: _Unwind_RaiseException failed with reason code: 9 Aborted
A gdb backtrace shows that it fails in the massive last test block. ldc 0.17.3 can't build the 1.2 version of this module's tests because of a static assert, while 1.1.1 has no problem with it, even with all optimizations. This looks like the ARM codegen errors we were getting before, @smolt?

Phobos issues

Vlad added a new linux-only test that seems to expect too much precision from std.file on Android, it passes if I change the last digit from a 7 to 0.
tsbockman added some tests to std.json that needed to be relaxed for platforms with 64-bit reals. Adding Android to those two version blocks gets the tests to pass again.
Vlad added a File.reopen to std.stdio which can't change modes on Android:
std/stdio.d(658): Cannot reopen file in mode 'w' (Bad address)

I used to build for Android/x86 from upstream and catch these stdlib issues early, but haven't done that in awhile, especially since my last x86 device died more than a year ago. I may look into installing Android/x86 in a VPS and run the upstream tests there.

The good news is that @kinke's 1.2 changes to somewhat allow cross-compiling reals work fine, ie all the same druntime/phobos tests pass when cross-compiled from linux/x64, so that applying #1317 wasn't needed anymore. Also, the full dmd-testsuite still passes natively on Android/ARM. 💃

The text was updated successfully, but these errors were encountered:

kinke · 2017-03-06T19:02:14Z

Thanks for the nice overview.

I suspected the std.json double roundtrip test to fail for 64-bit reals in general, but wasn't sure, so thanks for confirming. The relaxed tests should probably be upstreamed then (for GDC etc.).

joakim-noah · 2017-03-12T18:01:35Z

I looked into the murmurhash segfault, looks like the issue we had before with the unaligned exception table on ARM, ldc-developers/druntime#51. The last test checks the digest when run on unaligned data and segfaults when it reads the first unaligned 128-bit Element in this loop. With unoptimized codegen, ie -O0, it reads the first block in the foreach loop with the following ARM assembly (r0 has the address of the first unaligned block):

ldr     r2, [r0]
ldr     r3, [r0, #4]
ldr     r1, [r0, #8]
ldr     r0, [r0, #12]

That works fine because ldr doesn't care if r0 is aligned. However, when any more optimizations are used, ie -O1, it tries to turn the first two loads into a ldm, which does care that the address isn't aligned (r7 now has the first unaligned address):

ldm     r7, {r2, r3}
ldr     r0, [r7, #8]
ldr     r1, [r7, #12]

I'm guessing this has to do with the cast in the foreach too, so that the optimizer just thinks it's a properly aligned Element, ie some type info has been lost.

I'm unsure what to do about this, the whole point of the test seems to be to check that the digest works with unaligned data.

Update: I miscounted before and implied that the 128-bit Element is loaded into two 32-bit registers, updated the assembly to show it's actually four. Doesn't change the substance of the problem, but may have been confusing before.

dnadlinger · 2017-03-12T18:15:59Z

I'd say it's invalid code. D pointers/arrays are guaranteed to be aligned.

joakim-noah · 2017-03-13T03:45:14Z

@klickverbot, just so we're clear, the API of digest.put is to take a ubyte array, and since anyone can always pass an arbitrary ubyte slice to put, I don't think it's unreasonable for the ubyte slice passed in to be sometimes unaligned to word boundaries like this. Murmurhash's put grabs those ubytes in 16-byte chunks, so it is prone to this issue, which is why @gchatelet checks for it with this test.

One interesting thing I just noticed is that if I extract the unaligned test for 32-bit optimized Murmurhash with 128-bit Elements into a standalone file, it also segfaults because of the same optimization, while it didn't in the unittest for some reason, so even the 32-bit version shows the same problem.

Are you saying that users of Murmurhash should know never to pass unaligned ubyte arrays to it? Or that something else is "invalid code?"

kinke · 2017-03-13T10:41:19Z

ubyte arrays should never be unaligned as the ubyte alignment is 1. MurmurHash3.put() after a quick glance seems buggy as it first fills up its internal buffer (e.g., 128-bit buffer, 11 bytes already used => read first 5 data bytes) and then accesses the remaining data (starting at offset 5) in chunks by casting it to a Element (potentially a uint[4] or ulong[2]) slice. And that's where the required alignment (4/8 bytes) is likely to be violated for data.ptr + 5. I.e., it not just depends on the alignment of data.ptr, but also on the status of the hasher's internal buffer.

joakim-noah · 2017-03-13T17:44:43Z

ubyte arrays should never be unaligned as the ubyte alignment is 1.

On ARM, it seems to depend on the optimization level, ie at -O0 the ARM codegen will sometimes allocate ubyte arrays that are not aligned. In any case, a slice into an ubyte array can obviously be unaligned, which is what this test depends on. Given the way the test is setup, surely you agree that at least one of data[0 .. $-1] or data[1 .. $], the two slices passes to Murmurhash, must have its first ubyte at an unaligned address? After that, the below analysis means that the same address plus 16 is what goes into the loop.

MurmurHash3.put() after a quick glance seems buggy as it first fills up its internal buffer (e.g., 128-bit buffer, 11 bytes already used => read first 5 data bytes) and then accesses the remaining data (starting at offset 5) in chunks by casting it to a Element (potentially a uint[4] or ulong[2]) slice. And that's where the required alignment (4/8 bytes) is likely to be violated for data.ptr + 5. I.e., it not just depends on the alignment of data.ptr, but also on the status of the hasher's internal buffer.

I don't know what's going on on x86 or whatever arch you're testing, but on ARM with the given tests, the chunks always start at data.ptr + 16, because bufferSize is 0 and so bufferLeeway always jumps ahead by exactly one Element, ie 16 bytes. The status of the internal buffer seems to play no role for this bug on ARM.

kinke · 2017-03-13T18:41:10Z

I don't know what's going on on x86 or whatever arch you're testing, but on ARM with the given tests, the chunks always start at data.ptr + 16, because bufferSize is 0 and so bufferLeeway always jumps ahead by exactly one Element, ie 16 bytes. The status of the internal buffer seems to play no role for this bug on ARM.

I just skimmed through the code, no test performed. The way the test is set up (separately hashing 2 1KB blocks), bufferSize is indeed always 0, but it's not in the general case and the Phobos code doesn't handle resulting misalignments at all. It's not the reason for the particular test case failure on ARM but it's another way how unaligned memory accesses can result.
X86 allows misaligned memory access by default, although it may cost performance.

So I guess the code should be updated for non-misalignment-friendly architectures by checking if (cast(size_t) (data.ptr + bufferLeeway)) & (Element.sizeof - 1) == 0 and only then use the chunked approach (or try to use smaller chunks).

Edit: I don't know if an additional LDC intrinsic for unaligned loads (emitting a load T, T* ptr, align 1 in IR) would work (edit: obviously should as -O0 works) and whether it'd allow for more efficient code.

dnadlinger · 2017-03-14T00:38:25Z

ubyte arrays should never be unaligned as the ubyte alignment is 1

On ARM, it seems to depend on the optimization level

What @kinke and I were getting at is that ubyte arrays (or pointers) can by definition never be misaligned, because their alignment constraint is only is 1 byte.

I presume you are using "aligned" to specifically mean word-size aligned. This is in general not enough, as some instructions – e.g. aligned SIMD loads even on x86 – might require larger alignments. In this specific case, word-size alignment might of course be enough to avoid the crash.

The upshot is that the implementation that just casts ubyte[] to Element[] is invalid.

joakim-noah · 2017-03-14T15:49:57Z

It's not the reason for the particular test case failure on ARM but it's another way how unaligned memory accesses can result.

Ah, I thought you were saying it was the reason for this failure, got it now.

I presume you are using "aligned" to specifically mean word-size aligned.

Yes, this is why I clarified earlier that "I don't think it's unreasonable for the ubyte slice passed in to be sometimes unaligned to word boundaries like this," but maybe I should have emphasized that more.

The upshot is that the implementation that just casts ubyte[] to Element[] is invalid.

I mentioned that cast as the likely issue earlier, but what is the kosher D way to write such code and deal with alignment issues? I don't really deal with these casting and alignment issues much, so any pointer to other code that does this right would be helpful.

smolt · 2017-03-14T17:03:42Z

On ARM misalignment in mumurhash, I recalled it was discussed in the original PR and @jpf91 made suggestions that perhaps weren't implement correctly. See dlang/phobos#3916. The discussion on ARM alignment was good.

On std.random, it seems to be a stress case for ARM codegen by LLVM maybe because the stack usage is huge by individual function after inlining. When I played with watchOS (somewhere in the attic now), there were extra push and pops around calls when the stack usage by a function was huge. The pops were not handled in the exception landing pad and eventual function return when to wrong address. Maybe this is happening?

Maybe I have some time to catch up on LDC this weekend.

dnadlinger · 2017-03-14T17:38:15Z

but what is the kosher D way to write such code and deal with alignment issues?

I don't know of any better way than doing it manually, i.e. reading off data[0 .. Element.sizeof] manually and then casting the value to Element. You could also do it the C way and manually piece together the 32 bit integer from the appropriately shifted 8 bit values. Hm, thinking about it, I think there is a read function in std.bitmanip for endian-aware conversion you might be able to re-use for this.

kinke · 2017-03-14T18:41:59Z

but what is the kosher D way to write such code and deal with alignment issues?

The kosher way is to always keep alignment in mind when casting pointers (or accessing fields in packed structs etc.), but hardly anyone does as long as no errors pop up.

A quickfix would be guarding the Element-chunk loop via if ((cast(size_t) (data.ptr + bufferLeeway)) & (Element.sizeof - 1) == 0) (just checking whether the start address is a multiple of Element.sizeof, which must be a power of 2) as I already suggested and falling back to copying the bytes in the unaligned case. A more elaborate variant would check whether the alignment of data.ptr + bufferLeeway is suitable for 64-bit longs and use Element = ulong then, otherwise check if it's at least a multiple of 4 for Element = uint, maybe even another thingy for 16-bit shorts, and only then fall back to copying the bytes.

And yet another variant I already suggested would be introducing a new LDC intrinsic for arbitrary unaligned loads (afaik, we only have such a thing for unaligned SIMD vector loads - edit: here) and use that in the unaligned case. I guess that takes care of shifting and or'ing for architectures not supporting unaligned loads directly.

Edit: Looking at the code again, Element is actually the full 128-bit type/chunk (ulong[2] / uint[4]). What I meant by Element above is a scalar element, i.e., ulong / uint, defining the alignment requirements.

kinke · 2017-03-14T23:29:54Z

Here's a highly experimental hack. No idea about the performance impact and whether it actually works on ARM etc.; I just made sure the tests still work when using the unaligned code path with x86.

Amusing side note: the code in question containing the cast from ubyte[] to Element[] is actually @safe and doesn't even produce a warning. ;)

dnadlinger · 2017-03-15T00:22:06Z

I haven't looked at your changes in any detail yet, but you will definitely want to keep the upstream people in the loop to make things easier.

I also imagine there might be other (C) MurmurHash3 implementations for ARM dealing with the problem already. In particular, it might make sense to keep the current code for archs where unaligned loads are reasonably fast (x86).

gchatelet · 2017-03-15T08:37:37Z

Sorry for the late reply and thx for putting me in the loop.
I remember murmurhash PR to have taken quite a long time to submit and I couldn't test it on ARM or platform with bad/unsupported unaligned loads, so I just went on with what I had - adding some tests to catch the failure when it happens.
I wanted to create a version for "poorly supported unaligned read platform" but there didn't seem to be a version I could test against back then (not sure one exists nowadays either). I recommend creating one, the problem will keep on happening for other hashes as well.

The put(ubyte[]) interface is already quite slow compared to the plain Element interface, it would totally hurt performance to bit shift/mask on x86.

One thing I don't like about D's cast is that it sometimes reinterpret_cast and sometimes it static_cast but you don't really know if the operation is cheap or not. I was under the impression that slice cast would handle the misalignment depending on the platform but it seems it doesn't.

dnadlinger · 2017-03-15T09:04:14Z

The put(ubyte[]) interface is already quite slow compared to the plain Element interface, it would totally hurt performance to bit shift/mask on x86.

Yes, it certainly would.

I was under the impression that slice cast would handle the misalignment depending on the platform but it seems it doesn't.

This can't really work for rather fundamental reasons – after casting the slice, it is as good an Element[] value as any other slice is, and there is no way to remember the lack of alignment (consider for example passing it on as a function parameter, or returning it from a function). The only possible recourse after detecting an alignment mismatch at runtime would be to throw an error – which, thinking about it, might be a helpful functionality in debug mode.

gchatelet · 2017-03-15T09:11:21Z

The only possible recourse after detecting an alignment mismatch at runtime would be to throw an error – which, thinking about it, might be a helpful functionality in debug mode.

AFAIR it already happens when data size mismatches. e.g. https://dpaste.dzfl.pl/4514bcc4eea1
It would totally make sense to throw on misalignment if it's an issue for the platform.

kinke · 2017-03-15T09:22:14Z

I also imagine there might be other (C) MurmurHash3 implementations for ARM dealing with the problem already.

In essence, the 'problem' we are faced with is how to iteratively copy over chunks of 16 bytes in an efficient way. So any ARM memcpy() implementation could serve as base. We can't use it directly as we don't want a call and choose-best-method-logic for each chunk.

In particular, it might make sense to keep the current code for archs where unaligned loads are reasonably fast (x86).

My hack does. I basically implemented the first more elaborate variant I suggested further up. It would still be interesting to see if, for instance, copying over 4 aligned ints is faster than copying 2 unaligned longs on X86 in case the start address is a multiple of 4 (but not 8).

gchatelet · 2017-03-15T09:27:47Z

It would still be interesting to see if, for instance, copying over 4 aligned ints is faster than copying 2 unaligned longs on X86 in case the start address is a multiple of 4 (but not 8).

I'm pretty sure that streaming unaligned reads on modern x86 is free. For the first read the whole cache line is read, for the subsequent reads the prefetcher will load the cache lines in advance and hide the latency. This needs to be tested obviously.

joakim-noah · 2017-03-15T11:21:35Z

Thanks for all the info, @smolt and everyone else. I will read up more and get back.

I just tried cross-compiling the stdlib tests using ldc 1.2 with llvm 4.0, getting a bunch of modules segfaulting in the tests, have to look into that too.

joakim-noah · 2017-03-18T07:49:33Z

I spent some time tracking down the segfaults when using llvm 4.0: there may be an llvm regression in using ARM SIMD, ie NEON, instructions to speed up loads (yes, yet another alignment issue). For many functions that have array arguments, the ARM codegen uses the vld1.64 andvst1.64 instructions to quickly load small arrays onto the stack, maybe in particular for array literals.

There appears to have been a recent optimization in llvm 4.0, which is only applied at -O1 or higher, to place the array literals right after the function calling them, rather than somewhere at the end of the object file. With llvm 3.9.1 at any optimization level and llvm 4.0 at -O0, the ARM assembly for the affected functions always looks something like this (showing the assembly for this test):

000106f0 <_D3std9algorithm8mutation16__unittestL191_2FNfZv>:
   106f0:       e92d4070        push    {r4, r5, r6, lr}
   106f4:       e24dd020        sub     sp, sp, #32
   106f8:       e59f01e0        ldr     r0, [pc, #480]  ; 108e0 <_D3std9algorithm8mutation16__unittestL191_2FNfZv+0x1f0>
--- snip irrelevant instructions ---
   10708:       e08f0000        add     r0, pc, r0
   1070c:       e58d401c        str     r4, [sp, #28]
   10710:       f4600aef        vld1.64 {d16-d17}, [r0 :128]
--- skip to end of this test function ---
   108e0:       00035a30        andeq   r5, r3, r0, lsr sl

--- skip almost near end of executable ---

Disassembly of section .rodata:
00046140 <.arrayliteral>:
   46140:       00000004        andeq   r0, r0, r4
   46144:       00000005        andeq   r0, r0, r5
   46148:       00000006        andeq   r0, r0, r6
   4614c:       00000007        andeq   r0, r0, r7
   46150:       00000001        andeq   r0, r0, r1
   46154:       00000002        andeq   r0, r0, r2
   46158:       00000003        andeq   r0, r0, r3

With llvm 4.0 at -O1, it turns into this:

000120d8 <_D3std9algorithm8mutation16__unittestL191_2FNfZv>:
   120d8:       e92d4bf0        push    {r4, r5, r6, r7, r8, r9, fp, lr}
   120dc:       e24dd028        sub     sp, sp, #40     ; 0x28
--- snip irrelevant instructions ---
   120ec:       e28f0f41        add     r0, pc, #260    ; 0x104
   120f0:       f4600aef        vld1.64 {d16-d17}, [r0 :128]
--- skip to right after this test function ---

000121f8 <.arrayliteral>:
   121f8:       00000004        andeq   r0, r0, r4
   121fc:       00000005        andeq   r0, r0, r5
   12200:       00000006        andeq   r0, r0, r6               
   12204:       00000007        andeq   r0, r0, r7          
   12208:       00000001        andeq   r0, r0, r1              
   1220c:       00000002        andeq   r0, r0, r2               
   12210:       00000003        andeq   r0, r0, r3

This always segfaults at the vld1.64 instruction because the starting address of the arrayliteral data is not 128-bit word-aligned. What's interesting is that the same function compiled at -O2 starts working again, apparently because the more optimized function now happens to place the arrayliteral data after the function at a word boundary.

I can see how someone missed this if they happened to only test with functions that happened to end on word boundaries. Out of all the tests in the stdlib, only 14 modules have some tests that fail because of this. What do you guys think, llvm bug or is the alignment supposed to be set elsewhere?

dnadlinger · 2017-03-18T07:58:59Z

What do you guys think, llvm bug or is the alignment supposed to be set elsewhere?

Check the IR for the function/global in question. If there is no mismatch in alignment, I'd say it's an LLVM bug.

joakim-noah · 2017-03-21T09:11:25Z

Here are the relevant definitions from the IR at -O1, when the optimization is applied:

@.arrayliteral = internal unnamed_addr constant [7 x i32] [i32 4, i32 5, i32 6, i32 7, i32 1, i32 2, i32 3] ; [#uses = 1]
define void @_D3std9algorithm8mutation16__unittestL191_2FNfZv() local_unnamed_addr #0 comdat {
%list = alloca %"std.container.slist.SList!int.SList", align 4 ; [#uses = 5, size/byte = 4]                                           
%arrayliteral = alloca [7 x i32], align 4       ; [#uses = 2, size/byte = 28]                                                         
%r2 = alloca %"std.container.slist.SList!int.SList.Range", align 4 ; [#uses = 2, size/byte = 4]                                     
%1 = bitcast %"std.container.slist.SList!int.SList"* %list to i32* ; [#uses = 1]                                                     
store i32 0, i32* %1, align 4                                      
%2 = bitcast [7 x i32]* %arrayliteral to i8*    ; [#uses = 1]
call void @llvm.memcpy.p0i8.p0i8.i32(i8* nonnull %2, i8* bitcast ([7 x i32]* @.arrayliteral to i8*), i32 28, i32 4, i1 false)         
%3 = getelementptr inbounds [7 x i32], [7 x i32]* %arrayliteral, i32 0, i32 0 ; [#uses = 1, type = i32*]                             
%4 = insertvalue { i32, i32* } { i32 7, i32* undef }, i32* %3, 1 ; [#uses = 1]                                                        
%5 = call %"std.container.slist.SList!int.SList"* (%"std.container.slist.SList!int.SList"*, { i32, i32* }, ...)
@_D3std9container5slist12__T5SListTiZ5SList13__T6__ctorTiZ6__ctorMFNaNbNcNfAiXS3std9container5slist12__T5SListTiZ5SList(%"std.container.slist.SList!int.SList"* nonnull returned %list, { i32, i32* } %4) #0 ; [#uses = 0]
--- more IR ---

There is no alignment specified for the function or constant in the IR at any optimization level, which seems to imply llvm should be setting the right alignment. I notice some bitcasts to i8* in the function, not sure if that will affect how the constant is aligned beforehand.

@jmolloy, if I use the llvm flag to turn the new constant pools optimization off, --arm-promote-constant=false, the segfaults go away. Is it possible this is another alignment issue, stemming from a bad interaction with the NEON load optimization (detailed two comments above)?

kinke · 2017-03-21T09:28:01Z

Yep looks like an LLVM bug. The constant has no explicit alignment (and should thus default to at least 4 due to its type - quoting LLVM docs: If not present, or if the alignment is set to zero, the alignment of the global is set by the target to whatever it feels convenient.). The local %arrayliteral has an alignment of 4, and the memcpy() intrinsic gets an alignment of 4 (4th arg) as well.

joakim-noah · 2017-03-22T10:29:07Z

I just tried @kinke's patch for murmurhash3, it fixes the problem on ARM. @jpf91 suggested copying each chunk into a union (his benchmarks showed no slowdown on x64) and this patch does something similar, but @gchatelet didn't bother with that originally, wonder why?

@kinke, I think @klickverbot was suggesting not copying on x64, as you're currently doing for all arches. Either way, do you want to submit that as a PR for phobos, so we get this in upstream? You may want to modify the unaligned test to loop through several different misalignments, so you hit all your code paths.

gchatelet · 2017-03-22T10:48:58Z

I did bother :-) When I tested this last year performance was ok with LDC but quite catastrophic with DMD (I don't remember GDC but it should be tested as well)

Performance should remain as good as possible for all compilers so please check before submitting.

kinke · 2017-03-22T11:53:34Z

Cleaning up this hack and going through the probably tedious process of getting it merged upstream is very low on my priorities list, so I'd very much favor someone else taking care of this.

joakim-noah · 2017-03-22T14:24:24Z

@gchatelet, mind doing it? You know the code the best.

dnadlinger · 2017-03-22T15:33:58Z

@joakim-noah: Is there a bugs.llvm.org entry for the constant alignment issue already?

joakim-noah · 2017-03-22T16:02:34Z

No, want to submit it? I haven't accessed my account there in awhile.

kinke · 2017-08-16T17:38:40Z

With #2148, there are indeed a lot of new, fine-grained command line options (-enable-no-infs-fp-math etc.), but right now, these function attributes should be simple duplicates of the global settings (and I don't think you can customize all of them via cmdline right now). Johan added these not too long ago, apparently required for LTO: https://github.com/ldc-developers/ldc/blob/master/gen/functions.cpp#L423-L464. I would have thought that omitting these extra attributes wouldn't change a thing without LTO.

kinke · 2017-08-16T18:00:09Z

["target-cpu"="generic" seems a bit suspicious to me, I would have expected a concrete ARM[v7] CPU. It's x86-64 for some .ll I have lying around. See https://github.com/ldc-developers/ldc/blob/master/driver/targetmachine.cpp#L149-L188; I recently let LLVM handle this, ltsmaster doesn't have that.]

joakim-noah · 2017-08-21T04:54:33Z

I just tried commenting out applyTargetMachineAttributes, so that 1.4 applies the same function attributes as lts, and ldc still errors out. Very strange, not sure what else is different at this point. I'll look further.

joakim-noah · 2017-08-22T10:55:48Z

Comparing the IR generated for this extracted function, it's almost identical:

extern(C):
void _d_array_init_double(double* a, size_t n, double v)
{
    auto p = a;
    auto end = a+n;
    while (p !is end)
        *p++ = v;
}

Here are the two IR dumps taken from the last pass that successfully dumps IR when running -print-before-all, for lts and 1.4 master. I'm not sure exactly why it tries to bitcast to the address of %vector.body, but it works if the variable is given a name by the loop vectorization pass, but not a number? Let me know if any of you have an idea.

kinke · 2017-08-23T17:25:19Z

After a quick glance, the IR seems identical, the only difference being the stripped names for master for some reason. So I'm still thinking it has to do with a wrong CPU (generic). When running with -vv, the first line should show something like Targeting 'x86_64-pc-windows-msvc' (CPU 'x86-64' with features '+cx16'); the CPU would be the interesting bit to see if master really chooses generic over a concrete one.
If specifying the CPU used by ltsmaster via -mcpu for master works, we know where the problem is.

joakim-noah · 2017-08-23T18:01:41Z

Yep, you got it. ltsmaster shows Targeting 'armv7-none-linux-android' (CPU 'cortex-a8' with features '') while master shows Targeting 'armv7-none-linux-android' (CPU 'generic' with features ''). Passing -mcpu=cortex-a8 to ldc 1.4 gets that single function to compile with the loop vectorization applied, and all the druntime tests compile and pass again.

kinke · 2017-08-23T18:20:56Z

Alright, based on https://github.com/llvm-mirror/llvm/blob/master/include/llvm/Support/ARMTargetParser.def, there's no default CPU for architecture armv7. Almost certainly because it's too generic; cortex-a8 is the default CPU for armv7-a, cortex-r4 for armv7-r, cortex-m3 for armv7-m, cortex-m4 for armv7e-m, and swift for armv7s. I guess the main difference is their FPU features. So specifying a more precise triple should be enough.
[And the clang command line also includes -march=armv7-a.]

joakim-noah · 2017-08-24T03:59:19Z

So you think just specifying the target triple for ldc isn't enough, we have to do something like the NDK clang does, ie -target armv7-none-linux-androideabi -march=armv7-a -mfloat-abi=softfp -mfpu=vfpv3-d16 -mthumb? Until ldc 1.4, all I had to set was the triple and it would infer cortex-a8, just verified with ldc 1.3. Is this so other ARM variations can be supported too?

kinke · 2017-08-24T07:42:10Z

We previously defaulted manually to cortex-a8 for armv7, LLVM doesn't, so if something like armv7-a-none-linux-androideabi exists, that should work, otherwise specifying an additional -march might. I don't think we support -mfpu, but it might come with #2148.

joakim-noah · 2017-08-24T09:16:27Z

I don't think that triple does anything different, and my reading of the ARM Target Parser is that there's no default CPU for armv7/armv7-a. -march when passed to ldc doesn't accept armv7 or armv7-a, only basic options like arm or thumb, looks like -mcpu and -mattr are the way to go.

Update: I just tried compiling the init function extracted above with every cpu listed under ARMV7A, ie cortex-a5 through krait, all build except for cortex-a17, guess it has some different features. Also, I just ran the Phobos tests, which I built overnight with -mcpu=cortex-a8, and everything works as normal again.

I've been trying to avoid the complexity of the variety of ARM configurations so far, and ldc setting cortex-a8 by default has allowed me to do that, but I guess we need to address that now. I'll start using -mcpu=cortex-a8 and document that for ldc Android users.

kinke · 2017-08-24T10:36:45Z

There's definitely a default CPU for armv7-a, the cortex-a8 (see https://github.com/llvm-mirror/llvm/blob/release_40/include/llvm/Support/ARMTargetParser.def#L206; the 4th macro arg, true, represents whether it is the default CPU for the architecture). The question is simply how to specify that architecture. I hoped it'd be possible via the triple directly, but if armv7-a-... doesn't work (and something like armv7a-... neither), we should be looking into making -march work, as it apparently does work for clang. Having to specify a full CPU is way too specific IMO. [I'm not a fan of the heterogeneous ARM architectures at all.]

joakim-noah · 2017-08-24T11:18:45Z

Not anymore, I was looking at the master version you linked earlier and I see now that it's not the default in 5.0 rc2 either. Given the variety of ARM CPUs, I'm fine with either way of specifying the features to target.

kinke · 2017-08-24T15:06:20Z

Ah thanks, what an unholy mess. ;)

kinke · 2017-08-24T20:03:15Z

I'll start using -mcpu=cortex-a8 and document that for ldc Android users.

While doing that, please also include TARGET_SYSTEM="Android;Linux;UNIX" in the ldc-build-runtime command line, I think it was missing in the forum post. We already had guys ending up targeting armvN-...-windows-msvc because they only used -march=armvN on a Windows host... ;)

joakim-noah · 2017-08-25T09:18:47Z

Ah thanks, what an unholy mess. ;)

I actually like it, it's a sign of the much greater competition with ARM chips, though it comes with some effort on the developer end too.

As for setting the TARGET_SYSTEM, guess I missed that since I was only cross-compiling from linux. I'll move the full command to the wiki, and hopefully my PR to just set the OS and arch will make things much easier.

kinke · 2017-08-25T16:01:50Z

though it comes with some effort on the developer end too.

And the end user. I recall some superfluous troubles wrt. finding the right prebuilt video codec for MX Player for my phone a while ago...

joakim-noah · 2018-10-29T16:55:56Z

I don't see that last std.random issue with LLVM 7 lately, so I tried the reduced test case above, but it still segfaults. However, it goes away if I specify the CPU features with -mcpu=cortex-a8, which is recommended because of the loop vectorization issues listed above anyway, so it shouldn't hit anymore.

kinke · 2018-10-29T17:18:18Z

Oh nice, let's just hope it's not just 'fixed' by accident, popping up in other scenarios again.

It may make sense to default to cortex-a8 for suited triples/architectures instead of hoping that people will read the release notes/wiki pages.

joakim-noah · 2018-10-30T10:40:11Z

Does it fix the issue on armhf too? I'm not sure it makes sense to enforce cortex-a8 as the default, given the variety of ARM CPUs out there.

kinke · 2018-10-30T10:43:28Z

That's ARMv6, Cortex-A8 is ARMv7. I haven't run the tests in the ARM emulator in ages, and may not run the emulator ever again, given the few dozen ARM downloads.

joakim-noah · 2018-10-30T10:52:07Z

Yeah, given that AArch64 is taking over and doesn't have this problem, don't think we should worry about this any more.

joakim-noah mentioned this issue Aug 31, 2017

armv7l should not default to NEON enabled #1752

Open

kinke mentioned this issue Nov 7, 2017

phobos2-test-runner (1.5.0) fails on armv7l (RPI2): _Unwind_RaiseException failed with reason code: 9 #2403

Closed

joakim-noah referenced this issue in ldc-developers/phobos Dec 4, 2017

Clean up LDC-specific std.digest.murmurhash extension

e5876cd

joakim-noah mentioned this issue Jan 13, 2018

Merge 2.078.1 front-end and stdlibs #2486

Merged

kinke mentioned this issue Mar 27, 2018

Fix rt.backtrace.dwarf ldc-developers/druntime#124

Merged

joakim-noah mentioned this issue Jun 9, 2018

std.random-shared fails on ARM #2734

Closed

kinke mentioned this issue Aug 10, 2018

Cross compiling for Raspberry Pi on Windows #2322

Closed

joakim-noah mentioned this issue Aug 27, 2018

[traker] OpenWRT uClibc ARM Crosscompile #2810

Open

joakim-noah closed this as completed Oct 29, 2018

Android and ARM regressions in master 1.2 #2024

Android and ARM regressions in master 1.2 #2024

Comments

joakim-noah commented Mar 6, 2017 • edited

kinke commented Mar 6, 2017

joakim-noah commented Mar 12, 2017 • edited

dnadlinger commented Mar 12, 2017

joakim-noah commented Mar 13, 2017 • edited

kinke commented Mar 13, 2017 • edited

joakim-noah commented Mar 13, 2017

kinke commented Mar 13, 2017 • edited

dnadlinger commented Mar 14, 2017 • edited

joakim-noah commented Mar 14, 2017

smolt commented Mar 14, 2017

dnadlinger commented Mar 14, 2017

kinke commented Mar 14, 2017 • edited

kinke commented Mar 14, 2017 • edited

dnadlinger commented Mar 15, 2017 • edited

gchatelet commented Mar 15, 2017

dnadlinger commented Mar 15, 2017 • edited

gchatelet commented Mar 15, 2017 • edited

kinke commented Mar 15, 2017

gchatelet commented Mar 15, 2017

joakim-noah commented Mar 15, 2017

joakim-noah commented Mar 18, 2017 • edited

dnadlinger commented Mar 18, 2017 • edited

joakim-noah commented Mar 21, 2017

kinke commented Mar 21, 2017 • edited

joakim-noah commented Mar 22, 2017 • edited

gchatelet commented Mar 22, 2017

kinke commented Mar 22, 2017

joakim-noah commented Mar 22, 2017

dnadlinger commented Mar 22, 2017

joakim-noah commented Mar 22, 2017

kinke commented Aug 16, 2017 • edited

kinke commented Aug 16, 2017 • edited

joakim-noah commented Aug 21, 2017

joakim-noah commented Aug 22, 2017

kinke commented Aug 23, 2017 • edited

joakim-noah commented Aug 23, 2017

kinke commented Aug 23, 2017 • edited

joakim-noah commented Aug 24, 2017

kinke commented Aug 24, 2017

joakim-noah commented Aug 24, 2017 • edited

kinke commented Aug 24, 2017 • edited

joakim-noah commented Aug 24, 2017

kinke commented Aug 24, 2017

kinke commented Aug 24, 2017

joakim-noah commented Aug 25, 2017

kinke commented Aug 25, 2017

joakim-noah commented Oct 29, 2018

kinke commented Oct 29, 2018

joakim-noah commented Oct 30, 2018

kinke commented Oct 30, 2018

joakim-noah commented Oct 30, 2018

joakim-noah commented Mar 6, 2017 •

edited

joakim-noah commented Mar 12, 2017 •

edited

joakim-noah commented Mar 13, 2017 •

edited

kinke commented Mar 13, 2017 •

edited

kinke commented Mar 13, 2017 •

edited

dnadlinger commented Mar 14, 2017 •

edited

kinke commented Mar 14, 2017 •

edited

kinke commented Mar 14, 2017 •

edited

dnadlinger commented Mar 15, 2017 •

edited

dnadlinger commented Mar 15, 2017 •

edited

gchatelet commented Mar 15, 2017 •

edited

joakim-noah commented Mar 18, 2017 •

edited

dnadlinger commented Mar 18, 2017 •

edited

kinke commented Mar 21, 2017 •

edited

joakim-noah commented Mar 22, 2017 •

edited

kinke commented Aug 16, 2017 •

edited

kinke commented Aug 16, 2017 •

edited

kinke commented Aug 23, 2017 •

edited

kinke commented Aug 23, 2017 •

edited

joakim-noah commented Aug 24, 2017 •

edited

kinke commented Aug 24, 2017 •

edited