Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bus error on armhf during t/moar/02-qast-references.t test #431

Closed
dod38fr opened this issue Mar 19, 2018 · 21 comments
Closed

Bus error on armhf during t/moar/02-qast-references.t test #431

dod38fr opened this issue Mar 19, 2018 · 21 comments

Comments

@dod38fr
Copy link

dod38fr commented Mar 19, 2018

Hello

nqp tests always fail with bus error on armhf:

Using host libthread_db library "/lib/arm-linux-gnueabihf/libthread_db.so.1".
Core was generated by `/usr/bin/moar nqp.moarvm t/moar/02-qast-references.t'.
Program terminated with signal SIGBUS, Bus error.
#0  set_num (tc=0x1db1758, st=0x1de51e0, root=<optimized out>, data=0xb5081185, value=34) at src/6model/reprs/P6num.c:54
54      src/6model/reprs/P6num.c: No such file or directory.
[Current thread is 1 (Thread 0xb6fa0e40 (LWP 1692))]
(gdb) bt
#0  set_num (tc=0x1db1758, st=0x1de51e0, root=<optimized out>, data=0xb5081185, value=34) at src/6model/reprs/P6num.c:54
#1  0xb6bf85b4 in MVM_box_num (tc=0x1db1758, value=34, type=<optimized out>, dst=0x1e57d00) at src/core/coerce.c:424
#2  0xb6be1b6c in MVM_interp_run (tc=tc@entry=0x1db1758, initial_invoke=0x1e57d18, invoke_data=0x40) at src/core/interp.c:2148
#3  0xb6c60e22 in MVM_vm_run_file (instance=<optimized out>, filename=<optimized out>) at src/moar.c:407
#4  0x004a5c80 in main (argc=3, argv=0xbec7c684) at src/main.c:256

See also https://buildd.debian.org/status/fetch.php?pkg=nqp&arch=armhf&ver=2018.02%2Bdfsg-1&stamp=1521332696&raw=0

All the best

@dod38fr
Copy link
Author

dod38fr commented Mar 19, 2018

Note: build is fine on armel arch which is an arch quite close to armhf

@AlexDaniel
Copy link
Member

Has it always been like that? Can it be bisected?

@dod38fr
Copy link
Author

dod38fr commented Mar 19, 2018

Unfortunately, I cannot say if it ever worked: nqp has not been built on armhf since it was switched from parrot to moarvm back in 2015

@dod38fr
Copy link
Author

dod38fr commented Mar 19, 2018

I've tried in a buster chroot on armhf: the test also fails iwth nqp and moar 2017.06:

Core was generated by `/usr/bin/moar --libpath=/usr/share/nqp/lib /usr/share/nqp/lib/nqp.moarvm t/moar'.
Program terminated with signal SIGBUS, Bus error.
#0  0xb6c6dc42 in set_num (tc=0x1e4e5b8, st=0x1e796c8, root=<optimized out>, data=0xb6710ac1, value=34) at src/6model/reprs/P6num.c:52
52      src/6model/reprs/P6num.c: No such file or directory.
(gdb) bt
#0  0xb6c6dc42 in set_num (tc=0x1e4e5b8, st=0x1e796c8, root=<optimized out>, data=0xb6710ac1, value=34) at src/6model/reprs/P6num.c:52
#1  0xb6c4de04 in MVM_box_num (tc=0x1e4e5b8, value=34, type=<optimized out>, dst=0x1f25068) at src/core/coerce.c:427
#2  0xb6c3f282 in MVM_interp_run (tc=tc@entry=0x1e4e5b8, initial_invoke=0x1f25080, invoke_data=0x40) at src/core/interp.c:2085
#3  0xb6cabfd6 in MVM_vm_run_file (instance=0x1e4e150, filename=0xbe80e7d6 "/usr/share/nqp/lib/nqp.moarvm") at src/moar.c:318
#4  0x0042cd98 in main (argc=4, argv=0xbe80e674) at src/main.c:246

@dod38fr
Copy link
Author

dod38fr commented Mar 19, 2018

Same problem with nqp and moar 2016.12:

Program terminated with signal SIGBUS, Bus error.
#0  0xb6d7dc22 in set_num (tc=0x199e458, st=0x19bfc80, root=<optimized out>, data=0xb67ddba9, value=34) at src/6model/reprs/P6num.c:52
52      src/6model/reprs/P6num.c: No such file or directory.
(gdb) bt
#0  0xb6d7dc22 in set_num (tc=0x199e458, st=0x19bfc80, root=<optimized out>, data=0xb67ddba9, value=34) at src/6model/reprs/P6num.c:52
#1  0xb6d5d690 in MVM_box_num (tc=0x199e458, value=34, type=<optimized out>, dst=0x1aa6460) at src/core/coerce.c:438
#2  0xb6d4c4bc in MVM_interp_run (tc=tc@entry=0x199e458, initial_invoke=0x1aa6478, invoke_data=0x40) at src/core/interp.c:2042
#3  0xb6db973c in MVM_vm_run_file (instance=0x199e008, filename=0xbed39801 "/usr/share/nqp/lib/nqp.moarvm") at src/moar.c:309
#4  0x00010c70 in main (argc=4, argv=0xbed396a4) at src/main.c:192
(gdb) quit

I can't go back further.

@AlexDaniel
Copy link
Member

It's OK. I was just wondering if it's a very recent regression or not. Thank you for your help!

@dod38fr
Copy link
Author

dod38fr commented Mar 22, 2018

Unfortunately, this issue blocks the transition of nqp from Debian unstable to Debian testing.

And yesterday, rakudo and nqp were removed from Debian/testing because of the build issues on mips and armhf :-(

@robertlemmen
Copy link
Contributor

could this be related? MoarVM/MoarVM#762

@ugexe
Copy link
Contributor

ugexe commented Mar 23, 2018

A guess would be https://github.com/MoarVM/MoarVM/blob/f2937f594f86060f23983f77bc91ae71715281e5/build/probe.pm#L235 is incorrect in how it configures unaligned access

@dod38fr
Copy link
Author

dod38fr commented Mar 23, 2018

Build of moarvm on armhf shows:

    probing whether your compiler thinks that it is gcc  YES
    probing how your compiler does static inline ....... static __inline__
    your CPU can read unaligned values for only int32
    probing the size of pointers ....................... 4
JIT isn't supported on platforms with 4 byte pointers.

@robertlemmen
Copy link
Contributor

some playing with this (on armhf machine): I got nqp ga62cef7 and moar gd322741. I can reduce the problematic test case to this:

class E {
    has int8 $!var;  #  <-- switch to int and it no longer blows up
}

my $e := E.new;
say("mark");
ok(1 == 1, "huh?"); #  <-- this is where it blows up

which results in:

Program received signal SIGBUS, Bus error.
0xb6be8ae6 in set_num () from //home/robertle/MoarVM/../target/lib/libmoar.so
(gdb) bt
#0  0xb6be8ae6 in set_num () from //home/robertle/MoarVM/../target/lib/libmoar.so
#1  0xb6bbf31c in MVM_box_num () from //home/robertle/MoarVM/../target/lib/libmoar.so
#2  0x0018ef10 in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

I also fudged build/probe.pm to think it's an "other" arch:

 your CPU can't read unaligned values for any of int32 int64 num64

but this does not appear to affect the behavior

@robertlemmen
Copy link
Contributor

with debug symbols:

Program received signal SIGBUS, Bus error.
0xb6be8faa in set_num (tc=0x22620, st=0x55218, root=<optimized out>, data=0xb65cebad, value=1)
    at src/6model/reprs/P6num.c:52
52              default: ((MVMP6numBody *)data)->value.n64 = value; break;
(gdb) bt
#0  0xb6be8faa in set_num (tc=0x22620, st=0x55218, root=<optimized out>, data=0xb65cebad, value=1)
    at src/6model/reprs/P6num.c:52
#1  0xb6bbf5bc in MVM_box_num (tc=0x22620, value=1, type=<optimized out>, dst=0xc7280)
    at src/core/coerce.c:416
#2  0xb6ba688e in MVM_interp_run (tc=tc@entry=0x22620, initial_invoke=0xb6fb98dc, invoke_data=0x40)
    at src/core/interp.c:2148
#3  0xb6c92884 in MVM_vm_run_file (instance=<optimized out>, filename=<optimized out>) at src/moar.c:407
#4  0x00010ce8 in main (argc=3, argv=0xbefff664) at src/main.c:299

If I understand it correctly, then what happens is that bump allocator in the nursery just hands out blocks of memory without alignment. doubles do need to be aligned on armhf, but in this case an earlier allocation was for an odd number of bytes, making the block of the double in question non-aligned...

@robertlemmen
Copy link
Contributor

robertlemmen commented Mar 25, 2018

this fixes the problem for me on armhf, obviously wasting memory:

--- a/src/gc/allocation.c
+++ b/src/gc/allocation.c
@@ -19,6 +19,10 @@ void * MVM_gc_allocate_nursery(MVMThreadContext *tc, size_t size) {
 
     /* Guard against 0-byte allocation. */
     if (size > 0) {
+#if !defined(MVM_CAN_UNALIGNED_INT64) || !defined(MVM_CAN_UNALIGNED_NUM64)
+        /* Round up size to next multiple of 8, to ensure alignment. */
+        size = (size + 7) & ~7;
+#endif
         /* Do a GC run if this allocation won't fit in what we have
          * left in the nursery. Note this is a loop to handle a
          * pathological case: all the objects in the nursery are very

the change is without effect on at least x86_64, due to the auto-detected CAN_UNALIGNED macros

@dod38fr
Copy link
Author

dod38fr commented Mar 26, 2018

Unfortunately, this does not fix the bus error occurring with:

/home/dod/moarvm-2018.02+dfsg/moar nqp.moarvm t/moar/02-qast-references.t

@robertlemmen
Copy link
Contributor

even worse than that: the modified moar now fails to build NQP, just takes forever. I can't understand why, if anyone who knows this better then I would be very interested. perhaps the analysis is still useful...

an alternative route would be to make the path through to the allocation able to pass on alignment requirements, and then adhere to them in the allocator. but that seems quite structural, and given the weird effects that I get from the change I made I clearly do not understand some of the interactions...

@robertlemmen
Copy link
Contributor

dude I am learning a lot about GC here! updated patch:

diff --git a/src/gc/allocation.c b/src/gc/allocation.c
index 31b5807..7c1e0a1 100644
--- a/src/gc/allocation.c
+++ b/src/gc/allocation.c
@@ -19,6 +19,10 @@ void * MVM_gc_allocate_nursery(MVMThreadContext *tc, size_t size) {
 
     /* Guard against 0-byte allocation. */
     if (size > 0) {
+#if !defined(MVM_CAN_UNALIGNED_INT64) || !defined(MVM_CAN_UNALIGNED_NUM64)
+        /* Round up size to next multiple of 8, to ensure alignment. */
+        size = (size + 7) & ~7;
+#endif
         /* Do a GC run if this allocation won't fit in what we have
          * left in the nursery. Note this is a loop to handle a
          * pathological case: all the objects in the nursery are very
diff --git a/src/gc/collect.c b/src/gc/collect.c
index 8227b11..41989ab 100644
--- a/src/gc/collect.c
+++ b/src/gc/collect.c
@@ -310,7 +310,12 @@ static void process_worklist(MVMThreadContext *tc, MVMGCWorklist *worklist, Work
                 /* No, so it will live in the nursery for another GC
                  * iteration. Allocate space in the nursery. */
                 new_addr = (MVMCollectable *)tc->nursery_alloc;
+#if !defined(MVM_CAN_UNALIGNED_INT64) || !defined(MVM_CAN_UNALIGNED_NUM64)
+                /* Round up size to next multiple of 8, see MVM_gc_allocate_nursery */
+                tc->nursery_alloc = (char *)tc->nursery_alloc + ((item->size + 7) & ~7);
+#else
                 tc->nursery_alloc = (char *)tc->nursery_alloc + item->size;
+#endif
                 GCDEBUG_LOG(tc, MVM_GC_DEBUG_COLLECT, "Thread %d run %d : copying an object %p (reprid %d) of size %d to tospace %p\n",
                     item, REPR(item)->ID, item->size, new_addr);
 
@@ -615,7 +620,12 @@ void MVM_gc_collect_free_nursery_uncopied(MVMThreadContext *tc, void *limit) {
         }
 
         /* Go to the next item. */
+#if !defined(MVM_CAN_UNALIGNED_INT64) || !defined(MVM_CAN_UNALIGNED_NUM64)
+        /* Round up size to next multiple of 8, see MVM_gc_allocate_nursery */
+        scan = (char *)scan + ((item->size + 7) & ~7);
+#else
         scan = (char *)scan + item->size;
+#endif
     }
 }

This seems to solve the problem we had previously, and should not have any effect on x86_64. Whether it is the right thing to do, I don't know. A wild idea for an alternative would be to make MVMP6numBody bigger and change get_num and set_num to read within this space on an aligned boundary...

@robertlemmen
Copy link
Contributor

and more generally: shouldn't allocations in the nursery adhere to MVMStorageSpec.align?

@niner
Copy link
Contributor

niner commented Apr 19, 2018

Can you please test this patch? It does the same but the changes to the program text are a bit less intrusive and it's a bit more generic. The smallest possible allocation is 8 bytes + a pointer size, so I guess we'd not lose all that much there. And aligned pointers ought to be good for performance, so we may even want to do that in general.

diff --git a/src/gc/allocation.h b/src/gc/allocation.h
index c2092007d..c65e3f797 100644
--- a/src/gc/allocation.h
+++ b/src/gc/allocation.h
@@ -1,3 +1,8 @@
+#if !defined(MVM_CAN_UNALIGNED_INT64) || !defined(MVM_CAN_UNALIGNED_NUM64)
+#define MVM_ALIGN_SIZE(size) MVM_ALIGN_SECTION(size)
+#else
+#define MVM_ALIGN_SIZE(size) (size)
+#endif
 void * MVM_gc_allocate_nursery(MVMThreadContext *tc, size_t size);
 void * MVM_gc_allocate_zeroed(MVMThreadContext *tc, size_t size);
 MVMSTable * MVM_gc_allocate_stable(MVMThreadContext *tc, const MVMREPROps *repr, MVMObject *how);
@@ -10,5 +15,5 @@ void MVM_gc_allocate_gen2_default_clear(MVMThreadContext *tc);
 MVM_STATIC_INLINE void * MVM_gc_allocate(MVMThreadContext *tc, size_t size) {
     return tc->allocate_in_gen2
         ? MVM_gc_gen2_allocate_zeroed(tc->gen2, size)
-        : MVM_gc_allocate_nursery(tc, size);
+        : MVM_gc_allocate_nursery(tc, MVM_ALIGN_SIZE(size));
 }
diff --git a/src/gc/collect.c b/src/gc/collect.c
index cccba59bb..38f4ecec3 100644
--- a/src/gc/collect.c
+++ b/src/gc/collect.c
@@ -310,7 +310,7 @@ static void process_worklist(MVMThreadContext *tc, MVMGCWorklist *worklist, Work
                 /* No, so it will live in the nursery for another GC
                  * iteration. Allocate space in the nursery. */
                 new_addr = (MVMCollectable *)tc->nursery_alloc;
-                tc->nursery_alloc = (char *)tc->nursery_alloc + item->size;
+                tc->nursery_alloc = (char *)tc->nursery_alloc + MVM_ALIGN_SIZE(item->size);
                 GCDEBUG_LOG(tc, MVM_GC_DEBUG_COLLECT, "Thread %d run %d : copying an object %p (reprid %d) of size %d to tospace %p\n",
                     item, REPR(item)->ID, item->size, new_addr);
 
@@ -615,7 +615,7 @@ void MVM_gc_collect_free_nursery_uncopied(MVMThreadContext *tc, void *limit) {
         }
 
         /* Go to the next item. */
-        scan = (char *)scan + item->size;
+        scan = (char *)scan + MVM_ALIGN_SIZE(item->size);
     }
 }

@AlexDaniel
Copy link
Member

ping @dod38fr

@robertlemmen
Copy link
Contributor

I tried that patch instead of the earlier one on a armhf box, and it does work and pass the rakudo test suite a couple of times, as well as the golfed test above. I have been unable to reproduce the problem without the patch however...

@dod38fr
Copy link
Author

dod38fr commented Jun 20, 2018

Builds on Debian are fine now on all architectures.
I'm closing this bug.

Thanks for the help

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants