Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

s390x segfault/abort on 5.1 and trunk when running Effect test #12486

Closed
jmid opened this issue Aug 17, 2023 · 11 comments
Closed

s390x segfault/abort on 5.1 and trunk when running Effect test #12486

jmid opened this issue Aug 17, 2023 · 11 comments

Comments

@jmid
Copy link
Contributor

jmid commented Aug 17, 2023

Today we observed another segfault while running multicoretests.
This one is triggered on s390x using OCaml from the 5.1+trunk and 5.2+trunk opam packages (these follow the 5.1 and trunk branches AFAIU). The test in question does not involve parallelism, but is triggered while stress testing Effects.

A reproducible branch is available here: https://github.com/ocaml-multicore/multicoretests/commits/s390x-crash-repro
The test crashes 10/10 on both 5.1 and trunk so the issue seems deterministic.
The file consists of 4 positive tests and 4 negative ones and the crash consistently occurs in the 2nd negative test:

### OCaml runtime: debug mode ###
random seed: 260017153
generated error  fail  pass / total     time test name

[ ]     0     0     0     0 / 20000     0.0s Lin DSL ref int test with Effect
[ ]     0     0     0     0 / 20000     0.0s Lin DSL ref int test with Effect (generating)
[✓] 20000     0     0 20000 / 20000     0.5s Lin DSL ref int test with Effect

[ ]     0     0     0     0 / 20000     0.0s Lin DSL ref int64 test with Effect
[✓] 20000     0     0 20000 / 20000     0.5s Lin DSL ref int64 test with Effect

[ ]     0     0     0     0 / 20000     0.0s Lin DSL CList int test with Effect
[✓] 20000     0     0 20000 / 20000     0.6s Lin DSL CList int test with Effect

[ ]     0     0     0     0 / 20000     0.0s Lin DSL CList int64 test with Effect
[✓] 20000     0     0 20000 / 20000     0.7s Lin DSL CList int64 test with Effect

[ ]     0     0     0     0 / 20000     0.0s negative Lin DSL ref int test with Effect
[✓]     1     0     1     0 / 20000     0.0s negative Lin DSL ref int test with Effect

[ ]     0     0     0     0 / 20000     0.0s negative Lin DSL ref int64 test with Effect[00] file runtime/fiber.c; line 250 ### Assertion failed: d
File "src/neg_tests/dune", line 17, characters 7-27:
17 |  (name lin_tests_dsl_effect)
            ^^^^^^^^^^^^^^^^^^^^
(cd _build/default/src/neg_tests && ./lin_tests_dsl_effect.exe --verbose)
Command got signal ABRT.

The negative Lin DSL ref int64 test with Effect is just expected to raise an Unhandled exception and report it through the test runner, like the previous negative Lin DSL ref int test with Effect.

Above I've run the test under the debug runtime, where it consistently aborts while failing this assertion:

CAMLassert(d);

This could indicate an issue related to s390x frame descriptors.

If I comment out the first 4 tests the crash no longer happens, so I suspect these need to run to bring the heap to a particular shape. Commenting out the last two tests had the same effect.

To recreate the issue:

  • clone and checkout the above repo branch
  • opam install dune qcheck-core
  • dune build @ci -j1 --no-buffer
@dustanddreams
Copy link
Contributor

The (rare) failure occurs in caml_perform, at the end, when caml_raise_unhandled_effect gets invoked.

Why the stack gets corrupted from there is not clear yet.

@xavierleroy
Copy link
Contributor

I was able to reproduce the crash on the system-Z machine we use for the core OCaml CI, but it occurs quite rarely. I didn't get it to crash with dune build @ci -j1 --no-buffer. Running lin_tests_dsl_effect.exe directly or under gdb, I see the crash about once every 10 runs. Is there nondeterminism in this test? Or does this come from address space randomization in the OS? At any rate, I will not be able to debug this issue from this repro case, it's above what I can do. Someone will have to either debug from this repro or provide a simpler, deterministic repro.

@jmid
Copy link
Contributor Author

jmid commented Aug 30, 2023

Is there nondeterminism in this test?

No, for a fixed seed this should be completely deterministic.
In ocaml-multicore/multicoretests@d137914 I fixed the seed to one that would crash the test on both 5.1 and trunk consistently through the CI.

Without that commit, the test can be invoked as lin_tests_dsl_effect.exe -v (self-initializing) and lin_tests_dsl_effect.exe -v -s 123456 (passed seed) instead.
I usually run the former until finding a seed that consistently causes a crash and then pass that one for determinism.
The above commit mimicked that routine for a CI-accessed system.

I know @dustanddreams have been looking into this crash.
Have you found a seed that crashes it consistently with local machine access Miod?

@dustanddreams
Copy link
Contributor

I have the same difficulty as @xavierleroy to get the test to crash, it can take sometimes a dozen of runs to trigger the assertion.

My current analysis notes follow.

We hit an assertion in fiber.c!scan_stack_frames.

Relevant part of the function are:

  char * sp;
  uintnat retaddr;
  frame_descr * d;

  sp = (char*)stack->sp;

next_chunk:
  if (sp == (char*)Stack_high(stack)) return;
  sp = First_frame(sp);
  retaddr = Saved_return_address(sp);

  while(1) {
    d = caml_find_frame_descr(fds, retaddr);
    CAMLassert(d);

On entry, we have

(gdb) p stack
$6 = (struct stack_info *) 0x2aa0035f410
(gdb) p *stack
$7 = {sp = 0x2aa00360050, exception_ptr = 0x2aa00360068,
  handler = 0x2aa003604c0, cache_bucket = -1, size = 4774451407313060418,
  magic = 42, id = 0}
(gdb) x/20lx (stack->sp - 64)
0x2aa00360010:  0x000000000000005f      0x000002aa00193a1a
0x2aa00360020:  0x0000000000000439      0x000003ffedc02608
0x2aa00360030:  0x000002aa002e3d90      0x000002aa0019e3bc
0x2aa00360040:  0x000003ffedc00c28      0x000002aa000dd676
0x2aa00360050:  0x000003ffedbff090      0x1ab0dd35db74ef95
0x2aa00360060:  0x000002aa000e8d88      0x000002aa00360158
0x2aa00360070:  0x000002aa000e8dbe      0x000003ffedc67630
0x2aa00360080:  0x000003ffedbff090      0x000002aa000e91a2
0x2aa00360090:  0x000003ffedc674d0      0x000002aa000dfbec
0x2aa003600a0:  0x000002aa000e2364      0x000003ffedc67498

On s390x, we have

#define Saved_return_address(sp) *((intnat *)((sp) - 8))
#define First_frame(sp) ((sp) + 8)

so local variable sp is initialized with 0x2aa00360050, then with
First_frame(sp) which will be 0x2aa00360050+8 == 0x2aa00360058.

Then retaddr will be *(0x2aa00360058 - 8) = 0x000003ffedbff090. However
addresses in 0x3ff.####.#### are (C) stack addresses and definitely not code
addresses (those would be 0x2aa.####.####).

The usable backtrace in gdb is:

#4  0x000002aa00217172 in caml_failed_assert (expr=0x2aa0023ab10 "d",
    file_os=0x2aa0023aa34 "runtime/fiber.c", line=250) at runtime/misc.c:57
#5  0x000002aa001f06b8 in scan_stack_frames (f=0x2aa002143f0 <oldify_one>,
    fflags=SCANNING_ONLY_YOUNG_VALUES, fdata=0x3ffffffe7a0,
    stack=0x2aa0035f410, gc_regs=0x2aa0032df30) at runtime/fiber.c:250
#6  0x000002aa001f086c in caml_scan_stack (f=0x2aa002143f0 <oldify_one>,
    fflags=SCANNING_ONLY_YOUNG_VALUES, fdata=0x3ffffffe7a0,
    stack=0x2aa0035f410, gc_regs=0x2aa0032df30) at runtime/fiber.c:281
#7  0x000002aa00219e76 in caml_do_local_roots (f=0x2aa002143f0 <oldify_one>,
    fflags=SCANNING_ONLY_YOUNG_VALUES, fdata=0x3ffffffe7a0,
    local_roots=0x3ffffffefe8, current_stack=0x2aa0035f410,
    v_gc_regs=0x2aa0032df30) at runtime/roots.c:69
#8  0x000002aa00215fb6 in caml_empty_minor_heap_promote (domain=0x2aa0031af20,
    participating_count=1, participating=0x2aa0030c2d8 <stw_request+64>)
    at runtime/minor_gc.c:624
#9  0x000002aa002164a8 in caml_stw_empty_minor_heap_no_major_slice (
    domain=0x2aa0031af20, unused=0x0, participating_count=1,
    participating=0x2aa0030c2d8 <stw_request+64>) at runtime/minor_gc.c:704
#10 0x000002aa0021669c in caml_stw_empty_minor_heap (domain=0x2aa0031af20,
    unused=0x0, participating_count=1,
    participating=0x2aa0030c2d8 <stw_request+64>) at runtime/minor_gc.c:747
#11 0x000002aa001e9e00 in caml_try_run_on_all_domains_with_spin_work (sync=1,
    handler=0x2aa00216650 <caml_stw_empty_minor_heap>, data=0x0,
    leader_setup=0x2aa002162d8 <caml_empty_minor_heap_setup>,
    enter_spin_callback=0x2aa00216200 <caml_do_opportunistic_major_slice>,
    enter_spin_data=0x0) at runtime/domain.c:1549
#12 0x000002aa002167b0 in caml_try_stw_empty_minor_heap_on_all_domains ()
    at runtime/minor_gc.c:777
#13 0x000002aa00216852 in caml_empty_minor_heaps_once ()
    at runtime/minor_gc.c:798
#14 0x000002aa001ea5ea in caml_poll_gc_work () at runtime/domain.c:1694
#15 0x000002aa001ea72e in caml_handle_gc_interrupt () at runtime/domain.c:1730
#16 0x000002aa00216942 in caml_alloc_small_dispatch (dom_st=0x2aa0031af20,
    wosize=2, flags=1, nallocs=1, encoded_alloc_lens=0x0)
    at runtime/minor_gc.c:822
#17 0x000002aa001d64ae in caml_alloc_small (wosize=2, tag=0)
    at runtime/alloc.c:155
#18 0x000002aa001f1dc2 in caml_make_unhandled_effect_exn (effect=4398012037832)
    at runtime/fiber.c:698
#19 0x000002aa001f1e6a in caml_raise_unhandled_effect (effect=4398012037832)
    at runtime/fiber.c:706
#20 <signal handler called>

Frame #20 is not a signal handler but the lack of a correctly constructed
frame causes gdb's heuristics to believe it is a signal delivery trampoline.
The few saved registers available which matter are:

(gdb) fr 20
#20 <signal handler called>
gdb) info reg r8 r9 r15 pc
r8             0x2aa0035f410       2929171231760
r9             0x2aa0035f410       2929171231760
r15            0x3fffffff0d8       4398046507224
pc             0x2aa0022f2ac       0x2aa0022f2ac <caml_c_call+60>

They show that we come from caml_c_call, and the only place leading to
caml_c_call into caml_raise_unhandled_effect (frame 19) is caml_perform.

At this point, r15 (the stack pointer) points to a Caml stack. The previous C
stack pointer has been saved at offset Cstack_sp from it:

* struct c_stack_link */
#define Cstack_stack           160
#define Cstack_sp              168
#define Cstack_prev            176
#define SIZEOF_C_STACK_LINK    184

(gdb) x/lx $r15+168
0x3fffffff180:  0x000002aa00360050

And in that C stack, the return address, which we need to resume the backtrace,
has been saved at offset zero.

(gdb) x/lx 0x000002aa00360050
0x2aa00360050:  0x000003ffedbff090

This address is wrong, as it doesn't point to any code segment but to the C
stack.

This hints that the memory corruption has occurred between the call to
caml_c_call and the time the assertion fires.

If we put a breakpoint in caml_c_call immediately on return of the call, and
rerun the program, upon first breakpoint hit we see something like:

Breakpoint 7, <signal handler called>
(gdb) p $pc
$13 = (void (*)()) 0x2aa0022f2ac <caml_c_call+60>
(gdb) x/lx $r15+168
0x3fffffff180:  0x000002aa0032bb58
(gdb) x/lx 0x000002aa0032bb58
0x2aa0032bb58:  0x000002aa0014394e

where address 0x000002aa0014394e is a valid return point:

(gdb) x/i 0x000002aa0014394e
   0x2aa0014394e <camlStdlib.entry+686>:        lgrl    %r3,0x2aa0027c768

Indeed:

(gdb) x/2i 0x000002aa0014394e-6
   0x2aa00143948 <camlStdlib.entry+680>:
    brasl       %r14,0x2aa0022f270 <caml_c_call>
   0x2aa0014394e <camlStdlib.entry+686>:        lgrl    %r3,0x2aa0027c768

So we need to figure out what causes the return address to be overridden with
as incorrect value.

Instrumenting caml_scan_stack and scan_stack_frames to check for a change
in *(current_stack->sp) shows that no modification occurs.

This suggests that the corruption happens earlier.

Relying upon the fact that the code addresses will not change between executions
as the test binary is not PIE, we can add crude instrumentation to
caml_raise_unhandled_effect as this:

CAMLexport void caml_raise_unhandled_effect (value effect)
{
  struct stack_info *current_stack = Caml_state->current_stack;
  long ra = *(long *)(current_stack)->sp;
  if (ra != 1 && (ra < 0x2aa00000000L || ra >= 0x3ff00000000L))
    fprintf(stderr, "%s: bogus sp %lx\n", __func__, ra);
  else
    fprintf(stderr, "%s: good sp %lx\n", __func__, ra);
  caml_raise(caml_make_unhandled_effect_exn(effect));
}

Note the value of 1 is incorrect as a return address, but is accepted as good
here in order to be able to put a breakpoint on the "bogus sp" line and not
hit it with that value, due to the results of this check (see below).

When running the test program, we first get this:

[ ]     0     0     0     0 / 20000     0.0s negative Lin DSL ref int test with Effectcaml_raise_unhandled_effect: good sp 1
caml_raise_unhandled_effect: good sp 1
caml_raise_unhandled_effect: bogus sp fffffffffffffffb
caml_raise_unhandled_effect: bogus sp 2759ef863e86d591
caml_raise_unhandled_effect: good sp 1
caml_raise_unhandled_effect: good sp 1
caml_raise_unhandled_effect: bogus sp fffffffffffffffb
caml_raise_unhandled_effect: bogus sp 2759ef863e86d591
caml_raise_unhandled_effect: bogus sp 2759ef863e86d593
caml_raise_unhandled_effect: good sp 1
caml_raise_unhandled_effect: good sp 1
caml_raise_unhandled_effect: bogus sp fffffffffffffffb
caml_raise_unhandled_effect: bogus sp 2759ef863e86d591
caml_raise_unhandled_effect: bogus sp 2759ef863e86d595
caml_raise_unhandled_effect: good sp 1

which suggests corruption is happenning early, but somehow gets undone.
A bit later we see more unexpected values, but that particular test chunk
completes with a large sequence of

caml_raise_unhandled_effect: good sp 1

Then the next starts, and immediately reports incorrect values:

[ ]     0     0     0     0 / 20000     0.0s negative Lin DSL ref int64
test with Effectcaml_raise_unhandled_effect: bogus sp 3ff97b95e28
caml_raise_unhandled_effect: bogus sp 3ff97b95e28
caml_raise_unhandled_effect: bogus sp 3ff97b97110
caml_raise_unhandled_effect: bogus sp 3ff97b95e28
caml_raise_unhandled_effect: bogus sp 3ff97b97110
caml_raise_unhandled_effect: bogus sp 3ff97b95e28
[...]
caml_raise_unhandled_effect: bogus sp 3ff97a70218
caml_raise_unhandled_effect: bogus sp 3ff97a70218
caml_raise_unhandled_effect: bogus sp 3ff97a70218

Most of the times, despite these incorrect values, the test will succeed because
the garbage collector will not get invoked. It is only when it runs, that
the assertion gets a chance to fire. However, as exposed by these crude
printf, the memory corruption or incorrect initialization always occurs.

The next step is to move the checks deeper, in SWITCH_OCAML_STACKS,
caml_raise_exception and caml_runstack.

@dustanddreams
Copy link
Contributor

I have instrumented caml_c_call to call abort if the return address on the stack is nonsensical (where, for the time being, nonsensical means "the topmost 32 bits differ from 0x2aa).
Relevant part of the code are:

FUNCTION(G(caml_c_call))
CFI_STARTPROC
        CFI_SIGNAL_FRAME
        lay     %r15, -8(%r15)
        CFI_ADJUST(8)
        stg     %r14, 0(%r15)
        CFI_OFFSET(14, -168)
LBL(caml_c_call):
    /* Arguments:
        C arguments         : %r2, %r3, %r4, %r5, %r6
        C function          : ADDITIONAL_ARG */
    /* Switch from OCaml to C */
        SWITCH_OCAML_TO_C
        ...

and

#define SWITCH_OCAML_TO_C                                  \
    /* Fill in Caml_state->current_stack->sp */            \
        lg      TMP,  Caml_state(current_stack);           \
        stg     %r15, Stack_sp(TMP);                       \
    /* Fill in Caml_state->c_stack */                      \
        lg      TMP2, Caml_state(c_stack);                 \
        stg     TMP,  Cstack_stack(TMP2);                  \
        stg     %r15, Cstack_sp(TMP2);                     \
    /* Switch to C stack */                                \
        lgr     %r15, TMP2;                                \
        SWITCH_OCAML_TO_C_CFI

With some instrumentation, here is what the beginning of the function looks in gdb:

   0x2aa18f2f352 <caml_c_call>:         lay     %r15,-8(%r15)
   0x2aa18f2f358 <caml_c_call+6>:       stg     %r14,0(%r15)
   0x2aa18f2f35e <caml_c_call+12>:      lg      %r1,40(%r10)
   0x2aa18f2f364 <caml_c_call+18>:      stg     %r15,0(%r1)
   0x2aa18f2f36a <caml_c_call+24>:      lg      %r12,64(%r10)
   0x2aa18f2f370 <caml_c_call+30>:      stg     %r1,160(%r12)
   0x2aa18f2f376 <caml_c_call+36>:      stg     %r15,168(%r12)
   0x2aa18f2f37c <caml_c_call+42>:      lg      %r12,0(%r15)
   0x2aa18f2f382 <caml_c_call+48>:      srlg    %r12,%r12,32
   0x2aa18f2f388 <caml_c_call+54>:      lghi    %r1,682
   0x2aa18f2f38c <caml_c_call+58>:      cgrje       %r1,%r12,0x2aa18f2f3a6 <caml_c_call+84>
   0x2aa18f2f392 <caml_c_call+64>:      lg      %r12,0(%r15)
   0x2aa18f2f398 <caml_c_call+70>:      la      %r11,16
   0x2aa18f2f39c <caml_c_call+74>:      la      %r7,0(%r14)
   0x2aa18f2f3a0 <caml_c_call+78>:      jg      0x2aa18f2ef58 <diediedie>
   0x2aa18f2f3a6 <caml_c_call+84>:      lg      %r12,64(%r10)
   0x2aa18f2f3ac <caml_c_call+90>:      lgr     %r15,%r12
   0x2aa18f2f3b0 <caml_c_call+94>:      srlg    %r12,%r15,32
   0x2aa18f2f3b6 <caml_c_call+100>:     lghi    %r1,1023
   0x2aa18f2f3ba <caml_c_call+104>:     cgrje       %r1,%r12,0x2aa18f2f3ca <caml_c_call+120>
   0x2aa18f2f3c0 <caml_c_call+110>:     la      %r12,32
   0x2aa18f2f3c4 <caml_c_call+114>:     jg      0x2aa18f2ef58 <diediedie>
   0x2aa18f2f3ca <caml_c_call+120>:     lg      %r1,40(%r10)
   ...

Let's walk it.

   lay     %r15,-8(%r15)
   stg     %r14,0(%r15)

This is the beginning of caml_c_call. We make room for 8 bytes on stack, and save the return address (held in register r14 to it). At this point, r14 is obviously valid, and its value is 0x2aa18ddd402, which is camlDune__exe__Lin_tests_dsl_effect.code_begin+90, which corresponds to line 13 of src/neg_tests/lin_tests_dsl_effect.ml:

    let add r i = let old = !r in Lin_effect.yield (); set r (old+i)

Then we enter SWITCH_OCAML_TO_C:

   lg      %r1,40(%r10)
   stg     %r15,0(%r1)

This assigns Caml_state->current_stack to r1 and saves the current stack pointer (r15) into Caml_state->current_stack->sp.

   lg      %r12,64(%r10)
   stg     %r1,160(%r12)

This is the assignment Caml_state->c_stack = Caml_state->current_stack.

   stg     %r15,168(%r12)

This is the assignment Caml_state->c_stack->stack = Caml_state->current_stack->sp.

   lg      %r12,0(%r15)
   srlg    %r12,%r12,32

This reloads the return address saved to the stack at the beginning of caml_c_call into register r12, then shifts it to the right by 32 bits in order to only keep the upper part of the register. Since the return address is 0x2aa18ddd402, the value of r12 should be 0x2aa at this point.

   lghi    %r1,682
   cgrje       %r1,%r12,caml_c_call+84

We load 0x2aa (682 in decimal) into r1, and compare registers r1 and r12. If their values are equal, we skip the next four instructions, which are:

   lg      %r12,0(%r15)
   la      %r11,16
   la      %r7,0(%r14)
   jg      0x2aa18f2ef58 <diediedie>
caml_c_call+84:

They reload the return address into r12 again, put a marker value in r11, reload the return address also in r7, but from register r14 rather than from the stack, and unconditionally jump to diediedie, which simply calls abort.

The rest of the changes to SWITCH_OCAML_TO_C restore the previously clobbered r11 and r12 and continues as usual, and is not shown here.

What is interesting with this instrumentation, is that now:

  • the test fails (i.e. ends up in abort) every time.
  • from the core dump, we can see that, at abort invocation time, the value of r7 is correct, while the value of r12 is wrong (its value is 1).

This means that, between the time the return address (r14) has been saved on the stack, and the time it gets read back, something overwrites the stack.

Since this test is not multi-threaded (and gdb indeed reports only one thread), there are no function calls between the stack write and the stack read, and strace reports no signal delivery, the cause of the stack corruption is not clear. This could be a hardware cache issue where the wrong cache line contents are written back to memory.

Another simple experiment I just did was to change the beginning of caml_c_call as such:

FUNCTION(G(caml_c_call))
CFI_STARTPROC
        CFI_SIGNAL_FRAME
        lay     %r15, -48(%r15)
        CFI_ADJUST(8)
        stg     %r14, 0(%r15)
        stg     %r14, 8(%r15)
        stg     %r14, 16(%r15)
        stg     %r14, 24(%r15)
        stg     %r14, 32(%r15)
        stg     %r14, 40(%r15)
        la      %r15, 40(%r15)

In other words, the return address is saved 6 times on the stack rather than once. Of course since the stack pointer is then restored, these extra saves will get overridden, but by the time the check in SWITCH_OCAML_TO_C triggers, this part of the stack should have been left untouched, and with an extra adjustment of r15 before invoking abort, they will not get overwritten by abort either.

After placing a breakpoint on the jump to diediedie, when it gets hit, I see that r14 and r7 have the correct value of 0x2aa000dd402, but the saved value on the stack is wrong:

(gdb) x/2lx $r15
0x2aa00360148:  0x00000000      0x00000001

Moreover, if I look at the other copies of the return address on the stack, only one of them is correct:

(gdb) x/12lx $r15-40
0x2aa00360120:  0x000003ff      0xedca5488      0x000002aa      0x000e27ea
0x2aa00360130:  0x000002aa      0x000dd2f2      0x000003ff      0xedca5488
0x2aa00360140:  0x000002aa      0x000dd402      0x00000000      0x00000001

So there is definitely either other code being invoked on this stack by some unknown (to me) mechanism, or wrong/stale cache data being returned. In the latter case, there is obviously nothing we can do on the hump side to prevent this misbehaviour.

Changing slightly the experiment with this:

FUNCTION(G(caml_c_call))
CFI_STARTPROC
        CFI_SIGNAL_FRAME
        lay     %r15, -48(%r15)
        CFI_ADJUST(8)
        lghi TMP, 0x42
        stg     TMP, 0(%r15)
        stg     TMP, 8(%r15)
        stg     TMP, 16(%r15)
        stg     TMP, 24(%r15)
        stg     TMP, 32(%r15)
        stg     %r14, 40(%r15)
        la      %r15, 40(%r15)

would write five times 0x42 below the saved return address to the stack, but the stack shows up as:

(gdb) x/12lx $r15-40
0x2aa00360120:  0x000003ff      0xedca0140      0x000002aa      0x000e27ea
0x2aa00360130:  0x000002aa      0x000dd2f2      0x000003ff      0xedca0140
0x2aa00360140:  0x000002aa      0x000dd402      0x00000000      0x00000001

which shows a similar pattern as above, with none of the writes being visible. Again, this is either because other code has been invoked between these writes and SWITCH_OCAML_TO_C, or the behaviour of the particular cache line the stack currently uses is broken.

@gasche
Copy link
Member

gasche commented Aug 31, 2023

I'm not working on this bug, but I would like to interject to thank @dustanddreams for the very helpful description of his approach to debugging this kind of bug. As someone with no experience debugging low-level code, there is a lot I don't know about how to approach this, and I am sure that this dump of practitioner knowledge will prove useful to me in the future.

@dustanddreams
Copy link
Contributor

Tinkering further with caml_c_call, and checking the contents of the stack upon every useful instruction, I now have:

FUNCTION(G(caml_c_call))
CFI_STARTPROC
        CFI_SIGNAL_FRAME
        lay     %r15, -8(%r15)
        CFI_ADJUST(8)
        stg     %r14, 0(%r15)
  lg TMP, 0(%r15)
  cgrje TMP, %r14, 1f
  brasl %r14, diediedie
1:
        CFI_OFFSET(14, -168)
LBL(caml_c_call):
    /* Arguments:
        C arguments         : %r2, %r3, %r4, %r5, %r6
        C function          : ADDITIONAL_ARG */
    /* Switch from OCaml to C */
        SWITCH_OCAML_TO_C
        ....

and

#define SWITCH_OCAML_TO_C                                  \
    /* Fill in Caml_state->current_stack->sp */            \
        lg      TMP,  Caml_state(current_stack);           \
  lg TMP2, 0(%r15); \
  cgrje TMP2, %r14, 1f; \
  brasl %r14, diediedie; \
1: \
        stg     %r15, Stack_sp(TMP);                       \
  lg TMP2, 0(%r15); \
  cgrje TMP2, %r14, 1f; \
  brasl %r14, diediedie; \
1: \
    /* Fill in Caml_state->c_stack */                      \
        lg      TMP2, Caml_state(c_stack);                 \
        stg     TMP,  Cstack_stack(TMP2);                  \
  ...

Running the binary under gdb with a breakpoint on every call to diediedie, the first breakpoint hit is the first one in SWITCH_OCAML_TO_C.

Execution has thus been:

   0x2aa0022f34e <caml_c_call>:         lay     %r15,-8(%r15)
   0x2aa0022f354 <caml_c_call+6>:       stg     %r14,0(%r15)
   (first stack check omitted, for it was successful)
   0x2aa0022f36c <caml_c_call+30>:      lg      %r1,40(%r10)
   (second stack check triggers)

In other words, the first assembly statement of SWITCH_OCAML_TO_C, which is a load, has caused the stack to be corrupted.

If we look at the values, we have:

  • Caml_state(current_stack), in r10, is 0x2aa0031af20, so we perform a load at address 0x2aa0031af48, of value 0x2aa0035f410.
  • r15, our stack, is 0x2aa00360148.

Notice that the low 9 bits of r15 and the load address are the same. If this causes a cache memory misbehaviour, there could be similarities between memory contents on both addresses, but this is not the case:

(gdb) x/20lx 0x2aa0031af48
0x2aa0031af48:  0x000002aa      0x0035f410      0x000002aa      0x003604a0
0x2aa0031af58:  0x00000000      0x00000000      0x000003ff      0xfffff0d8
0x2aa0031af68:  0x000002aa      0x0032b920      0x000002aa      0x0032df20
0x2aa0031af78:  0x000002aa      0x0032df30      0x000002aa      0x0031b2f0
0x2aa0031af88:  0x000002aa      0x0031b810      0x00000000      0x00000000
(gdb) x/20lx 0x2aa00360148
0x2aa00360148:  0x00000000      0x00000001      0x28d4597b      0x56e94ef7
0x2aa00360158:  0x000003ff      0xedc9ea20      0x000002aa      0x000e8db8
0x2aa00360168:  0x000002aa      0x00360288      0x000002aa      0x000e8dee
0x2aa00360178:  0x000003ff      0xedca1998      0x000003ff      0xedc9ea20
0x2aa00360188:  0x000002aa      0x000e91d2      0x000003ff      0xedca1838

Interestingly, since all SWITCH_OCAML_TO_C uses immediately follow a save of the return address to the stack, adding another (redundant) save immediately after that first statement of SWITCH_OCAML_TO_C causes the stack checks to no longer trigger. However, after a few runs, the assertion in fiber.c is hit again.

@jmid
Copy link
Contributor Author

jmid commented Aug 31, 2023

As I've shared with @dustanddreams elsewhere, I've observed this same test crashing and aborting with the same assertion failure on both 5.1.0~rc2 and trunk on a regular Linux machine with the frame pointer option enabled:
https://github.com/ocaml-multicore/multicoretests/actions/runs/6029955947/job/16360710343#step:18:1257

At this point I don't understand whether this is the same bug surfacing - or whether these are two independent bugs with the same symptoms.

When run with a fixed seed like -s 34037619 this is still a deterministic OCaml test program.
However it crashes in every 5th run or so for me locally, similarly to the above s390x observations.
Others may have more luck with a different seed:

$ _build/default/src/neg_tests/lin_tests_dsl_effect.exe -v -s 34037619
### OCaml runtime: debug mode ###
random seed: 34037619
generated error  fail  pass / total     time test name
[✓] 20000     0     0 20000 / 20000     0.7s Lin DSL ref int test with Effect
[✓] 20000     0     0 20000 / 20000     0.8s Lin DSL ref int64 test with Effect
[✓] 20000     0     0 20000 / 20000     0.8s Lin DSL CList int test with Effect
[✓] 20000     0     0 20000 / 20000     1.0s Lin DSL CList int64 test with Effect
[✓]     2     0     1     1 / 20000     0.0s negative Lin DSL ref int test with Effect
[ ]     0     0     0     0 / 20000     0.0s negative Lin DSL ref int64 test with Effect[00] file runtime/fiber.c; line 250 ### Assertion failed: d
Aborted (core dumped)

To recreate locally:

git clone git@github.com:ocaml-multicore/multicoretests.git
cd multicoretests
opam switch create 5.1.0~rc2+fp --packages=ocaml-variants.5.1.0~rc2+options,ocaml-option-fp
opam install qcheck-core
dune build src/neg_tests/lin_tests_dsl_effect.exe
_build/default/src/neg_tests/lin_tests_dsl_effect.exe -v -s 34037619

@xavierleroy
Copy link
Contributor

Impressive debugging job! Just one remark, in case you were not aware:

this part of the stack should have been left untouched, and with an extra adjustment of r15 before invoking abort, they will not get overwritten by abort either.

Every C function, incl. abort(), is allowed to write in the 160 bytes ABOVE the stack pointer on entry to the function. So, make sure you decrement the stack pointer by 160 before jumping to abort.

Normally, the 160 bytes at the bottom of the C stack are reserved at all times, so that, after switching to the C stack (of course), we can call any C function. But maybe something is wrong here too.

dustanddreams added a commit to dustanddreams/ocaml that referenced this issue Sep 4, 2023
Delivery of Effect.Unhandled, Effect.Continuation_already_resumed and
Pervasives.array_bound_error exceptions (through caml_raise_unhandled_effect,
caml_raise_continuation_already_resumed and caml_array_bound_error_asm
respectively) should not assume anything about the stack they run on and thus
should not bypass the first few instructions of caml_c_call.

This only affects amd64 and s390x, all other platforms were already correct.
@dustanddreams
Copy link
Contributor

When things do not appear to make sense, this means that my understanding of how things behave is wrong.

The experiments above have been exposing a complete change in stack contents between two points in caml_c_call. What I forgot is that, between these two points, there is a public label, and some other routines jump into the middle of caml_c_call.

These jumps are easy to identify, since each of them sets ADDITIONAL_ARG (r7) to a specific code address, prior to jumping into caml_c_call.

The new instrumention is this:

FUNCTION(G(caml_c_call))
CFI_STARTPROC
        CFI_SIGNAL_FRAME
        lay     %r15, -8(%r15)
        CFI_ADJUST(8)
        stg     %r14, 0(%r15)
        CFI_OFFSET(14, -168)
LBL(caml_c_call):
    /* Arguments:
        C arguments         : %r2, %r3, %r4, %r5, %r6
        C function          : ADDITIONAL_ARG */
    /* Switch from OCaml to C */
#if 1
  lg TMP2, 0(%r15)
  srlg TMP2, TMP2, 32
  lghi TMP, 0x2aa
  cgrje TMP2, TMP, 1f
  nop
1:
#endif
        SWITCH_OCAML_TO_C

with a breakpoint put on the nop. Running the test again, this breakpoint is hit with r7 always pointing to caml_raise_unhandled_effect.

We can confirm that this is the only code path leading to this incorrect stack contents by using a conditional breakpoint:

(gdb) x/i $r7
   0x2aa001f1e48 <caml_raise_unhandled_effect>: stmg    %r11,%r15,88(%r15)
(gdb) info b                                                          [287/9145]
Num     Type           Disp Enb Address            What
2       breakpoint     keep y   0x000002aa0022f292 <caml_c_call+34>                     breakpoint already hit 87 times
(gdb) d 2
(gdb) b *0x000002aa0022f292 if $r7 != 0x2aa001f1e48
Breakpoint 3 at 0x2aa0022f292
(gdb) info b
Num     Type           Disp Enb Address            What
3       breakpoint     keep y   0x000002aa0022f292 <caml_c_call+34>
        stop only if $r7 != 0x2aa001f1e48
(gdb) c

in which case execution of the test completes without hitting the breakpoint, and further runs never trigger it.

Another test with a breakpoint put at the end of caml_perform when loading r7 shows that this breakpoint is hit as many times as the one in the stack check in caml_c_call, which means that - at least in this test program - all uses of caml_raise_unhandled_effect in caml_perform are done using an incorrect stack.

But is this really the problem here?

If we look at the use of caml_c_call from assembly code on all non-bytecode native:

  • arm64, power and riscv branch to the beginning of caml_c_call.
  • amd64 and s390x branch within caml_c_call after the return address has been saved on stack.

This hints that the stack does not need to be an existing stack on which a valid return address has been saved - it may be a brand new stack, and in any case it needs to be treated as such, by not assuming any of its contents.

Changing the s390x code to align with the other non-amd64 platform seems to make the problem disappear. Running the test in a loop for more than one hour does trigger the original CAMLassert in fiber.c anymore, while it used to trigger in no more than a few minutes without that change.

Whether amd64 also needs a similar change is left as an exercize to the reader.

dustanddreams added a commit to dustanddreams/ocaml that referenced this issue Sep 4, 2023
Delivery of Effect.Unhandled (through caml_raise_unhandled_effect)
should not assume anything about the stack it runs on and thus
should not bypass the first few instructions of caml_c_call.

This only affects amd64 and s390x, all other platforms were already correct.
dustanddreams added a commit to dustanddreams/ocaml that referenced this issue Sep 4, 2023
Delivery of Effect.Unhandled (through caml_raise_unhandled_effect)
should not assume anything about the stack it runs on and thus
should not bypass the first few instructions of caml_c_call.
dustanddreams added a commit to dustanddreams/ocaml that referenced this issue Sep 4, 2023
Delivery of Effect.Unhandled and Effect.Continuation_already_resumed
(through caml_raise_unhandled_effect and caml_raise_continuation_already_resumed
 respectively) should not assume anything about the stack it runs on and thus
should not bypass the first few instructions of caml_c_call.
dustanddreams added a commit to dustanddreams/ocaml that referenced this issue Sep 4, 2023
Delivery of Effect.Unhandled and Effect.Continuation_already_resumed
(through caml_raise_unhandled_effect and caml_raise_continuation_already_resumed
 respectively) should not assume anything about the stack it runs on and thus
should not bypass the first few instructions of caml_c_call.
dustanddreams added a commit to dustanddreams/ocaml that referenced this issue Sep 4, 2023
Delivery of Effect.Unhandled and Effect.Continuation_already_resumed
(through caml_raise_unhandled_effect and caml_raise_continuation_already_resumed
 respectively) should not assume anything about the stack it runs on and thus
should not bypass the first few instructions of caml_c_call.
dustanddreams added a commit to dustanddreams/ocaml that referenced this issue Sep 4, 2023
Delivery of Effect.Unhandled and Effect.Continuation_already_resumed
(through caml_raise_unhandled_effect and caml_raise_continuation_already_resumed
 respectively) should not assume anything about the stack it runs on and thus
should not bypass the first few instructions of caml_c_call.
gasche added a commit that referenced this issue Sep 4, 2023
Fix delivery of effect-related exceptions (#12486)
Octachron pushed a commit that referenced this issue Sep 6, 2023
Fix delivery of effect-related exceptions (#12486)

(cherry picked from commit ae28598)
eutro pushed a commit to eutro/ocaml that referenced this issue Sep 17, 2023
Delivery of Effect.Unhandled and Effect.Continuation_already_resumed
(through caml_raise_unhandled_effect and caml_raise_continuation_already_resumed
 respectively) should not assume anything about the stack it runs on and thus
should not bypass the first few instructions of caml_c_call.
gasche added a commit that referenced this issue Sep 20, 2023
Fix delivery of effect-related exceptions, take 2 (#12486)
@jmid
Copy link
Contributor Author

jmid commented Oct 4, 2023

The s390x and fp crashes mentioned here have been fixed in #12530 and #12535, so I'll close this.

@jmid jmid closed this as completed Oct 4, 2023
dra27 pushed a commit to dra27/ocaml that referenced this issue Dec 6, 2023
Fix delivery of effect-related exceptions, take 2 (ocaml#12486)

(cherry picked from commit 01f737a)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants