Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ppc64 backend segfault on trunk #12482

Closed
jmid opened this issue Aug 16, 2023 · 5 comments
Closed

ppc64 backend segfault on trunk #12482

jmid opened this issue Aug 16, 2023 · 5 comments

Comments

@jmid
Copy link
Contributor

jmid commented Aug 16, 2023

As part of multicoretests we are observing segfaults in code produced by the newly restored ppc64 backend.
The test triggering it is property-based test of arrays against a model.
Contrary to previous torture tests, this is a sequential, single-domain test causing a crash.
A branch is available here: https://github.com/ocaml-multicore/multicoretests/tree/ppc64-crash-repro

Here's an example output: https://ocaml-multicoretests.ci.dev:8100/job/2023-08-16/132015-ci-ocluster-build-8fbc23#L343

random seed: 257065471
generated error fail pass / total     time test name

[ ]    0    0    0    0 / 1000     0.0s STM Array test sequential
[ ]    0    0    0    0 / 1000     0.0s STM Array test sequential (generating)File "src/array/dune", line 4, characters 7-16:
4 |  (name stm_tests)
           ^^^^^^^^^
(cd _build/default/src/array && ./stm_tests.exe --verbose)
Command got signal SEGV.

To recreate:

  • clone the above repo branch
  • opam install dune qcheck-core
  • dune build @ci -j1 --no-buffer

I don't have direct access to a ppc-machine ATM, so the above has been produced via CI-golf... 🤓
I may get access to one next week.

Context

  • The test runs with a hard-coded seed and crashes consistently (deterministically) in the CI on the first generated input.
    Below I include the long counter example produced with this seed - a 283-element cmd list!
    (The output was produced by adding a fork to avoid the crash taking down the testing process).
    The other sizes we have observed crashes on are also around 283 or bigger (the smallest was 278 elements).
    I suspect the particular commands and their order is less significant and instead serve to build up a certain heap structure.
  • While chasing this I've observed occasional messages of: double free or corruption (out), free(): invalid pointer, and Fatal error: allocation failure during minor GC which may indicate memory corruption
  • We have observed similar crashes in tests of Bytes and Array.Floatarray and even an infinite loop. The observations have been tracked in [ocaml5-issue] Crashes and hangs on ppc64 trunk/5.2 ocaml-multicore/multicoretests#380.
Long counter example
random seed: 257065471
generated error  fail  pass / total     time test name

[ ]     0     0     0     0 / 10000     0.0s STM Array test sequential
[ ]     0     0     0     0 / 10000     0.0s STM Array test sequential (generating)
[✗]     1     0     1     0 / 10000    10.8s STM Array test sequential

--- Failure --------------------------------------------------------------------

Test STM Array test sequential failed (5 shrink steps):

 Mem 'm'
 Fill (6, 4, '|')
 Sub (9, 8)
 To_list
 Fill (1, 0, 'R')
 Get 4
 Set (15, ')')
 Length
 Fill (7, 15, '$')
 Fill (8, 7, 'H')
 Length
 Copy
 To_seq
 Mem 'b'
 Get 5
 Copy
 Sort
 Sort
 Fill (6, 3, 'G')
 To_seq
 Fill (6, 6, 'w')
 To_seq
 Copy
 Fill (4, 5, 'u')
 Fill (1, 71, 'U')
 Length
 Fill (2, 4, 'U')
 Fill (23, 9, 'j')
 To_list
 Set (50, 'z')
 To_list
 Sort
 Sub (11, 12)
 Get 8
 Set (9, 'B')
 Mem '4'
 Mem 'n'
 Sub (4, 9)
 Length
 Fill (2, 14, 'C')
 Sort
 Length
 Fill (9, 2, '=')
 To_list
 Sub (13, 9)
 Set (9, 'A')
 To_list
 To_seq
 Get 6
 Copy
 Copy
 To_list
 Set (1, '\'')
 Sub (6, 10)
 Fill (6, 5, 'b')
 Fill (14, 50, '|')
 Sort
 Get 13
 Length
 Get 7
 Mem 'd'
 Mem '`'
 Sub (8, 13)
 To_seq
 To_list
 Copy
 Get 3
 To_seq
 Sort
 To_list
 Fill (7, 4, 'K')
 Mem '<'
 To_list
 Fill (8, 4, '/')
 Fill (2, 11, 'M')
 Fill (3, 12, 'E')
 Sort
 To_list
 Sort
 Mem '<'
 Fill (6, 1, ',')
 Set (4, '+')
 To_seq
 Set (2, 'H')
 To_seq
 Copy
 Get 7
 Sub (6, 3)
 Get 1
 Copy
 Get 12
 Fill (12, 7, '>')
 To_seq
 Set (2, 'c')
 Fill (6, 4, '&')
 Set (1, 'B')
 Mem 'x'
 Length
 Mem '%'
 Fill (4, 5, 'Q')
 Sub (65, 8)
 Copy
 Length
 Sort
 Get 0
 To_seq
 Mem '>'
 Sub (6, 0)
 To_seq
 Fill (5, 8, '4')
 To_list
 Get 8
 Get 53
 Fill (11, 2, ',')
 Copy
 Fill (12, 0, 's')
 Sub (4, 0)
 Set (29, 'i')
 Length
 Fill (1, 3, 'E')
 Sub (99, 8)
 Mem ' '
 Sort
 Sort
 Get 2
 Copy
 Sub (5, 6)
 Sort
 Sort
 To_seq
 Get 0
 To_list
 Sort
 Copy
 Mem 's'
 Length
 Get 3
 To_seq
 Length
 To_seq
 Get 1
 To_seq
 To_seq
 Get 1
 Mem '3'
 Fill (11, 1, 'o')
 To_seq
 Mem ':'
 Copy
 Sub (15, 14)
 To_seq
 Fill (41, 2, 'h')
 Mem '#'
 Set (3, 's')
 Get 4
 Get 6
 Sort
 To_seq
 Sub (6, 5)
 Sub (5, 0)
 Sub (6, 11)
 To_list
 To_seq
 Copy
 Fill (5, 7, 'Y')
 Fill (15, 9, '_')
 Copy
 Sub (8, 6)
 Fill (11, 76, '>')
 Sub (9, 1)
 Get 0
 To_list
 Length
 Mem 'r'
 Sub (9, 3)
 Get 5
 Copy
 Fill (1, 6, 'N')
 To_list
 Get 4
 To_list
 Sort
 Length
 Set (15, '6')
 To_seq
 To_seq
 Copy
 To_list
 To_seq
 Sub (5, 4)
 To_list
 Get 14
 Get 7
 Fill (11, 1, 'u')
 Mem 'E'
 Set (74, 'T')
 Sub (0, 7)
 Sub (9, 13)
 To_seq
 To_list
 Fill (6, 3, '6')
 To_seq
 Fill (9, 2, ':')
 Mem '}'
 Mem '9'
 To_seq
 Get 9
 To_seq
 Sort
 To_seq
 Mem 'z'
 Copy
 Copy
 Length
 Length
 Mem 'I'
 Fill (81, 22, 'V')
 Sub (98, 0)
 Mem '4'
 Get 12
 Get 6
 To_list
 Sub (13, 4)
 Copy
 Sort
 Copy
 Sub (4, 5)
 Sub (2, 1)
 To_seq
 Copy
 Sub (1, 0)
 Mem '2'
 Mem 'S'
 Mem 'X'
 Get 8
 To_seq
 Length
 Sort
 Mem '1'
 Length
 Set (1, '#')
 Sub (2, 6)
 Sort
 Sort
 Length
 Fill (10, 1, 'H')
 Fill (63, 1, 'x')
 Get 78
 Mem '7'
 Set (8, '6')
 Fill (1, 6, 'U')
 Get 14
 Mem '+'
 To_list
 Fill (13, 0, 'b')
 Length
 Copy
 Copy
 Mem 'D'
 Fill (78, 8, 'k')
 To_list
 Get 9
 Sort
 Fill (81, 7, '.')
 To_list
 Sub (15, 5)
 Set (2, 'L')
 Sort
 To_list
 Copy
 Mem 'v'
 Mem '4'
 Get 9
 Fill (8, 7, 'w')
 Mem 'E'
 Fill (0, 4, 'u')
 To_seq
 Set (2, '\\')
 Fill (75, 12, 'H')
 Get 19
 To_seq
 To_seq
 Fill (2, 11, 'g')
 Length
@jmid
Copy link
Contributor Author

jmid commented Aug 17, 2023

A run with the debug runtime yields the following malloc error:

### OCaml runtime: debug mode ###
random seed: 257065471
generated error fail pass / total     time test name
malloc(): mismatching next->prev_size (unsorted)

[ ]    0    0    0    0 / 1000     0.0s STM Array test sequential
[ ]    0    0    0    0 / 1000     0.0s STM Array test sequential (generating)File "src/array/dune", line 4, characters 7-16:
4 |  (name stm_tests)
           ^^^^^^^^^
(cd _build/default/src/array && ./stm_tests.exe --verbose)
Command got signal ABRT.

@dustanddreams
Copy link
Contributor

The failing cases on POWER seem to all involve array operations.

The POWER backend is "special" in that it uses hardware instructions which trap in order to perform array bounds checks. If the index is within bounds, execution continues, otherwise SIGTRAP is delivered. A POWER-specific signal handler in runtime/signals_nat.c is used to handle this, and raise the appropriate exception.

The presence of this signal handler unfortunately makes debugging more difficult, as the debugger also uses SIGTRAP for its own purposes, and under Linux, the kernel does not pass enough information to userland to be able to easily tell bounds check traps from debugger's ptrace traps.

If we look at the signal handler, the interesting parts are:

void caml_sigtrap_handler(int signo, siginfo_t * info, void * context)
{
   ...
  /* Recover the values of interesting registers. */
  ucontext_t * ctx = context;
  uint64_t ctx_pc = ctx->uc_mcontext.gp_regs[32];
  uint64_t ctx_sp = ctx->uc_mcontext.gp_regs[1];
   ...
  /* Save address of trap as the return address in the standard stack frame
     location, so that it will be recorded in the stack backtrace. */
  ((uint64_t *) ctx_sp)[2] = ctx_pc;
   ...
  /* Raise the exception */
  caml_array_bound_error_asm();
}

In particular, this line

  ((uint64_t *) ctx_sp)[2] = ctx_pc;

assumes that the usual stack layout, with the lower four words are the reserved area where the third word (thus index 2) is the saved value of the "Link Register", i.e. the return address.

Adding simple printf calls in the signal handler to display the values of ctx_pc, ctx_sp and the value of the word being overwritten shows this:

$ ./_build/default/src/array/stm_tests.exe
### OCaml runtime: debug mode ###
random seed: 257065471
generated error fail pass / total     time test name
[ ]    0    0    0    0 / 1000     0.0s STM Array test sequential (generating)
pc c7321f8a998 sp c7358d05610
update ra c7321f89100 -> c7321f8a998
pc c7321f8a908 sp c7358d03390
update ra c7321f89100 -> c7321f8a908
pc c7321f8a998 sp c7358d031b0
update ra c7321f89100 -> c7321f8a998
pc c7321f8a998 sp c7358d01410
update ra 0 -> c7321f8a998
pc f058567a908 sp f05a7a3f490
update ra 9d -> f058567a908
corrupted size vs. prev_size
Aborted (core dumped)

First, this shows the signal handler is indeed triggered, Second, it also shows that the would be saved link register on the stack has a completely wrong value when the signal fires for the fourth time.

This might not be the cause of the memory corruption being seen, but this is definitely something getting in the way and which needs to be fixed.

Adding a crude "kill me if the stack looks odd" line to the signal handler:

if (((uint64_t *)ctx_sp)[2] < 0x1000) *((volatile int *)0) = 0x42;

allows us to get a core dump before hopefully too much memory corruption happens:

./_build/default/src/array/stm_tests.exe
### OCaml runtime: debug mode ###
random seed: 257065471
generated error fail pass / total     time test name
[ ]    0    0    0    0 / 1000     0.0s STM Array test sequential (generating)
pc 30c8afa998 sp 30daee5610
update ra 30c8af9100 -> 30c8afa998
pc 30c8afa908 sp 30daee3390
update ra 30c8af9100 -> 30c8afa908
pc 30c8afa998 sp 30daee31b0
update ra 30c8af9100 -> 30c8afa998
pc 30c8afa998 sp 30daee1410
update ra 0 -> 30c8afa998
Segmentation fault (core dumped)

From the core dump, we can gather more information.

First, the pc obtained by the signal handler indeed matches a conditional trap instruction.

(gdb) x/i 0x30c8afa998
   0x30c8afa998 <camlDune__exe__Stm_tests.fun_1689+56>: tdlle   r8,r5

This is confirmed by the backtrace information:

(gdb) bt 4
#0  0x00000030c8c46f2c in caml_sigtrap_handler (signo=5, info=0x30daee1190,
    context=0x30daee0410) at runtime/signals_nat.c:114
#1  <signal handler called>
#2  camlDune__exe__Stm_tests.fun_1689 () at src/array/stm_tests.ml:86
#3  0x00000030c8b037e8 in camlUtil.protect_981 () at lib/util.ml:75
(More stack frames follow...)

where line 86 of stm_tests is:

    | Set (i,c)    -> Res (result unit exn, protect (Array.set a i) c)

where Array.set requires a bounds check.

Now if we disassemble the whole camlDune__exe__Stm_tests.fun_1689 function:

(gdb) x/20i 0x30c8afa960
   0x30c8afa960 <camlDune__exe__Stm_tests.fun_1689>:    addis   r2,r12,27
   0x30c8afa964 <camlDune__exe__Stm_tests.fun_1689+4>:  addi    r2,r2,-12896
   0x30c8afa968 <camlDune__exe__Stm_tests.fun_1689+8>:  addi    r1,r1,-32
   0x30c8afa96c <camlDune__exe__Stm_tests.fun_1689+12>: mflr    r0
   0x30c8afa970 <camlDune__exe__Stm_tests.fun_1689+16>: std     r0,48(r1)
   0x30c8afa974 <camlDune__exe__Stm_tests.fun_1689+20>: std     r2,40(r1)
   0x30c8afa978 <camlDune__exe__Stm_tests.fun_1689+24>: ld      r11,48(r1)
   0x30c8afa97c <camlDune__exe__Stm_tests.fun_1689+28>: mtlr    r11
   0x30c8afa980 <camlDune__exe__Stm_tests.fun_1689+32>: ld      r6,16(r4)
   0x30c8afa984 <camlDune__exe__Stm_tests.fun_1689+36>: ld      r5,24(r4)
   0x30c8afa988 <camlDune__exe__Stm_tests.fun_1689+40>: ld      r7,-8(r6)
   0x30c8afa98c <camlDune__exe__Stm_tests.fun_1689+44>: sldi    r9,r5,2
   0x30c8afa990 <camlDune__exe__Stm_tests.fun_1689+48>: add     r10,r6,r9
   0x30c8afa994 <camlDune__exe__Stm_tests.fun_1689+52>: srdi    r8,r7,9
   0x30c8afa998 <camlDune__exe__Stm_tests.fun_1689+56>: tdlle   r8,r5
   0x30c8afa99c <camlDune__exe__Stm_tests.fun_1689+60>: lwsync
   0x30c8afa9a0 <camlDune__exe__Stm_tests.fun_1689+64>: std     r3,-4(r10)
   0x30c8afa9a4 <camlDune__exe__Stm_tests.fun_1689+68>: li      r3,1
   0x30c8afa9a8 <camlDune__exe__Stm_tests.fun_1689+72>: addi    r1,r1,32
   0x30c8afa9ac <camlDune__exe__Stm_tests.fun_1689+76>: blr

it follows the following logic:

   # are two instructions to load $r2 (TOC pointer), as required by ABI
   addis   r2,r12,27
   addi    r2,r2,-12896
   # allocate 32 bytes on the stack (r1 being the stack pointer)
   addi    r1,r1,-32
   # get the value of the link register (return address) in r0
   mflr    r0
   # save it on stack at offset 48
   std     r0,48(r1)
   # save TOC register on stack at offset 40
   std     r2,40(r1)
   # reload link register value
   ld      r11,48(r1)
   # write it back to link register
   mtlr    r11
   # get various closure parameters
   ld      r6,16(r4)
   ld      r5,24(r4)
   ld      r7,-8(r6)
   sldi    r9,r5,2
   add     r10,r6,r9
   srdi    r8,r7,9
   # perform the bounds check
   tdlle   r8,r5
   lwsync
   # the actual Array.set operation
   std     r3,-4(r10)
   # prepare to return success
   li      r3,1
   # restore stack pointer
   addi    r1,r1,32
   # branch to the address stored in the link register
   blr

The mandatory (ABI required) register saves are correctly performed in the caller frame, hence at offsets 32 + 8 and 32 + 16 since the stack has been decreased by 32 bytes.

However, the signal handler expects to find the link register value at offset 16 from the stack.

Now, that function is a leaf function, so it does not need to adjust the stack, but the native code generator does not care about leaf functions and will always perform a stack adjustment.

Therefore the signal handler makes wrong assumptions about the contents of the stack. A reasonably cheap yet correct way would be to reload the value of the link register into a temporary register (r11 or r12) before every conditional trap instruction, and have the signal handler use it rather than assuming it can use the current value of r1.

(Note that this will not be enough to solve the issue at the moment - a quick'n'dirty change to always add 32 to the stack pointer in the signal handler no longer displays nonsensical values, but does not prevent the memory corruption in libc from occurring...)

dustanddreams added a commit to dustanddreams/ocaml that referenced this issue Sep 8, 2023
The signal handler updates the stack prior to raising the array_bound_error
exception, but assumes this information is available at the bottom of the
stack, regardless of actual stack allocation for local variables.

Fix this by passing the appropriate offset in a temporary register prior
to issueing conditional trap instructions. The signal handler can then
recompute the proper addresses.
@xavierleroy
Copy link
Contributor

Nice detective work, as always!

I agree that the line ((uint64_t *) ctx_sp)[2] = ctx_pc; in the signal handler is wrong, in that it assumes an empty stack frame when the bounds check fail. I'm more surprised about the bizarre values of ctx_sp and ctx_pc coming from the signal context, but this could be a corollary of the stack being corrupted by the line above.

For context: in OCaml 4, the "last return address into OCaml code" that serves as key to the descriptor for the first frame during a stack walk is stored in the global domain state, field last_return_address. So, the POWER signal handler just sets it to ctx_pc. In OCaml 5, the last return address is expected to be in the OCaml stack, at a fixed offset from the stack pointer. That's hard to achieve in the context of this POWER signal handler.

I'll look at #12539 soon, but I'm afraid this is the death toll for trap-based bounds checking. It's probably time to get back to the standard compare & branch implementation.

@dustanddreams
Copy link
Contributor

I'll look at #12539 soon, but I'm afraid this is the death toll for trap-based bounds checking. It's probably time to get back to the standard compare & branch implementation.

For what it's worth, I have done a quick conversion to the usual compare & branch logic, and the multicore tests no longer fail, so this hints that the stack arithmetic fix is not good enough.

I'll clean this up and send another PR with the backend conversion, and then you can decide which direction to choose.

dustanddreams added a commit to dustanddreams/ocaml that referenced this issue Sep 8, 2023
The signal handler updates the stack prior to raising the array_bound_error
exception, but assumes this information is available at the bottom of the
stack, regardless of actual stack allocation for local variables.

Fix this by passing the appropriate offset in a temporary register prior
to issueing conditional trap instructions. The signal handler can then
recompute the proper addresses.
@jmid
Copy link
Contributor Author

jmid commented Oct 4, 2023

Fixed in #12540, so I'll close this.

@jmid jmid closed this as completed Oct 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants