New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ppc64 backend segfault on trunk #12482
Comments
A run with the debug runtime yields the following
|
The failing cases on POWER seem to all involve array operations. The POWER backend is "special" in that it uses hardware instructions which trap in order to perform array bounds checks. If the index is within bounds, execution continues, otherwise The presence of this signal handler unfortunately makes debugging more difficult, as the debugger also uses If we look at the signal handler, the interesting parts are:
In particular, this line
assumes that the usual stack layout, with the lower four words are the reserved area where the third word (thus index Adding simple
First, this shows the signal handler is indeed triggered, Second, it also shows that the would be saved link register on the stack has a completely wrong value when the signal fires for the fourth time. This might not be the cause of the memory corruption being seen, but this is definitely something getting in the way and which needs to be fixed. Adding a crude "kill me if the stack looks odd" line to the signal handler:
allows us to get a core dump before hopefully too much memory corruption happens:
From the core dump, we can gather more information. First, the
This is confirmed by the backtrace information:
where line 86 of
where Now if we disassemble the whole
it follows the following logic:
The mandatory (ABI required) register saves are correctly performed in the caller frame, hence at offsets However, the signal handler expects to find the link register value at offset Now, that function is a leaf function, so it does not need to adjust the stack, but the native code generator does not care about leaf functions and will always perform a stack adjustment. Therefore the signal handler makes wrong assumptions about the contents of the stack. A reasonably cheap yet correct way would be to reload the value of the link register into a temporary register ( (Note that this will not be enough to solve the issue at the moment - a quick'n'dirty change to always add |
The signal handler updates the stack prior to raising the array_bound_error exception, but assumes this information is available at the bottom of the stack, regardless of actual stack allocation for local variables. Fix this by passing the appropriate offset in a temporary register prior to issueing conditional trap instructions. The signal handler can then recompute the proper addresses.
Nice detective work, as always! I agree that the line For context: in OCaml 4, the "last return address into OCaml code" that serves as key to the descriptor for the first frame during a stack walk is stored in the global domain state, field I'll look at #12539 soon, but I'm afraid this is the death toll for trap-based bounds checking. It's probably time to get back to the standard compare & branch implementation. |
For what it's worth, I have done a quick conversion to the usual compare & branch logic, and the multicore tests no longer fail, so this hints that the stack arithmetic fix is not good enough. I'll clean this up and send another PR with the backend conversion, and then you can decide which direction to choose. |
The signal handler updates the stack prior to raising the array_bound_error exception, but assumes this information is available at the bottom of the stack, regardless of actual stack allocation for local variables. Fix this by passing the appropriate offset in a temporary register prior to issueing conditional trap instructions. The signal handler can then recompute the proper addresses.
Fixed in #12540, so I'll close this. |
As part of multicoretests we are observing segfaults in code produced by the newly restored ppc64 backend.
The test triggering it is property-based test of
array
s against a model.Contrary to previous torture tests, this is a sequential, single-domain test causing a crash.
A branch is available here: https://github.com/ocaml-multicore/multicoretests/tree/ppc64-crash-repro
Here's an example output: https://ocaml-multicoretests.ci.dev:8100/job/2023-08-16/132015-ci-ocluster-build-8fbc23#L343
To recreate:
opam install dune qcheck-core
dune build @ci -j1 --no-buffer
I don't have direct access to a ppc-machine ATM, so the above has been produced via CI-golf... 🤓
I may get access to one next week.
Context
Below I include the long counter example produced with this seed - a 283-element
cmd
list!(The output was produced by adding a
fork
to avoid the crash taking down the testing process).The other sizes we have observed crashes on are also around 283 or bigger (the smallest was 278 elements).
I suspect the particular commands and their order is less significant and instead serve to build up a certain heap structure.
double free or corruption (out)
,free(): invalid pointer
, andFatal error: allocation failure during minor GC
which may indicate memory corruptionBytes
andArray.Floatarray
and even an infinite loop. The observations have been tracked in [ocaml5-issue] Crashes and hangs on ppc64 trunk/5.2 ocaml-multicore/multicoretests#380.Long counter example
The text was updated successfully, but these errors were encountered: