New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
s390x segfault/abort on 5.1 and trunk when running Effect test #12486
Comments
The (rare) failure occurs in Why the stack gets corrupted from there is not clear yet. |
I was able to reproduce the crash on the system-Z machine we use for the core OCaml CI, but it occurs quite rarely. I didn't get it to crash with |
No, for a fixed seed this should be completely deterministic. Without that commit, the test can be invoked as I know @dustanddreams have been looking into this crash. |
I have the same difficulty as @xavierleroy to get the test to crash, it can take sometimes a dozen of runs to trigger the assertion. My current analysis notes follow. We hit an assertion in fiber.c!scan_stack_frames. Relevant part of the function are:
On entry, we have
On s390x, we have
so local variable sp is initialized with Then retaddr will be The usable backtrace in gdb is:
Frame #20 is not a signal handler but the lack of a correctly constructed
They show that we come from At this point,
And in that C stack, the return address, which we need to resume the backtrace,
This address is wrong, as it doesn't point to any code segment but to the C This hints that the memory corruption has occurred between the call to If we put a breakpoint in
where address
Indeed:
So we need to figure out what causes the return address to be overridden with Instrumenting This suggests that the corruption happens earlier. Relying upon the fact that the code addresses will not change between executions
Note the value of 1 is incorrect as a return address, but is accepted as good When running the test program, we first get this:
which suggests corruption is happenning early, but somehow gets undone.
Then the next starts, and immediately reports incorrect values:
Most of the times, despite these incorrect values, the test will succeed because The next step is to move the checks deeper, in |
I have instrumented
and
With some instrumentation, here is what the beginning of the function looks in gdb:
Let's walk it.
This is the beginning of
Then we enter
This assigns
This is the assignment
This is the assignment
This reloads the return address saved to the stack at the beginning of
We load
They reload the return address into The rest of the changes to What is interesting with this instrumentation, is that now:
This means that, between the time the return address ( Since this test is not multi-threaded (and gdb indeed reports only one thread), there are no function calls between the stack write and the stack read, and Another simple experiment I just did was to change the beginning of
In other words, the return address is saved 6 times on the stack rather than once. Of course since the stack pointer is then restored, these extra saves will get overridden, but by the time the check in After placing a breakpoint on the jump to
Moreover, if I look at the other copies of the return address on the stack, only one of them is correct:
So there is definitely either other code being invoked on this stack by some unknown (to me) mechanism, or wrong/stale cache data being returned. In the latter case, there is obviously nothing we can do on the hump side to prevent this misbehaviour. Changing slightly the experiment with this:
would write five times
which shows a similar pattern as above, with none of the writes being visible. Again, this is either because other code has been invoked between these writes and |
I'm not working on this bug, but I would like to interject to thank @dustanddreams for the very helpful description of his approach to debugging this kind of bug. As someone with no experience debugging low-level code, there is a lot I don't know about how to approach this, and I am sure that this dump of practitioner knowledge will prove useful to me in the future. |
Tinkering further with
and
Running the binary under gdb with a breakpoint on every call to Execution has thus been:
In other words, the first assembly statement of If we look at the values, we have:
Notice that the low 9 bits of
Interestingly, since all |
As I've shared with @dustanddreams elsewhere, I've observed this same test crashing and aborting with the same assertion failure on both At this point I don't understand whether this is the same bug surfacing - or whether these are two independent bugs with the same symptoms. When run with a fixed seed like
To recreate locally:
|
Impressive debugging job! Just one remark, in case you were not aware:
Every C function, incl. Normally, the 160 bytes at the bottom of the C stack are reserved at all times, so that, after switching to the C stack (of course), we can call any C function. But maybe something is wrong here too. |
Delivery of Effect.Unhandled, Effect.Continuation_already_resumed and Pervasives.array_bound_error exceptions (through caml_raise_unhandled_effect, caml_raise_continuation_already_resumed and caml_array_bound_error_asm respectively) should not assume anything about the stack they run on and thus should not bypass the first few instructions of caml_c_call. This only affects amd64 and s390x, all other platforms were already correct.
When things do not appear to make sense, this means that my understanding of how things behave is wrong. The experiments above have been exposing a complete change in stack contents between two points in These jumps are easy to identify, since each of them sets The new instrumention is this:
with a breakpoint put on the nop. Running the test again, this breakpoint is hit with We can confirm that this is the only code path leading to this incorrect stack contents by using a conditional breakpoint:
in which case execution of the test completes without hitting the breakpoint, and further runs never trigger it. Another test with a breakpoint put at the end of But is this really the problem here? If we look at the use of
This hints that the stack does not need to be an existing stack on which a valid return address has been saved - it may be a brand new stack, and in any case it needs to be treated as such, by not assuming any of its contents. Changing the s390x code to align with the other non-amd64 platform seems to make the problem disappear. Running the test in a loop for more than one hour does trigger the original Whether amd64 also needs a similar change is left as an exercize to the reader. |
Delivery of Effect.Unhandled (through caml_raise_unhandled_effect) should not assume anything about the stack it runs on and thus should not bypass the first few instructions of caml_c_call. This only affects amd64 and s390x, all other platforms were already correct.
Delivery of Effect.Unhandled (through caml_raise_unhandled_effect) should not assume anything about the stack it runs on and thus should not bypass the first few instructions of caml_c_call.
Delivery of Effect.Unhandled and Effect.Continuation_already_resumed (through caml_raise_unhandled_effect and caml_raise_continuation_already_resumed respectively) should not assume anything about the stack it runs on and thus should not bypass the first few instructions of caml_c_call.
Delivery of Effect.Unhandled and Effect.Continuation_already_resumed (through caml_raise_unhandled_effect and caml_raise_continuation_already_resumed respectively) should not assume anything about the stack it runs on and thus should not bypass the first few instructions of caml_c_call.
Delivery of Effect.Unhandled and Effect.Continuation_already_resumed (through caml_raise_unhandled_effect and caml_raise_continuation_already_resumed respectively) should not assume anything about the stack it runs on and thus should not bypass the first few instructions of caml_c_call.
Delivery of Effect.Unhandled and Effect.Continuation_already_resumed (through caml_raise_unhandled_effect and caml_raise_continuation_already_resumed respectively) should not assume anything about the stack it runs on and thus should not bypass the first few instructions of caml_c_call.
Fix delivery of effect-related exceptions (#12486)
Delivery of Effect.Unhandled and Effect.Continuation_already_resumed (through caml_raise_unhandled_effect and caml_raise_continuation_already_resumed respectively) should not assume anything about the stack it runs on and thus should not bypass the first few instructions of caml_c_call.
Fix delivery of effect-related exceptions, take 2 (#12486)
Fix delivery of effect-related exceptions, take 2 (ocaml#12486) (cherry picked from commit 01f737a)
Today we observed another segfault while running
multicoretests
.This one is triggered on s390x using OCaml from the
5.1+trunk
and5.2+trunk
opam packages (these follow the5.1
andtrunk
branches AFAIU). The test in question does not involve parallelism, but is triggered while stress testingEffect
s.A reproducible branch is available here: https://github.com/ocaml-multicore/multicoretests/commits/s390x-crash-repro
The test crashes 10/10 on both
5.1
andtrunk
so the issue seems deterministic.The file consists of 4 positive tests and 4 negative ones and the crash consistently occurs in the 2nd negative test:
The
negative Lin DSL ref int64 test with Effect
is just expected to raise anUnhandled
exception and report it through the test runner, like the previousnegative Lin DSL ref int test with Effect
.Above I've run the test under the debug runtime, where it consistently aborts while failing this assertion:
ocaml/runtime/fiber.c
Line 250 in be72b7b
This could indicate an issue related to s390x frame descriptors.
If I comment out the first 4 tests the crash no longer happens, so I suspect these need to run to bring the heap to a particular shape. Commenting out the last two tests had the same effect.
To recreate the issue:
opam install dune qcheck-core
dune build @ci -j1 --no-buffer
The text was updated successfully, but these errors were encountered: