Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix delivery of effect-related exceptions, take 2 (#12486) #12535

Merged
merged 1 commit into from Sep 20, 2023

Conversation

dustanddreams
Copy link
Contributor

This is similar to #12530, but on amd64, in order to repair operation when the runtime is built with --enable-frame-pointers.

Tail calls to caml_c_call are adjusted to always have ENTER_FUNCTION as well as TSan instrumentation, so that the stack layout matches what is expected.

@jmid
Copy link
Contributor

jmid commented Sep 6, 2023

FYI, elsewhere with @dustanddreams we've refined the reproducer initially mentioned in #12486 (comment)

Running with a reduced heap is sufficient to segfault reliably (5/5 runs on my local machine):

git clone git@github.com:ocaml-multicore/multicoretests.git
cd multicoretests
opam switch create 5.1.0~rc2+fp --packages=ocaml-variants.5.1.0~rc2+options,ocaml-option-fp
opam install qcheck-core
dune build src/neg_tests/lin_tests_dsl_effect.exe
OCAMLRUNPARAM=s32 _build/default/src/neg_tests/lin_tests_dsl_effect.exe -v -s 34037619

(you don't even need the seed -s 34037619 as it crashes consistently with any seed when run with the small minor heap)

@jmid
Copy link
Contributor

jmid commented Sep 6, 2023

I've built a local opam switch with frame pointers enabled

ocamlopt -config | grep with_frame_pointers
with_frame_pointers: true

and can confirm that 10/10 runs of both OCAMLRUNPARAM=s32 _build/default/src/neg_tests/lin_tests_dsl_effect.exe -v -s 34037619 and OCAMLRUNPARAM=s32 _build/default/src/neg_tests/lin_tests_dsl_effect.exe no longer crash with this fix.

@gasche
Copy link
Member

gasche commented Sep 6, 2023

@jmid can you also test with tsan enabled?

@jmid
Copy link
Contributor

jmid commented Sep 6, 2023

$ opam install . --inplace-build ocaml-option-tsan
...
#=== ERROR while compiling ocaml-variants.5.2.0+trunk =========================#
# context     2.1.2 | linux/x86_64 |  | pinned(git+file:///home/jmi/software/ocaml-06-09-2023-PR12535#amd64_fp_fix#21fa63fa)
# path        ~/software/ocaml-06-09-2023-PR12535
# command     ~/.opam/opam-init/hooks/sandbox.sh build make -j15
# exit-code   2
# env-file    ~/.opam/log/ocaml-variants-2741389-f8847f.env
# output-file ~/.opam/log/ocaml-variants-2741389-f8847f.out
### output ###
# [...]
# make[2]: *** [Makefile:994: runtime/domain.n.o] Error 1
# make[2]: *** [Makefile:994: runtime/custom.n.o] Error 1
# gcc: error: unrecognized command-line option ‘--param=tsan-distinguish-volatile=1’
# make[2]: *** [Makefile:994: runtime/fiber.n.o] Error 1
# gcc: error: unrecognized command-line option ‘--param=tsan-distinguish-volatile=1’
# gcc: error: unrecognized command-line option ‘--param=tsan-distinguish-volatile=1’
# make[2]: *** [Makefile:994: runtime/dynlink.n.o] Error 1
# make[2]: *** [Makefile:994: runtime/extern.n.o] Error 1
# make[2]: Leaving directory '/home/jmi/software/ocaml-06-09-2023-PR12535'
# make[1]: *** [Makefile:306: opt.opt] Error 2
# make[1]: Leaving directory '/home/jmi/software/ocaml-06-09-2023-PR12535'
# make: *** [Makefile:381: world.opt] Error 2

This is with the following gcc:

gcc (Ubuntu 10.5.0-1ubuntu1~20.04) 10.5.0
Copyright (C) 2020 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

@OlivierNicole @fabbing - I am horribly so-last-year just using my disto's gcc 😅 😬

@dustanddreams
Copy link
Contributor Author

@jmid can you also test with tsan enabled?

I gave it a try but both trunk and this PR's branch fail:

$ OCAMLRUNPARAM=s32 _build/default/src/neg_tests/lin_tests_dsl_effect.exe -v -s 34037619
### OCaml runtime: debug mode ###
random seed: 34037619
generated error  fail  pass / total     time test name
[ ]     0     0     0     0 / 20000     0.0s Lin DSL ref int test with Effect (generating)ThreadSanitizer:DEADLYSIGNAL
==699670==ERROR: ThreadSanitizer: stack-overflow on address 0x7ffea698bff8 (pc 0x7f6db1bbb9ee bp 0xe000000f2a66 sp 0x7ffea698c000 T699670)
ThreadSanitizer:DEADLYSIGNAL
ThreadSanitizer: nested bug in the same thread, aborting.

runtime/amd64.S Outdated
@@ -1177,7 +1178,10 @@ CFI_STARTPROC
UPDATE_BASE_POINTER(%rcx)
SWITCH_OCAML_STACKS
jmp *(%rbx)
2: TSAN_ENTER_FUNCTION(0)
2: ENTER_FUNCTION
TSAN_SAVE_CALLER_REGS
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The lack of TSAN_SAVE/RESTORE_CALLER_REGS here was deliberate for performance reasons: we found out taht saving too many unnecessary registers would hurt performance with TSan.

Considering that what we need to do here is almost exactly what is done in caml_c_call prologue, and contrary to what I said over there, I suggest we go the same way you did with #12530 and rather jmp to caml_c_call prologue with the GCALL label.
There will be a small performance cost, and hopefully this won't happen often, but also has the big benefit of unifying the amd64 code with s390x.

@@ -1178,12 +1178,8 @@ CFI_STARTPROC
         UPDATE_BASE_POINTER(%rcx)
         SWITCH_OCAML_STACKS
         jmp     *(%rbx)
-2:      ENTER_FUNCTION
-        TSAN_SAVE_CALLER_REGS
-        TSAN_ENTER_FUNCTION(0)
-        TSAN_RESTORE_CALLER_REGS
-        LEA_VAR(caml_raise_continuation_already_resumed, %rax)
-        jmp LBL(caml_c_call)
+2:      LEA_VAR(caml_raise_continuation_already_resumed, %rax)
+        jmp GCALL(caml_c_call)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that there used to be TSAN_ENTER_FUNCTION(0) in this code path, I only added ENTER_FUNCTION to set up the stack frame correctly and TSAN_{SAVE,RESTORE}_CALLER_REGS to wrap the TSAN_ENTER_FUNCTION call.

However I notice that caml_ml_array_bound_error also uses TSAN_ENTER_FUNCTION without the SAVE/RESTORE dance.

I am pushing a new version of this code which mimics caml_ml_array_bound_error in this case.

@jmid
Copy link
Contributor

jmid commented Sep 7, 2023

If we want to add a regression test case I've cooked up the following, which crashes my local 5.1.0~rc+fp switch consistently. It should also be able to double as a test for the s390x issue.

open Effect

type _ t += Yield : unit t

let rec burn l =
  if List.hd l > 12 then ()
  else
    burn (l @ l |> List.map (fun x -> x + 1))

let foo l =
  burn l;
  perform Yield

let bar i = foo [i]

let _ =
  for _ = 1 to 10_000 do
    try bar 8
    with Unhandled _ -> ()
  done
$ ocamlopt crash.ml 
$ OCAMLRUNPARAM=s32 ./a.out 
Segmentation fault (core dumped)

Copy link
Member

@gasche gasche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe that the change makes the code strictly more correct, so I am approving. (It only makes a different with frame pointers enabled.)

I think it would be useful if the TSan people could leave an explicit trace in the code of the register-saving decisions: it would have made the work on this PR easier, and it may show up again in the future. (It will also be useful documentation if someone less knowledgeable tries to port the TSan runtime support to another backend.)

@gasche
Copy link
Member

gasche commented Sep 8, 2023

@dustanddreams could you add the testcase proposed by @jmid (it's a good testcase because it runs instantly) to the testsuite, for example as a new program testsuite/tests/frame-pointers/unhandled_effect.ml? Ideally the file have a comment linking to the related github issues.

(To run a new test file you can just run ./ocamltest/ocamltest testsuite/../<test_file>.)

@dustanddreams
Copy link
Contributor Author

@dustanddreams could you add the testcase proposed by @jmid (it's a good testcase because it runs instantly) to the testsuite, for example as a new program testsuite/tests/frame-pointers/unhandled_effect.ml? Ideally the file have a comment linking to the related github issues.

Sure, let me know if the file looks good to you.

@fabbing
Copy link
Contributor

fabbing commented Sep 8, 2023

We've been able to reproduce the crash on lin_tests_dsl_effect both with TSan only, and TSan + fp, by running only a subset of 2 tests.
They seems to crash in a different way, we've started investigating the issue.

@@ -0,0 +1,37 @@
(* TEST
frame_pointers;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we really want to restrict the test to only setups with frame pointers enabled? The current fix is for a bug that only occurs with frame pointers, but I assume that the test may have caught the non-frame-pointers-specific s390x issue, and could be helpful in the test harness of new backends.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(To clarify: my gut feeling is that we do not want the frame_pointers condition here, and I am waiting to see what @dustanddreams thinks before merging, with or without the condition.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you're right; although the code was only incorrect with frame pointers enabled, this was not the case on s390x. But then, should it really belong to the tests/frame-pointers directory?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But then, should it really belong to the tests/frame-pointers directory?

Honestly I don't really care, please choose the test directory that you prefer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved to effects, then.

@OlivierNicole
Copy link
Contributor

I think it would be useful if the TSan people could leave an explicit trace in the code of the register-saving decisions: it would have made the work on this PR easier, and it may show up again in the future. (It will also be useful documentation if someone less knowledgeable tries to port the TSan runtime support to another backend.)

I agree; I intend to do it in our fix PR.

@fabbing
Copy link
Contributor

fabbing commented Sep 11, 2023

We may have found one of the issues, the one involving the combination of TSan and frame-pointers: the code didn't always use the macro (which we added ourselves!) to account for possible frame-pointers.
A fix PR should come very soon.

This repairs operation when the compiler is built with --enable-frame-pointers
and either Effect.Unhandled or Effect.Continuation_already_resumed needs to
be raised.
@Octachron
Copy link
Member

As a change for the fp and TSAN mode, it seems better to me to cherry-pick it to 5.1 (if there are no objections from reviewers) in prevision of the upcoming 5.1 release.

@OlivierNicole
Copy link
Contributor

That makes sense. Another fix for TSan + fp is #12561.

@fabbing
Copy link
Contributor

fabbing commented Dec 6, 2023

TSan won't be part of new 5.1.x, will it?
In that case the TSAN_ENTER_FUNCTION should be dropped as it only affects TSan, but the ENTER_FUNCTIONs are worth adding as they fix missing pure frame-pointer missing instructions.

@OlivierNicole
Copy link
Contributor

Ah, yes, it had slipped my mind that TSan won’t be in 5.1.1. What @fabbing said.

dra27 pushed a commit to dra27/ocaml that referenced this pull request Dec 6, 2023
Fix delivery of effect-related exceptions, take 2 (ocaml#12486)

(cherry picked from commit 01f737a)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants