Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible corruption bug in caml_master_lock in 4.14.1 on macOS #12636

Closed
zbaylin opened this issue Oct 5, 2023 · 19 comments
Closed

Possible corruption bug in caml_master_lock in 4.14.1 on macOS #12636

zbaylin opened this issue Oct 5, 2023 · 19 comments
Labels
Milestone

Comments

@zbaylin
Copy link

zbaylin commented Oct 5, 2023

A couple of days ago, I upgraded to macOS 14 Sonoma from 13 on my M1 Macbook Pro. After doing so, I noticed that my LSP server would seemingly randomly stop working and would spin, causing 100% CPU usage.

I opened an issue (ocaml/ocaml-lsp#1194) thinking it was maybe a bug in the LSP server itself, but upon further investigation it appears that it might be a bug in the runtime.

Unfortunately, I haven't been able to find an easily reproducible example for this, but I'll continue to try to do so and outline my investigation here.


Once the LSP froze, I attached to it with lldb to see if I could glean some more info that way. Immediately after doing so, a breakpoint is hit (even though I hadn't set any):

* thread #1, stop reason = EXC_BREAKPOINT (code=1, subcode=0x188124220)
    frame #0: 0x0000000188124220 libsystem_platform.dylib`_os_unfair_lock_corruption_abort + 52
libsystem_platform.dylib`:
->  0x188124220 <+52>: brk    #0x1
    0x188124224 <+56>: stp    x20, x21, [sp, #-0x10]!
    0x188124228 <+60>: adrp   x20, 0
    0x18812422c <+64>: add    x20, x20, #0x95d          ; "BUG IN CLIENT OF LIBPLATFORM: os_unfair_lock is corrupt"

The backtrace at this point looks like the following:

* thread #1, stop reason = EXC_BREAKPOINT (code=1, subcode=0x188124220)
  * frame #0: 0x0000000188124220 libsystem_platform.dylib`_os_unfair_lock_corruption_abort + 52
    frame #1: 0x000000018811f788 libsystem_platform.dylib`_os_unfair_lock_lock_slow + 332
    frame #2: 0x00000001880ee3f0 libsystem_pthread.dylib`pthread_mutex_destroy + 64
    frame #3: 0x00000001008144fc ocamllsp`caml_thread_reinitialize [inlined] st_mutex_destroy(m=0x00006000016dc040) at st_posix.h:228:8 [opt]
    frame #4: 0x00000001008144f4 ocamllsp`caml_thread_reinitialize at st_stubs.c:425:7 [opt]
    frame #5: 0x00000001880f6f7c libsystem_pthread.dylib`_pthread_atfork_child_handlers + 76
    frame #6: 0x0000000187faeb90 libsystem_c.dylib`fork + 112
    frame #7: 0x000000010081323c ocamllsp`spawn_unix(v_env=<unavailable>, v_cwd=<unavailable>, v_prog=<unavailable>, v_argv=<unavailable>, v_stdin=<unavailable>, v_stdout=<unavailable>, v_stderr=<unavailable>, v_use_vfork=<unavailable>, v_setpgid=1) at spawn_stubs.c:439:43 [opt]
    frame #8: 0x0000000100849fc0 ocamllsp`caml_c_call + 28
    frame #9: 0x0000000100661e38 ocamllsp`camlSpawn__spawn_inner_997 + 152
    frame #10: 0x00000001003ba90c ocamllsp`camlOcaml_lsp_server__run_in_directory_5864 + 524
    frame #11: 0x000000010044e728 ocamllsp`camlMerlin_kernel__Pparse__apply_rewriter_786 + 296
    frame #12: 0x000000010044ec28 ocamllsp`camlMerlin_kernel__Pparse__rewrite_957 + 144
    frame #13: 0x000000010044eca4 ocamllsp`camlMerlin_kernel__Pparse__apply_rewriters_str_inner_1399 + 68
    frame #14: 0x000000010044ed74 ocamllsp`camlMerlin_kernel__Pparse__apply_rewriters_1089 + 52
    frame #15: 0x0000000100464564 ocamllsp`camlMerlin_kernel__Mppx__fun_1203 + 60
    frame #16: 0x0000000100464428 ocamllsp`camlMerlin_kernel__Mppx__code_begin + 136
    frame #17: 0x000000010044f81c ocamllsp`camlMerlin_kernel__Phase_cache__apply_inner_528 + 852
    frame #18: 0x0000000100465ca0 ocamllsp`camlMerlin_kernel__Mpipeline__fun_1778 + 152
    frame #19: 0x000000010064bdd0 ocamllsp`camlMerlin_utils__Misc__try_finally_inner_3715 + 48
    frame #20: 0x00000001007b993c ocamllsp`camlCamlinternalLazy__force_lazy_block_362 + 140
    frame #21: 0x000000010046482c ocamllsp`camlMerlin_kernel__Mpipeline__fun_1613 + 172
    frame #22: 0x00000001007b993c ocamllsp`camlCamlinternalLazy__force_lazy_block_362 + 140
    frame #23: 0x0000000100465d44 ocamllsp`camlMerlin_kernel__Mpipeline__fun_1791 + 60
    frame #24: 0x00000001007b993c ocamllsp`camlCamlinternalLazy__force_lazy_block_362 + 140
    frame #25: 0x000000010046482c ocamllsp`camlMerlin_kernel__Mpipeline__fun_1613 + 172
    frame #26: 0x00000001007b993c ocamllsp`camlCamlinternalLazy__force_lazy_block_362 + 140
    frame #27: 0x000000010042acb0 ocamllsp`camlQuery_commands__dispatch_2992 + 7320
    frame #28: 0x0000000100646d24 ocamllsp`camlMerlin_utils__Std__let_ref_3289 + 60
    frame #29: 0x000000010064bdd0 ocamllsp`camlMerlin_utils__Misc__try_finally_inner_3715 + 48
    frame #30: 0x00000001007f8048 ocamllsp`camlStdlib__Fun__protect_317 + 96
    frame #31: 0x000000010045acd8 ocamllsp`camlMerlin_kernel__Mocaml__with_state_438 + 80
    frame #32: 0x000000010037dd44 ocamllsp`camlOcaml_lsp_server__Document__fun_4117 + 116
    frame #33: 0x00000001007a1d18 ocamllsp`camlStdune__Exn_with_backtrace__try_with_422 + 40
    frame #34: 0x00000001006645c8 ocamllsp`camlLev_fiber__do_no_raise_1549 + 24
    frame #35: 0x000000010066fd38 ocamllsp`camlLev_fiber_util__Worker__run_322 + 88
    frame #36: 0x0000000100671354 ocamllsp`camlThread__fun_850 + 44
    frame #37: 0x000000010084a030 ocamllsp`caml_start_program + 104
    frame #38: 0x000000010083ffa0 ocamllsp`caml_callback_exn(closure=<unavailable>, arg=1) at callback.c:111:10 [opt]
    frame #39: 0x0000000100814768 ocamllsp`caml_thread_start(arg=<unavailable>) at st_stubs.c:549:5 [opt]
    frame #40: 0x00000001880f3034 libsystem_pthread.dylib`_pthread_start + 136

This seems to refer to code similar to this found in libplatform in the version of Darwin used by macOS 14: https://github.com/apple/darwin-libplatform/blob/215b09856ab5765b7462a91be7076183076600df/src/os/lock.c#L136

Dumping the register state here is also a bit informative:

General Purpose Registers:
        x0 = 0x00000000000007fa
        x1 = 0x0000000000000000
        x2 = 0x00000000000007fa
        x3 = 0x0000000000000000
        x4 = 0x00000d5200000000
        x5 = 0x0000040300000001
        x6 = 0x0000000000002000
        x7 = 0x0000000000000000
        x8 = 0x00000000000007fa
        x9 = 0x00000000000005fa
       x10 = 0x000000000003fe00
       x11 = 0x0000000100e33a38  ocamllsp`caml_master_lock + 24
       x12 = 0x0000000100e33a38  ocamllsp`caml_master_lock + 24
       x13 = 0x00000000ffff8128
       x14 = 0x0000000000000010
       x15 = 0x00000000ffff7dff
       x16 = 0x0000000000000203
       x17 = 0x00000001e85e8e58
       x18 = 0x0000000000000000
       x19 = 0x0000000000000303
       x20 = 0x0000000000050000
       x21 = 0x00006000016dc048
       x22 = 0x0000000001050002
       x23 = 0x00000000000007fa
       x24 = 0x0000000000000000
       x25 = 0x00000000ffffffff
       x26 = 0x0000000000000303
       x27 = 0x0000000000000000
       x28 = 0x000000016fc367b0
        fp = 0x000000016fc36610
        lr = 0x000000018811f788  libsystem_platform.dylib`_os_unfair_lock_lock_slow + 332
        sp = 0x000000016fc365d0
        pc = 0x0000000188124220  libsystem_platform.dylib`_os_unfair_lock_corruption_abort + 52
      cpsr = 0x80001000

It's unclear to me what specifically x11 and x12 are being used for here, but they both contain pointers inside caml_master_lock, which libplatform seems to believe the lock inside is corrupt.

The backtrace also makes mention of caml_thread_reinitialize, which appears to be calling st_mutex_destroy over all open channels:

for (chan = caml_all_opened_channels;
chan != NULL;
chan = chan->next) {
if (chan->mutex != NULL) {
st_mutex_destroy(chan->mutex);
chan->mutex = NULL;
}
}

Unfortunately that's sort of where my investigation has ended -- I can't seem to figure out if the lock is actually even corrupt or not, let alone where that's happening if so.

If anyone has any pointers for where I should look next, please let me know. I'll work on making a reproducible example.

@gasche
Copy link
Member

gasche commented Oct 5, 2023

Your stack trace seems to suggest that the process you are using is mixing both threads and fork(). This is calling for trouble.

The fork() call comes from a spawn_unix function that is not provided by the OCaml runtime, but by C stubs provided in a third-party package:

https://github.com/janestreet/spawn/blob/faf3d71f3af9127b09a33f101413150b61215193/src/spawn_stubs.c#L377

This may or may not be the source of the bug you are observing, but I would say that the post-forking resource logic in that C code is likely to be involved in the bug. My intuition is thus that the bug should first be reported to the authors of the spawn library, and/or I would look for a simplified reproduction case that does not involve this code before considering this as a runtime bug.

@gasche
Copy link
Member

gasche commented Oct 5, 2023

Note: maybe ocaml-lsp should consider using the implementation of spawn functions provided by the OCaml runtime, which use posix_spawn instead of fork() (since OCaml 4.12), instead of relying on the spawn package:

https://github.com/ocaml/ocaml-lsp/blob/672ca1bfbb816ff6b5f07a3f3abdecb66f982d8c/ocaml-lsp-server/src/ocaml_lsp_server.ml#L900

It may be easier to rewrite ocaml-lsp to not use janestreet/spawn, and then try to observe whether the bug is gone or still there, than to diagnose an issue (or absence of an issue) in janestreet/spawn.

@zbaylin
Copy link
Author

zbaylin commented Oct 5, 2023

Thanks @gasche -- the combination of fork and threads also raised some red flags with me, but after looking at the code for a bit it seemed safe, i.e.

It seems to me that a) this is a pretty standard fork -> exec scheme and b) there is very little (if any) interaction between the child process and the runtime, which I assume would be the source of any bugs here. Nevertheless I do think it's a good idea to see if replacing spawn with something else in ocaml-lsp is a good idea to see if that fixes the issue.

Would you prefer that I close this for now until I'm able to gather more evidence?

@zbaylin
Copy link
Author

zbaylin commented Oct 5, 2023

posix_spawn instead of fork() (since OCaml 4.12)

I'm also not quite sure what you mean by this -- are you suggesting that ocaml-lsp use Unix.create_process, for example? It does seem like that implementation uses posix_spawn:

CAMLprim value unix_spawn(value executable, /* string */

I assume the "since OCaml 4.12" comment refers to #9573?

@gasche
Copy link
Member

gasche commented Oct 5, 2023

My understanding:

  • currently ocaml-lsp uses the janestreet/spawn third-party library, which implements spawn on top of fork()
  • it could use functions from the Unix module directly, which (since OCaml 4.12) use posix_spawn instead of fork()

I agree with you that the code of janestreet/spawn seems okay (but it is hard to tell). It may be that there is a bug in the runtime (the trace suggests a combination of threads, fork() and channel mutexes), and that avoiding fork() would merely work around the issue -- if it works at all.

@xavierleroy
Copy link
Contributor

Yes, this looks like a bug in OCaml 4.14. The mutexes that protect I/O channels are being destroyed in the child process after fork() even though they can be locked, which is an error according to the POSIX standard and seems to be turned into a fatal error by the pthreads library. Interestingly, this part of the code was rewritten in OCaml 5 (by @Engil if I remember correctly) and looks much safer. We should look into backporting the OCaml 5 solution to 4.14.

@xavierleroy xavierleroy added the bug label Oct 6, 2023
@abbysmal
Copy link
Contributor

abbysmal commented Oct 9, 2023

@xavierleroy are you looking into implementing this?

I do remember the struggles with getting dune and after fork() semantics right on Multicore, and I may have forgotten to back port a few natural improvements. (without rewriting the whole thing.)

I can give a shot at implementing this, if you are not actively working on it, or can offer review otherwise.

@abbysmal
Copy link
Contributor

abbysmal commented Oct 9, 2023

I spent some time investigating the issue today.

First thing: I could not get a reproduction of the issue on my machine (MacBook Air, M2 chip, MacOS Sonoma, running LSP on 4.14.1 within VScode)
I did try to write an individual test case for it (by following Xavier's suspicions of I/O channels mutexes being involved) but to no avail yet.
Is there more specific directions on how VScode is used to trigger the bug?

Second thing:
I recognise that the most recent developments in systhreads were made by @gadmm.
The most recent version of caml_thread_reinitialize does not handle IO mutexes:

  /* FIXME: The same should be done (or not) for IO mutexes (as in the
     pre-multicore world). However there they might be locked by
     someone else, and we are probably in an inconsistent state. */

Which is in contrast with what is done in 4.14, where it does destroy all such mutexes.

Now, destroying these potentially locked mutexes is a programming error (and is UB), the question is whether ignoring them is better or not?

I will keep trying to reach for a reproduction, it seems odd to me this was only reported once.

@xavierleroy
Copy link
Contributor

@xavierleroy are you looking into implementing this?

I can try to backport some of the OCaml 5 improvements, maybe later this week, maybe the week after.

It's nice of you to try to reproduce the issue.

Now, destroying these potentially locked mutexes is a programming error (and is UB), the question is whether ignoring them is better or not?

Ignoring means that in the child process, some I/O channel may remain locked forever, causing a deadlock if the child process tries to use them. I don't have a good mental picture of how likely this is to happen.

@gadmm
Copy link
Contributor

gadmm commented Oct 9, 2023

There might be issues with the OCaml 5 implementation as well (e.g. caml_reset_domain_lock). Cross-ref: #12399, in case someone figures out the best course of action.

@zbaylin
Copy link
Author

zbaylin commented Oct 9, 2023

Is there more specific directions on how VScode is used to trigger the bug?

I was actually using neovim, but the client shouldn't matter for a repro. Unfortunately I've only been able to trigger this using our relatively large proprietary code base and not on any open source library I've tried. I've spent a lot of time this past week trying to do so to no avail. I'll keep trying and will update here if something changes.

If, in the meantime, you want me to collect some more information about the crash I can reproduce (either from a debugger or elsewhere) I'd be more than happy to provide it.

@xavierleroy
Copy link
Contributor

Also: another big improvement in OCaml 5 is that it no longer uses pthread_atfork, so the "reset thread machinery" function is called only if forking happens in OCaml (with Unix.fork), but not if it happens in C, as is the case in OCaml-LSP (see the trace above). In other words, just removing the use of pthread_atfork should fix the present issue, even if we keep destroying mutexes haphazardly.

xavierleroy pushed a commit to xavierleroy/ocaml that referenced this issue Oct 10, 2023
This commit introduce an atfork hook in the runtime, as a follow up to the issue encountered in MPR ocaml#455 and MPR ocaml#471.

As a result, the usage of pthreads_atfork is also removed, opting to rely in both case on manually calling the runtime hook when required.  This way, fork() calls from C code no longer reset the OCaml thread machinery.  This should help with issues such as ocaml#12636.

(cherry picked from commit c6d00ee)
(and adapted to 4.14)

Co-authored-by: xavier.leroy@college-de-france.fr
xavierleroy added a commit to xavierleroy/ocaml that referenced this issue Oct 10, 2023
Some of these mutexes may be locked, so this is undefined behavior,
and can cause fatal errors in the C thread library (ocaml#12636).

Leaving IO channel mutexes unchanged is what OCaml 5 does.
@xavierleroy
Copy link
Contributor

Some plausible fixes at #12646. But as long as we cannot reproduce the corruption, it's hard to tell.

@zbaylin
Copy link
Author

zbaylin commented Oct 11, 2023

I was able to find a somewhat of a reproducible case, but not sure if others will be able to do so. I've been working on a small new project here https://github.com/skolemlabs/diff/tree/initial and noticed that the same issue was present. The only somewhat reliable way I've been able to reproduce this is by spamming the file with edits (i.e. line deletions, etc.) that cause errors. After a while, the LSP diagnostics don't change and the "crash" presents itself. If anyone is still trying to replicate this, try that and see if it works.

Let me know if there's anything else I can do to help.

@voodoos
Copy link
Contributor

voodoos commented Oct 17, 2023

@zbaylin I tried to reproduce to no avail. Using Macos Sonoma on a M1 pro, OCaml 4.14.1, ocaml-lsp 0.26.1.
I was not able to get ocaml-lsp to crash by spamming edits in the various files of the project..

@abbysmal
Copy link
Contributor

@zbaylin by any chances do the patchset at #12646 improves your situation?

xavierleroy added a commit to xavierleroy/ocaml that referenced this issue Oct 17, 2023
…stroying them

Some of these mutexes may be locked, so both destroying and
reinitializing are undefined behaviors.  However, destroying can cause
fatal errors in the C pthread library (ocaml#12636), while reinitializing
should not.
@zbaylin
Copy link
Author

zbaylin commented Oct 18, 2023

@Engil working on doing that now -- will report back when I'm able to test it.

@zbaylin
Copy link
Author

zbaylin commented Oct 18, 2023

From my preliminary usage it seems like using the branch @ #12646 fixes the lock/crash/etc. I'll keep using it and report back if I see any differently.

xavierleroy added a commit to xavierleroy/ocaml that referenced this issue Oct 26, 2023
…stroying them

Some of these mutexes may be locked, so both destroying and
reinitializing are undefined behaviors.  However, destroying can cause
fatal errors in the C pthread library (ocaml#12636), while reinitializing
should not.

Fixes: ocaml#12636
xavierleroy pushed a commit to xavierleroy/ocaml that referenced this issue Oct 26, 2023
This commit introduce an atfork hook in the runtime, as a follow up to the issue encountered in MPR ocaml#455 and MPR ocaml#471.

As a result, the usage of pthreads_atfork is also removed, opting to rely in both case on manually calling the runtime hook when required.  This way, fork() calls from C code no longer reset the OCaml thread machinery.  This should help with issues such as ocaml#12636.

(cherry picked from commit c6d00ee)
(and adapted to 4.14)

Co-authored-by: xavier.leroy@college-de-france.fr
xavierleroy added a commit to xavierleroy/ocaml that referenced this issue Oct 26, 2023
…stroying them

Some of these mutexes may be locked, so both destroying and
reinitializing are undefined behaviors.  However, destroying can cause
fatal errors in the C pthread library (ocaml#12636), while reinitializing
should not.

Fixes: ocaml#12636
@xavierleroy xavierleroy added this to the 4.14.2 milestone Oct 26, 2023
@xavierleroy
Copy link
Contributor

Should be fixed by #12646. Please comment if the problem remains in 4.14.2.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants