-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Possible corruption bug in caml_master_lock
in 4.14.1 on macOS
#12636
Comments
Your stack trace seems to suggest that the process you are using is mixing both threads and The This may or may not be the source of the bug you are observing, but I would say that the post-forking resource logic in that C code is likely to be involved in the bug. My intuition is thus that the bug should first be reported to the authors of the |
Note: maybe ocaml-lsp should consider using the implementation of It may be easier to rewrite ocaml-lsp to not use janestreet/spawn, and then try to observe whether the bug is gone or still there, than to diagnose an issue (or absence of an issue) in janestreet/spawn. |
Thanks @gasche -- the combination of
It seems to me that a) this is a pretty standard Would you prefer that I close this for now until I'm able to gather more evidence? |
I'm also not quite sure what you mean by this -- are you suggesting that ocaml-lsp use Line 31 in 49bff4c
I assume the "since OCaml 4.12" comment refers to #9573? |
My understanding:
I agree with you that the code of janestreet/spawn seems okay (but it is hard to tell). It may be that there is a bug in the runtime (the trace suggests a combination of threads, |
Yes, this looks like a bug in OCaml 4.14. The mutexes that protect I/O channels are being destroyed in the child process after fork() even though they can be locked, which is an error according to the POSIX standard and seems to be turned into a fatal error by the pthreads library. Interestingly, this part of the code was rewritten in OCaml 5 (by @Engil if I remember correctly) and looks much safer. We should look into backporting the OCaml 5 solution to 4.14. |
@xavierleroy are you looking into implementing this? I do remember the struggles with getting I can give a shot at implementing this, if you are not actively working on it, or can offer review otherwise. |
I spent some time investigating the issue today. First thing: I could not get a reproduction of the issue on my machine (MacBook Air, M2 chip, MacOS Sonoma, running LSP on 4.14.1 within VScode) Second thing:
Which is in contrast with what is done in 4.14, where it does destroy all such mutexes. Now, destroying these potentially locked mutexes is a programming error (and is UB), the question is whether ignoring them is better or not? I will keep trying to reach for a reproduction, it seems odd to me this was only reported once. |
I can try to backport some of the OCaml 5 improvements, maybe later this week, maybe the week after. It's nice of you to try to reproduce the issue.
Ignoring means that in the child process, some I/O channel may remain locked forever, causing a deadlock if the child process tries to use them. I don't have a good mental picture of how likely this is to happen. |
There might be issues with the OCaml 5 implementation as well (e.g. |
I was actually using neovim, but the client shouldn't matter for a repro. Unfortunately I've only been able to trigger this using our relatively large proprietary code base and not on any open source library I've tried. I've spent a lot of time this past week trying to do so to no avail. I'll keep trying and will update here if something changes. If, in the meantime, you want me to collect some more information about the crash I can reproduce (either from a debugger or elsewhere) I'd be more than happy to provide it. |
Also: another big improvement in OCaml 5 is that it no longer uses |
This commit introduce an atfork hook in the runtime, as a follow up to the issue encountered in MPR ocaml#455 and MPR ocaml#471. As a result, the usage of pthreads_atfork is also removed, opting to rely in both case on manually calling the runtime hook when required. This way, fork() calls from C code no longer reset the OCaml thread machinery. This should help with issues such as ocaml#12636. (cherry picked from commit c6d00ee) (and adapted to 4.14) Co-authored-by: xavier.leroy@college-de-france.fr
Some of these mutexes may be locked, so this is undefined behavior, and can cause fatal errors in the C thread library (ocaml#12636). Leaving IO channel mutexes unchanged is what OCaml 5 does.
Some plausible fixes at #12646. But as long as we cannot reproduce the corruption, it's hard to tell. |
I was able to find a somewhat of a reproducible case, but not sure if others will be able to do so. I've been working on a small new project here https://github.com/skolemlabs/diff/tree/initial and noticed that the same issue was present. The only somewhat reliable way I've been able to reproduce this is by spamming the file with edits (i.e. line deletions, etc.) that cause errors. After a while, the LSP diagnostics don't change and the "crash" presents itself. If anyone is still trying to replicate this, try that and see if it works. Let me know if there's anything else I can do to help. |
@zbaylin I tried to reproduce to no avail. Using Macos Sonoma on a M1 pro, OCaml 4.14.1, ocaml-lsp 0.26.1. |
…stroying them Some of these mutexes may be locked, so both destroying and reinitializing are undefined behaviors. However, destroying can cause fatal errors in the C pthread library (ocaml#12636), while reinitializing should not.
@Engil working on doing that now -- will report back when I'm able to test it. |
From my preliminary usage it seems like using the branch @ #12646 fixes the lock/crash/etc. I'll keep using it and report back if I see any differently. |
…stroying them Some of these mutexes may be locked, so both destroying and reinitializing are undefined behaviors. However, destroying can cause fatal errors in the C pthread library (ocaml#12636), while reinitializing should not. Fixes: ocaml#12636
This commit introduce an atfork hook in the runtime, as a follow up to the issue encountered in MPR ocaml#455 and MPR ocaml#471. As a result, the usage of pthreads_atfork is also removed, opting to rely in both case on manually calling the runtime hook when required. This way, fork() calls from C code no longer reset the OCaml thread machinery. This should help with issues such as ocaml#12636. (cherry picked from commit c6d00ee) (and adapted to 4.14) Co-authored-by: xavier.leroy@college-de-france.fr
…stroying them Some of these mutexes may be locked, so both destroying and reinitializing are undefined behaviors. However, destroying can cause fatal errors in the C pthread library (ocaml#12636), while reinitializing should not. Fixes: ocaml#12636
Should be fixed by #12646. Please comment if the problem remains in 4.14.2. |
A couple of days ago, I upgraded to macOS 14 Sonoma from 13 on my M1 Macbook Pro. After doing so, I noticed that my LSP server would seemingly randomly stop working and would spin, causing 100% CPU usage.
I opened an issue (ocaml/ocaml-lsp#1194) thinking it was maybe a bug in the LSP server itself, but upon further investigation it appears that it might be a bug in the runtime.
Unfortunately, I haven't been able to find an easily reproducible example for this, but I'll continue to try to do so and outline my investigation here.
Once the LSP froze, I attached to it with lldb to see if I could glean some more info that way. Immediately after doing so, a breakpoint is hit (even though I hadn't set any):
The backtrace at this point looks like the following:
This seems to refer to code similar to this found in
libplatform
in the version of Darwin used by macOS 14: https://github.com/apple/darwin-libplatform/blob/215b09856ab5765b7462a91be7076183076600df/src/os/lock.c#L136Dumping the register state here is also a bit informative:
It's unclear to me what specifically
x11
andx12
are being used for here, but they both contain pointers insidecaml_master_lock
, which libplatform seems to believe the lock inside is corrupt.The backtrace also makes mention of
caml_thread_reinitialize
, which appears to be callingst_mutex_destroy
over all open channels:ocaml/otherlibs/systhreads/st_stubs.c
Lines 421 to 428 in 49bff4c
Unfortunately that's sort of where my investigation has ended -- I can't seem to figure out if the lock is actually even corrupt or not, let alone where that's happening if so.
If anyone has any pointers for where I should look next, please let me know. I'll work on making a reproducible example.
The text was updated successfully, but these errors were encountered: