Skip to content

Writing custom events causes GC deadlocks #12897

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
talex5 opened this issue Jan 15, 2024 · 4 comments · Fixed by #12900
Closed

Writing custom events causes GC deadlocks #12897

talex5 opened this issue Jan 15, 2024 · 4 comments · Fixed by #12900

Comments

@talex5
Copy link
Contributor

talex5 commented Jan 15, 2024

On OCaml 5.1.1, this program hangs very quickly for me:

let unit =
  let encode _buf () = Gc.minor (); 0 in
  let decode _buf _len = () in
  Runtime_events.Type.register ~encode ~decode

type Runtime_events.User.tag += My_event
let my_event = Runtime_events.User.register "event" My_event unit

let run () =
  for i = 1 to 1_000_000 do
    Printf.printf "%d: %d\n%!" (Domain.self () :> int) i;
    Runtime_events.User.write my_event ();
  done

let () =
  Runtime_events.start ();
  let other = Domain.spawn run in
  run ();
  Domain.join other

The output is typically:

1: 1                                   
0: 1
1: 2
[hangs]

This is a simplified version of what happens when tracing Eio's HTTP benchmark. The Gc.minor call in encode doesn't seem to be necessary (caml_runtime_events_user_write calls alloc_and_clear_stack_parent, which allocates), but triggering the bug without it takes longer.

I think what is happening is:

  1. Domain 0 takes write_buffer_lock, then decides to do a GC while holding it. It waits for domain 1 to be ready.
  2. Domain 1 tries to take write_buffer_lock

(I'm not sure why the C code is managing the (single) write buffer. It would probably be safer to let the OCaml code deal with that and just call the events code with the buffer already filled in.)

/cc @sadiqj @OlivierNicole

@gasche
Copy link
Member

gasche commented Jan 15, 2024

I wondered how other locking code avoids this issue. The way that runtime/io.c manages its per-channel locks is: first try_lock the buffer, and if that fails release the runtime lock (that is, enter a blocking section) and lock again. While the main thread has released the runtime lock, the backup thread will service GC requests, so this deadlock situation will not happen.
(There is also special logic to also unlock the channel if an exception is raised instead of normal return, that is actually implemented in caml_raise. There is a tight coupling between the ability to lock safely within the runtime and the rest of the runtime.)

@gasche
Copy link
Member

gasche commented Jan 15, 2024

I think that we should document somewhere in the runtime code that locks cannot be safely used across OCaml callbacks due to this type of issue. I looked at the other usage of caml_plat_lock in the runtime and they seem safe -- most of them are for short-lived sections inside the runtime itself.

@gasche
Copy link
Member

gasche commented Jan 15, 2024

One way to avoid the issue, while remaining on the C side, would be to use caml_ml_mutex_lock from runtime/lock.c, which has the same try-or-enter-blocking-section logic as channel mutexes. (But then we would need to ensure that the critical section does not raise, which I am not sure is the case due to its use of caml_alloc_string)

@gasche
Copy link
Member

gasche commented Jan 15, 2024

@talex5: I propose a fix in #12900. Would you by chance be able to test it?

The fix does not move the whole logic to the OCaml side as you suggested (it would be more invasive and I was too lazy to do it), only the handling of the write buffer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants