Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix assertion race condition in #11800 #12707

Merged
merged 2 commits into from Oct 31, 2023
Merged

Conversation

jmid
Copy link
Contributor

@jmid jmid commented Oct 31, 2023

This PR fixes the race condition reported in #11800
We have also encountered it twice in multicoretests in ocaml-multicore/multicoretests#386 and ocaml-multicore/multicoretests#402.

After a fair amount of head scratching, I realized that performing two consecutive atomic reads in the backup-thread state machine assertion

di->backup_thread_msg == BT_INIT || di->backup_thread_msg == BT_TERMINATE

is a bad idea to execute in parallel with backup_thread_func that performs a state machine transition from BT_TERMINATE to BT_INIT

atomic_store_release(&di->backup_thread_msg, BT_INIT);

The reason is that there is a microscopic chance

  • that di->backup_thread_msg is first BT_TERMINATE causing di->backup_thread_msg == BT_INIT to be false,
  • then the field is updated to BT_INIT in parallel by backup_thread_func, and
  • finally di->backup_thread_msg == BT_TERMINATE is also false!

This also explains why the bug has been close to impossible to reproduce.
Overall, it thus amounts to a buggy assertion and the fix is straightforward.

To confirm the above hypothesis, I first inserted a strategic thrd_sleep in the assertion:

@@ -1034,6 +1034,13 @@ static void* backup_thread_func(void* v)
   return 0;
 }
 
+static int sleep_false()
+{
+  /* sleep for 50 ms */
+  thrd_sleep(&(struct timespec){.tv_sec=0, .tv_nsec=050000000}, NULL);
+  return 0;
+}
+
 static void install_backup_thread (dom_internal* di)
 {
   int err;
@@ -1043,7 +1050,8 @@ static void install_backup_thread (dom_internal* di)
 
   if (di->backup_thread_running == 0) {
     CAMLassert (di->backup_thread_msg == BT_INIT || /* Using fresh domain */
-            di->backup_thread_msg == BT_TERMINATE); /* Reusing domain */
+               sleep_false () || /* Delay window */
+               di->backup_thread_msg == BT_TERMINATE); /* Reusing domain */
 
     while (atomic_load_acquire(&di->backup_thread_msg) != BT_INIT) {
       /* Give a chance for backup thread on this domain to terminate */

I then found that I could reliably trigger the bug on 5/5 runs on a torture test from multicoretests performing lots of
spawns and joins when run with a reduced minor heap:

$ OCAMLRUNPARAM="s=4096" _build/default/src/domain/domain_joingraph.exe -v
### OCaml runtime: debug mode ###
random seed: 513884652
generated error fail pass / total     time test name
[✓]  100    0    0  100 /  100    15.6s Domain.spawn/join - tak work
[ ]  107    0    0  107 /  500    57.6s Domain.spawn/join - atomic[14] file runtime/domain.c; line 1052 ### Assertion failed: di->backup_thread_msg == BT_INIT || sleep_false () || di->backup_thread_msg == BT_TERMINATE
Aborted (core dumped)

I then inserted the patch with the thrd_sleep still in place, and reran the above test to confirm that 0 out of 10 consecutive runs would now trigger the assertion failure.

@gasche
Copy link
Member

gasche commented Oct 31, 2023

@jmid are you planning to write a Changes entry? The fix is simple but the hunt was not so easy. Your choice -- I would put it in the "Bug fixes" section.

@jmid
Copy link
Contributor Author

jmid commented Oct 31, 2023

Thanks Gabriel - also for your notes in #11800 and #11801 🙏
I've now added a Changes entry.

@gasche gasche merged commit 8cc728b into ocaml:trunk Oct 31, 2023
9 of 10 checks passed
@jmid jmid deleted the domain-assert-failure branch October 31, 2023 15:25
@gasche
Copy link
Member

gasche commented Oct 31, 2023

I focused on the explanation and the technical details, and I forgot to say that I found the hunt impressive. This was indeed a fairly tricky issue to diagnose!

@dra27
Copy link
Member

dra27 commented Nov 21, 2023

@Octachron - @jmid is continuously testing the released compiler, so there is use in this being back-ported to 5.1.1. May I do that?

@dra27 dra27 added this to the 5.1.1 milestone Nov 21, 2023
@Octachron
Copy link
Member

I am fine with cherry-picking to 5.1.

dra27 pushed a commit to dra27/ocaml that referenced this pull request Dec 6, 2023
Fix assertion race condition in ocaml#11800

(cherry picked from commit 8cc728b)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants