Refactor caml_c_thread_(un)register to fix various bugs #12834

gadmm · 2023-12-15T18:49:39Z

Another systhread rabbit-hole: caml_c_thread_(un)register. In order to make it better understandable, I refactored it in order to share code with thread_new, thread_start and thread_stop. Differences between thread_new+thread_start and c_thread_register are subtle and come down to when is the GC available to allocate the thread descriptor. ~~We could share even more code if we change a few details, I can share some idea if someone has the time to implement it.~~

I think several bugs are fixed by this refactor:

Fix TSan-reported race in tests/parallel/test_c_thread_register.ml #12773 : Missing restore (via @Johan511 & Fixes save_runtime_state logic in caml_thread_yield #12830)
Fix calling c_thread_register without having initialized systhreads
Fix invalid raising of exception and leak if tick thread is not created
Fix missing memprof_delete_threads
Fix situation where the main thread tries to stop before the registered C thread (?)
Fix missing signal stack

gadmm · 2023-12-15T18:52:42Z

Effort has been made to let this PR be reviewed commit-per-commit.

gadmm · 2023-12-17T14:51:41Z

Another discrepancy: in case of an uncaught exception, caml_fatal_uncaught_exception is called and the program stops. This is in contrast with regular systhreads: in case of an uncaught exception, the thread stops and an error is output on stderr, and Thread.set_uncaught_exception_handler is obeyed.

dustanddreams

Good work! This looks more readable to me, and should help preventing introducing new bugs or inconsistencies in the future.

I can confirm I can no longer reproduce the sporadic failure encountered in parallel/test_c_thread_register.ml on this branch.

gasche · 2023-12-18T14:31:46Z

@dustanddreams I'm thankful for your review (I was worried of ending up being the reviewer by lack of any other volunteer), but I am surprised that you managed to do a full review of this tricky code without asking any question / for any clarification or with no particular improvement to suggest. Is the code really that nice, or are you adverse to nitpicking?

dustanddreams · 2023-12-18T14:35:56Z

Is the code really that nice, or are you adverse to nitpicking?

The individual commits were simple enough and self-contained to be able to assess the impact of each of them.

gadmm · 2023-12-18T14:59:30Z

Thanks for the review.

It is indeed tricky code if we consider that we cannot rely on the current state of the code, so a global review could be useful in addition of a commit-per-commit review. Some of the bugs look big-if-true (e.g. missing altsigstack) so a helpful method @gasche could be for each bug listed, to confirm that there was a bug, and to convince oneself that it is fixed correctly in this PR. This would make a cross-review that would usefully increase confidence in this code.

gasche · 2023-12-21T05:46:18Z

otherlibs/systhreads/st_stubs.c

-  thread_lock_release(Dom_c_threads);
+  /* Release the domain lock the regular way. Note: we cannot receive
+     an exception here. */
+  caml_enter_blocking_section_no_pending();


At this point there are many ways offered to "enter blocking section" or "release the runtime lock", including this one or thread_lock_release but also caml_thread_enter_blocking_section. I'm a bit lost as to which does what. Can you explain why caml_enter_blocking_section(_no_pending) is the right one to use here, and maybe include this explanation in the comment on lines 709-710?

One thing that is confusing to me is that "thread lock" and "domain lock" are two notions that are not orthogonal but overlapping. In particular, if I understand correctly, thread_lock_release eventually calls caml_release_domain_lock (among other things).

Reading this code, I have no clear idea whether one of the following holds:

in the success path, the stuff that occurs before caml_enter_blocking_section sums up to the equivalent of what thread_lock_release would have done, so we end up with something that is equivalent to a well-bracketed acquisition and release of the thread lock, or

in the success path, the code is doing less than what thread_lock_release would have done, because the point is to keep ownership of something after the function returns

(My bet would be that (1) holds, that those caml_c_{register,unregister} functions are expected to be called from foreign threads that are not "active" in any respect with regard to the OCaml runtime (or equivalently, that can run concurrently with the runtime or OCaml mutators), and preserve this property on exit.)

Indeed the two functions must be intimate of the details of the relationship the thread lock and the domain lock. The goal is to construct and destruct an "OCaml" thread from scratch which is why it deals with the details of acquiring/releasing the lock. Out of the four locking operations, though, two should be straightforward. If the C thread is correctly attached to the domain, then one can call caml_enter_blocking_section at the end of having attached it, and caml_leave_blocking_section at the beginning of detaching it. Indeed the point is to have a thread with which caml_enter/leave_blocking_section are functional. I think the current comment meant to explain this is clear enough, but this might be more clear when reading the code directly rather than from the diff.

gasche · 2023-12-21T09:38:09Z

otherlibs/systhreads/st_stubs.c

+  caml_thread_t th = thread_alloc_and_add();
+  /* If it fails, we release the lock and return an error. */
+  if (th == NULL) goto out_err;
+  thread_init_current(th);


I am missing something here. My reading of the code suggest that thread_alloc_and_add returns thread info block with an uninitialized runtime state (all fields are 0/NULL), and then thread_init_current calls restore_runtime_state(th) which "restores" this unitialized runtime state into Caml_state, resulting in a domain in an incorrect state.

I would naively expect a new thread to somehow get a valid Caml_state (this could be done by caml_init_domain_self for example), and then start by saving this valid runtime state in the thread info block.

(Ah: maybe it is fine if the domain is in an invalid state, because we are giving up control so immediately another thread will be activated and restore its own (valid) state?)

This is a bit counter-intuitive, but the thread info is initialized with a valid runtime state indeed. I added a comment to clarify this in a new commit.

In reviewing this fact, I have found that I do not understand the initialization of th->trap_barrier_off which is in discrepancy with that in domain_create. This needs to be reviewed by someone knowledgeable before merging. (See the temporary TODO comment.)

(All this initialization between domain_create and caml_thread_new_info could be further factored, but I won't be doing this here.)

Sorry for piling up the commits, but upon the review you prompted, I fixed another issue with error handling in caml_thread_new_info, whereby the NULL return value of allocated memprof state was ignored.

gasche · 2023-12-21T09:45:51Z

This PR introduces a detach_from_runtime abstraction that I like, and in particular it makes the final code of unregister very simple. On the other hand, the current register implementation is not symmetric, it remains messy. Could we maybe introduce an attach_to_runtime function that would serve a symmetric purpose?

jmid · 2023-12-21T10:51:47Z

FYI, I've given this PR a spin on multicoretests as a stress test. It triggered

an instance of Crashes and locks under Windows MingW #12230 under MinGW bytecode
an instance of the fatal error from Do not fail too hard in domain_thread_func #12855 under MinGW native code

~~(none of which are related to this PR)~~ (actually I had a small hope of the PR fixing the former)
and otherwise held up nicely!

gadmm · 2023-12-21T18:04:41Z

Thanks everyone, I might get back to this in January! Have nice holidays.

gadmm · 2024-01-02T23:15:32Z

This PR introduces a detach_from_runtime abstraction that I like, and in particular it makes the final code of unregister very simple. On the other hand, the current register implementation is not symmetric, it remains messy. Could we maybe introduce an attach_to_runtime function that would serve a symmetric purpose?

It is factored as much as possible. (Well, I could move create_tick_thread into thread_alloc_and_add, but this would change the error message in the cases where creating the tick thread fails. It's up for discussion.)

There is an inextricable difference between the two ways to attach a new thread. It is essential in one that the thread descriptor is allocated before creating the thread, while in the other, it must be allocated while attaching the thread. The details of when the GC is available are tricky.

gadmm · 2024-01-07T20:14:51Z

This PR is a bugfix candidate for 5.2. Do you think this could arrive in time, or should I target 5.3 instead?

gasche

I read the code, made some suggestions, and haven't found any important issue. I am happy to approve this bugfix PR and move towards merging in 5.2 -- I will wait a bit in case @gadmm or others want to discuss further changes.

gasche · 2024-01-08T16:17:36Z

otherlibs/systhreads/st_stubs.c

@@ -298,12 +303,11 @@ static caml_thread_t caml_thread_new_info(void)

 #ifndef NATIVE_CODE
  th->trap_sp_off = 1;
-  th->trap_barrier_off = 2;
+  th->trap_barrier_off = 2; /* TODO: 0? trap_barrier_block? */
  th->external_raise = NULL;


@dustanddreams would you by chance be able to dive into this trap_barrier_off business and give an opinion on what would be best here?

For now I think that keeping the current behavior unchanged with the TODO question is okay, so we could merge the PR without clarification if useful to make progress, but it would of course be nicer to answer the question.

cc also @fabbing who presented an impressive detailed account of how exception handling works, hence they might be familiar with this already.

(The question is why is there a discrepancy—the only one—with domain_create here.)

I don't know why there is such a discrepancy here. I think caml_thread_new_info should definitely initialize trap_barrier_block to -1 here, and I also think setting trap_barrier_off to 0 rather than 2 in domain_create would make sense, unless this was done on purpose to prevent barriers from firing too early in a domain's life (in which case a comment explaining that rationale would have been nice).

Thanks a lot for your analysis.

I see that these fields were not there in OCaml 4. They are new, including this initialization code.

So it is worth fixing quickly. A fix would need a second person to review (not me). Hence I recommend fixing it separately and leave the TODO in the code (and create a new issue about it/add it to the systhread issue).

So it is worth fixing quickly. A fix would need a second person to review (not me). Hence I recommend fixing it separately and leave the TODO in the code (and create a new issue about it/add it to the systhread issue).

There should be two TODO then, one in caml_thread_new_info regarding the non-initialization of trap_barrier_block and one in domain_create regarding the initialization of trap_barrier_off.

I also think trap_barrier_block should be initialized to -1.
But, if my understanding is correct, trap_barrier_off should stay initialized to 2, to avoid triggering a trap when reaching the default exception handler (as guessed from https://github.com/gadmm/ocaml/blob/f00e977cc5779fa73c42a0db5bcb112de7f16237/runtime/debugger.c#L248). Also this constant is expected in this computation: https://github.com/gadmm/ocaml/blob/f00e977cc5779fa73c42a0db5bcb112de7f16237/runtime/interp.c#L157

@dustanddreams As far I can see, trap_barrier_off is already initialized to 0 inside domain_create. Did you mean something different?

Thanks @fabbing and @dustanddreams, I'll simply keep the TODO as is for now.

This issue is now recorded as part of #12399 with a link to the present discussion

Fix situation where the main thread tries to stop before the registered C thread (?)

gadmm · 2024-01-12T12:51:41Z

I just rebased.

Refactor caml_c_thread_(un)register to fix various bugs (cherry picked from commit 5348fa9)

gadmm mentioned this pull request Dec 15, 2023

Fixes save_runtime_state logic in caml_thread_yield #12830

Closed

gadmm force-pushed the refactor_thread_register branch from be8c8f3 to b3cd839 Compare December 15, 2023 20:43

gadmm changed the title ~~Refactor thread register to fix various bugs~~ Refactor caml_c_thread_(un)register to fix various bugs Dec 15, 2023

gadmm force-pushed the refactor_thread_register branch 2 times, most recently from 01a9c6e to a6682e6 Compare December 15, 2023 23:00

xavierleroy added bug runtime-system threads labels Dec 16, 2023

dustanddreams approved these changes Dec 18, 2023

View reviewed changes

gasche reviewed Dec 21, 2023

View reviewed changes

gadmm force-pushed the refactor_thread_register branch from 6225568 to f00e977 Compare January 2, 2024 23:08

gasche approved these changes Jan 8, 2024

View reviewed changes

gadmm added 7 commits January 12, 2024 13:42

Protect against registering c thread before systhreads init

7adb48a

Clarify effect of calling c_thread_register after c_thread_unregister

2dd5627

Refactor and fix thread initialization

5d7cdde

Refactor and fix tick thread creation

4d38247

Refactor and fix c_thread_unregister

ead8f58

Fix situation where the main thread tries to stop before the registered C thread (?)

Fix missing memprof_delete_threads

5d44566

Fix missing signal stack for registered C threads

75c7be1

gadmm added 4 commits January 12, 2024 13:50

[minor] Cosmetic clean-up

47ffd38

Clarify role of caml_thread_new_info

f1bcb43

Fix error handling in caml_thread_new_info

d0c1721

Changes

97c2ee7

gadmm force-pushed the refactor_thread_register branch from f00e977 to 97c2ee7 Compare January 12, 2024 12:51

gadmm mentioned this pull request Jan 12, 2024

Remaining issues in the OCaml 5 systhread implementation #12399

Open

5 tasks

gasche merged commit 5348fa9 into ocaml:trunk Jan 12, 2024
9 checks passed

gasche added a commit that referenced this pull request Jan 12, 2024

Merge pull request #12834 from gadmm/refactor_thread_register

34fbd7d

Refactor caml_c_thread_(un)register to fix various bugs (cherry picked from commit 5348fa9)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor caml_c_thread_(un)register to fix various bugs #12834

Refactor caml_c_thread_(un)register to fix various bugs #12834

gadmm commented Dec 15, 2023 •

edited

gadmm commented Dec 15, 2023

gadmm commented Dec 17, 2023

dustanddreams left a comment

gasche commented Dec 18, 2023

dustanddreams commented Dec 18, 2023

gadmm commented Dec 18, 2023

gasche Dec 21, 2023

gasche Dec 21, 2023

gadmm Jan 2, 2024 •

edited

gasche Dec 21, 2023

gasche Dec 21, 2023

gadmm Jan 2, 2024

gasche commented Dec 21, 2023

jmid commented Dec 21, 2023 •

edited

gadmm commented Dec 21, 2023

gadmm commented Jan 2, 2024

gadmm commented Jan 7, 2024 •

edited

gasche left a comment

gasche Jan 8, 2024

gadmm Jan 9, 2024

gadmm Jan 10, 2024

dustanddreams Jan 10, 2024

gadmm Jan 10, 2024

dustanddreams Jan 11, 2024

fabbing Jan 11, 2024 •

edited

gadmm Jan 12, 2024

gadmm Jan 12, 2024

gadmm Jan 12, 2024

gadmm commented Jan 12, 2024

Refactor caml_c_thread_(un)register to fix various bugs #12834

Refactor caml_c_thread_(un)register to fix various bugs #12834

Conversation

gadmm commented Dec 15, 2023 • edited

gadmm commented Dec 15, 2023

gadmm commented Dec 17, 2023

dustanddreams left a comment

Choose a reason for hiding this comment

gasche commented Dec 18, 2023

dustanddreams commented Dec 18, 2023

gadmm commented Dec 18, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gadmm Jan 2, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gasche commented Dec 21, 2023

jmid commented Dec 21, 2023 • edited

gadmm commented Dec 21, 2023

gadmm commented Jan 2, 2024

gadmm commented Jan 7, 2024 • edited

gasche left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fabbing Jan 11, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gadmm commented Jan 12, 2024

gadmm commented Dec 15, 2023 •

edited

gadmm Jan 2, 2024 •

edited

jmid commented Dec 21, 2023 •

edited

gadmm commented Jan 7, 2024 •

edited

fabbing Jan 11, 2024 •

edited