-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor caml_c_thread_(un)register to fix various bugs #12834
Conversation
Effort has been made to let this PR be reviewed commit-per-commit. |
be8c8f3
to
b3cd839
Compare
01a9c6e
to
a6682e6
Compare
Another discrepancy: in case of an uncaught exception, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good work! This looks more readable to me, and should help preventing introducing new bugs or inconsistencies in the future.
I can confirm I can no longer reproduce the sporadic failure encountered in parallel/test_c_thread_register.ml
on this branch.
@dustanddreams I'm thankful for your review (I was worried of ending up being the reviewer by lack of any other volunteer), but I am surprised that you managed to do a full review of this tricky code without asking any question / for any clarification or with no particular improvement to suggest. Is the code really that nice, or are you adverse to nitpicking? |
The individual commits were simple enough and self-contained to be able to assess the impact of each of them. |
Thanks for the review. It is indeed tricky code if we consider that we cannot rely on the current state of the code, so a global review could be useful in addition of a commit-per-commit review. Some of the bugs look big-if-true (e.g. missing altsigstack) so a helpful method @gasche could be for each bug listed, to confirm that there was a bug, and to convince oneself that it is fixed correctly in this PR. This would make a cross-review that would usefully increase confidence in this code. |
thread_lock_release(Dom_c_threads); | ||
/* Release the domain lock the regular way. Note: we cannot receive | ||
an exception here. */ | ||
caml_enter_blocking_section_no_pending(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At this point there are many ways offered to "enter blocking section" or "release the runtime lock", including this one or thread_lock_release
but also caml_thread_enter_blocking_section
. I'm a bit lost as to which does what. Can you explain why caml_enter_blocking_section(_no_pending)
is the right one to use here, and maybe include this explanation in the comment on lines 709-710?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing that is confusing to me is that "thread lock" and "domain lock" are two notions that are not orthogonal but overlapping. In particular, if I understand correctly, thread_lock_release
eventually calls caml_release_domain_lock
(among other things).
Reading this code, I have no clear idea whether one of the following holds:
- in the success path, the stuff that occurs before
caml_enter_blocking_section
sums up to the equivalent of whatthread_lock_release
would have done, so we end up with something that is equivalent to a well-bracketed acquisition and release of the thread lock, or - in the success path, the code is doing less than what
thread_lock_release
would have done, because the point is to keep ownership of something after the function returns
(My bet would be that (1) holds, that those caml_c_{register,unregister}
functions are expected to be called from foreign threads that are not "active" in any respect with regard to the OCaml runtime (or equivalently, that can run concurrently with the runtime or OCaml mutators), and preserve this property on exit.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed the two functions must be intimate of the details of the relationship the thread lock and the domain lock. The goal is to construct and destruct an "OCaml" thread from scratch which is why it deals with the details of acquiring/releasing the lock. Out of the four locking operations, though, two should be straightforward. If the C thread is correctly attached to the domain, then one can call caml_enter_blocking_section at the end of having attached it, and caml_leave_blocking_section at the beginning of detaching it. Indeed the point is to have a thread with which caml_enter/leave_blocking_section are functional. I think the current comment meant to explain this is clear enough, but this might be more clear when reading the code directly rather than from the diff.
caml_thread_t th = thread_alloc_and_add(); | ||
/* If it fails, we release the lock and return an error. */ | ||
if (th == NULL) goto out_err; | ||
thread_init_current(th); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am missing something here. My reading of the code suggest that thread_alloc_and_add
returns thread info block with an uninitialized runtime state (all fields are 0/NULL), and then thread_init_current
calls restore_runtime_state(th)
which "restores" this unitialized runtime state into Caml_state
, resulting in a domain in an incorrect state.
I would naively expect a new thread to somehow get a valid Caml_state
(this could be done by caml_init_domain_self
for example), and then start by saving this valid runtime state in the thread info block.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Ah: maybe it is fine if the domain is in an invalid state, because we are giving up control so immediately another thread will be activated and restore its own (valid) state?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a bit counter-intuitive, but the thread info is initialized with a valid runtime state indeed. I added a comment to clarify this in a new commit.
In reviewing this fact, I have found that I do not understand the initialization of th->trap_barrier_off
which is in discrepancy with that in domain_create
. This needs to be reviewed by someone knowledgeable before merging. (See the temporary TODO
comment.)
(All this initialization between domain_create and caml_thread_new_info could be further factored, but I won't be doing this here.)
Sorry for piling up the commits, but upon the review you prompted, I fixed another issue with error handling in caml_thread_new_info
, whereby the NULL return value of allocated memprof state was ignored.
This PR introduces a |
FYI, I've given this PR a spin on
|
Thanks everyone, I might get back to this in January! Have nice holidays. |
6225568
to
f00e977
Compare
It is factored as much as possible. (Well, I could move There is an inextricable difference between the two ways to attach a new thread. It is essential in one that the thread descriptor is allocated before creating the thread, while in the other, it must be allocated while attaching the thread. The details of when the GC is available are tricky. |
This PR is a bugfix candidate for 5.2. Do you think this could arrive in time, or should I target 5.3 instead? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I read the code, made some suggestions, and haven't found any important issue. I am happy to approve this bugfix PR and move towards merging in 5.2 -- I will wait a bit in case @gadmm or others want to discuss further changes.
@@ -298,12 +303,11 @@ static caml_thread_t caml_thread_new_info(void) | |||
|
|||
#ifndef NATIVE_CODE | |||
th->trap_sp_off = 1; | |||
th->trap_barrier_off = 2; | |||
th->trap_barrier_off = 2; /* TODO: 0? trap_barrier_block? */ | |||
th->external_raise = NULL; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dustanddreams would you by chance be able to dive into this trap_barrier_off
business and give an opinion on what would be best here?
For now I think that keeping the current behavior unchanged with the TODO question is okay, so we could merge the PR without clarification if useful to make progress, but it would of course be nicer to answer the question.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc also @fabbing who presented an impressive detailed account of how exception handling works, hence they might be familiar with this already.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(The question is why is there a discrepancy—the only one—with domain_create
here.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know why there is such a discrepancy here. I think caml_thread_new_info
should definitely initialize trap_barrier_block
to -1
here, and I also think setting trap_barrier_off
to 0
rather than 2
in domain_create
would make sense, unless this was done on purpose to prevent barriers from firing too early in a domain's life (in which case a comment explaining that rationale would have been nice).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for your analysis.
I see that these fields were not there in OCaml 4. They are new, including this initialization code.
So it is worth fixing quickly. A fix would need a second person to review (not me). Hence I recommend fixing it separately and leave the TODO in the code (and create a new issue about it/add it to the systhread issue).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So it is worth fixing quickly. A fix would need a second person to review (not me). Hence I recommend fixing it separately and leave the TODO in the code (and create a new issue about it/add it to the systhread issue).
There should be two TODO then, one in caml_thread_new_info
regarding the non-initialization of trap_barrier_block
and one in domain_create
regarding the initialization of trap_barrier_off
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also think trap_barrier_block
should be initialized to -1
.
But, if my understanding is correct, trap_barrier_off
should stay initialized to 2
, to avoid triggering a trap when reaching the default exception handler (as guessed from https://github.com/gadmm/ocaml/blob/f00e977cc5779fa73c42a0db5bcb112de7f16237/runtime/debugger.c#L248). Also this constant is expected in this computation: https://github.com/gadmm/ocaml/blob/f00e977cc5779fa73c42a0db5bcb112de7f16237/runtime/interp.c#L157
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dustanddreams As far I can see, trap_barrier_off
is already initialized to 0 inside domain_create
. Did you mean something different?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @fabbing and @dustanddreams, I'll simply keep the TODO as is for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This issue is now recorded as part of #12399 with a link to the present discussion
Fix situation where the main thread tries to stop before the registered C thread (?)
f00e977
to
97c2ee7
Compare
I just rebased. |
Refactor caml_c_thread_(un)register to fix various bugs (cherry picked from commit 5348fa9)
Another systhread rabbit-hole:
caml_c_thread_(un)register
. In order to make it better understandable, I refactored it in order to share code with thread_new, thread_start and thread_stop. Differences between thread_new+thread_start and c_thread_register are subtle and come down to when is the GC available to allocate the thread descriptor.We could share even more code if we change a few details, I can share some idea if someone has the time to implement it.I think several bugs are fixed by this refactor:
save_runtime_state
logic incaml_thread_yield
#12830)memprof_delete_thread
s