Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simplifications and fixes to multicore systhreads implementation (3/3) #11386

Merged
merged 9 commits into from
Jul 18, 2023

Conversation

gadmm
Copy link
Contributor

@gadmm gadmm commented Jul 1, 2022

Follow-up to #11271, #11385 and #11393 (cc @Engil, @kayceesrk).

Targets 5.1 (does not seem to fix important bugs). Details in the individual commit logs. Can you please have a look at the commit that removes dead code, in case this was unintended and hides a bug? I also added a FIXME regarding a behaviour from old systhreads that is not yet implemented in multicore (see comments below).

This PR lives on top of #11393, so only the 5 topmost commits belong to it.

No change entry needed.

@gadmm
Copy link
Contributor Author

gadmm commented Jul 3, 2022

I have rebased on top of #11393, so keep in mind that only the 5 topmost commits belong to this PR.

@@ -417,6 +437,10 @@ CAMLprim value caml_thread_initialize(value unit)
CAMLreturn(Val_unit);
}

/* Cleanup the thread machinery when the runtime is shut down. Joining the tick
thread take 25ms on average / 50ms in the worst case, so we don't do it on
program exit. (FIXME) */
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I restored this comment from old systhreads. However, this behaviour does not seem implemented yet in multicore.

@gadmm gadmm force-pushed the systhread_simpl_and_fixes3 branch 2 times, most recently from b2d65d7 to e5a8f1e Compare July 6, 2022 17:23
@gadmm
Copy link
Contributor Author

gadmm commented Jul 6, 2022

I rebased on top of #11385 and added some new minor fixes and clarifications. It still only contains less important fixes targeting 5.1. For clarity I convert it as a draft, we should review it when earlier PRs have been merged.

I added some more FIXMEs that I think are worth looking into.

(of course a round of precheck is welcome)

@gasche
Copy link
Member

gasche commented Jul 10, 2022

@gadmm
Copy link
Contributor Author

gadmm commented Jan 27, 2023

This should be rebased and made ready for review. While adapting to the new way systhread is initialized on every thread, I stumbled into a rabbit hole regarding error handling inside Domain.spawn and found other bugs with signals. I am not yet sure how to go about it, I just wanted to post an update.

@gadmm gadmm force-pushed the systhread_simpl_and_fixes3 branch from e5a8f1e to 9b109bf Compare March 6, 2023 15:13
@gadmm
Copy link
Contributor Author

gadmm commented Mar 6, 2023

I have rebased on top of trunk. This could do with a round of review. Instructions for reviewers:

  • This is best reviewed commit-per-commit, with the help of additional detail in each commit log.
  • If someone knows systhread well, please review that commit "Remove dead code" does not reveal a regression in functionality compared to OCaml 4 (code that became dead for wrong reasons).
  • A few FIXMEs are added, coming from my review of the OCaml 5 systhread implem. and are about possible regressions wrt. OCaml 4. I do not intend or expect to have to fix these FIXMEs in this PR.
    • Some FIXME concern possible changes/regressions wrt. the "best effort" to support fork on a single domain.
    • One FIXME is about the delay for joining the tick threads on shutdown (25ms on average). On OCaml 4 there was no wait. When reviewing the code, it appeared to me that a solution could be to interrupt the sleep system call in the tick thread using the self-pipe trick with pselect (edit: or simply using pthread_cond_timedwait).
  • One FIXME is added to avoid delaying this PR further: the FIXME in domain_thread_func is about proper error handling inside domain initialization. Domain initialization was moved from OCaml to C code at the end of 5.0 development, which has created new issues. Errors during domain initialization should be converted into exceptions; currently this probably crashes or deadlocks. As said in my previous comment, fixing this has revealed to be a bit of a rabbit hole, it is doable but is better done separately in order to make progress on this PR. However, reviewers are free to disagree and refuse the addition of a FIXME comment in the code (in which case we can discuss solutions here).

@gadmm gadmm marked this pull request as ready for review March 6, 2023 15:50
@gasche gasche self-assigned this Mar 6, 2023
@gasche
Copy link
Member

gasche commented Mar 6, 2023

Do we have a volunteer to review this PR? I have assigned myself as "person in charge of making sure we don't forget about it completely", but I would be happy to let someone else do the review.

@kayceesrk
Copy link
Contributor

kayceesrk commented May 27, 2023

I'm starting to review this PR, but I have a couple of difficulties in the way this PR and others like #11307 are organised which is making it harder for me to review the PR(s).

My main difficulty is that the PRs suggest many improvements that seem independent but are organised as several commits, each of which has to be read one after the other. The details of the improvements are in the commit messages. This has several downsides that make it harder to review such PRs.

  • On Github, the discussions on several unrelated improvements get mixed together in the same PR thread which makes it harder to follow along.
  • If some of the early improvements are OK, but the follow-ups need changes, the whole PR gets blocked. As a reviewer, I am not able to checkpoint my approval on individual commits. I'm forced to read the entire commit chain in order to see how these interact with each other.
  • IINM, this is also different from how OCaml compiler development operates today, which is a PR for every improvement that's independent.

Personally, I really find it hard to review PRs commit by commit.

May I recommend @gadmm to consider having separate PRs for each improvement so that it makes reviewing easier? Small improvements will go much faster with a PR per commit. If they are inter-related, please mention that it builds on top of another PR. Also, possibly consider marking the PRs that are targets in a dependence chain as drafts so that reviewer time is not wasted on reviewing PRs which may change based on whether the source PRs require changes. Reviewer time is precious and limited, and it would help if the PRs are organised to be reviewer friendly.

Copy link
Contributor

@kayceesrk kayceesrk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first commit looks good to me. I've left some comments for optional improvements.

out_err2:
pthread_mutex_destroy(&m->lock);
out_err:
return rc;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No change necessary, but noting that since we don't know which operation resulted in the non-0 return value, all we can conclude is that 0 is success and non-0 is an error. In particular, one cannot usefully inspect the error number.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can simplify this by thinking about which errors would be useful to ultimately signal to the mutator -- there can only really be Out_of_memory or another exception to signal that a thread cannot be created (to distinguish between memory exhaustion and other system limits like thread limits).

In this function, the only way the pthread_mutex_init or pthread_mutex_cond functions can fail is with a memory or resource error (EAGAIN or ENOMEM), since the functions aren't called in a way that could return EINVAL. So just returning a negative error code of -1 to indicate an OOM condition would be simpler than propagating the results of one of the mutex calls failing, I think.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is true that you do not know which one failed, but e.g. in case of ENOMEM this does not really matter.

The docs also mention EBUSY and the function is indeed used to reinitialize a mutex after fork, so since what is done for fork+threads is not really standards-compliant, it is best to keep the explicit error message for debugging. After spending an hour trying to figure out the best course of action here, I am thinking we might be overthinking it.

Raising an exception à la sync_check_error (which correctly maps ENOMEM to Out_of_memory) is doable, after fixing the handling of errors during domain creation.

m->init = 0; /* force initialization */
st_masterlock_init(m);
m->init = 0; /* force reinitialization */
if (0 != st_masterlock_init(m))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is 0 != st_masterlock_init(m) better than st_masterlock_init(m) != 0?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The house style in the runtime is mostly the latter, with only a few sprinklings of the (0 == ...) style that I can find in runtime/sys.c

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, fixed. Comments about style are welcome, to help fix the inconsistency of the code base.

st_masterlock_init(m);
m->init = 0; /* force reinitialization */
if (0 != st_masterlock_init(m))
caml_fatal_error("Unix.fork: failed to reinitialize master lock");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see that the proposed change is better than what was earlier. Is there no better recovery here than a caml_fatal_error?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is even caml_fatal_error safe here, given the earlier giant "only execute async-signal-safe operations" comment in the diff hunk above? If we've failed to do something as simple as allocate a few mutexes at this point in the fork, just exiting is probably the only option.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed if it fails, there is not much that can be done at that point. In case there is something to improve to the "best effort" to support fork+threads, it is still better to attempt writing an error message. Note that when trying to support fork+threads, one is already deep in unchartered territory (with some hope that the global lock ensures enough consistency to let you ignore the "only async-signal-safe" requirement). Of course, the whole enterprise is doomed to failure if there are other C threads and one is interrupted in the middle of malloc, say.

@@ -410,18 +411,20 @@ static void caml_thread_domain_stop_hook(void) {
};
}

static void caml_thread_domain_initialize_hook(void)
/* Returns 0 if initialization failed */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This semantics is the opposite of the semantics of other thread operations like pthread_mutex_init and pthread_cond_init where 0 is success and non-zero is an error. Wondered whether there was a reason to do it this way?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed with @kayceesrk. It's the opposite of the convention established in st_masterlock_init in this diff.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reasoning was that the other ones are internal, while this one is exported in some sense (my view was probably influenced by hacking heavily the runtime hooks in the past).

The new version raises an exception instead. This is mostly useful when initializing threads on the first domain, which was already prepared to receive an exception.

When fixing domain_thread_func to handle errors, this will need to be changed to returning an encoded exception.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since I feel that my comment was not very clear: I am backtracking a bit on returning an error code and I keep the exception-raising behaviour of caml_stat_alloc. I add a check to raise an exception on faulty mutex initialization. This lets me improve the behaviour when initializing threads on the first domain, without fixing or making worse the behaviour on initialization for other domains. This last fix will have to come in another PR (as mentioned already).

runtime/domain.c Show resolved Hide resolved
@@ -488,6 +494,10 @@ CAMLprim value caml_thread_initialize(value unit)
return Val_unit;
}

/* Cleanup the thread machinery when the runtime is shut down. Joining the tick
thread take 25ms on average / 50ms in the worst case, so we don't do it on
program exit. (FIXME: not implemented in OCaml 5 yet) */
Copy link
Contributor

@kayceesrk kayceesrk May 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, the fix is to not join the systhread since it might be sleeping? I noticed that the commit messages confirms this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note, as I mentioned in #11386 (comment), that not joining is more problematic for multicore, and I suggested the use of pthread_cond_timedwait with a condition variable used to request continuing or shutting down as a possible solution.

| _ -> print_endline "NOK"
end;
Domain.join d
let res = match Unix.fork () with
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is much better. Thanks!

@kayceesrk
Copy link
Contributor

Except for this one comment on the second commit (142beb8#r1207685036), the
other changes look fine to me.

@@ -470,7 +488,8 @@ CAMLprim value caml_thread_initialize(value unit)
st_tls_newkey(&caml_thread_key);

/* First initialise the systhread chain on this domain */
caml_thread_domain_initialize_hook();
if (!caml_thread_domain_initialize_hook())
caml_failwith("caml_thread_initialize: Thread initialization failed.");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be really nice to distinguish between an OOM (Out_of_memory) and a thread allocation fail (Failure) here, if possible.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wish granted. Note that properly dealing with those exceptions for domains ≥ 1 will be more work, as mentioned.

otherlibs/unix/fork.c Outdated Show resolved Hide resolved
process, the pid of the child process for the parent process. It
fails if the OCaml process is multi-core (any domain has been
spawned). In addition, if any threads have been spawned then the
child process might be in a corrupt state.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't entirely clear on how other threads have been spawned, given the invariant earlier. Do you mean system threads independently spawned of the OCaml runtime?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think this sentence refers to pthreads that would have been created by C code or other foreign code in the current process.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also systhreads, which is what I meant originally. C/FFI programmers already know not to mix fork & threads. I clarified the comment.

@gasche
Copy link
Member

gasche commented Jun 15, 2023

@gadmm it looks like the present PR has been reviewed three weeks ago, and would be mergeable in trunk after you rebase and take review comments into account. Are you available to do this in the next few weeks? If not, would you prefer if a maintainer took care of the rebase?

@kayceesrk kayceesrk self-assigned this Jun 16, 2023
@gadmm
Copy link
Contributor Author

gadmm commented Jun 16, 2023

As I understand it, this is not scheduled for 5.1 so there is no rush.

@kayceesrk
Copy link
Contributor

As I understand it, this is not scheduled for 5.1 so there is no rush.

It would be useful to get this towards a merge soon as the code has already gone out of sync with trunk and needs to be rebased. Merging it faster would mean that fixing conflicting changes would be the responsibility of folks who propose newer PRs ;-)

@gadmm
Copy link
Contributor Author

gadmm commented Jun 27, 2023

Thanks for the gentle pressure, I am well aware and this has been at the top of my pile. As you can guess my own time is constrained, and I also received many solicitations from the OCaml github concurrently. I have been prioritizing 5.1. (Even though the release schedule is kept private for some odd reason, I have some sense that 5.1 is the priority at this time.)

Still on the topic of what we could do better to make sure that aging PRs end up getting merged, would monthly reminders be received positively? Would they help?

@gadmm gadmm force-pushed the systhread_simpl_and_fixes3 branch from ef90b1c to ae46a4a Compare July 9, 2023 15:54
@gadmm
Copy link
Contributor Author

gadmm commented Jul 9, 2023

The "fixup" comments are for reviewing my fixes. They will be squashed later.

@kayceesrk
Copy link
Contributor

Except for this one comment on the second commit (142beb8#r1207685036), the
other changes look fine to me.

Force push lost my comment ?! :-( I don't know what my comment was, but I am assuming that it was fixed. If @gadmm can confirm, I think the PR is good to go.

@gadmm
Copy link
Contributor Author

gadmm commented Jul 17, 2023

The comment is still there right above (#11386 (comment)), along with my reply, but it looks like your pointer was invalidated. (I got the still-valid link by doing "copy link" directly from the PR discussion, which seems to give a different URL than yours.)

Let me know about the last point, and then I will clean-up the PR.

@kayceesrk
Copy link
Contributor

Let me know about the last point, and then I will clean-up the PR.

LGTM. Please also add a changes entry. I'll aim to merge this right after that.

gadmm and others added 6 commits July 18, 2023 18:08
This revealed further issues with the new way domains are initialized,
hence the addition of a FIXME.
In this code path, if Thread_lock(Dom_c_threads) is well-defined, then
Active_thread is non-null since both variables are initialized at the
same moment.

As with the previous one, this should be looked at to check if this
does not hide a more serious bug.
The tick thread no longer uses signals.
Take advantage of the fact that the domain_state is passed as an argument
…have been forgotten

Add a FIXME for a feature that is missing from OCaml 5 (avoid joining
on the tick thread at shutdown).
We change the order of spawning of the tick thread so that in case of
(very improbable) failure, we do not end up with both an exception and
a thread being spawned.
@gadmm gadmm force-pushed the systhread_simpl_and_fixes3 branch from ae46a4a to 96833f9 Compare July 18, 2023 16:29
@gadmm gadmm force-pushed the systhread_simpl_and_fixes3 branch from 96833f9 to 2569640 Compare July 18, 2023 16:40
@gadmm
Copy link
Contributor Author

gadmm commented Jul 18, 2023

Rebased & added a Changes entry (mostly to credit the work done by reviewers, as no change entry was needed for these minor changes & fixes IMO).

Edit: as already mentioned in a different discussion, the general advice on the organisation of this PR, which mentions various points that were already applied, and did not seem related to the delay in starting the review, ended up discussed in private.

@kayceesrk kayceesrk merged commit 90dae66 into ocaml:trunk Jul 18, 2023
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants