Allow maximum number of domains to be specified as a OCAMLRUNPARAM parameter#13272
Allow maximum number of domains to be specified as a OCAMLRUNPARAM parameter#13272gasche merged 2 commits intoocaml:trunkfrom
Conversation
|
Shouldn't the user-configurable value be bounds checked to be within reasonable limits (i.e. 1 <= user value < no more than a few k)? |
That sounds reasonable. Would 4096 be a good upper limit for a start? |
|
I believe that this is a step in the right direction -- hardcoding a fixed maximum of domains is bound to create issues sooner or later. I tried to use github search to find code out there that relies on the current limit to the number of domains. It's hard because there is a lot of noise from copies of the compiler codebase. I found the following:
Boxroot is easily fixed by reading the new global instead, I believe. (This code runs inside the OCaml process, so it sees the same OCAMLRUNPARAM value). For multicoretests this is less obvious because it runs as a separate process, so it cannot access the "real" value, but in fact this is just fine as long as the test runner controls the OCaml process and can fix the max_domains value there. For runtime_events consumers we should be in the clear, as the runtime_events codebase had the excellent insight of passing the To summarize: I think that we are making this change early enough that there is not a lot of sensitivity on this value in the ecosystem, and we should be fine compatibility-wise. |
I think so. Or maybe |
|
KC Sivaramakrishnan (2024/07/01 02:54 -0700):
> Shouldn't the user-configurable value be bounds checked to be within reasonable limits (i.e. 1 <= user value < no more than a few k)?
That sounds reasonable. Would 4096 be a good upper limit for a start?
I have no idea / experience but are there values that would just make no
sense in terms of cache sizes or so?
|
One issue with this is that it would break on the current default for most machines, so for example if I run the multicoretest runner with @kayceesrk why are we using one-letter name for OCAMLRUNPARAM variables, is this enforced by the specification or implementation? Otherwise I wouldn't mind |
|
Gabriel Scherer (2024/07/01 02:59 -0700):
> Or maybe `2 x sysconf(_SC_NPROCESSORS_CONF)` on Unix-like platforms. But that's probably too much bikeshed already.
One issue with this is that it would break on the current default for most machines, so for example if I run the multicoretest runner with `OCAMLRUNPARAM="domains=128"` I get a hard failure.
@kayceesrk why are we using one-letter name for OCAMLRUNPARAM
variables, is this enforced by the specification or implementation?
Otherwise I wouldn't mind `domains` or `max_domains` instead of `d`.
@damiendoligez may have something to say on that one?
|
Thanks for writing this out. I did want to mention this in the PR message. There was nothing to be done here since @sadiqj has already done the necessary work. Thanks Sadiq! |
Thanks for the summary. To fix the failing windows builds, @dra27 mentioned that this may need a bootstrap because of flexdll bootstrap. I'll do that next. |
e541e05 to
05ae3b0
Compare
| * [participating_domains, Max_domains) are all those domains free to be used | ||
| * This structure is protected by all_domains_lock. [0, participating_domains) | ||
| * are all the domains taking part in STW sections. [participating_domains, | ||
| * caml_params->max_domains) are all those domains free to be used. |
There was a problem hiding this comment.
Nit: I would rather keep the [foo, bar) intervals on a single line each for readability, which requires doing some manual line-breaking here.
| if (stw_domains.domains == NULL) | ||
| caml_fatal_error("Failed to allocate stw_domains.domains"); | ||
|
|
||
| caml_init_gc_stats(max_domains); |
There was a problem hiding this comment.
I think this call should probably move to caml_init_gc, as it initializes a runtime service that is independent from the domains machinery. (I have no idea what caml_init_signal_handling is doing at the end of the current function, and the same comment would apply -- or maybe even move it to caml_start_* -- but oh well.)
|
Still seeing Windows CI failures after the bootstrap. @dra27 do you see other possibilities for failures that we see? |
| " table"); | ||
|
|
||
| tick_thread_stop = caml_stat_alloc_noexc(caml_params->max_domains * | ||
| sizeof(atomic_uintnat)); |
There was a problem hiding this comment.
As an alternative, perhaps simpler solution, these two variables implementing some domain-local state could be moved into dedicated Caml_state fields. It should not be a problem if people not using systhreads allocate two fields but do not use them.
There was a problem hiding this comment.
It seemed appropriate to move it to the thread_table in st_stubs.c. Done that in 263be36.
There was a problem hiding this comment.
I have mixed feelings about this suggestion (for and against). This does not matter as I think you came to a satisfying resolution already, but let me elaborate in case it can help for future discussions.
- I think that systhreads are there to stay, and that we should not hesitate to integrate them more in the runtime to make their implementation better, simpler, more correct.
- On the other hand, I like that we still have code lying around in the runtime that behaves sort of like a third-party consumer of the runtime interfaces, instead of being hardcoded in every place. The ability for third-party libraries to have domain-local state as an array is useful, and keeping this aspect of the systhreads implementation is useful as a canary to check that we preserve this use-case. (@gadmm points out that the present change incurs a performance regression, which may matter for our current boxroot prototype; I don't know if it does matter for systhreads).
To conclude: if there are strong benefits to moving things into Caml_state, let's do it. But if we can do without, it gives us useful information about what third-party libraries can do, and there is value in that as well.
There was a problem hiding this comment.
(This reasoning somehow applies to gc_stats as well.)
There was a problem hiding this comment.
That's a roundabout way of saying that the runtime could implement a proper DLS for itself because it would also benefit libraries.
There was a problem hiding this comment.
Yep, "roundabout" is my middle name.
| collection, major cycle (which potentially includes compaction), | ||
| all of these events could happen during domain termination. */ | ||
| static struct gc_stats sampled_gc_stats[Max_domains]; | ||
| static struct gc_stats* sampled_gc_stats; |
There was a problem hiding this comment.
Out of curiosity, is there a reason for not having this as a field in the domain state variable?
There was a problem hiding this comment.
There are a few more failure cases to handle if sampled_gc_stats were moved to Caml_state. In particular, Caml_state is created on demand when domains are created. If gc_stats* were a field in Caml_state, even with the guarantee that stats are only sampled in stop-the-world sections, we need to check for whether gc_stats is not NULL (since newly created domains may not participate in stw sections).
I tried to make this change, but the changes for handling failure cases got messy. Perhaps one solution is to have sampled_gc_stats in dom_internal struct. But that struct is internal to domain.c.
Something to come back to later.
There was a problem hiding this comment.
ok, sounds like a good enough reason
It's not that easy, and the change is not ideal for us. As previously discussed with @gasche, simply removing the limitation of This PR presents an intermediate solution that does not remove Max_domains entirely, but makes it chosen at runtime. This changes the implementation of DLS as follows:
So this is not ideal for us, as the fast path of boxroot was carefully optimised (and we really gave advance notice about this usage of Max_domains). Two possible solutions are not too complex to implement:
|
|
Thinking further, there might be a third solution to our problem of an efficient DLS, via caching inside an (efficient) implicit TLS variable. In this case, it might not matter very much if DLS is made a bit less efficient. I will think about it further and let you know if there are any issue in the end with the current PR. |
|
Regarding the usage of It is still useful to have a limit number of domains to implement the backing DLS, though. So it is a different story than if users started taking very large values for max_domains (currently limited to 4096) or if max_domains was removed entirely. |
Indeed. We have also documented this fact here: https://ocaml.org/manual/5.2/parallelism.html#s:par_systhread_interaction.
We will have plenty of reasons to celebrate if someone usefully takes advantage of 4096 CPU cores in a single process. :-) Thanks for the approval. I'm debugging the Windows failures now. |
804b5b8 to
685cb07
Compare
|
@kayceesrk would it also be possible to have a Sandmark run for this PR? I don't expect any performance difference because I don't think we rely intensively on domain-indexed arrays in the runtime, but @gadmm comments made me curious about potential performance impact. |
|
I don't think it is necessary to benchmark the change. We're talking about microarchitectural effects here, and the code changed by this PR is not optimised that much. |
| thread_table = caml_stat_calloc_noexc(caml_params->max_domains, | ||
| sizeof(struct caml_thread_table)); | ||
| if (thread_table == NULL) | ||
| caml_fatal_error("caml_thread_initialize: failed to allocate thread" |
There was a problem hiding this comment.
At first I was concerned that the previous failure point in this function uses caml_failwith, but it can only occur in a specific cinematic where it makes sense to raise an exception, while these added allocation failures should be considered fatal (hence caml_fatal_error).
Eventually (in a later PR) this routine should be allowed to return a caml_result, but that's more work (and orthogonal to the problem being solved here).
There was a problem hiding this comment.
Actually, I think these fatal errors should be Out_of_memory exceptions instead, since the caml_failwith above shows that it is admissible to fail with uninitialized thread machinery.
Now this raises the question of whether the caml_failwith above is sound. It is sound if we are sure that an uncaught exception during module initialization stops the program (otherwise we need hardening by checking the variable threads_initialized at the entry points of the Thread module). Unfortunately I don't know enough about module initialization to answer this question.
There was a problem hiding this comment.
Is it a good idea to continue program execution when caml_stat_calloc_noexc fails to allocate memory for thread_table? Given that the underlying calloc has failed, the program is very likely doomed to fail soon in a non-recoverable way. Better to hard fail (fatal error) as soon as we anticipate this.
Like @gadmm, I also have the question of whether caml_failwith is sound. In particular, the exception that may be raised by this function is not handled at the call site,
ocaml/otherlibs/systhreads/thread.ml
Lines 76 to 79 in 6ad156a
and will lead to module initialisation failing with an uncaught exception?
There was a problem hiding this comment.
My reasoning here is that the Out_of_memory will halt the program quickly anyway. Indeed, otherwise there is also an issue right above with caml_failwith. This can provide better debugging. Since this is a large allocation (e.g. can be above glibc MMAP_THRESHOLD), this does not mean that the remaining mallocs are going to fail.
But I agree that the main concern is the Failure above, which happens if another module spawns domains during initialization. Do we know if, when module initialization fails:
- no functions from this module can be called?
- the program stops?
There was a problem hiding this comment.
The semantics I expect if the evaluation of a module raises an exception is that the whole program fails with that exception, without executing the code of later modules in link order. (The mental model of ocamlc a.ml b.ml c.ml is that the produced program performs the sequential evaluation of modules A, B, C.)
However, taking a step back: I think this part of the runtime is an unfinished design, not something set in stone that we should think deeply about. We should admit that the runtime model exposed to our users is a small number of domains with arbitrarily many non-parallel threads per domain, and move whatever parts of the thread machinery into the runtime/ logic that let us do this in a simpler, more robust way. It's "just" that no one around is actively working on this.
There was a problem hiding this comment.
Since this is a large allocation (e.g. can be above glibc MMAP_THRESHOLD), this does not mean that the remaining mallocs are going to fail.
Good point.
We should admit that the runtime model exposed to our users is a small number of domains with arbitrarily many non-parallel threads per domain, and move whatever parts of the thread machinery into the runtime/ logic that let us do this in a simpler, more robust way.
Agree with this sentiment. It may help us move towards a cleaner runtime, avoiding the various hooks.
I suspect that there will be a bit of design work that will be needed as the introduction of systhreads may add overheads in programs (due to the hooks). An easy first step may be to run Sandmark benchmarks with and without linking systhreads to see whether the overheads are significant.
|
There is one last failure on MSVC 64-bits: https://github.com/ocaml/ocaml/actions/runs/9774768473/job/26994732983#step:13:1862. I'm struggling to set up a development environment to debug this. |
|
CI runs are successful now. It looks like not zero initialising what used to be statically allocated arrays was causing failures. |
…rameter. Needs Bootstrap.
24af4d7 to
2437e39
Compare
|
@gadmm makes the point that this introduces conflicts with other in-flight PRs ( #12410, #12964, #13026 ). I still believe that "merge whichever is ready first" is a good principle, so I'm happy to merge this one, but I think we should also take this remark as a warning sign that we have a backlog of runtime PRs to handle, not to let fester for too long. I haven't been available to do much OCaml hacking in the last few weeks (and this will remain the case for another couple weeks). Clearly for #12410 the ball is in my camp and I will try to have a look. |
The parsing is done in C by a simple loop. We'd need a careful redesign (and more code) to switch to arbitrary identifiers. |
This PR introduces a new
OCAMLRUNPARAMparameterdthat permits the user to set the maximum number of domains. This idea was suggested by @shindere at #13256 (comment).Fixes #13256.