Skip to content

Allow maximum number of domains to be specified as a OCAMLRUNPARAM parameter#13272

Merged
gasche merged 2 commits intoocaml:trunkfrom
kayceesrk:max_domains_ocamlrunparam
Jul 4, 2024
Merged

Allow maximum number of domains to be specified as a OCAMLRUNPARAM parameter#13272
gasche merged 2 commits intoocaml:trunkfrom
kayceesrk:max_domains_ocamlrunparam

Conversation

@kayceesrk
Copy link
Copy Markdown
Contributor

This PR introduces a new OCAMLRUNPARAM parameter d that permits the user to set the maximum number of domains. This idea was suggested by @shindere at #13256 (comment).

Fixes #13256.

@ghost
Copy link
Copy Markdown

ghost commented Jul 1, 2024

Shouldn't the user-configurable value be bounds checked to be within reasonable limits (i.e. 1 <= user value < no more than a few k)?

@kayceesrk
Copy link
Copy Markdown
Contributor Author

Shouldn't the user-configurable value be bounds checked to be within reasonable limits (i.e. 1 <= user value < no more than a few k)?

That sounds reasonable. Would 4096 be a good upper limit for a start?

@gasche
Copy link
Copy Markdown
Member

gasche commented Jul 1, 2024

I believe that this is a step in the right direction -- hardcoding a fixed maximum of domains is bound to create issues sooner or later.

I tried to use github search to find code out there that relies on the current limit to the number of domains. It's hard because there is a lot of noise from copies of the compiler codebase. I found the following:

Boxroot is easily fixed by reading the new global instead, I believe. (This code runs inside the OCaml process, so it sees the same OCAMLRUNPARAM value). For multicoretests this is less obvious because it runs as a separate process, so it cannot access the "real" value, but in fact this is just fine as long as the test runner controls the OCaml process and can fix the max_domains value there.

For runtime_events consumers we should be in the clear, as the runtime_events codebase had the excellent insight of passing the max_domains value in the metadata header from the start: it is already possible to write consumers that are robust to different values of this.

To summarize: I think that we are making this change early enough that there is not a lot of sensitivity on this value in the ecosystem, and we should be fine compatibility-wise.

@ghost
Copy link
Copy Markdown

ghost commented Jul 1, 2024

That sounds reasonable. Would 4096 be a good upper limit for a start?

I think so.

Or maybe 2 x sysconf(_SC_NPROCESSORS_CONF) on Unix-like platforms. But that's probably too much bikeshed already.

@shindere
Copy link
Copy Markdown
Contributor

shindere commented Jul 1, 2024 via email

@gasche
Copy link
Copy Markdown
Member

gasche commented Jul 1, 2024

Or maybe 2 x sysconf(_SC_NPROCESSORS_CONF) on Unix-like platforms. But that's probably too much bikeshed already.

One issue with this is that it would break on the current default for most machines, so for example if I run the multicoretest runner with OCAMLRUNPARAM="domains=128" I get a hard failure.

@kayceesrk why are we using one-letter name for OCAMLRUNPARAM variables, is this enforced by the specification or implementation? Otherwise I wouldn't mind domains or max_domains instead of d.

@shindere
Copy link
Copy Markdown
Contributor

shindere commented Jul 1, 2024 via email

@kayceesrk
Copy link
Copy Markdown
Contributor Author

For runtime_events consumers we should be in the clear, as the runtime_events codebase had the excellent insight of passing the max_domains value in the metadata header from the start: it is already possible to write consumers that are robust to different values of this.

Thanks for writing this out. I did want to mention this in the PR message. There was nothing to be done here since @sadiqj has already done the necessary work. Thanks Sadiq!

@kayceesrk
Copy link
Copy Markdown
Contributor Author

To summarize: I think that we are making this change early enough that there is not a lot of sensitivity on this value in the ecosystem, and we should be fine compatibility-wise.

Thanks for the summary.

To fix the failing windows builds, @dra27 mentioned that this may need a bootstrap because of flexdll bootstrap. I'll do that next.

Comment thread manual/src/cmds/runtime.etex Outdated
@kayceesrk kayceesrk force-pushed the max_domains_ocamlrunparam branch from e541e05 to 05ae3b0 Compare July 1, 2024 12:56
Comment thread runtime/caml/domain.h Outdated
Comment thread runtime/domain.c Outdated
* [participating_domains, Max_domains) are all those domains free to be used
* This structure is protected by all_domains_lock. [0, participating_domains)
* are all the domains taking part in STW sections. [participating_domains,
* caml_params->max_domains) are all those domains free to be used.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: I would rather keep the [foo, bar) intervals on a single line each for readability, which requires doing some manual line-breaking here.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 263be36.

Comment thread otherlibs/systhreads/st_stubs.c Outdated
Comment thread runtime/domain.c Outdated
if (stw_domains.domains == NULL)
caml_fatal_error("Failed to allocate stw_domains.domains");

caml_init_gc_stats(max_domains);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this call should probably move to caml_init_gc, as it initializes a runtime service that is independent from the domains machinery. (I have no idea what caml_init_signal_handling is doing at the end of the current function, and the same comment would apply -- or maybe even move it to caml_start_* -- but oh well.)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 263be36.

Comment thread runtime/gc_stats.c
Comment thread runtime/domain.c
Comment thread runtime/domain.c Outdated
@kayceesrk
Copy link
Copy Markdown
Contributor Author

Still seeing Windows CI failures after the bootstrap. @dra27 do you see other possibilities for failures that we see?

Comment thread otherlibs/systhreads/st_stubs.c Outdated
" table");

tick_thread_stop = caml_stat_alloc_noexc(caml_params->max_domains *
sizeof(atomic_uintnat));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As an alternative, perhaps simpler solution, these two variables implementing some domain-local state could be moved into dedicated Caml_state fields. It should not be a problem if people not using systhreads allocate two fields but do not use them.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seemed appropriate to move it to the thread_table in st_stubs.c. Done that in 263be36.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have mixed feelings about this suggestion (for and against). This does not matter as I think you came to a satisfying resolution already, but let me elaborate in case it can help for future discussions.

  1. I think that systhreads are there to stay, and that we should not hesitate to integrate them more in the runtime to make their implementation better, simpler, more correct.
  2. On the other hand, I like that we still have code lying around in the runtime that behaves sort of like a third-party consumer of the runtime interfaces, instead of being hardcoded in every place. The ability for third-party libraries to have domain-local state as an array is useful, and keeping this aspect of the systhreads implementation is useful as a canary to check that we preserve this use-case. (@gadmm points out that the present change incurs a performance regression, which may matter for our current boxroot prototype; I don't know if it does matter for systhreads).

To conclude: if there are strong benefits to moving things into Caml_state, let's do it. But if we can do without, it gives us useful information about what third-party libraries can do, and there is value in that as well.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(This reasoning somehow applies to gc_stats as well.)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a roundabout way of saying that the runtime could implement a proper DLS for itself because it would also benefit libraries.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, "roundabout" is my middle name.

Comment thread runtime/gc_stats.c
collection, major cycle (which potentially includes compaction),
all of these events could happen during domain termination. */
static struct gc_stats sampled_gc_stats[Max_domains];
static struct gc_stats* sampled_gc_stats;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of curiosity, is there a reason for not having this as a field in the domain state variable?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a few more failure cases to handle if sampled_gc_stats were moved to Caml_state. In particular, Caml_state is created on demand when domains are created. If gc_stats* were a field in Caml_state, even with the guarantee that stats are only sampled in stop-the-world sections, we need to check for whether gc_stats is not NULL (since newly created domains may not participate in stw sections).

I tried to make this change, but the changes for handling failure cases got messy. Perhaps one solution is to have sampled_gc_stats in dom_internal struct. But that struct is internal to domain.c.

Something to come back to later.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, sounds like a good enough reason

@gadmm
Copy link
Copy Markdown
Contributor

gadmm commented Jul 1, 2024

Boxroot is easily fixed by reading the new global instead, I believe. (This code runs inside the OCaml process, so it sees the same OCAMLRUNPARAM value).

It's not that easy, and the change is not ideal for us. Max_domain being known at compilation-time means that the array is allocated statically. This is used to implement an efficient domain-local state, which is relied-on for the fast path of boxroot. This usage of a fixed Max_domain has previously been discussed as we introduced the timing hook caml_domain_terminated_hook for this purpose (#11209).

As previously discussed with @gasche, simply removing the limitation of Max_domains would remove an application without providing an alternative solution. Yet alternative implementations of DLS for C code would require (simple, but) more intrusive modifications to the runtime, and therefore need coordination with upstream prior to removal.

This PR presents an intermediate solution that does not remove Max_domains entirely, but makes it chosen at runtime. This changes the implementation of DLS as follows:

  • One additional test for null when loading the DLS field.
  • One additional indirection with a data depency on the previous test.
  • This becomes a memory-inefficient way to implement a DLS when Max_domains is large (same issue for the present PR actually).

So this is not ideal for us, as the fast path of boxroot was carefully optimised (and we really gave advance notice about this usage of Max_domains).

Two possible solutions are not too complex to implement:

  • More efficient DLS could be offered natively. For instance, another domain state variable that works like Caml_state could be introduced, with slots allocated dynamically like an implicit TLS). This also ought to be useful for the runtime, e.g. the two uses of DLS inside systhreads (except if you accept my alternative suggestion to reserve a Caml_state slot for them).
  • A quick solution could be to please add one field for us (boxroot) inside the domain state. (A quick solution indeed although I don't like the idea of boxroot being priviledged compared to other libraries.)

@gadmm
Copy link
Copy Markdown
Contributor

gadmm commented Jul 1, 2024

Thinking further, there might be a third solution to our problem of an efficient DLS, via caching inside an (efficient) implicit TLS variable. In this case, it might not matter very much if DLS is made a bit less efficient. I will think about it further and let you know if there are any issue in the end with the current PR.

Copy link
Copy Markdown
Contributor

@gadmm gadmm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me.

@gadmm
Copy link
Copy Markdown
Contributor

gadmm commented Jul 2, 2024

Regarding the usage of Max_domains to implement a DLS, I am now pretty confident that we can get better performance by caching a pointer to DLS using implicit TLS, than any other possible implementation of DLS. Indeed, implicit TLS is backed by linker tricks that are otherwise inaccessible. (This makes the assumption that a system thread cannot change domains, but this assumption is probably deeply entrenched everywhere in the runtime implementation anyway.)

It is still useful to have a limit number of domains to implement the backing DLS, though. So it is a different story than if users started taking very large values for max_domains (currently limited to 4096) or if max_domains was removed entirely.

@kayceesrk
Copy link
Copy Markdown
Contributor Author

kayceesrk commented Jul 3, 2024

(This makes the assumption that a system thread cannot change domains, but this assumption is probably deeply entrenched everywhere in the runtime implementation anyway.)

Indeed. We have also documented this fact here: https://ocaml.org/manual/5.2/parallelism.html#s:par_systhread_interaction.

So it is a different story than if users started taking very large values for max_domains (currently limited to 4096)

We will have plenty of reasons to celebrate if someone usefully takes advantage of 4096 CPU cores in a single process. :-)

Thanks for the approval. I'm debugging the Windows failures now.

@kayceesrk kayceesrk force-pushed the max_domains_ocamlrunparam branch from 804b5b8 to 685cb07 Compare July 3, 2024 07:47
@gasche
Copy link
Copy Markdown
Member

gasche commented Jul 3, 2024

@kayceesrk would it also be possible to have a Sandmark run for this PR? I don't expect any performance difference because I don't think we rely intensively on domain-indexed arrays in the runtime, but @gadmm comments made me curious about potential performance impact.

@gadmm
Copy link
Copy Markdown
Contributor

gadmm commented Jul 3, 2024

I don't think it is necessary to benchmark the change. We're talking about microarchitectural effects here, and the code changed by this PR is not optimised that much.

Comment thread runtime/caml/startup_aux.h Outdated
thread_table = caml_stat_calloc_noexc(caml_params->max_domains,
sizeof(struct caml_thread_table));
if (thread_table == NULL)
caml_fatal_error("caml_thread_initialize: failed to allocate thread"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At first I was concerned that the previous failure point in this function uses caml_failwith, but it can only occur in a specific cinematic where it makes sense to raise an exception, while these added allocation failures should be considered fatal (hence caml_fatal_error).

Eventually (in a later PR) this routine should be allowed to return a caml_result, but that's more work (and orthogonal to the problem being solved here).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Proper error handling in this function is actually part of #12410.

(Incidentally I noticed that this PR interferes with many others: #12410, #12964, #13026.)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I think these fatal errors should be Out_of_memory exceptions instead, since the caml_failwith above shows that it is admissible to fail with uninitialized thread machinery.

Now this raises the question of whether the caml_failwith above is sound. It is sound if we are sure that an uncaught exception during module initialization stops the program (otherwise we need hardening by checking the variable threads_initialized at the entry points of the Thread module). Unfortunately I don't know enough about module initialization to answer this question.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it a good idea to continue program execution when caml_stat_calloc_noexc fails to allocate memory for thread_table? Given that the underlying calloc has failed, the program is very likely doomed to fail soon in a non-recoverable way. Better to hard fail (fatal error) as soon as we anticipate this.

Like @gadmm, I also have the question of whether caml_failwith is sound. In particular, the exception that may be raised by this function is not handled at the call site,

let () =
thread_initialize ();
(* Called back in [caml_shutdown], when the last domain exits. *)
Callback.register "Thread.at_shutdown" thread_cleanup

and will lead to module initialisation failing with an uncaught exception?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My reasoning here is that the Out_of_memory will halt the program quickly anyway. Indeed, otherwise there is also an issue right above with caml_failwith. This can provide better debugging. Since this is a large allocation (e.g. can be above glibc MMAP_THRESHOLD), this does not mean that the remaining mallocs are going to fail.

But I agree that the main concern is the Failure above, which happens if another module spawns domains during initialization. Do we know if, when module initialization fails:

  • no functions from this module can be called?
  • the program stops?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The semantics I expect if the evaluation of a module raises an exception is that the whole program fails with that exception, without executing the code of later modules in link order. (The mental model of ocamlc a.ml b.ml c.ml is that the produced program performs the sequential evaluation of modules A, B, C.)

However, taking a step back: I think this part of the runtime is an unfinished design, not something set in stone that we should think deeply about. We should admit that the runtime model exposed to our users is a small number of domains with arbitrarily many non-parallel threads per domain, and move whatever parts of the thread machinery into the runtime/ logic that let us do this in a simpler, more robust way. It's "just" that no one around is actively working on this.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is a large allocation (e.g. can be above glibc MMAP_THRESHOLD), this does not mean that the remaining mallocs are going to fail.

Good point.

We should admit that the runtime model exposed to our users is a small number of domains with arbitrarily many non-parallel threads per domain, and move whatever parts of the thread machinery into the runtime/ logic that let us do this in a simpler, more robust way.

Agree with this sentiment. It may help us move towards a cleaner runtime, avoiding the various hooks.

I suspect that there will be a bit of design work that will be needed as the introduction of systhreads may add overheads in programs (due to the hooks). An easy first step may be to run Sandmark benchmarks with and without linking systhreads to see whether the overheads are significant.

Comment thread runtime/domain.c Outdated
@ghost ghost mentioned this pull request Jul 3, 2024
@kayceesrk
Copy link
Copy Markdown
Contributor Author

There is one last failure on MSVC 64-bits: https://github.com/ocaml/ocaml/actions/runs/9774768473/job/26994732983#step:13:1862. I'm struggling to set up a development environment to debug this.

@kayceesrk
Copy link
Copy Markdown
Contributor Author

CI runs are successful now. It looks like not zero initialising what used to be statically allocated arrays was causing failures.

@kayceesrk kayceesrk force-pushed the max_domains_ocamlrunparam branch from 24af4d7 to 2437e39 Compare July 4, 2024 05:37
@kayceesrk
Copy link
Copy Markdown
Contributor Author

CI is green now. We've had approvals from @gadmm and @dustanddreams. @gasche are we ok to merge?

@gasche
Copy link
Copy Markdown
Member

gasche commented Jul 4, 2024

@gadmm makes the point that this introduces conflicts with other in-flight PRs ( #12410, #12964, #13026 ). I still believe that "merge whichever is ready first" is a good principle, so I'm happy to merge this one, but I think we should also take this remark as a warning sign that we have a backlog of runtime PRs to handle, not to let fester for too long. I haven't been available to do much OCaml hacking in the last few weeks (and this will remain the case for another couple weeks). Clearly for #12410 the ball is in my camp and I will try to have a look.

@gasche gasche merged commit 162d000 into ocaml:trunk Jul 4, 2024
@damiendoligez
Copy link
Copy Markdown
Member

@gasche

why are we using one-letter name for OCAMLRUNPARAM variables, is this enforced by the specification or implementation? Otherwise I wouldn't mind domains or max_domains instead of d.

The parsing is done in C by a simple loop. We'd need a careful redesign (and more code) to switch to arbitrary identifiers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Raise the maximum number of domains

6 participants