Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multicore OCaml #10831

Merged
merged 4,103 commits into from Jan 10, 2022
Merged

Multicore OCaml #10831

merged 4,103 commits into from Jan 10, 2022

Conversation

kayceesrk
Copy link
Contributor

@kayceesrk kayceesrk commented Dec 21, 2021

This PR adds support for shared-memory parallelism through domains and direct-style concurrency through effect handlers (without syntactic support). It intends to have backwards compatibility in terms of language features, C API, and also the performance of single-threaded code.

For users

If you want to learn more about Multicore OCaml, please have a look at the multicore wiki for papers, talks, tutorials and blog posts.

If you are interested in using Multicore OCaml, we suggest having a look at the following work-in-progress libraries:

  1. domainslib -- a library for nested task parallelism
  2. eio -- asynchronous io in direct-style

Here is a snapshot of the multicore scalability results on parallel benchmarks from sandmark on a 2 processor AMD EPYC 7551 server with 64 cores in total:

The number in the parenthesis next to the benchmark name is the time (in seconds) taken by the sequential baseline of the corresponding benchmark.

Review process

This PR comes at the back of asynchronous code reviews by the OCaml core developers followed by a week-long synchronous code review session. The summary of the discussions is available in the November multicore monthly.

As mentioned in the November multicore monthly, this PR constitutes the minimum viable product (MVP) for Multicore OCaml. The "pre-MVP" tasks have been completed, and the "post-MVP for 5.00" tasks are yet to be done. The aim is that these tasks, and the ones that follow, can be completed on ocaml/ocaml Github repo, through the usual OCaml PR review process rather than on the Multicore OCaml repo.

Given that the PR is quite large, if you spot major breakages and functionality gaps, we suggest that you make separate issues on ocaml/ocaml Github repo. This will help keep the discussion threads readable. Feel free to comment on minor issues (typos, formatting edits, etc) directly in this PR, and we shall be happy to fix those.

What's in the box

The only supported backend is amd64 on Linux and macOS; arm64 is in the works. The following features are not implemented:

  • statmemprof
  • compilation with frame pointers
  • ocamldebug
  • AFL support
  • flambda

Acknowledgements

Multicore OCaml has been a long running project. We would like to thank all those who have helped find issues, debugged issues, reviewed code and contributed code along the way.

-- The Multicore OCaml team

Engil and others added 30 commits December 9, 2021 10:46
…o/skip-unsupported-tests

Skip unsupported and incompatible tests
…threads_wg3_comments

Systhreads WG3 comments
…ment_on_bt_install

Comment on caml_domain_spawn also calling in install_backup_thread
…nitize-exports

Rename / hide some global variables
…gnals_coalesce

Signals changes from sync review and WG3
@xavierleroy
Copy link
Contributor

just pushed a multicore branch to shindere/ocaml. At the head of this branch is a commit that removes the mention of naked
pointers from Inria's CI script. Could this be cherry-picked to this PR,please?

Done (just after the merge).

Copy link
Contributor

@gadmm gadmm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Some of these comments point out bugs.
  • There is a question about a discrepancy between the implementation of non-atomic writes according to the PLDI paper and the actual implementation of the write barrier.
  • Many comments are about improvements that were lost at some point during rebase/merges regarding the polling behaviour. It is easier to get multicore up to speed with the recent changes separately (and help you with it), but I have to use this PR to point out the issues.

// Give the GC a chance to run, and run memprof callbacks
return caml_process_pending_actions_with_root (result);
/* Give the GC a chance to run */
return caml_check_urgent_gc (result);
Copy link
Contributor

@gadmm gadmm Jan 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(For context caml_check_urgent_gc is nowadays supposed to not to call any asynchronous callback to avoid raising asynchronous exceptions, which will be the topic of another review comment.)

I do not like losing memprof-specific code and comments which were the work of several error-prone PRs (or do you have a strategy for both authors and reviewers to make sure that they will be reintroduced when the time comes?). I find it better keep the memprof code and comments until it is upgraded to multicore, at the cost of commenting-out some of it (not needed here).

Idem further below in the file.

@@ -445,7 +480,7 @@ static value caml_array_gather(intnat num_arrays,
/* Many caml_initialize in a row can create a lot of old-to-young
refs. Give the minor GC a chance to run if it needs to.
Run memprof callbacks for the major allocation. */
res = caml_process_pending_actions_with_root (res);
res = caml_check_urgent_gc(res);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for keeping the memprof comment here; please also keep the polling behaviour for consistency of the comment with the trunk semantics of process_pending vs. check_urgent.

/* Beware: the allocations below may cause finalizers to be run, and another
backtrace---possibly of a different length---to be stashed (for example
if the finalizer raises then catches an exception). We choose to ignore
any such finalizer backtraces and return the original one. */
Copy link
Contributor

@gadmm gadmm Jan 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By current semantics of caml_alloc, this should not happen. Instead, caml_alloc should be fixed to delay the call to finalisers like in trunk (this looks simple to do). Then one can simplify the code below it as was done in trunk some time ago.

@@ -27,23 +27,35 @@
extern "C" {
#endif

/* It is guaranteed that these allocation functions will not trigger
any OCaml callback such as finalizers or signal handlers. */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I mentioned above, this recent guarantee should be kept. It must have been accidentally lost during the rebase on top of 4.12, since I do not see any difficulty in keeping it in multicore.

DOMAIN_STATE(volatile uintnat, young_limit)
/* Minor heap limit. Typically young_limit == young_start, but this field is set
* by other domains to signal this domain by causing a spurious allocation
* failure. */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be atomic instead of volatile?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is now fixed.

needed. Therefore, we set [caml_something_to_do] again in order
to force reexamination of callbacks. */
caml_set_action_pending();
return exn;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is necessary if you want to correctly implement caml_process_pending_actions*

if (!mem)
printf("Trimming failed\n");
else if (mem != (void*)aligned_start)
printf("Hang on a sec - it's allocated a different block?!\n");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Taking inspiration from https://github.com/golang/go/blob/c27a3592aec5f46ae18f7fd3d9ba18e69e2f1dcb/src/runtime/malloc.go#L813-L828 this seems to be on the right track, but it seems that you need to do it in a loop to take care of the possibility of a race.
I'd avoid printf to report a runtime error, and the string could be a bit more meaningful. In any case make sure that in case of error it returns 0 (you can also call caml_fatal_error for situations you know cannot happen according to specification).
Idem about printf in two more occurrences below in the same file.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(cc @dra27)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See #10908 (which, for what it's worth, would have appeared 5 days ago with an issue 😉)

runtime/memory.c Outdated
Using release stores for all writes also ensures publication safety
for newly-allocated objects, and isn't necessary for initialising
writes. The cost is free on x86, but requires a fence in
caml_modify on weakly-ordered architectures (ARM, Power).
Copy link
Contributor

@gadmm gadmm Jan 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a difference between this description and what is described in the PLDI 2018 paper, isn't it? I.e. no release store for writes (i.e. in armv8 str rather than stlr, no full dmb barrier).

runtime/memory.c Outdated
/* See Note [MM] above */
atomic_thread_fence(memory_order_acquire);
atomic_store_explicit(&Op_atomic_val((value)fp)[0], val,
memory_order_release);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following my comment above based on the PLDI paper here I expect memory_order_relaxed instead of memory_order_release

runtime/interp.c Outdated
}
/* Fall through CHECK_SIGNALS */

/* Signal handling */

Instruct(CHECK_SIGNALS): /* accu not preserved */
if (caml_something_to_do) goto process_actions;
if (Caml_check_gc_interrupt(domain_state) ||
caml_check_for_pending_signals())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In link with the polling behaviour changes, this can be made more efficient again. (Idem for POPTRAP.)

@silene
Copy link

silene commented Jan 11, 2022

Function caml_process_pending_actions_exn seems to have disappeared. Is it on purpose?

@gadmm
Copy link
Contributor

gadmm commented Jan 11, 2022

@silene This is ocaml-multicore/ocaml-multicore#791. A lot of things making caml_process_pending_actions_exn possible are currently missing, my review above tracks what is missing.

@sadiqj
Copy link
Contributor

sadiqj commented Jan 11, 2022

A lot of things making caml_process_pending_actions_exn possible are currently missing, my review above tracks what is missing.

I think it's worth us pulling this in to a separate issue on ocaml/ocaml where we can work out exactly what needs to happen. There are some parts (like delaying finalisers and returning exceptions rather than raising) which actually require a fair bit of work to make happen.

@gadmm
Copy link
Contributor

gadmm commented Jan 11, 2022

I agree, just rebasing correctly alone would have been very hard for someone not familiar with the code, and working on this part in trunk is the best way to be able to help you.

runtime/memory.c Outdated
} else {
/* See Note [MM] above */
atomic_thread_fence(memory_order_acquire);
ret = atomic_exchange(Op_atomic_val(ref), v);
Copy link
Contributor

@gadmm gadmm Jan 11, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's a question about the implementation of the memory model again. For context, currently, Atomic.store and Atomic.exchange result directly in a call to caml_atomic_exchange.
Here if I follow naively the paper I expect instead something like:

    ret = atomic_exchange(Op_atomic_val(ref), v);
    atomic_thread_fence(memory_order_release);

On armv8 (clang) the original code results in:
dmb (ish)ld; L: ldaxr w2, [x0]; stlxr w3, w1, [x0]; cbnz w3, L (notice the missing dmb st)
whereas the above one results in:
L: ldaxr w2, [x0]; stlxr w3, w1, [x0]; cbnz w3, .L17; dmb (ish)
which is closer to the paper (it does not seem to be possible to emit the dmb st barrier with standard C atomics; the closest is to use a full barrier).

Of course at the moment armv8 is not supported (but I assume that the armv8 prototype is currently implemented with these functions). However this runs deeper:

  • The implementation currently seems correct on x86 when these functions are called from OCaml. However one could imagine that when they are called from C code, a compiler might try to inline them and be too smart. How do you ensure that these functions implement the OCaml memory model on x86 when called from C? (Edit: the answer is that currently caml_atomic_exchange is not (yet) accessible from C code; only caml_modify, which seems to implement a stronger operation than the OCaml memory model. Maybe this question will be relevant later.)
  • Maybe the memory model has evolved between the first prototype and the paper and the documentation at the top of the file was not updated. In that case should comment it be updated (more that I initially thought)?

Naively I would think that for the OCaml-C FFI you are eventually going to implement a conservative approximation of the OCaml memory model in C, whereas for native OCaml code you will eventually emit the exact assembly code you need (plus a C call to the write barrier). Sorry if there is a simple answer that I am missing!

Edit: the questions above will be relevant eventually, but for now (x86-only) it looks correct, even if misleading.

sabine pushed a commit to sabine/ocaml that referenced this pull request Feb 1, 2022
… rebase

Mentioned in ocaml#10831 review comment on runtime/io.c:552.
@gadmm gadmm mentioned this pull request Feb 1, 2022
xavierleroy pushed a commit that referenced this pull request Feb 1, 2022
…se (#10975)

Mentioned in #10831 review comment on runtime/io.c:552.

Co-authored-by: sabine <sabine@tarides.com>
moyodiallo added a commit to moyodiallo/ocaml-multicore-ci that referenced this pull request Mar 1, 2022
ocaml-multicore/multicore was already merge in ocaml/ocaml
(ocaml/ocaml#10831) and avalaible
with 5.00.0+trunk.

Now it makes sense to remove ocaml-multicore repo
and use directly the version 5.00 of OCaml.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet