New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multicore OCaml #10831
Multicore OCaml #10831
Conversation
…o/skip-unsupported-tests Skip unsupported and incompatible tests
…threads_wg3_comments Systhreads WG3 comments
…ment_on_bt_install Comment on caml_domain_spawn also calling in install_backup_thread
…nitize-exports Rename / hide some global variables
…ert_testsuite_summarize
…gnals_coalesce Signals changes from sync review and WG3
Done (just after the merge). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Some of these comments point out bugs.
- There is a question about a discrepancy between the implementation of non-atomic writes according to the PLDI paper and the actual implementation of the write barrier.
- Many comments are about improvements that were lost at some point during rebase/merges regarding the polling behaviour. It is easier to get multicore up to speed with the recent changes separately (and help you with it), but I have to use this PR to point out the issues.
| // Give the GC a chance to run, and run memprof callbacks | ||
| return caml_process_pending_actions_with_root (result); | ||
| /* Give the GC a chance to run */ | ||
| return caml_check_urgent_gc (result); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(For context caml_check_urgent_gc is nowadays supposed to not to call any asynchronous callback to avoid raising asynchronous exceptions, which will be the topic of another review comment.)
I do not like losing memprof-specific code and comments which were the work of several error-prone PRs (or do you have a strategy for both authors and reviewers to make sure that they will be reintroduced when the time comes?). I find it better keep the memprof code and comments until it is upgraded to multicore, at the cost of commenting-out some of it (not needed here).
Idem further below in the file.
| @@ -445,7 +480,7 @@ static value caml_array_gather(intnat num_arrays, | |||
| /* Many caml_initialize in a row can create a lot of old-to-young | |||
| refs. Give the minor GC a chance to run if it needs to. | |||
| Run memprof callbacks for the major allocation. */ | |||
| res = caml_process_pending_actions_with_root (res); | |||
| res = caml_check_urgent_gc(res); | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for keeping the memprof comment here; please also keep the polling behaviour for consistency of the comment with the trunk semantics of process_pending vs. check_urgent.
| /* Beware: the allocations below may cause finalizers to be run, and another | ||
| backtrace---possibly of a different length---to be stashed (for example | ||
| if the finalizer raises then catches an exception). We choose to ignore | ||
| any such finalizer backtraces and return the original one. */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By current semantics of caml_alloc, this should not happen. Instead, caml_alloc should be fixed to delay the call to finalisers like in trunk (this looks simple to do). Then one can simplify the code below it as was done in trunk some time ago.
| @@ -27,23 +27,35 @@ | |||
| extern "C" { | |||
| #endif | |||
|
|
|||
| /* It is guaranteed that these allocation functions will not trigger | |||
| any OCaml callback such as finalizers or signal handlers. */ | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I mentioned above, this recent guarantee should be kept. It must have been accidentally lost during the rebase on top of 4.12, since I do not see any difficulty in keeping it in multicore.
| DOMAIN_STATE(volatile uintnat, young_limit) | ||
| /* Minor heap limit. Typically young_limit == young_start, but this field is set | ||
| * by other domains to signal this domain by causing a spurious allocation | ||
| * failure. */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this be atomic instead of volatile?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is now fixed.
| needed. Therefore, we set [caml_something_to_do] again in order | ||
| to force reexamination of callbacks. */ | ||
| caml_set_action_pending(); | ||
| return exn; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function is necessary if you want to correctly implement caml_process_pending_actions*
| if (!mem) | ||
| printf("Trimming failed\n"); | ||
| else if (mem != (void*)aligned_start) | ||
| printf("Hang on a sec - it's allocated a different block?!\n"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Taking inspiration from https://github.com/golang/go/blob/c27a3592aec5f46ae18f7fd3d9ba18e69e2f1dcb/src/runtime/malloc.go#L813-L828 this seems to be on the right track, but it seems that you need to do it in a loop to take care of the possibility of a race.
I'd avoid printf to report a runtime error, and the string could be a bit more meaningful. In any case make sure that in case of error it returns 0 (you can also call caml_fatal_error for situations you know cannot happen according to specification).
Idem about printf in two more occurrences below in the same file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(cc @dra27)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See #10908 (which, for what it's worth, would have appeared 5 days ago with an issue
runtime/memory.c
Outdated
| Using release stores for all writes also ensures publication safety | ||
| for newly-allocated objects, and isn't necessary for initialising | ||
| writes. The cost is free on x86, but requires a fence in | ||
| caml_modify on weakly-ordered architectures (ARM, Power). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a difference between this description and what is described in the PLDI 2018 paper, isn't it? I.e. no release store for writes (i.e. in armv8 str rather than stlr, no full dmb barrier).
runtime/memory.c
Outdated
| /* See Note [MM] above */ | ||
| atomic_thread_fence(memory_order_acquire); | ||
| atomic_store_explicit(&Op_atomic_val((value)fp)[0], val, | ||
| memory_order_release); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Following my comment above based on the PLDI paper here I expect memory_order_relaxed instead of memory_order_release
runtime/interp.c
Outdated
| } | ||
| /* Fall through CHECK_SIGNALS */ | ||
|
|
||
| /* Signal handling */ | ||
|
|
||
| Instruct(CHECK_SIGNALS): /* accu not preserved */ | ||
| if (caml_something_to_do) goto process_actions; | ||
| if (Caml_check_gc_interrupt(domain_state) || | ||
| caml_check_for_pending_signals()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In link with the polling behaviour changes, this can be made more efficient again. (Idem for POPTRAP.)
|
Function |
|
@silene This is ocaml-multicore/ocaml-multicore#791. A lot of things making |
I think it's worth us pulling this in to a separate issue on ocaml/ocaml where we can work out exactly what needs to happen. There are some parts (like delaying finalisers and returning exceptions rather than raising) which actually require a fair bit of work to make happen. |
|
I agree, just rebasing correctly alone would have been very hard for someone not familiar with the code, and working on this part in trunk is the best way to be able to help you. |
runtime/memory.c
Outdated
| } else { | ||
| /* See Note [MM] above */ | ||
| atomic_thread_fence(memory_order_acquire); | ||
| ret = atomic_exchange(Op_atomic_val(ref), v); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's a question about the implementation of the memory model again. For context, currently, Atomic.store and Atomic.exchange result directly in a call to caml_atomic_exchange.
Here if I follow naively the paper I expect instead something like:
ret = atomic_exchange(Op_atomic_val(ref), v);
atomic_thread_fence(memory_order_release);On armv8 (clang) the original code results in:
dmb (ish)ld; L: ldaxr w2, [x0]; stlxr w3, w1, [x0]; cbnz w3, L (notice the missing dmb st)
whereas the above one results in:
L: ldaxr w2, [x0]; stlxr w3, w1, [x0]; cbnz w3, .L17; dmb (ish)
which is closer to the paper (it does not seem to be possible to emit the dmb st barrier with standard C atomics; the closest is to use a full barrier).
Of course at the moment armv8 is not supported (but I assume that the armv8 prototype is currently implemented with these functions). However this runs deeper:
- The implementation currently seems correct on x86 when these functions are called from OCaml. However one could imagine that when they are called from C code, a compiler might try to inline them and be too smart. How do you ensure that these functions implement the OCaml memory model on x86 when called from C? (Edit: the answer is that currently
caml_atomic_exchangeis not (yet) accessible from C code; onlycaml_modify, which seems to implement a stronger operation than the OCaml memory model. Maybe this question will be relevant later.) - Maybe the memory model has evolved between the first prototype and the paper and the documentation at the top of the file was not updated. In that case should comment it be updated (more that I initially thought)?
Naively I would think that for the OCaml-C FFI you are eventually going to implement a conservative approximation of the OCaml memory model in C, whereas for native OCaml code you will eventually emit the exact assembly code you need (plus a C call to the write barrier). Sorry if there is a simple answer that I am missing!
Edit: the questions above will be relevant eventually, but for now (x86-only) it looks correct, even if misleading.
… rebase Mentioned in ocaml#10831 review comment on runtime/io.c:552.
ocaml-multicore/multicore was already merge in ocaml/ocaml (ocaml/ocaml#10831) and avalaible with 5.00.0+trunk. Now it makes sense to remove ocaml-multicore repo and use directly the version 5.00 of OCaml.
This PR adds support for shared-memory parallelism through domains and direct-style concurrency through effect handlers (without syntactic support). It intends to have backwards compatibility in terms of language features, C API, and also the performance of single-threaded code.
For users
If you want to learn more about Multicore OCaml, please have a look at the multicore wiki for papers, talks, tutorials and blog posts.
If you are interested in using Multicore OCaml, we suggest having a look at the following work-in-progress libraries:
Here is a snapshot of the multicore scalability results on parallel benchmarks from sandmark on a 2 processor AMD EPYC 7551 server with 64 cores in total:
The number in the parenthesis next to the benchmark name is the time (in seconds) taken by the sequential baseline of the corresponding benchmark.
Review process
This PR comes at the back of asynchronous code reviews by the OCaml core developers followed by a week-long synchronous code review session. The summary of the discussions is available in the November multicore monthly.
As mentioned in the November multicore monthly, this PR constitutes the minimum viable product (MVP) for Multicore OCaml. The "pre-MVP" tasks have been completed, and the "post-MVP for 5.00" tasks are yet to be done. The aim is that these tasks, and the ones that follow, can be completed on ocaml/ocaml Github repo, through the usual OCaml PR review process rather than on the Multicore OCaml repo.
Given that the PR is quite large, if you spot major breakages and functionality gaps, we suggest that you make separate issues on ocaml/ocaml Github repo. This will help keep the discussion threads readable. Feel free to comment on minor issues (typos, formatting edits, etc) directly in this PR, and we shall be happy to fix those.
What's in the box
The only supported backend is
amd64on Linux and macOS;arm64is in the works. The following features are not implemented:Acknowledgements
Multicore OCaml has been a long running project. We would like to thank all those who have helped find issues, debugged issues, reviewed code and contributed code along the way.
-- The Multicore OCaml team