New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OCaml multicore memory model and C (runtime, FFI, VM) #10992
Comments
Thank you for the concrete example. There's a typo: More surprisingly, your analysis seems to ignore the "acquire barrier; store release" operations performed by |
My layman's understanding is this: for synchronization to happen you need an order on the writer side and an order on the reader side. If you lack either then there is no order in total. Here the problematic behaviour is with relaxed memory models in hardware that can reorder reads (e.g. out-of-order and speculative reads on Arm), or clever compilers that try to leverage branch prediction by speculating on values. Both behaviours are allowed by the C11 memory model. So what is missing is an order on the reader side. As I wrote, Now as you can guess it would be very hard for a chip to load the value at address Compilers can speculate on values by rewriting Now compilers gave up implementing |
Thanks for the additional explanations. I was too focused on the writing side but now I see the reading side is the concern. If it can help, we could introduce and mandate the use of a I'm more afraid of crazy hardware than of crazy compilers. OCaml already depends on various UBs to be compiled in a sensible manner, e.g. overflows in signed integer arithmetic, and to mitigate the risks we can pass options to the compiler (e.g. Finally, I didn't see how making |
Thanks for the excellent writeup. For the record, my ideal option for |
|
I personally would like to see |
The Additional note (sorry for this, skip please if not interested). Something has been puzzling me for long and I have no answer: what happens when one mixes atomic and pain accesses to the same location? In other words, I am not sure that casting non atomic values to atomic ones and using atomic primitive to replace a plain load (something like |
I think @maranget's question on casting to atomics (currently used in multicore) is on-topic for this issue. http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2014/n4013.html provides some useful context (leaning towards "admitted off-standard practice"). |
Possible solutionWhat is going to happen is that there is going to be legacy code where mutable state is safe by virtue of not being shared between domains. If we admit this, then one gets the best of all worlds: legacy code remains fast and correct even in a strict C11 sense ("please use this library only on one domain"), and this legacy code can nevertheless evolve progressively as needed to lift the restriction (and remain correct in a strict C11 sense). Understand that by one domain, I do not mean a restriction to single-core programs, but instead that a multi-core program should make sure all uses of the library happen on the same domain (which less strict and would still make many libraries useful). We can discuss various ways to enforce this constraint. Note that this is going to solve initially more thread-unsafety issues for code ported from OCaml 4 than the one discussed here (e.g. races that are not C data races). With this in mind, you want to introduce a macro
edit: |
Just to make sure we're on the same page: there are two very common such situation, namely 1- using |
Yes we are, I should have clarified that I was talking about loads of mutable fields. |
Excellent! Then I'm in favor of introducing |
I have to backtrack on what I said. As far as publication safety is concerned, if you want to stay within the C11 standard, you need a My latest proposition does not address the following issue:
My proposition instead addresses the backwards-compatibility part by proposing a plausible evolution path for legacy code if one wants to stay within the C11 standard. |
IIUC you need the If so, what would be the difference between Since data-race-free programs have SC semantics, do we only recommend the use of |
To be clear, you first have to answer three questions regarding C code called from OCaml:
AFAIU it is not possible to answer A at all three questions. If the answer includes 1A or 2A, then you need to load fields with an acquire load. So I see three solutions:
In terms of backwards-compatibility and code evolution from OCaml 4, the solutions are ranked BBA > ABB > AAB. In addition, ABB immediately makes legacy code "correct". In terms of C11 compliance it is ranked AAB > ABB > BBA. Indeed ABB relies on atomic casts which is not strictly C11 as Luc mentions, but this is simpler than assuming that a read from a volatile cast can head a dependency chain (BBA). In terms of performance, this is BBA > AAB > ABB. But note that while ABB sounds horrendous in terms of performance (SC load/stores), as far as performance on armv8 is concerned AAB is not much better than ABB (IIUC acq-rel atomics are compiled the same as SC atomics). If we completely ignore the C11 spec, I am not entirely confident that one can stick a volatile cast to the
Yes, orthogonally to what I was proposing, you do not need a distinction between mutable and immutable loads. I briefly confused myself with some earlier messages after having left this issue aside for a while. It is not obvious, though, that the difference does not matter. One could think: what if we only made the loads of mutable fields be acquire loads? Would that not be enough, and simpler to handle? The problem is that the load of a mutable field could be done on the OCaml side before being passed to a C function. Then you end up relying implicitly on dependency ordering on the C-side, and so this solution is not better than BBA "volatile cast". Or you could add an acquire fence at every OCaml to C call, to address this issue! Then this might make a 4th option. If you use a volatile cast in It shares some pros and some cons of AAB and BBA. If the dependency chains stuff is found too worrisome, then BBB is the closest good alternative to BBA.
This is the gist of my proposal regarding the handling of backwards-compatibility and code evolution. Existing code runs on a single domain, so there are currently no data-race issues for OCaml 4 code running on OCaml 5. However, it is not enough to describe what makes a whole program correct. We are talking about libraries that programmers will assemble to form whole programs, and existing programs that will progressively evolve to take advantage of multicore. Programs will live in a state where some part of the program subscribe to one memory model and another part to the other one. And memory models have an infectious nature (if the C part has catch-fire semantics, then OCaml has too, since the OCaml code can introduce races in the C part). My proposal is to make programming language sense of this idea by suggesting essentially to import the notion of single-threaded shared mutable state which exists in Rust ( |
Here is a summary of a discussion I had today with @kayceesrk. ProposalRegarding the
The existence of macros similar to Make
|
There is one point I'd like to understand better: why have an acquire semantics for the |
Consider the thread value t2(value x)
{
value y = Field(x, 0); // plain read
if (Is_some(y)) {
value z = Field(y, 0); // plain read
printf("%d:%d", Int_val(Field(z, 0)), Int_val(Field(z, 1))); // plain reads
}
return Val_unit;
} If the use of |
You mean duplicating value t2(value x)
{
if (Is_some(Field(x,0))) {
value z = Field(Field(x,0), 0); // plain read
printf("%d:%d", Int_val(Field(z, 0)), Int_val(Field(z, 1))); // plain reads
}
return Val_unit;
} I do not see how this duplication can do any harm w.r.t. to the hardware model (as the same memory location is read twice). I am not even sure that this duplication will occur if I am still not convinced that having acquire semantics is needed here, as long as the compiler does not destroy the dependency from y to z and from z to |
The duplication of loads is problematic because a parallel thread could have written As you say, the important thing after having a relaxed atomic load is that the compiler preserves the dependencies, to correctly observe the initializing writes after allocation. However, the C memory model does not give any such guarantee. The acquire loads and fences solve the problem, albeit in a heavy-handed way. To leverage the dependency ordering, you allude to doing something like in the Linux kernel and expect the compiler to preserve dependencies. Kernel developers rely on strict coding rules and attention given to what particular implementation of compilers expected to compile the kernel are doing. There is the "Linux kernel memory model" which is an informal model written in documentation notes, mailing list messages and allusions in standards committee proposals, which establishes a set of beliefs on how one can rely on dependency ordering in some compilers currently (gcc and clang). We did not think realistic in the long term to go the same way: in practice because the two projects are very different (notably because for OCaml the model is intended for users too), and in theory because this is at odds with the efforts of having a formal memory model. There is another option we did not discuss with KC (probably because it is way beyond my expertise), but which should be contemplated seriously (with its pros and cons): trusting that A |
If I may sum it up. We all agree on having a dedicated
I am in favor of 1., mostly because I think it is sensible to assume that no compiler will destroy "actual" dependencies. Option 3 or 3.5 is safer however and the cost may be low. Additionally, I assume that we are discussing the C side of accessing OCaml non-atomic references, or more generally mutable fields fields. Accessing OCaml atomics from C is a different story, I guess one should in that case use dedicated primitives. I am not aware of a discussion on this point. |
`Field(v,i)` is now `volatile`-qualified, to better handle races with concurrent updates in C or in OCaml. (More complete discussion at #10992.) `&Field(v,i)` now has type `volatile value *`. Add `volatile` qualifiers to `caml_modify`, `caml_initialize`, and in internal parts of the runtime system.
`Field(v,i)` is now `volatile`-qualified, to better handle races with concurrent updates in C or in OCaml. (More complete discussion at #10992.) `&Field(v,i)` now has type `volatile value *`. Add `volatile` qualifiers to `caml_modify`, `caml_initialize`, and in internal parts of the runtime system. (cherry picked from commit 0e09dc4)
During the last months, the question of how to treat concurrent accesses from the runtime and from FFI functions has re-emerged several times, as witnessed by the number of links to this thread. Can we agree on a policy? I don’t have any authority in the field of memory models, but as a compiler dweller who is frequently confronted to the problem, I am interested in having a clear set of rules to refer to. My impression is that state of this discussion has been well summarized in another thread 1:
In addition, we should create a Footnotes |
I agree that we should try to summarize the discussions in a form that newcomers can understand. On a ping from @OlivierNicole I asked @gadmm for more details, and I wrote the summary below under this supervision. This extends and replaces the summary I wrote earlier, quoted by Olivier above. Documenting the current state
Looking at the futureWe do not wish to deprecate the Guillaume Munch-Maccagnoni has considered two approaches to avoid the uneasy situation with
|
I wonder if the "documenting the current state" part should be placed somewhere more permanent. In memory.h? Somewhere in the user manual? I am thinking of sending a PR to include it as a .h comment somewhere -- suggestions on where to place it are welcome. |
One place that would seem logical to me would be to put it after the |
The "documenting the current state" part is now included as a comment in memory.c as suggested by @OlivierNicole. Thanks for the feedback! |
Even simple OCaml programs depend on bits written in C, via the runtime or the FFI, and all bytecode OCaml programs are run by an interpreter written in C. C11 has so-called “catch-fire” semantics for data races, which means that the program can (even in practice) crash or do worse things in the presence of a data race. The OCaml multicore memory model (proposed in the PLDI'18 paper) on the other hand aims to guarantee that even racy programs have well-defined behaviour.
The backwards-compatibility guarantees of multicore motivates (ideally) retrofitting this property onto existing programs, including in terms of performance preservation, which might prove challenging. If the local DRF property is (reasonably) not required of all C code in existence (which is initially going to be legacy code inherited from OCaml 4), then we can weaken the requirement as follows:
In addition, the backwards-compatibility requirement is strict for single-threaded programs, but this leaves some leeway to ask programmers that they audit C code for memory-model compliance before turning on multiple domains. In this case, there should be a realistic path to making existing code compliant. It is not clear yet that such a relaxation of backwards-compatibility claims is necessary, but it is an option.
(Disclaimer that usually discussions about memory models fly over my head, but I made the effort of finding good references for what is below and I asked for expert advice who confirmed the diagnosis. All errors in the document below are mine.)
(I believe that this issue can interest @stedolan, @kayceesrk, @maranget.)
The
Field
macroIn this issue I will focus on the
Field
macro fromcaml/mlvalues.h
which is used to read and initialize blocks in the OCaml heap, and is part of the documented FFI. It is currently implemented as a plain load/store:This can notably be subject to load/store tearing (even on aligned word-sized pointers) and invented load/stores (see Who's afraid of a big bad optimizing compiler?).
Example
As an example, the following is a well-defined racy OCaml program:
Thread 1 and 2 can also be implemented in C which looks like code that one can imagine seeing in real-world programs:
According to the C memory model this is UB because 1)
Field
does a plain read on a value with a conflicting concurrent access, and 2) even ifField
did a relaxed atomic read, there is no guarantee that t2 sees the initializing writes of y, it might be looking at uninitialized memory and dereferencing garbage. The code does not do anything fancy: any implementation of a simple pattern-matching can be subject to these issues. (The variabley
would be declared withCAMLlocal1
if the documentation was followed closely, but this does not fundamentally change the problem and the simplification is realistic of what to expect of expert code.)For example, for 1) the compiler could rewrite t2 by inventing reads as:
which could happen in practice if this code is placed in a context of high register pressure. Now the second read is not guaranteed to give the same value as the first one in a racy OCaml program, so one might be dereferencing whatever. To avoid this, the read should at least be a
relaxed
atomic read.For 2), the current implementation of the OCaml memory model (adapted to the new STW collector design) intends to rely on dependency ordering being enforced in hardware (e.g. address dependency in Arm) and the compiler not doing anything crazy things that breaks dependencies.
Dependency ordering in C
memory_order_consume
.relaxed
ordering, see http://www.open-std.org/JTC1/SC22/WG21/docs/papers/2020/p2055r0.pdf §2.6.7 (see also §A).Some background for people not familiar with
memory_order_consume
: it is an ordering for atomic reads stronger thanrelaxed
but much cheaper thanacquire
on relaxed memory models, meant to rely on chips' ordering of data dependencies, inspired by the Linux kernel's RCU synchronisation mechanism which is very cheap for the reader. In practice though it was found impossible to implement as intended and is compiled like an acquire load by all compilers. Only DEC Alpha did not respect data dependency due to cache incoherence, in theory it is also threatened by any form of value prediction, more below. This is why in practice “memory_order_relaxed
may be used to head dependency chains”.The Linux kernel is a good source of info regarding the use of data dependency in practice in C. They avoid
consume
atomics at some cost:volatile
reads, expected to be treated similarly to relaxed atomic reads on word-sized values on the trusted compilers (AFAIU they could also use relaxed atomics, but use volatile for a mix of code legacy, backwards-compatibility and inadequacy of the C11 memory model)At least, Linux shows precedent about the need to go off-road regarding the strict confines of C11 well-defined behaviour. (The situation is eloquently described here.)
2 imperfect solutions
Therefore for the
Field
macro I see two (imperfect) options:Field
avolatile
cast (((volatile value *)(x)) [i]
):-fwrapv
regularizes OCaml's assumptions on signed overflows).Field
anatomic
cast (((_Atomic value *)(x)) [i]
):Field
become SC load/storesconsume
ordering (that is, anacquire
in practice, with permanent impact on performance), with some optimisations possible if immediateAnother option is to not change anything and inherit the “catch-fire” semantics of C11 when there is a race where one victim is the runtime or FFI code. But since this already invokes UB, it is unclear that there are any advantages to this compared to making
Field
a volatile cast.Remaining questions
Field
macro and similar macros in the same file as described above.Field
to store an immediate when it is known that what is overwritten is an immediate?)Field
macro is found in the C parts of the stdlib. AssumingField
is fixed, do the latter conform to the OCaml memory model?Field
macro is found in the interpreter. AssumingField
is fixed, does the latter implement the OCaml memory model?The text was updated successfully, but these errors were encountered: