Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OCaml multicore memory model and C (runtime, FFI, VM) #10992

Open
5 tasks
gadmm opened this issue Feb 5, 2022 · 26 comments
Open
5 tasks

OCaml multicore memory model and C (runtime, FFI, VM) #10992

gadmm opened this issue Feb 5, 2022 · 26 comments

Comments

@gadmm
Copy link
Contributor

gadmm commented Feb 5, 2022

Even simple OCaml programs depend on bits written in C, via the runtime or the FFI, and all bytecode OCaml programs are run by an interpreter written in C. C11 has so-called “catch-fire” semantics for data races, which means that the program can (even in practice) crash or do worse things in the presence of a data race. The OCaml multicore memory model (proposed in the PLDI'18 paper) on the other hand aims to guarantee that even racy programs have well-defined behaviour.

  • To prevent the C code from threatening OCaml's local DRF property, the C code interacting with OCaml should implement the OCaml memory model.

The backwards-compatibility guarantees of multicore motivates (ideally) retrofitting this property onto existing programs, including in terms of performance preservation, which might prove challenging. If the local DRF property is (reasonably) not required of all C code in existence (which is initially going to be legacy code inherited from OCaml 4), then we can weaken the requirement as follows:

  • To prevent the catch-fire semantics of C11 from threatening OCaml's memory safety guarantees, the C code interacting with OCaml should take the OCaml memory model and racy OCaml programs into account.

In addition, the backwards-compatibility requirement is strict for single-threaded programs, but this leaves some leeway to ask programmers that they audit C code for memory-model compliance before turning on multiple domains. In this case, there should be a realistic path to making existing code compliant. It is not clear yet that such a relaxation of backwards-compatibility claims is necessary, but it is an option.

(Disclaimer that usually discussions about memory models fly over my head, but I made the effort of finding good references for what is below and I asked for expert advice who confirmed the diagnosis. All errors in the document below are mine.)

(I believe that this issue can interest @stedolan, @kayceesrk, @maranget.)


The Field macro

In this issue I will focus on the Field macro from caml/mlvalues.h which is used to read and initialize blocks in the OCaml heap, and is part of the documented FFI. It is currently implemented as a plain load/store:

#define Field(x, i) (((value *)(x)) [i])           /* Also an l-value. */

This can notably be subject to load/store tearing (even on aligned word-sized pointers) and invented load/stores (see Who's afraid of a big bad optimizing compiler?).

Example

As an example, the following is a well-defined racy OCaml program:

let t1 x = x := Some (1,1) (* thread 1 *)
let t2 x = match !x with None -> () | Some (n,m) -> Printf.printf "%d:%d" n m (* thread 2 *)
let _ =
  let x = ref None in
  {{ t1 x || t2 x }}

Thread 1 and 2 can also be implemented in C which looks like code that one can imagine seeing in real-world programs:

value t1(value x)
{
  CAMLparam1(x);
  CAMLlocal2(y,z);
  y = caml_alloc_small(1, Tag_some); // plain stores
  z = caml_alloc_small(2, 0); // plain stores
  Field(z, 0) = Val_int(0); // plain store
  Field(z, 1) = Val_int(0); // plain store
  Field(y, 0) = z; // plain store
  Store_field(x, 0, y); // release atomic store via caml_modify
  CAMLreturn(Val_unit);
}

value t2(value x)
{
  value y = Field(x, 0); // plain read
  if (Is_some(y)) {
    value z = Field(y, 0); // plain read
    printf("%d:%d", Int_val(Field(z, 0)), Int_val(Field(z, 1))); // plain reads
  }
  return Val_unit;
}

According to the C memory model this is UB because 1) Field does a plain read on a value with a conflicting concurrent access, and 2) even if Field did a relaxed atomic read, there is no guarantee that t2 sees the initializing writes of y, it might be looking at uninitialized memory and dereferencing garbage. The code does not do anything fancy: any implementation of a simple pattern-matching can be subject to these issues. (The variable y would be declared with CAMLlocal1 if the documentation was followed closely, but this does not fundamentally change the problem and the simplification is realistic of what to expect of expert code.)

For example, for 1) the compiler could rewrite t2 by inventing reads as:

  if (Is_some(Field(x, 0))) {
    z = Field(Field(x, 0), 0);
    printf("%d:%d", Int_val(Field(z, 0)), Int_val(Field(z, 1)));
  }

which could happen in practice if this code is placed in a context of high register pressure. Now the second read is not guaranteed to give the same value as the first one in a racy OCaml program, so one might be dereferencing whatever. To avoid this, the read should at least be a relaxed atomic read.

For 2), the current implementation of the OCaml memory model (adapted to the new STW collector design) intends to rely on dependency ordering being enforced in hardware (e.g. address dependency in Arm) and the compiler not doing anything crazy things that breaks dependencies.

Dependency ordering in C

  • In theory the C version is going to be UB if reads are not at least memory_order_consume.
  • In practice it falls within a de facto use of the relaxed ordering, see http://www.open-std.org/JTC1/SC22/WG21/docs/papers/2020/p2055r0.pdf §2.6.7 (see also §A).

    “[if] the developers are willing to live within strict coding standards [12], memory_order_relaxed may be used to head dependency chains. […] Note well that this design pattern is outside of the current standard.”

Some background for people not familiar with memory_order_consume: it is an ordering for atomic reads stronger than relaxed but much cheaper than acquire on relaxed memory models, meant to rely on chips' ordering of data dependencies, inspired by the Linux kernel's RCU synchronisation mechanism which is very cheap for the reader. In practice though it was found impossible to implement as intended and is compiled like an acquire load by all compilers. Only DEC Alpha did not respect data dependency due to cache incoherence, in theory it is also threatened by any form of value prediction, more below. This is why in practice memory_order_relaxed may be used to head dependency chains”.

The Linux kernel is a good source of info regarding the use of data dependency in practice in C. They avoid consume atomics at some cost:

  • place trust in the C compilers that Linux is supposed to be compiled with (gcc, clang, icc?) to do nothing crazy
  • use volatile reads, expected to be treated similarly to relaxed atomic reads on word-sized values on the trusted compilers (AFAIU they could also use relaxed atomics, but use volatile for a mix of code legacy, backwards-compatibility and inadequacy of the C11 memory model)
  • expect programmers to follow a strict coding discipline, cf. link above. In practice for OCaml+C code not all of these rules are necessary and the rest are difficult to break (essentially not doing anything that would let the compiler break dependencies by speculating on the value of pointers).

At least, Linux shows precedent about the need to go off-road regarding the strict confines of C11 well-defined behaviour. (The situation is eloquently described here.)

2 imperfect solutions

Therefore for the Field macro I see two (imperfect) options:

  1. Make Field a volatile cast (((volatile value *)(x)) [i]):
    • the OCaml runtime & FFI do not follow the C11 memory model
    • volatile reads are forever assumed to mimick relaxed load/stores on word-sized aligned pointers, and to head dependency chains: no exotic/futuristic ISAs even for bytecode (or hope that a compiler switch makes it well-defined in the future, like -fwrapv regularizes OCaml's assumptions on signed overflows).
    • preserves performance (hopefully)
    • clarify rules similar to the RCU “coding standards” linked above (based on the relevant subset of it). A careful read is reassuring that only things out of the ordinary can have bad consequences, but is still important to document.
    • not sufficient for legacy code to automatically enforce the OCaml memory model?
    • backwards-compatible?
  2. Make Field an atomic cast (((_Atomic value *)(x)) [i]):
    • well-defined behaviour, read/writes using Field become SC load/stores
    • definite performance impact: the idea is that code does not break but users are encouraged to progressively migrate towards newly-introduced more efficient primitives
    • in order to stay within the C11 memory model, a new primitive to read a field would still require at least consume ordering (that is, an acquire in practice, with permanent impact on performance), with some optimisations possible if immediate
    • sufficient for legacy code to automatically enforce the OCaml memory model?
    • backwards-compatible in a different manner?

Another option is to not change anything and inherit the “catch-fire” semantics of C11 when there is a race where one victim is the runtime or FFI code. But since this already invokes UB, it is unclear that there are any advantages to this compared to making Field a volatile cast.


Remaining questions

  • Decide a solution to fix the Field macro and similar macros in the same file as described above.
  • Is any undocumented rule used by realistic, professional code with OCaml 4 but invalid for the OCaml memory model? (For instance, use of Field to store an immediate when it is known that what is overwritten is an immediate?)
  • Are there other parts of the documented FFI that requires adaptation to make it realistic to conform to the OCaml memory model? Is the documentation up-to-date in this respect?
  • The Field macro is found in the C parts of the stdlib. Assuming Field is fixed, do the latter conform to the OCaml memory model?
  • The Field macro is found in the interpreter. Assuming Field is fixed, does the latter implement the OCaml memory model?
@xavierleroy
Copy link
Contributor

xavierleroy commented Feb 6, 2022

Thank you for the concrete example. There's a typo: Field(y, 0) = y; should be = z;.

More surprisingly, your analysis seems to ignore the "acquire barrier; store release" operations performed by caml_modify, a.k.a. Store_field. Of course, if Store_field was a plain write, "there is no guarantee that t2 sees the initializing reads writes of y, it might be looking at uninitialized memory and dereferencing garbage". But it is not a plain write. So, what' really going on in this example?

@gadmm
Copy link
Contributor Author

gadmm commented Feb 6, 2022

My layman's understanding is this: for synchronization to happen you need an order on the writer side and an order on the reader side. If you lack either then there is no order in total. Here the problematic behaviour is with relaxed memory models in hardware that can reorder reads (e.g. out-of-order and speculative reads on Arm), or clever compilers that try to leverage branch prediction by speculating on values. Both behaviours are allowed by the C11 memory model. So what is missing is an order on the reader side. As I wrote, Store_field does a release atomic store. But you can have the strongest barrier you want in caml_modify, this is not going to address the issue.

Now as you can guess it would be very hard for a chip to load the value at address *p before loading the value at address p (!). So it is very cheap for Arm to guarantee that this does not happen (address ordering). It is possible to imagine though how hardware could give up this guarantee and require a fence, if the cache itself is not ordered like on Alpha, or if the hardware is able to predict values (only an academic research topic at the moment). The C11 memory model is either very in the past or very in the future, and allows such things. The only way to leverage Arm's address ordering according to the C11 standard is to use an order stronger or equal to memory_order_consume.

Compilers can speculate on values by rewriting E[p] as if (p == A) then E[A] else E[p] which kicks in branch prediction. It is very hard for a compiler to speculate on pointer values, this is the gist of the RCU coding guidelines which gives a few examples where this can happen (essentially the compiler can take guesses according to what you do with the pointer afterwards). Most examples do not apply to normal usage of the OCaml FFI, and I find it rather reassuring. The guidelines advises turning off feedback-driven optimizers, which lets you imagine very creative scenarios: for instance if an optimizer observes a program running and decides that it is good to speculate on the equality between two pointer values (for instance if a single rare allocation is the only allocation in a certain size class for the whole duration of the program, which de facto gives it a fixed address compared to some internal allocator data). Of course this is very unlikely to happen. But this is allowed by the C11 memory model. To force compilers to preserve data dependencies, C11 gives you memory_order_consume.

Now compilers gave up implementing memory_order_consume and simply replace it with memory_order_acquire. Hence even putting backwards-compatibility constraints aside, you are left with a choice between efficiency and standards-compliance.

@xavierleroy
Copy link
Contributor

Thanks for the additional explanations. I was too focused on the writing side but now I see the reading side is the concern.

If it can help, we could introduce and mandate the use of a Load_mutable_field macro, symmetrical to Store_field, that loads with appropriate memory ordering (acquire?), and is not a l-value. Most OCaml data structures being immutable, and esp. those accessed from C stubs, I'd like to keep the existing Field definition unchanged for initializing and reading from immutable blocks.

I'm more afraid of crazy hardware than of crazy compilers. OCaml already depends on various UBs to be compiled in a sensible manner, e.g. overflows in signed integer arithmetic, and to mitigate the risks we can pass options to the compiler (e.g. -fwrap currently) and/or limit optimization level (e.g. -O2 currently, and I wouldn't go any higher). So, if crazy compiler optimizations is the only concern, it should be manageable.

Finally, I didn't see how making Field a volatile memory access would prevent the "let's guess the address that will be accessed" optimizations.

@stedolan
Copy link
Contributor

stedolan commented Feb 6, 2022

Thanks for the excellent writeup. For the record, my ideal option for Field would be some sort of (_Atomic(relaxed) value*) cast that results in an lvalue that does relaxed-atomic loads and stores. Sadly no such cast exists in C, so I'd lean towards option (1) (volatile cast) as the closest alternative. (However, even this changes the type of (&Field(...)), which will have its own backwards compatibility concerns)

@gadmm
Copy link
Contributor Author

gadmm commented Feb 6, 2022

  • Adding a Load_mutable_field and deprecating certain uses of Field reminds (at a smaller scale) of the kind of changes that were needed by the previous concurrent collector design (it, too, could have simply preserved the behaviour for programs running on a single domain, IIUC). It does not seem obvious at all to me that one can look at C code and identify all the places where Field should be changed (nothing distinguishes a mutable load from an immutable load in current code). Perhaps Stephen and KC have more experience to share regarding the difficulty of such changes?
  • A performance (and backwards-compatibility, in light of Stephen's remark) evaluation of making Field a volatile cast seems needed to decide whether or not the change has such big impact. It seems that various reasons could force you to keep the macro as it is, requiring empirical evidence.
  • The rationale for assuming volatile reads head dependency ordering is: the C committee knows that consume is broken, and that relaxed is used for this purpose in practice (cf. p2055r0); moreover you are in good company in assuming that volatile is a proxy for relaxed in this usage. This is current de facto, documented use of C, the first of which still has currently no alternative for some crucial use. If I follow correctly, the committee are more the kind of people to try to evolve the standard to serve existing usage than the converse, and (C) compiler writers try to avoid breaking code.

@gasche
Copy link
Member

gasche commented Feb 7, 2022

I personally would like to see Load_mutable_field and Load_immutable_field macros on the C side, that need not be lvalues. I could use them to express my intent / my understanding of the value type being manipulated, and I would feel confident that this is robust to future changes in the runtime.
(I don't know if it's realistic to deprecate Field and expect those new macros to replace existing FFI code. But having them in addition to the current approach makes it easier to write forward-looking code, and it will make it easier to make some more opinionated changes to the FFI in N years if we want to or hav to.)

@maranget
Copy link
Contributor

maranget commented Feb 7, 2022

The Load_mutable_field/Store_field suggestion reminds me of C-kernel practice: deprecating the all-purpose macro for sensitive pointers, in favor of having different macros for storing to and loading from sensitive pointers. Those macros a based upon casts to volatile pointers. Notice that they did so for reason different of ours and that they manage for the load-side macros not to be a lvalue (some details). So it is possible to have volatile casts and force r-value usage.

Additional note (sorry for this, skip please if not interested). Something has been puzzling me for long and I have no answer: what happens when one mixes atomic and pain accesses to the same location? In other words, I am not sure that casting non atomic values to atomic ones and using atomic primitive to replace a plain load (something like atomic_load_explicit((_Atomic(value) *)p,MEMORY_ORDER_RELAXED), if I am not mistaken) is well defined according to the standard.

@gadmm
Copy link
Contributor Author

gadmm commented Feb 8, 2022

I think @maranget's question on casting to atomics (currently used in multicore) is on-topic for this issue. http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2014/n4013.html provides some useful context (leaning towards "admitted off-standard practice").

@gadmm
Copy link
Contributor Author

gadmm commented Mar 26, 2022

Possible solution

What is going to happen is that there is going to be legacy code where mutable state is safe by virtue of not being shared between domains. If we admit this, then one gets the best of all worlds: legacy code remains fast and correct even in a strict C11 sense ("please use this library only on one domain"), and this legacy code can nevertheless evolve progressively as needed to lift the restriction (and remain correct in a strict C11 sense).

Understand that by one domain, I do not mean a restriction to single-core programs, but instead that a multi-core program should make sure all uses of the library happen on the same domain (which less strict and would still make many libraries useful). We can discuss various ways to enforce this constraint.

Note that this is going to solve initially more thread-unsafety issues for code ported from OCaml 4 than the one discussed here (e.g. races that are not C data races).

With this in mind, you want to introduce a macro Load_field that performs an acquire load, but:

  1. you can either say that newly-introduced loads must use it, or that there are situations where it still makes sense to use Field (e.g. performance in single-domain scenarios, given how expensive acquire-release is on Arm)
  2. you can make Field perform volatile accesses, but this is not necessary for safety. You would want to do it to guard against crashes in practice due to misuse, but the first argument for memory safety remains based in the C11 standard.

edit: Load_field instead of Load_mutable_field.

@xavierleroy
Copy link
Contributor

there are situations where it still makes sense to use Field

Just to make sure we're on the same page: there are two very common such situation, namely 1- using Field to access an immutable field and 2- initializing such a field after caml_alloc_small. But for mutable fields, I'm all in favor of introducing and using Load_mutable_field.

@gadmm
Copy link
Contributor Author

gadmm commented Mar 26, 2022

Just to make sure we're on the same page

Yes we are, I should have clarified that I was talking about loads of mutable fields.

@xavierleroy
Copy link
Contributor

Excellent! Then I'm in favor of introducing Load_mutable_field, documenting it, and using it within the core system (at first).

@gadmm
Copy link
Contributor Author

gadmm commented Mar 28, 2022

I have to backtrack on what I said. As far as publication safety is concerned, if you want to stay within the C11 standard, you need a Load_immutable_field that performs an acquire atomic load (let's put memory_order_consume aside which introduces complications on its usage in theory and is useless in practice).

My latest proposition does not address the following issue:

even putting backwards-compatibility constraints aside, you are left with a choice between efficiency and standards-compliance.

My proposition instead addresses the backwards-compatibility part by proposing a plausible evolution path for legacy code if one wants to stay within the C11 standard.

@kayceesrk
Copy link
Contributor

@gadmm

IIUC you need the acquire load on the immutable fields because the initialising stores are plain stores and publication safety is thanks to the release store in caml_modify. Is that right?

If so, what would be the difference between Load_mutable_field and Load_immutable_field? We just need a single Load_field_atomic which does an acquire load, don't we?

Since data-race-free programs have SC semantics, do we only recommend the use of Load_field_atomic only when the access is known to be racy?

@gadmm
Copy link
Contributor Author

gadmm commented Apr 20, 2022

IIUC you need the acquire load on the immutable fields because the initialising stores are plain stores and publication safety is thanks to the release store in caml_modify. Is that right?

To be clear, you first have to answer three questions regarding C code called from OCaml:

  1. Do you want to enforce the local DRF property (A), or do you want to only avoid the catch-fire semantics (B)?
  2. Do you want to remain strictly within the C11 spec (A), or do you accept de facto uses which the committee knows about, such as found in the Linux kernel (B)?
  3. Does backwards-compatibility includes preserving performance (A), or are you fine with a "C-OCaml" memory model which is stronger than it needs be (B)?

AFAIU it is not possible to answer A at all three questions.

If the answer includes 1A or 2A, then you need to load fields with an acquire load.
If the answer includes 1B and 2B, then you can make them volatile accesses instead.

So I see three solutions:

  • Option AAB (1A,2A,3B): introduce a Load_field macro with acquire semantics.
  • Option ABB (1A,2B,3B): make the Field macro an _Atomic cast, and introduce a macro for a relaxed store that programmers can use to replace field initialization.
  • Option BBA (1B,2B,3A): make the Field macro a volatile cast.

In terms of backwards-compatibility and code evolution from OCaml 4, the solutions are ranked BBA > ABB > AAB. In addition, ABB immediately makes legacy code "correct".

In terms of C11 compliance it is ranked AAB > ABB > BBA. Indeed ABB relies on atomic casts which is not strictly C11 as Luc mentions, but this is simpler than assuming that a read from a volatile cast can head a dependency chain (BBA).

In terms of performance, this is BBA > AAB > ABB. But note that while ABB sounds horrendous in terms of performance (SC load/stores), as far as performance on armv8 is concerned AAB is not much better than ABB (IIUC acq-rel atomics are compiled the same as SC atomics).

If we completely ignore the C11 spec, I am not entirely confident that one can stick a volatile cast to the Field macro and declare that all legacy code is fine. My "proposal" above, independently of AAB or BBA is chosen, is meant to rationalize the existence of legacy code that otherwise would not follow any rules (ABB does not need this).

If so, what would be the difference between Load_mutable_field and Load_immutable_field? We just need a single Load_field_atomic which does an acquire load, don't we?

Yes, orthogonally to what I was proposing, you do not need a distinction between mutable and immutable loads. I briefly confused myself with some earlier messages after having left this issue aside for a while.

It is not obvious, though, that the difference does not matter. One could think: what if we only made the loads of mutable fields be acquire loads? Would that not be enough, and simpler to handle? The problem is that the load of a mutable field could be done on the OCaml side before being passed to a C function. Then you end up relying implicitly on dependency ordering on the C-side, and so this solution is not better than BBA "volatile cast".

Or you could add an acquire fence at every OCaml to C call, to address this issue! Then this might make a 4th option. If you use a volatile cast in Field as a proxy for relaxed load/stores, then you would only need a Load_mutable_field macro for publication safety (instead of relying on assumptions about dependency chains). Let's call it BBB. This idea just came to me while answering your question, so I did not have time to fully explore it.

It shares some pros and some cons of AAB and BBA. If the dependency chains stuff is found too worrisome, then BBB is the closest good alternative to BBA.

Since data-race-free programs have SC semantics, do we only recommend the use of Load_field_atomic only when the access is known to be racy?

This is the gist of my proposal regarding the handling of backwards-compatibility and code evolution. Existing code runs on a single domain, so there are currently no data-race issues for OCaml 4 code running on OCaml 5.

However, it is not enough to describe what makes a whole program correct. We are talking about libraries that programmers will assemble to form whole programs, and existing programs that will progressively evolve to take advantage of multicore. Programs will live in a state where some part of the program subscribe to one memory model and another part to the other one. And memory models have an infectious nature (if the C part has catch-fire semantics, then OCaml has too, since the OCaml code can introduce races in the C part).

My proposal is to make programming language sense of this idea by suggesting essentially to import the notion of single-threaded shared mutable state which exists in Rust (Cell). Whenever adaptation will be needed to make C functions subscribe to the OCaml memory model (models AAB, BBA and BBB), then legacy code will remain correct in the meanwhile by virtue of this notion; and on the perennial term it justifies the use of more efficient relaxed loads instead of acquire loads in some situations (in the case of AAB and BBB).

@gadmm
Copy link
Contributor Author

gadmm commented Apr 28, 2022

Here is a summary of a discussion I had today with @kayceesrk.

Proposal

Regarding the Field macro, we propose a 3-layers solution:

  1. Make Field a volatile cast, to avoid crashes in practice following from incorrect usage of the macro.
  2. Introduce and document a Load_field macro, to advocate a style that conforms to both the C11 and OCaml memory models.
  3. Allow the use of the Field macro in the absence of races, for backwards-compatibility.

The existence of macros similar to Field, which would require a similar treatment, remains to be audited.

Make Field a volatile cast

For the reasons explained above, we believe that making Field a volatile cast will be enough in practice to avoid crashes following from situations like the initial example. We also believe that impact on performance is negligible (though this remains to be confirmed by benchmarks).

However, this relies on a fragile pile of assumptions:

  • that one can use volatile store/loads as a proxy for relaxed store/loads,
  • that relaxed store/loads can head dependency chains in the absence of code that would let the compiler speculate on values,
  • that such code that would let the compiler speculate on values is too exotic to appear inside legacy C-OCaml code,
  • that this will remain true through the evolution of compilers and hardware.

This does not fix specialised user code (e.g. if they copy-pasted the Field macro into their own version). In addition, this causes issues if someone had a project to use a certified C compiler with OCaml, because this relies on largely unwritten and unspecified behaviour unlikely to be part of what is certified.

Therefore, we see this as a stop-gap measure rather than a solution.

Introduce and document a Load_field macro

The end goal is that programmers use a Load_field macro which does an acquire load. By adding an acquire fence when going from OCaml to C, we actually only need the Load_field macro when reading from mutable fields.

To avoid cognitive overload, we will present a simple usage and an "expert" usage:

  • In doubt, use Load_field instead of r-value Field.
  • Expert users can use Field instead of Load_field in some circumstances, for instance when the field is immutable or of an immediate type.

We will explain that this is necessary in order to subscribe to both the C and OCaml memory models. I think that this gives the local DRF property under the assumption that the Field macro is not used improperly.

We do not propose to deprecate the Field macro, as we see such a change as unrealistic.

Allow the use of the Field macro in the absence of races

Experience from the concurrent collector showed that changing the C API to introduce a read barrier in existing programs is a lot of work. For the sake of backwards-compatibility, users would only have to introduce the Load_field macro progressively as they make their program multi-core. In the absence of races, for instance in programs that run on a single domain, it is fine to keep using Field.

This will need to be explicitly documented, to give a status to programs ported from OCaml 4. Since "no data race" is a whole-program (non-compositional) property, we instead plan to explicitly list programming patterns that are allowed. For instance:

  • Whenever some mutable data is never shared between threads
  • Whenever some mutable data is protected by a mutex

Threats

  • As mentioned by Stephen, making Field a volatile cast will change the type of &Field(), which might cause issues for backwards-compatibility. This should be checked.

  • Giving mutable fields acquire-release semantics is extremely costly on non-x86 hardware (see the SRA model in the PLDI paper). What remains to be seen is whether such performance-sensitive code is found on the C side of OCaml programs, and if so, whether this cost can be alleviated by other means. It should be easy to run a worst-case benchmark.

  • There is a risk that users will keep using the Field macro and never fully transition to Load_field, under the observation that it "just works" due to the volatile cast and is more performant. The fact that the "volatile cast" solution is an "almost" solution might become a drawback.

  • Idem for the "no data race" criterion, if users feel compelled to rely on it. Enforcing the programming patterns requires a Rust-like type system which OCaml does not have, so we are back to a catch-fire semantics (mitigated by the volatile cast).

@maranget
Copy link
Contributor

There is one point I'd like to understand better: why have an acquire semantics for the Load_field macro?

@kayceesrk
Copy link
Contributor

@maranget

Consider the thread t2 in the original example #10992 (comment), reproduced below:

value t2(value x)
{
  value y = Field(x, 0); // plain read
  if (Is_some(y)) {
    value z = Field(y, 0); // plain read
    printf("%d:%d", Int_val(Field(z, 0)), Int_val(Field(z, 1))); // plain reads
  }
  return Val_unit;
}

If the use of Field macro in t2 were replaced by Load_field, then the ordering between the first two loads are preserved thanks to the acquire semantics. The compiler no longer does the problematic optimisation (duplicating the first load).

@maranget
Copy link
Contributor

You mean duplicating value y=Field(x,0) as follows:

value t2(value x)
{
  if (Is_some(Field(x,0))) {
    value z = Field(Field(x,0), 0); // plain read
    printf("%d:%d", Int_val(Field(z, 0)), Int_val(Field(z, 1))); // plain reads
  }
  return Val_unit;
}

I do not see how this duplication can do any harm w.r.t. to the hardware model (as the same memory location is read twice). I am not even sure that this duplication will occur if Field(x,0) above includes a cast to volatile. Perhaps some additional compiler transformation can do real harm but I fail not see it.

I am still not convinced that having acquire semantics is needed here, as long as the compiler does not destroy the dependency from y to z and from z to Field(z,0) and Field(z,1). (Considering, as we agreed earlier in this thread, that we need some release ordering on the writing side).

@gadmm
Copy link
Contributor Author

gadmm commented Jun 5, 2022

The duplication of loads is problematic because a parallel thread could have written Val_none into Field(x,0) before the second read. However, a relaxed atomic load (or volatile load) is enough to avoid the duplication.

As you say, the important thing after having a relaxed atomic load is that the compiler preserves the dependencies, to correctly observe the initializing writes after allocation. However, the C memory model does not give any such guarantee. The acquire loads and fences solve the problem, albeit in a heavy-handed way.

To leverage the dependency ordering, you allude to doing something like in the Linux kernel and expect the compiler to preserve dependencies. Kernel developers rely on strict coding rules and attention given to what particular implementation of compilers expected to compile the kernel are doing. There is the "Linux kernel memory model" which is an informal model written in documentation notes, mailing list messages and allusions in standards committee proposals, which establishes a set of beliefs on how one can rely on dependency ordering in some compilers currently (gcc and clang). We did not think realistic in the long term to go the same way: in practice because the two projects are very different (notably because for OCaml the model is intended for users too), and in theory because this is at odds with the efforts of having a formal memory model.

There is another option we did not discuss with KC (probably because it is way beyond my expertise), but which should be contemplated seriously (with its pros and cons): trusting that volatile is enough currently as I initially proposed, and then regularizing the situation by making something like the LKMM formal (roughly, "obvious" dependencies are respected except if the code lets the compiler speculate on values). A thought experiment is to see if it is plausible that there may be one day compiler flags to enforce the assumptions (in the manner of -fwrapv and -fno-strict-aliasing). However, if I can imagine a flag to give volatile the semantics of relaxed atomic loads, the dependency business is trickier (first because it is hard to properly define). We only care about actual semantic dependencies and not artificial corner-cases of the consume ordering, but so did all the people who have tried to come up with a practical alternative to consume, and the current proposals to fix the consume ordering look nothing like something that could readily apply to existing code like simply adding a specifier in a macro. This makes me suspect a lot of unexpected difficulties. On the other hand, it is baffling that it is so hard to formalise how obvious dependencies in the source translate to dependencies in the target, so I would be very happy to see it solved.

A Load_field that does an acquire load, with a simple usage and an "expert" usage, is simple to explain and implement for OCaml. My main fear is that users will prefer gambling on volatile anyway, because it does not require them to take action to migrate code and it does not cost performance on Arm.

@maranget
Copy link
Contributor

maranget commented Jun 10, 2022

If I may sum it up. We all agree on having a dedicated Load_field. But there are three options to implement it

  1. Cast to volatile.
  2. Cast to atomic, relaxed access.
  3. Cast to atomic acquire access.
    (There is a 3.5 option, no cast, add an acquire fence after the load, hoping for reasonable semantics, i.e. semantically equivalent to 3.)

I am in favor of 1., mostly because I think it is sensible to assume that no compiler will destroy "actual" dependencies. Option 3 or 3.5 is safer however and the cost may be low.

Additionally, I assume that we are discussing the C side of accessing OCaml non-atomic references, or more generally mutable fields fields. Accessing OCaml atomics from C is a different story, I guess one should in that case use dedicated primitives. I am not aware of a discussion on this point.

xavierleroy pushed a commit that referenced this issue Jul 11, 2022
`Field(v,i)` is now `volatile`-qualified, to better handle races with concurrent updates in C or in OCaml.
(More complete discussion at #10992.)

`&Field(v,i)` now has type `volatile value *`. Add `volatile` qualifiers to `caml_modify`, `caml_initialize`,
and in internal parts of the runtime system.
xavierleroy pushed a commit that referenced this issue Jul 11, 2022
`Field(v,i)` is now `volatile`-qualified, to better handle races with concurrent updates in C or in OCaml.
(More complete discussion at #10992.)

`&Field(v,i)` now has type `volatile value *`. Add `volatile` qualifiers to `caml_modify`, `caml_initialize`,
and in internal parts of the runtime system.

(cherry picked from commit 0e09dc4)
@OlivierNicole
Copy link
Contributor

During the last months, the question of how to treat concurrent accesses from the runtime and from FFI functions has re-emerged several times, as witnessed by the number of links to this thread. Can we agree on a policy? I don’t have any authority in the field of memory models, but as a compiler dweller who is frequently confronted to the problem, I am interested in having a clear set of rules to refer to.

My impression is that state of this discussion has been well summarized in another thread 1:

  • for runtime data structures that are only used from C code, we should use C11 atomics
  • but for the OCaml heap and any other data that is accessed both from the C runtime and from the OCaml mutator, we currently use (volatile *) following the Linux model:
    • using C11 atomics does not provide any correctness guarantees in presence of races coming from OCaml accesses, we need to reason on the assembly level anyway
    • using relaxed may in general not be correct (we typically don´t want to have the reads hoisted out of loops, etc.)
    • using consume or acquire or sequential may be too expensive

In addition, we should create a Load_field macro for reading fields, implemented as a cast to volatile * 2. We should encourage its use in new code (and possibly one day deprecate the use of Field for concurrent reading?).

Footnotes

  1. https://github.com/ocaml/ocaml/pull/12030#discussion_r1179173212

  2. https://github.com/ocaml/ocaml/issues/10992#issuecomment-1152124829

@damiendoligez damiendoligez modified the milestones: 5.1, long-term Jul 12, 2023
@gasche
Copy link
Member

gasche commented Jul 25, 2023

I agree that we should try to summarize the discussions in a form that newcomers can understand. On a ping from @OlivierNicole I asked @gadmm for more details, and I wrote the summary below under this supervision. This extends and replaces the summary I wrote earlier, quoted by Olivier above.

Documenting the current state

  • For runtime data structures that are only used from C code, we should use C11 atomics.
  • But for the OCaml heap and any other data that is accessed both from the C runtime and from the OCaml mutator, we currently use (volatile *) following the Linux model:
    1. Using C11 atomics does not provide any correctness guarantees in presence of races coming from OCaml accesses, we need to reason on the assembly level anyway
    2. Using consume or acquire or sequential may be too expensive (for a general use in the Field macro)
    3. Using relaxed and volatile * may be too weak in general, as our C code assumes a dependency ordering (reading fields after seeing a constructor).
    4. But in practice many usage patterns of volatile * (and possibly relaxed) are safe with C compilers. The dangerous patterns are unlikely to be met in real-life OCaml FFI code. We currently use a (volatile *) cast in Field for this reason.
  • This choice does not consitute a proper memory model for mixed OCaml/C programs. To be used safely, it should come with a set of guidelines on C programming patterns to avoid (and compilers, optimizers, compiler options to avoid...), similar to the Linux document https://www.kernel.org/doc/Documentation/RCU/rcu_dereference.txt on RCU dereference. We do not currently have such a document.

Looking at the future

We do not wish to deprecate the Field macro due to the backward-compatibility impact.

Guillaume Munch-Maccagnoni has considered two approaches to avoid the uneasy situation with volatile*:

  1. With Stephen Dolan: refining the C memory model to provide an intermediate ordering between relaxed and consume that would suffice for OCaml value dependencies, with a way to ask C compilers to use this ordering for volatile*. This obviously involves a lot of work outside the OCaml compiler, and no one is currently actively pursuing this.

  2. With KC: ntroduce a Load_mutable_field to use on all mutable reads, combined with an extra barrier on FFI entry points (when values coming from OCaml mutable reads are sent to FFI code). This would be more correct in theory, but one can doubt that users would actually update their code if they are told that the volatile* issue is a theoretical concern. No one is currently actively pursuing this approach.

@gasche
Copy link
Member

gasche commented Jul 25, 2023

I wonder if the "documenting the current state" part should be placed somewhere more permanent. In memory.h? Somewhere in the user manual? I am thinking of sending a PR to include it as a .h comment somewhere -- suggestions on where to place it are welcome.

@OlivierNicole
Copy link
Contributor

One place that would seem logical to me would be to put it after the Note [MM]: Enforcing the memory model. comment in memory.c.

@gasche
Copy link
Member

gasche commented Aug 6, 2023

The "documenting the current state" part is now included as a comment in memory.c as suggested by @OlivierNicole. Thanks for the feedback!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

9 participants