Arm64 multicore support #10972

ctk21 · 2022-01-31T14:21:43Z

This PR implements the assembler, proc and emit changes needed to get ARM64 working. I have tested it using the compiler testsuite on MacOS/M1 and a Linux/Graviton2.

What this PR intends to do:

implements the separation of the OCaml stack from the C stack for ARM64
implement dynamic stack checks for ARM64
implements fibers and the primitives for them for ARM64
emits sequences for Iload and Istore to implement the OCaml memory model in ARMv8.

What this PR doesn't do:

I have not implemented the DWARF gymnastics to handle stack unwinding. I want the assembler to be settled first, we might also decide that on ARM64 the frame-pointer is the right story here.
I have not spent quality time plumbing through the memory model requirements; I've limited the exercise to getting the assembler up and the testsuite going. My thinking was to handle the memory model in a follow on.

I'm keen to get feedback, the assembler is very much: "get it to run". There may be beneficial refactorings or ARM64 idioms (or <gasp> bugs not exercised by our testsuite) that other eyes will see.

[edited to reflect the PR now incorporates changes in emit for the memory model]

kit-ty-kate · 2022-01-31T14:34:31Z

My daily driver is arm64 so this PR makes me very happy ^_^
So far it compiles just fine and the REPL works well. I’ll daily drive this branch and report any issues i have here.

xavierleroy · 2022-01-31T18:23:57Z

@EduardoRFS , you need to look into this...

EduardoRFS · 2022-01-31T19:02:32Z

@ctk21 I didn't knew you were working on this thank you so much, I was late to the party, this PR is quite close to what I had, so we can just follow here. Will be reviewing it in the near future.

About the

I have not implemented the DWARF gymnastics to handle stack unwinding

Yeah I was having a hard time on this, is it documented somewhere how all the magic works?

kayceesrk · 2022-02-01T01:09:20Z

We need the PR #10943 for correct compilation of memory model. Then,

For non-atomic stores which are assignments (edited), we need to emit dmb ld; str.
For atomic loads, we need to emit dmb ld; ldar.

See table 5 https://kcsrk.info/papers/pldi18-memory.pdf. Atomic stores are compiled as external calls, and hence, nothing needs to be done there.

kayceesrk · 2022-02-01T04:31:05Z

@EduardoRFS

Yeah I was having a hard time on this, is it documented somewhere how all the magic works?

There's no documentation on this currently except for the comments in runtime/amd64.S attached to the four .cfi_escape sequences. Once the dust settles down on this PR, I'll document this in the code.

asmcomp/arm64/emit.mlp

xavierleroy · 2022-02-01T08:23:29Z

For non-atomic stores, we need to emit dmb ld; str.

Except initializing stores Istore(chunk, addr, false), for which no barrier is needed, right?

Most stores are initializing stores, so it makes a big difference in code size.

kayceesrk · 2022-02-01T08:38:35Z

Except initializing stores Istore(chunk, addr, false), for which no barrier is needed, right?
Most stores are initializing stores, so it makes a big difference in code size.

Indeed. The barrier is not needed for initialising stores.

gadmm · 2022-02-01T14:09:21Z

For non-initializing non-atomic store did you not decide for dmb ld; stlr for publication safety? (cf. release store in caml_modify).
In addition for atomic stores I had the following question #10831 (comment) (see godbolt for comparing to the paper).

kayceesrk · 2022-02-01T16:28:10Z

For non-initializing non-atomic store did you not decide for dmb ld; stlr for publication safety? (cf. release store in caml_modify).

Yes, I believe so. We need a release store for the publication safety of initialising stores.

In addition for atomic stores I had the following question #10831 (comment) (see godbolt for comparing to the paper).

I don't know enough about the translation to C11 memory model.

Perhaps @stedolan can say more.

kayceesrk · 2022-02-03T04:59:01Z

#10943 is merged now. Thanks for merging, @gasche.

While we are reviewing Arm64, I'd like to see the codegen side of the memory model discussed and implemented in this PR as well. The codegen changes are small and IINM contained to the emission of appropriate instructions for non-atomic assignment (but not initialising) Istore and atomic Iload in arm64/emit.mlp. Let me know if you disagree, and think we should do this as a separate PR.

It would also be great if we can have eyes on runtime/arm64.S where the majority of the changes are. The way to review this code is to compare that against runtime/amd64.S.

kayceesrk · 2022-02-03T13:15:15Z

@gadmm

For non-initializing non-atomic store did you not decide for dmb ld; stlr for publication safety? (cf. release store in caml_modify).

Yes, I believe so. We need a release store for the publication safety of initialising stores.

I don't think stlr is actually necessary for non-atomic assignments in the arm64 codegen (ie, arm64/emit.mlp), and dmb ld; str is sufficient. Here is the reasoning. Any publishing store will take a newly allocated object local to a domain and make it available via a reference available to other domains. For example,

let r1 = ref [||]                                                                   
let t1 () = r1 := [| "Hello, world" |] (* thread 1 *)                                              
let t2 () = print_endline (!r1.(0)) (* thread 2 *)

It would be bad (crash!) if t2 sees the new array, but does not see the initialising write to the first field. But observe that the store to r1 goes through caml_modify, which is emitted in the translation to cmm.

ocaml/asmcomp/cmm_helpers.ml

Lines 2212 to 2225 in ebb9f0d

    
           let setfield n ptr init arg1 arg2 dbg = 
        
             match assignment_kind ptr init with 
        
             | Caml_modify -> 
        
                 return_unit dbg 
        
                   (Cop(Cextcall("caml_modify", typ_void, [], false), 
        
                       [field_address arg1 n dbg; arg2], 
        
                       dbg)) 
        
             | Caml_initialize -> 
        
                 return_unit dbg 
        
                   (Cop(Cextcall("caml_initialize", typ_void, [], false), 
        
                        [field_address arg1 n dbg; arg2], 
        
                        dbg)) 
        
             | Simple -> 
        
                 return_unit dbg (set_field arg1 n arg2 init dbg)

caml_modify has the release store necessary to ensure publication safety.

gadmm · 2022-02-03T18:04:01Z

If you know that all potentially-publishing stores of blocks necessarily go though caml_modify then I agree that the fence is already present whenever you need it (it also makes sense that this is already what happens in the compiler). In particular when you statically know that you replace an immediate with an immediate then you need neither the fence nor the barrier.

gasche · 2022-02-03T20:23:24Z

Sorry for forcing people to work more on this, but I'd like to ask for an explanation of "publication safety" that non-experts like me can follow. (I think this is interesting in particular because arm64 is the first non-TSO backend to be merged, and the discussions here may inform other backends in the future.)

caml_modify first performs an "acquire fence" and then a "release store/write" in the modified location. There is a comment in memory.c that explains why this guarantees memory-safety, but I cannot check that I agree with what it says, and or with what's going on here.

In my naive understanding of memory models, release stores synchronize with acquire loads (when the load sees the result of the store), and this guarantees that any writes (atomic or not) seen by the release store will also be seen after the acquire load. (The C memory model are also "consume" loads with weaker guarantees.) But in the example of @kayceesrk above, the read of the reference r is not an atomic load (acquire or something else), it is a non-atomic load !r1 (in !r1.(0)). So what guarantees do we actually have that the reader of !r1 will also see the initialize write in !r1.(0)?

(Is this related to the fence? I'm not fully clear on what these fences are doing, but my current understanding is that they ensure that if a thread sees the write to r1, then they also see the writes from other threads that the writer of r1 had seen. So this would not come into play in this example, but it would if r1 had been initialized by a read from a mutable location first written from a third thread.)

gasche · 2022-02-03T21:07:20Z

(My question comes from a remark made by @gadmm to me which made me realize that I don't understand why store-release without load-acquire (on mutable fields) works. The message from @gadmm suggests that he himself reached enlightenment, but I'm afraid I'm still lagging behind.)

gadmm · 2022-02-03T21:22:06Z

Our messages crossed, I have guessed that this relies on Arm's dependency ordering (cf. consume ordering which you mentioned, which was never implemented as intended in C compilers). But this opens more questions, for which I am preparing an issue.

kayceesrk · 2022-02-04T04:04:47Z

In my naive understanding of memory models, release stores synchronize with acquire loads (when the load sees the result of the store), and this guarantees that any writes (atomic or not) seen by the release store will also be seen after the acquire load. (The C memory model are also "consume" loads with weaker guarantees.) But in the example of @kayceesrk above, the read of the reference r is not an atomic load (acquire or something else), it is a non-atomic load !r1 (in !r1.(0)). So what guarantees do we actually have that the reader of !r1 will also see the initialize write in !r1.(0)?

The key point here is that there is an address dependency between the two loads on t2. The result of the first load is an address which the second load reads from. This is sufficient to ensure that the loads are not reordered [1,2].

Here is a broken message passing example constructed only using non-atomic locations for comparison:

let msg = ref 0 and flag = ref None
let t1 () = 
  msg := 1; 
  flag := Some 42
let t2 () = 
  let rf = !flag in
  let rm = !msg in
  assert (not (rf = Some 42 && rm = 0)) (* may fail *)

The write to flag in t1 goes through caml_modify, and the release store ensures that the write to the msg is ordered before the write to the flag. But on t2 there is not enough ordering between the loads. Hence, on t2, the read of msg can be reordered before the read of flag, and hence the assertion may fail.

[1] Limits on ordering, https://developer.arm.com/documentation/102376/0100/Normal-memory
[2] Fig 6 https://kcsrk.info/papers/pldi18-memory.pdf, addr (address dependency) is included in dob (dependency-ordered before)

gasche · 2022-02-04T15:22:25Z

I must admit that I'm not familiar with the axiomatic memory model from the paper (the operational model is simple and nice, but it doesn't say anything here as we are mixing atomic and non-atomic accesses on the same location); I haven't had the brain surgery yet that makes people comfortable with axiomatic memory models.

If I understand correctly, your reply basically means that on Arm, non-atomic loads behave like "consume" loads when they synchronize with "release" stores? (I mentioned that compiling all (non-atomic) reads of mutable locations to acquire loads would be safe; in fact I guess that having non-atomic reads of mutable locations act as "consume" loads suffices for this problem of block value initialization, as all potentially-problematic loads are dependent on this load.)

kayceesrk · 2022-02-04T16:11:27Z

If I understand correctly, your reply basically means that on Arm, non-atomic loads behave like "consume" loads when they synchronize with "release" stores?

I believe so. In fact, I now notice that the behaviour in this example is exactly the "publisher-subscriber situations with pointer-mediated publication" in the description in [1]. But the document also says compilers promote consume to acquire, and discourages the use of consume ordering. Hence, I think it may be better to reason about the mixed-mode behaviours in terms of the hardware model and not C++11.

[1] https://en.cppreference.com/w/cpp/atomic/memory_order#Release-Consume_ordering

xavierleroy · 2022-02-04T17:10:38Z

This conversation has been going above my head for a while already, but:

I must admit that I'm not familiar with the axiomatic memory model from the paper (the operational model is simple and nice, but it doesn't say anything here as we are mixing atomic and non-atomic accesses on the same location)

The discussion is all about OCaml nonatomic loads and stores, as far as I can see. (Actually, static typing prevents mixing atomic and nonatomic accesses on the same location, again as far as I can see.) So, the models from the paper should tell you which behaviors are allowed. Then, all you need is @maranget - level understanding of the ARMv8 memory model to check that the proposed implementation doesn't allow more behaviors than that.

maranget · 2022-02-04T18:21:54Z

This conversation has been going above my head for a while already, but:

I must admit that I'm not familiar with the axiomatic memory model from the paper (the operational model is simple and nice, but it doesn't say anything here as we are mixing atomic and non-atomic accesses on the same location)

The discussion is all about OCaml nonatomic loads and stores, as far as I can see. (Actually, static typing prevents mixing atomic and nonatomic accesses on the same location, again as far as I can see.) So, the models from the paper should tell you which behaviors are allowed. Then, all you need is @maranget - level understanding of the ARMv8 memory model to check that the proposed implementation doesn't allow more behaviors than that.

I understand the question as mixing OCaml and C11 models. As far as I understand , in the specific scenario, one wishes ordering from one read to another, when the second read address depends on the first read value, a.k.a. address dependency. The difficulty originates from coding the read-to-read sequence in C. Armv8 leaves address dependencies alone, i.e no fence needed here, nor load-acquire for performing the first read. But in C?

Are we sure that the (C) reads are compiled to loads, i.e. one load instruction per read? The said load instruction performing the so-called atomic load -- i.e. exactly one 'atomic' memory read occurs. Here, a cast to volatile and aligned accesses should be enough. The cast may be necessary, if the access is in a loop for instance, and that one relies on the read value to change for exiting the loop. Perhaps in some situations there is no read at all? (See 2.)
Are we sure that the compiler cannot (for instance) guess the value of the second read and then destroy the dependency. If so, as armv8 does not order read to read... To avoid destroying dependencies, C11 provides the "Consume" memory order for the first read(*). However this is poorly implemented, if not ignored. Moreover this means a cast to atomic (when the pointer is not to an atomic type), whose semantics still puzzles me. In general, I do not believe in compilers that destroy dependencies of the kind: int *p = *q; int r = *p; or int i = *q; int r = t[i]. Some compilers destroy the dependency int i = *q; int r = t[i-i];, well. I am naive maybe.

As a conclusion the cast to volatile on pointers to aligned values does not incur much penalty, if any. But well this is not C11!

(*) Well Consume is here to enforce address dependency as an order, for DEC alpha, which does not provide it by default.

xavierleroy · 2022-02-04T18:49:19Z

Again, this goes above my head, but: there is only one piece of C code involved in the discussion as far as I can see, namely caml_modify and its famous "acquire fence; release store" sequence. (Plus its famous 100-line explanatory comment.) The rest is ocamlopt-generated AArch64 code that implements non-atomic loads and non-atomic non-initializing stores.

kayceesrk · 2022-02-04T23:50:26Z

Agree with Xavier that the only C code here is in caml_modify. The scenario that Luc mentions is possible if the second thread were implemented in C using the C API. Perhaps it is better if it were discussed in a separate issue.

gadmm · 2022-02-05T09:46:43Z

The confusion is a bit of my fault, I had a draft pointing out two issues with this dependency ordering, including the one with C code. I was proposing that the Field macro (& co.) should be changed either to a volatile (yes, like in the Linux kernel) or an _Atomic cast. Not being an expert in memory models, and because of the implications given that none of the solutions is perfect, I was asking some clarification in private from Luc before I would send it, so I think (?) this is the context you had been missing from @maranget's comment. This is on-topic for the Arm implementation (the release store has to synchronise with somebody, which may be C code) but not specific to it, so I was going to open a separate issue about it. I'll do it just now.

xavierleroy · 2022-02-05T10:29:00Z

Thank you for bits of the missing context. I'd like to recenter the memory model discussion on this PR.

If I understand correctly, here is the proposed implementation of the Multicore OCaml memory model on AArch64:

code sequence	operation
`ldr`	all non-atomic reads
`str`	initializing writes
`dmb ishld; str`	non-initializing non-atomic writes of integers (no publication of new objects can take place)
`dmb ishld; stlr`	non-initializing non-atomic writes of pointers (with possible publication; this is `caml_modify`)

Note: the last line is the code produced by GCC 11.2 for the "acquire fence; release store" sequence in caml_modify.

Is everyone comfortable with this implementation? Especially the use of a plain str instead of a release stlr on line 3.

Please try to give clear answers without hidden context.

[ Edited to use dmb ishld consistently. ]

xavierleroy

I read the assembly code (runtime/arm64.S) up to the Fibers part. The code is actually quite easy to read, in part because of judicious use of GAS macros. Congratulations!

I did not spot any issues in this part of runtime/arm64.S, just suggestions for clarifications and minor simplifications in proc.ml and emit.mlp.

asmcomp/arm64/emit.mlp

runtime/arm64.S

xavierleroy · 2022-02-14T15:02:57Z

runtime/arm64.S

+1:      RESTORE_ALL_REGS
+    /* Free stack space and return to caller */
+        ldp     x29, x30, [sp], 16
+        add     sp, sp, 16 /* pop argument */


I assume this SP adjustment is needed so that the backtrace is properly recorded.

Yes, we want the stack to be well formed when we call caml_raise_exn in case a backtrace is needed.

runtime/arm64.S

xavierleroy · 2022-02-14T15:17:59Z

asmcomp/arm64/proc.ml

+  (* x20-x28, d8-d15 preserved *)
  Array.of_list (List.map phys_reg
-    [0;1;2;3;4;5;6;7;8;9;10;11;12;13;14;15;
+    [0;1;2;3;4;5;6;7;8;9;10;11;12;13;14;15;16;


With a minor change to arm64.S, we might be able to avoid destroying x19, in which case this change can be reverted.

So using the frame pointer to avoid destroying x19 is something I think can work by changing the emit for Iextcall in the case of non-allocating C calls.

Let me try that.

e7b3839 implements Iextcall for non-allocating C calls using the frame pointer register to hold the OCaml stack.

If we end up having frame pointer support on the OCaml frames in arm64, then I believe we could leave x29 alone and then restore the OCaml stack using x29 and env.stack_offset. So I think this doesn't back us into a corner there.

Overall I like this change; it gives back x19 to user code.

gadmm · 2022-02-16T14:09:08Z

runtime/memory.c

@@ -234,7 +234,9 @@ CAMLexport int caml_atomic_cas_field (
  } else {
    /* need a real CAS */
    atomic_value* p = &Op_atomic_val(obj)[field];
-    if (atomic_compare_exchange_strong(p, &oldval, newval)) {
+    int cas_ret = atomic_compare_exchange_strong(p, &oldval, newval);
+    atomic_thread_fence(memory_order_release); /* generates `dmb ish` */


Again limiting my review to the memory model part, here is a minor nit. These C functions are not specific to Arm, and a lot more is needed to explain the hybrid scheme they implement now, so these comments are out of context. It is important to explain why these fences are there, but perhaps it is better to have it as part of the upcoming memory model documentation?

Would you be satisfied if the comment says

generates dmb ish on Arm64

Of course, the comment by itself is not illustrative unless accompanied by a memory model documentation, but will be correct. Once the documentation is available, we can update the comment to point to that.

Whatever works for you, it was just an advice. It's fine to clarify it later.

kayceesrk

I have reviewed the arm64.S file, and it looks correct to me. I am also fairly confident about the coverage of the tests in testsuite/tests/effects and the effect handlers tests in testsuite/tests/callback. They were written to comprehensively test each of the different transitions between C and OCaml, and fibers. We have also reviewed the memory model aspects of the Arm64 compilation quite thoroughly here. Based on these, I am approving this PR.

The DWARF stack unwinding work should be done in a separate PR as the original PR message says. Separately, there is also work to document the low-level implementation details of fibers.

@xavierleroy I'm wondering what else do we need to get this PR across the line? Once the PR is at a stage where it doesn't need more work, @ctk21 promised to make the commit history sensible and add a CHANGES entry.

xavierleroy · 2022-02-18T17:04:22Z

I agree this PR is in good shape and can be merged. I'll be happy if @ctk21 can clean up the history a bit.

EduardoRFS

With the memory model related changes, I was not capable of doing a proper review, so I made a superficial review to arm64.S and it looks okay, there is a lot of small improvements to do(mostly shaving instructions away), but in general it looks sound and quite similar to amd64.S and to what I was writing.

Thank you so much @ctk21

ctk21 · 2022-02-21T09:54:00Z

Many thanks for the reviews.
I will now cleanup the history and get this PR ready for a merge.

…loc Iextcall; s/max_stack_size/max_frame_size/ in {arm64/amd64}/emit.mlp; refactor out preproc_stack_check to emitaux.ml to share between {arm64,amd64}/emit.mlp; handle Ladjust_trap_depth in preproc_stack_check; exhaustive match in preproc_stack_check

…ly handle ALLOC_PTR & TRAP_PTR in {SAVE,RESTORE}_ALL_REGS ARM64: remove dead commented out code in caml_c_call

Make sure atomic loads use the trivial addressing mode; the `ldar` instruction does not support fancy addressing. ARM64 assembler generation - Non-initializing stores can use "str" and need a barrier only at Word_int and Word_val types. (For smaller or bigger types, the guarantees of the Multicore OCaml memory model do not apply anyway.) - Atomic loads are performed only at Word_int and Word_val types, and do not need a barrier in addition to the "ldar" instruction. Add casts to ensure proper typing of printf format Add atomic_thread_fence(memory_order_release) to ensure 'dmb ish' for caml_atomic_exchange, caml_atomic_cas_field, caml_atomic_cas and caml_atomic_fetch_add Co-authored-by: Xavier Leroy <xavierleroy@users.noreply.github.com>

remove unnecessary zero of backtrace_pos in emit for Raise_regular; fixes for comments and minor tweaks; CFI_OFFSET fixes; use clearer ldr with Cstack_prev in caml_start_program; utilize the frame pointer to hold the OCaml stack over non-allocating Iextcall returning x19 as a user saved register Co-authored-by: Xavier Leroy <xavierleroy@users.noreply.github.com>

ctk21 · 2022-02-21T10:37:42Z

I have updated the Changes entry and cleaned up the commits (if you squint, it might represent the PR history).
If a single commit is desirable, let me know.

xavierleroy

Looks ready for merging ! Thanks a lot @ctk21 and all the reviewers.