New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update the POWER port for OCaml 5 #12276
Conversation
e49f9b8
to
6cfa87c
Compare
To simplify reviewing, I removed the experiment with frame pointer support from this PR, as it is definitely not ready. |
11124db
to
62e51c6
Compare
I've been working on and off on the Power64 support also and would be happy to review this PR. |
@tmcgilchrist In case you need a native host, I've installed opam on the ppc64 orithea host which you should have a shell on, and (as a convenient CLI for any compiler PR) you can do |
I've taken this PR for a quick spin on my OCaml 5 monorepos, and it compiles a large amount of OCaml code (including mirage-crypto/tls) and serves pages just fine. I'll run some benchmarks but no regressions spotted so far. |
Keep only PPC 64 bits little-endian.
Definition taken from the Linux kernel sources, arch/powerpc/include/asm/vdso/processor.h
On the compiler side: - Generate POWER barriers for atomic loads and noninitializing stores Noninitializing stores: as described in the PLDI 2018 memory model paper. Atomic loads: as generated by GCC for SC atomic_load. - Update code generated for external calls - Add support for Icompf instruction - Add support for stack overflow checking - Revised exception handling: - Trap frames are 2 words (16 bytes) large, this is enough - Linkage is in first word of trap frame (needed for stack reallocation) - Trap pointer points to trap frame, not to bottom of stack frame (needed for stack reallocation) - Implemented caml_reraise_exn and moved zero-ing of backtrace_pos to caml_raise_exn. On the runtime side: - Extensive rewrite of runtime/power.S, based mostly on runtime/arm64.S - Added POWER definitions to <caml/stack.h> - Added POWER definition of `c_stack_link` to `<caml/fiber.h>` - Handle SIGTRAP signals and turn them into out-of-bounds exceptions
62e51c6
to
05d82f9
Compare
Because a trap frame size of 32 was still used, while it's 16 now.
These branches are taken/not taken when an allocation runs out of space or when polling is positive, which is unlikely.
I upstreamed the build fixes (which seem to be due to a gcc 11 compiler bug on powerpc64 issuing spurious warnings) to mirage-crypto, and now the effects-based webserver is running fine on ppc64le with this PR: https://orithia.ocamllabs.io. Just running stress tests on it, then shifting to multi-domain testing, but no issues encountered so far. Bulk builds also show no unexpected regressions against OCaml 4 ppc64 so far (but are chugging along more slowly). |
Big TOC has been the default for a while, since many packages overflow the TOC when compiled in small TOC mode.
/* The purpose of this function is to simulate a [caml_c_call] | ||
to [caml_array_bound_error_asm]. */ | ||
|
||
caml_domain_state * dom_st = Caml_state; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, but:
- is it known that the signal handler will only run for OCaml code? I wonder if it is necessary to check that the pc points to OCaml code prior to accessing
Caml_state
. - For C11 conformance, one can access TLS in a signal handler, but the thread-local variable has to be
_Atomic
, which it is not, currently. One also has to consider the fact that glibc does not conform to C11 and can allocate in corner cases (that are probably irrelevant here);as for MacOS one might want to reimplement the more efficient implementation of(edit: irrelevant for this target!). Could you side-step these worries by retrieving the domain state pointer fromCaml_state
that uses explicit TLS one dayctx
instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tmcgilchrist if you mean where the bound checks are emitted, then on power it's via instructions that emit traps. See the Icheckbound functions in power/emit.mlp; e.g.
| Lop(Iintop_imm(Icheckbound, n)) ->
if !Clflags.debug then
record_frame env Reg.Set.empty (Dbg_other i.dbg);
` tdllei {emit_reg i.arg.(0)}, {emit_int n}\n`
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tmcgilchrist It makes sense to adapt #10409 to SIGTRAP on Power indeed (as part of a new PR and if someone volunteers).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you side-step these worries by retrieving the domain state pointer from ctx instead?
Yes, sure, see 0b12d86 .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it known that the signal handler will only run for OCaml code? I wonder if it is necessary to check that the pc points to OCaml code prior to accessing Caml_state.
And what shall we do if it doesn't? Abort the program?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For SEGV you used to restore the signal handler to its default disposition and restart at the faulting instruction (and thus abort the program in a debuggable way). I imagine that you did not do the same here for some reason. I imagine that traps increment the program counter like on x86 but decrementing the pc does not sound like an obstacle. Am I missing anything?
I also noticed that caml_find_code_fragment_by_pc
is essentially signal-safe because the call to caml_code_fragment_cleanup
in cycle_all_domains_callback
can be omitted (in fact, does nothing) in native, as no code fragment gets removed.
I agree with your point that none of these issues are blocking, since this is old code being ported to multicore.
]} | ||
*) | ||
let (bitnum, negated) = emit_float_comp cmp i.arg in | ||
emit_extract_crbit bitnum negated i.res.(0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The memory model is implemented by the following sequences (compared to RISC-V and ARMv8):
OCaml Operation | POWER operation | ARMv8 operation | RISC-V operation |
---|---|---|---|
Atomic load | sync; lwz; cmpw; bne; isync | dmb ishld; ldar | fence iorw, iorw; ld; fence iorw, iorw |
Atomic store | lwsync; stw | L: ldaxr; stlxr; cbnz L; dmb st | fence iorw, ow; amoswap.d.aq |
Nonatomic load | lwz | ldr | ld |
Nonatomic store | stw | dmb ishld; str | sd |
These look right to me and match the instructions I had worked out.
It uses these instructions:
- sync (hwsync) - heavy weight sync is a full memory barrier that prevents any reordering
- lwsync - lightweight sync is a memory barrier that prevents reordering (Load-Load, Load-Store, Store-Store)
- isync - instruction sync, refetches any instructions that might have been fetched prior to this instruction
I'm running this through the ocaml-multicore/multicoretests to validate further.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ocaml-multicore/multicoretests didn't find any new issues (that aren't already reported elsewhere). I've compiled various EIO/multicore projects and no problems found.
I'd like to setup https://check.ci.ocaml.org to run against CI's POWER pool and see what it finds. @avsm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Non-initialising (non-atomic) stores for Power are incorrect in the table. The PR does the right thing and emits: lwsync; stw
, which is the same as what the PLDI 2018 paper suggests. See sec 8.2 in the paper: https://kcsrk.info/papers/pldi18-memory.pdf. In the code, see https://github.com/ocaml/ocaml/pull/12276/files#diff-f99d939a29505721a1d19a6aed606eac2d97eff24a8936102ab32e0794a3a97aR695.
@@ -16,19 +16,11 @@ | |||
|
|||
(* Specific operations for the PowerPC processor *) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Technically this should be Specific operations for the Power processor
. PowerPC is name for the 32bit family, which is being dropped and IBM changed naming from POWER to Power with Power10.
A note about the use of traps in this port, to make sure it doesn't monopolize the reviewing. It's a design choice that goes back to the original POWER port of OCaml circa 1996. The nice thing about it is that bounds checking is performed in 1 instruction, instead of 2+ with the usual approach, so code is smaller and runs fast in the "within bounds" case. The not so nice thing is that a signal is generated in the "out of bounds" case, and turning this signal into an OCaml exception is slow and runs into all the vagaries of Unix signal handling. In 25+ years of OCaml on POWER/PowerPC, nobody has complained about these limitations, though. I chose to keep the 1996 approach in this PR to minimize the amount of new code, but we could also change the code generated for bounds checks to not use traps. I would not do this in this PR and not before this PR is accepted, though. |
I think using hardware traps should remain the choice for now, until emerging new kernel->userspace signalling mechanisms land like Linux uintr which is way better than either signals or eventfd. So a slow/safe implementation using signals for now seems just fine... Certainly, I'm finding myself just using this PR day-to-day now to write OCaml code on an fast Raptor Talos II machine and quite enjoying the experience :-) |
I’ll volunteer and have a go at updating #10409
…On Wed, 7 Jun 2023 at 10:36 pm, Guillaume Munch-Maccagnoni < ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In runtime/signals_nat.c
<#12276 (comment)>:
> @@ -85,3 +85,37 @@ void caml_garbage_collection(void)
nallocs, alloc_len);
}
}
+
+/* Trap handling for the POWER architecture. Convert the trap into
+ an out-of-bounds exception. */
+
+#if defined(TARGET_power)
+
+extern void caml_array_bound_error_asm(void);
+
+void caml_sigtrap_handler(int signo, siginfo_t * info, void * context)
+{
+ /* The trap occurs in ocamlopt-generated code. */
+ /* The purpose of this function is to simulate a [caml_c_call]
+ to [caml_array_bound_error_asm]. */
+
+ caml_domain_state * dom_st = Caml_state;
@tmcgilchrist <https://github.com/tmcgilchrist> It makes sense to adapt
#10409 <#10409> to SIGTRAP on Power
indeed (as part of a new PR and if someone volunteers).
—
Reply to this email directly, view it on GitHub
<#12276 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AABJXOK6675MFHBM5CKABO3XKBYUBANCNFSM6AAAAAAYUA66DA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
The code looks ok. I didn't spot any gotchas. As a sanity check, I built and ran the testsuite under Docker for Mac which uses qemu for emulating ppc64le. I received the following failures:
Running just one of the tests (
The error seems unrelated to the PR and looks like a qemu issue? https://gitlab.com/qemu-project/qemu/-/issues/588 I'm yet to build the compiler on Power hardware. @xavierleroy while it hasn't been explicitly mentioned in the thread, I assume the testsuite passes there. |
Yes, I'm afraid QEMU doesn't fully implement trap instructions. And, yes, I ran the testsuite (many times) on real hardware -- the same POWER9 machine at Cambridge that is part of the Jenkins CI. I'm sure @avsm can provide you with an account on this machine. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code looks good to me. Glad to have Power back!
Thanks for the confirmation. I am happy for this one to be merged. |
I do regret adding the more exotic architectures to the default binfmt_misc in Docker for Desktop; the number of bug reports that have come through from qemu being an incomplete emulator are very high. For completeness alongside the existing POWER9 results, I also ran the testsuite in this PR on an older physical POWER8 machine (PowerNV 8348-21C), with no failures.
So thumbs up from me on this PR. (KC: I'll contact you separately about a POWER9 account; you've already got an account on the machines) |
Thanks all for the reviews and tests. It's now merged! |
The issue linked was closed 11 months ago. If it still happens, it would be great if you open a new issue, otherwise this might be never fixed. |
This big PR restores native-code support for the POWER/PowerPC 64 bits little endian architecture and adds support for all the OCaml 5 niceties.
The other two POWER/PowerPC variants (32 bits and 64 bits big endian) remain bytecode-only, as they have pretty much disappeared from the server world and are now only found in embedded systems that are unlikely to run OCaml code.
The PR is best reviewed commit-by-commit. Most of the work is in commit 05d82f9 :
On the compiler side:
Noninitializing stores: as described in the PLDI 2018 memory model paper.
Atomic loads: as generated by GCC for SC atomic_load.
- Trap frames are 2 words (16 bytes) instead of 4, this is large enough
- Linkage is in first word of trap frame (needed for stack reallocation)
- Trap pointer points to trap frame, not to bottom of stack frame (needed for stack reallocation)
- Implemented caml_reraise_exn and moved zero-ing of backtrace_pos to caml_raise_exn.
On the runtime side:
c_stack_link
to<caml/fiber.h>
The first 4 commits are ground work.
aa4a811 is from #12275.(now merged).The last commit is an unfinished experiment into supporting frame pointers, which seem necessary to get decent backtraces under GDB. (CFI information seems ignored by default?) It needs more work.