New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Naked pointers and the bytecode interpreter #9680
Conversation
(Mentioning @ejgallego because this discussion applies to the Coq VM interpreter too.) |
Multicore actually uses 1a (tagging) for code pointers, and 1b (offsets) for trap handlers. The difference is that the multicore branch maintains trap handlers as offsets all the time (and so push/pop does no conversion), whereas this patch keeps absolute references in the trap handler globals (e.g. For the record, I think what's proposed here is a better design. In hindsight, keeping the globals as offsets made the diff larger without much benefit. Using the code fragments table will have some performance impact in multicore (whatever synchronisation is needed will have to do well under contention, as all domains start a major GC cycle simultaneously), but I don't imagine it will be too slow, especially as this only affects bytecode. |
Thanks for the heads up @xavierleroy , I'll ping @coq/vm-native-maintainers so they are in the loop. We're going to add 4.12 to the CI so we can try to track upstream development cc: coq/coq#12518 |
A note about the performance impact of |
A concern with the approach proposed in this PR is that it makes it much more difficult to implement control operators by copying part of the bytecode interpreter stack and saving it in a heap block, like Oleg K's When naked pointers are supported or when code pointers in the stack are encapsulated, the heap blocks in question can safely be traversed by the GC. This is not the case with the approach proposed in this PR, so a more complicated copying strategy is needed, like |
To encourage fair competition, I submitted the alternate approach (of encapsulating code pointers in the stack) as #9687. Ideally, some brave souls would read both and perhaps run benchmarks... |
Batch-compiled bytecode executables.
Yes, there was only one fragment in my tests. No, there is already some overhead in calling
You can get that at the toplevel REPL. But log2 100 is less than 7... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The patch of callback.c
is actually a bug fix. It should be inserted earlier in the history in order to make sure that the intermediate commit are, to the best of our knowledge, correct (for easier bisecting).
Perhaps you could add a comment in caml_oldify_local_roots
to document the fact that caml_oldify_one
can be called with code pointers (or naked pointers in general) without fear.
About the change in interp.c
, I am somewhat confused. On the one hand you say that calling caml_find_code_fragment_by_pc
for every single (non-integer) stack entry at each major collection is negligible, but on the other hand you say that calling it once in debug mode for each closure allocation is too costly. This seems contradictory to me.
@@ -96,6 +92,8 @@ CAMLexport value caml_callbackN_exn(value closure, int narg, value args[]) | |||
local_callback_code[4] = POP; | |||
local_callback_code[5] = 1; | |||
local_callback_code[6] = STOP; | |||
/* Not registering the code fragment, as code fragment management | |||
would need to be revised thoroughly for an hypothetical JIT */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know the history of the LOCAL_CALLBACK_BYTECODE
flag, but it seems to be completely dead code since its insertion in 2004.
Frankly, keeping pieces of untested dead code for hypothetical clients that are very likely to never exist and for which we actually do not even know that they will need this trick, is, IMO, not good software engineering practice. This is even worse with this patch since we know for sure that this is bogus (the code fragment is not registered)....
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The history is that circa 2004 Basile Starynkevitch was experimenting with a JIT compiler for OCaml bytecode, and left some hooks in the code. I agree these hooks are outdated and should be removed, but in another PR.
Which is still seven times larger than Do you want me to do benchmarks with large pieces of code executed in the toplevel and recursive functions? We would be in the the worst case : a large stack and a large number of code fragments. |
That would be very nice. We could then compare with #9687. |
I stand corrected for the trap handlers. Concerning 1a (tagging of code pointers), @jhjourdan noticed a serious problem (see #9687): an integer value on the stack can look like a (tagged) code pointer, messing up the recording of the call stack. I'm afraid we'll have to abandon this idea and use approach 2b like in this PR. |
Between two start-of-major-GC stack scans, you can have a million closure allocations (think CPS, LWT, etc). so the overhead of assertions adds up. I care about the speed of the debug runtime, as it should be able to run possibly big programs. |
I did that. Here is the benchmark I ran:
With trunk, it gives : With this branch (merged with trunk to workaround #9681), I get : Which is a 12% difference. Not negligible. Sure, this is a micro-benchmark designed to hit the worst case (essentially, the worst case happens when there are a lot of code fragments and a large stack containing a lot of pointers to the heap, which happens to have a size of the order of magnitude of the major heap). |
Thanks for the example and for the testing. +12% for a worst-case example is not that bad... I played a bit with your example:
Why the smaller lists? A long time ago I worried about stack scanning time in ocamlopt (with the frame descriptors and all that), and before trying to speed it up, I made some measurements on a few OCaml programs that use a lot of recursive functions (Coq, the OCaml compilers, etc). On an average, the stack was quite small: typically 100 call frames and less than 1000 roots. Sometimes it can get much deeper than that, of course, but not for very long, hence the probability of scanning a huge stack is low. |
At any rate, quoting Mrs Thatcher, I'm afraid there is no (reasonable) alternative to the approach followed in this PR. So, can we proceed with the reviewing? |
If we did not bother about development time, I think the solution could actually to use splay trees, which can adapt to the expected very unbalanced distribution of pointers in the stack. Perhaps with some engineering, the implementation could even be shared with free lists, who knows... But yes, I agree that nobody would have time for that, given that your current implementation gives good performances in usual cases and not-too-bad performances with made-up examples. And this is not even speaking about the fact that concurrency and splay tree do not play well together. |
I think that most toplevel workflows (which include HOL-light in interactive mode and the killer jitted-ocamlnat of the future) will spend most of their heavy computation in functions defined in libraries, rather than just defined in the toplevel section -- which is typically used for exploratory usage of an API. If I understand correctly, this corresponds to a best-case within the worst-case, as the "main" code fragment can be located in 1 step within the skip list, right? |
No. The skiplist is sorted by increasing code address, so there is no reason the "main" code fragment will be at head. The splay tree mentioned by @jhjourdan could perform better in this case. |
Generally speaking: I'm not excluding future work on better data structures for the table of code fragments. What I'd rather not do now nor in the foreseeable future is to add metadata to bytecode (e.g. frame descriptors) to support GC scanning and backtrace construction. |
@jhjourdan you left a review requesting changes, but didn't say which ones. Would you please clarify? |
As I said in my review:
Sure these are not critical changes, and the PR could merged as is, but these are still easy improvements. Don't take my "change requested" flag too seriously. If you prefer, I can revert it. |
So that the return address pushed on the bytecode interpreter stack is correctly recognized as a code pointer.
Rather than storing a pointer to the previous frame in the Trap_link field of the current frame, store the distance (pointer difference) between the current frame and the previous frame, tagged as an OCaml integer. Using a tagged integer instead of a raw pointer means fever problems later with strict no-naked-pointer support. Using a distance rather than an absolute address simplifies the code that resizes the stack.
In no-naked-pointers mode, recognize and skip code pointers present in the stack of the bytecode interpreter. This is needed only for the scan at beginning of a major GC cycle, not for the scan done at every minor GC.
Earlier, no-naked-pointers mode was effective only in native code.
…ge table `!Is_in_value_area(pc)` is always false if we turn the page table off. A better check would be `caml_find_code_fragment_by_pc(pc) != NULL`, but I feel this is too costly even for the debug mode of the interpreter.
00ad7fb
to
552bc3e
Compare
@jhjourdan: thank you for the clarification, and apologies for my confusion, I was looking at the wrong place in this discussion... I just pushed the two suggested changes. A reason why "requested changes" make me nervous is that they would block merging had we activated the "mandatory review" feature of Github. We haven't yet, but that's something we might do. That's why I use "requested changes" only when there's a big problem. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me!
@xavierleroy : understood. I will try to change my convention, then.
We need to make progress. I'm merging based on @jhjourdan's review and lack of alternative approaches that work. |
PS. The lack of a Changes entry is intentional. I plan to have a single Changes entry for the various pull requests that improve support for no-naked-pointers mode. |
Does anyone know what should be changed in
|
This is the next installment in the "make no-naked-pointers mode work in preparation for Multicore OCaml" series. The hero of today's episode is the bytecode interpreter.
The stack of the bytecode interpreter is an interesting mix of
The latter two are "naked" pointers outside the heap, so in no-naked-pointers mode something must be done so that the [edited] major GC never sees them.
There are at least 4 solutions:
Don't store naked pointers inside the bytecode interpreter stack
1a. Set the low bit to 1 so that the (aligned) pointer looks like an integer
1b. Store an offset (a relative address) instead of a pointer (an absolute address), encoding the offset as an OCaml tagged integer.
Skip over the naked pointers when scanning the stack during a [edit] major GC
2a. Use metadata (like the frame descriptors in native-code) to know where the values are and where the other pointers are in the stack
2b. Use tests such as
caml_find_code_fragment_by_pc
to identify code pointers, and comparisons with the bounds of the stack to identify pointers within the stack.Multicore OCaml currently uses 1a (encapsulated pointers) for code pointers and [edited] 1b (offsets) for pointers within the stack, see https://github.com/ocaml-multicore/ocaml-multicore/blob/parallel_minor_gc/runtime/interp.c.
It's a fine solution [edit: modulo a problem with imprecise stack backtraces] but a fairly big patch. In this PR I explore another possibility:
The diff is quite small, but suffices to make no-naked-pointers mode work in bytecode. (Before this PR, it was active only in native code.)
A note about performance:
caml_find_code_fragment_by_pc
is not cheap, even with the nifty skiplist-based implementation; however, the slowdown oncaml_do_local_roots
appears very small -- I was not able to measure it yet. This is because the call stack is rarely very deep, andcaml_do_local_roots
is called only at the beginning of a major GC cycle, i.e. quite rarely. The scanning function that is called at every minor GC,caml_oldify_local_roots
, does not need to worry about code pointers, because it excludes everything that is not a pointer in the minor heap.For the record: I also have an implementation of approach 1a if it is deemed preferable. [edit: see #9687] But it is a bigger change.