Naked pointers and the bytecode interpreter #9680

xavierleroy · 2020-06-14T15:51:25Z

This is the next installment in the "make no-naked-pointers mode work in preparation for Multicore OCaml" series. The hero of today's episode is the bytecode interpreter.

The stack of the bytecode interpreter is an interesting mix of

OCaml values (tagged integers or heap pointers), for values of local variables and for intermediate evaluation results;
code pointers, for return addresses into calling functions;
pointers within the stack, for chaining of exception handlers.

The latter two are "naked" pointers outside the heap, so in no-naked-pointers mode something must be done so that the [edited] major GC never sees them.

There are at least 4 solutions:

Don't store naked pointers inside the bytecode interpreter stack
1a. Set the low bit to 1 so that the (aligned) pointer looks like an integer
1b. Store an offset (a relative address) instead of a pointer (an absolute address), encoding the offset as an OCaml tagged integer.
Skip over the naked pointers when scanning the stack during a [edit] major GC
2a. Use metadata (like the frame descriptors in native-code) to know where the values are and where the other pointers are in the stack
2b. Use tests such as caml_find_code_fragment_by_pc to identify code pointers, and comparisons with the bounds of the stack to identify pointers within the stack.

Multicore OCaml currently uses 1a (encapsulated pointers) for code pointers and [edited] 1b (offsets) for pointers within the stack, see https://github.com/ocaml-multicore/ocaml-multicore/blob/parallel_minor_gc/runtime/interp.c.

It's a fine solution [edit: modulo a problem with imprecise stack backtraces] but a fairly big patch. In this PR I explore another possibility:

Approach 1b for chaining exception handlers: the pointer to the next handler is replaced by an offset. This has the additional benefit that the chaining of handlers no longer needs updating when the bytecode stack is reallocated. (Oleg Kiselyov told me so a long time ago; I should have listened.)
Approach 2b to identify and exclude code pointers during [edited] major GC stack scanning.

The diff is quite small, but suffices to make no-naked-pointers mode work in bytecode. (Before this PR, it was active only in native code.)

A note about performance: caml_find_code_fragment_by_pc is not cheap, even with the nifty skiplist-based implementation; however, the slowdown on caml_do_local_roots appears very small -- I was not able to measure it yet. This is because the call stack is rarely very deep, and caml_do_local_roots is called only at the beginning of a major GC cycle, i.e. quite rarely. The scanning function that is called at every minor GC, caml_oldify_local_roots, does not need to worry about code pointers, because it excludes everything that is not a pointer in the minor heap.

For the record: I also have an implementation of approach 1a if it is deemed preferable. [edit: see #9687] But it is a bigger change.

xavierleroy · 2020-06-14T15:52:05Z

(Mentioning @ejgallego because this discussion applies to the Coq VM interpreter too.)

stedolan · 2020-06-15T17:47:08Z

Multicore actually uses 1a (tagging) for code pointers, and 1b (offsets) for trap handlers. The difference is that the multicore branch maintains trap handlers as offsets all the time (and so push/pop does no conversion), whereas this patch keeps absolute references in the trap handler globals (e.g. Caml_state->trapsp), converting to offsets when pushed/popped from the stack.

For the record, I think what's proposed here is a better design. In hindsight, keeping the globals as offsets made the diff larger without much benefit. Using the code fragments table will have some performance impact in multicore (whatever synchronisation is needed will have to do well under contention, as all domains start a major GC cycle simultaneously), but I don't imagine it will be too slow, especially as this only affects bytecode.

ejgallego · 2020-06-15T22:21:28Z

Thanks for the heads up @xavierleroy , I'll ping @coq/vm-native-maintainers so they are in the loop.

We're going to add 4.12 to the CI so we can try to track upstream development cc: coq/coq#12518

jhjourdan · 2020-06-16T07:43:59Z

A note about the performance impact of caml_find_code_fragment_by_pc: which kind of measurement have you done? I would expect, indeed, that the impact is rather small if very little modules are loaded (i.e., only one module in the most common case) since the overhead of searching in the skiplist is very small when the skiplist is almost empty. The story might be different with e.g., a few hundreds of modules loaded (which seems large, but not completely crazy either).

xavierleroy · 2020-06-16T08:56:58Z

A concern with the approach proposed in this PR is that it makes it much more difficult to implement control operators by copying part of the bytecode interpreter stack and saving it in a heap block, like Oleg K's delimcc library does.

When naked pointers are supported or when code pointers in the stack are encapsulated, the heap blocks in question can safely be traversed by the GC. This is not the case with the approach proposed in this PR, so a more complicated copying strategy is needed, like delimcc does for native-code stacks, if I remember correctly.

xavierleroy · 2020-06-16T08:58:11Z

To encourage fair competition, I submitted the alternate approach (of encapsulating code pointers in the stack) as #9687. Ideally, some brave souls would read both and perhaps run benchmarks...

xavierleroy · 2020-06-16T09:00:21Z

A note about the performance impact of caml_find_code_fragment_by_pc: which kind of measurement have you done?

Batch-compiled bytecode executables.

I would expect, indeed, that the impact is rather small if very little modules are loaded (i.e., only one module in the most common case) since the overhead of searching in the skiplist is very small when the skiplist is almost empty.

Yes, there was only one fragment in my tests. No, there is already some overhead in calling caml_find_code_fragment_by_pc and then calling the skiplist function.

The story might be different with e.g., a few hundreds of modules loaded (which seems large, but not completely crazy either).

You can get that at the toplevel REPL. But log2 100 is less than 7...

jhjourdan

The patch of callback.c is actually a bug fix. It should be inserted earlier in the history in order to make sure that the intermediate commit are, to the best of our knowledge, correct (for easier bisecting).

Perhaps you could add a comment in caml_oldify_local_roots to document the fact that caml_oldify_one can be called with code pointers (or naked pointers in general) without fear.

About the change in interp.c, I am somewhat confused. On the one hand you say that calling caml_find_code_fragment_by_pc for every single (non-integer) stack entry at each major collection is negligible, but on the other hand you say that calling it once in debug mode for each closure allocation is too costly. This seems contradictory to me.

jhjourdan · 2020-06-16T09:52:10Z

runtime/callback.c

@@ -96,6 +92,8 @@ CAMLexport value caml_callbackN_exn(value closure, int narg, value args[])
  local_callback_code[4] = POP;
  local_callback_code[5] =  1;
  local_callback_code[6] = STOP;
+  /* Not registering the code fragment, as code fragment management
+     would need to be revised thoroughly for an hypothetical JIT */


I don't know the history of the LOCAL_CALLBACK_BYTECODE flag, but it seems to be completely dead code since its insertion in 2004.

Frankly, keeping pieces of untested dead code for hypothetical clients that are very likely to never exist and for which we actually do not even know that they will need this trick, is, IMO, not good software engineering practice. This is even worse with this patch since we know for sure that this is bogus (the code fragment is not registered)....

The history is that circa 2004 Basile Starynkevitch was experimenting with a JIT compiler for OCaml bytecode, and left some hooks in the code. I agree these hooks are outdated and should be removed, but in another PR.

jhjourdan · 2020-06-16T10:20:38Z

You can get that at the toplevel REPL. But log2 100 is less than 7...

Which is still seven times larger than log2 2 == 1.

Do you want me to do benchmarks with large pieces of code executed in the toplevel and recursive functions? We would be in the the worst case : a large stack and a large number of code fragments.

xavierleroy · 2020-06-16T13:14:21Z

Do you want me to do benchmarks with large pieces of code executed in the toplevel and recursive functions? We would be in the the worst case : a large stack and a large number of code fragments.

That would be very nice. We could then compare with #9687.

xavierleroy · 2020-06-16T16:17:12Z

Multicore actually uses 1a (tagging) for code pointers, and 1b (offsets) for trap handlers.

I stand corrected for the trap handlers.

Concerning 1a (tagging of code pointers), @jhjourdan noticed a serious problem (see #9687): an integer value on the stack can look like a (tagged) code pointer, messing up the recording of the call stack. I'm afraid we'll have to abandon this idea and use approach 2b like in this PR.

xavierleroy · 2020-06-16T16:23:57Z

About the change in interp.c, I am somewhat confused. On the one hand you say that calling caml_find_code_fragment_by_pc for every single (non-integer) stack entry at each major collection is negligible, but on the other hand you say that calling it once in debug mode for each closure allocation is too costly. This seems contradictory to me.

Between two start-of-major-GC stack scans, you can have a million closure allocations (think CPS, LWT, etc). so the overhead of assertions adds up. I care about the speed of the debug runtime, as it should be able to run possibly big programs.

jhjourdan · 2020-06-17T07:55:19Z

Do you want me to do benchmarks with large pieces of code executed in the toplevel and recursive functions? We would be in the the worst case : a large stack and a large number of code fragments.

That would be very nice. We could then compare with #9687.

I did that. Here is the benchmark I ran:

#load "unix.cma";;

(fun _ -> ());; (* repeated 1000 times *)

let rec allocate_list = function
  | 0 -> []
  | n ->
     let l = allocate_list (n-1) in
     n :: l

let () =
  let n = 30 in
  let sum = ref 0. in
  let sqsum = ref 0. in
  for j = 0 to n-1 do
    let t1 = Unix.gettimeofday () in
    for i = 0 to 30 do
      ignore (allocate_list 100000)
    done;
    let t2 = Unix.gettimeofday () in
    let dt = t2-.t1 in
    Printf.fprintf stderr "%f\n%!" dt;
    sum := !sum +. dt;
    sqsum := !sqsum +. dt*.dt;
  done;
  let mean = !sum /. float n in
  let sqmean = !sqsum /. float n in
  let stddev = sqrt (sqmean -. mean *. mean) in
  Printf.fprintf stderr "MEAN : %f STDDEV %f STDERR %f\n%!"
    mean stddev (stddev /. sqrt (float n));;

let () =
  Gc.print_stat stderr;;

With trunk, it gives :
MEAN : 0.178121 STDDEV 0.004059 STDERR 0.000741

With this branch (merged with trunk to workaround #9681), I get :
MEAN : 0.199789 STDDEV 0.007673 STDERR 0.001401

Which is a 12% difference. Not negligible.

Sure, this is a micro-benchmark designed to hit the worst case (essentially, the worst case happens when there are a lot of code fragments and a large stack containing a lot of pointers to the heap, which happens to have a size of the order of magnitude of the major heap).

xavierleroy · 2020-06-17T12:49:04Z

Thanks for the example and for the testing. +12% for a worst-case example is not that bad...

I played a bit with your example:

Compiled with ocamlc (-> only two code fragments), I see a 1% slowdown.
Running under the toplevel (-> a lot of code fragments), but reducing the size of lists to 1000, I see a 4% slowdown.

Why the smaller lists? A long time ago I worried about stack scanning time in ocamlopt (with the frame descriptors and all that), and before trying to speed it up, I made some measurements on a few OCaml programs that use a lot of recursive functions (Coq, the OCaml compilers, etc). On an average, the stack was quite small: typically 100 call frames and less than 1000 roots. Sometimes it can get much deeper than that, of course, but not for very long, hence the probability of scanning a huge stack is low.

xavierleroy · 2020-06-17T12:52:25Z

At any rate, quoting Mrs Thatcher, I'm afraid there is no (reasonable) alternative to the approach followed in this PR. So, can we proceed with the reviewing?

jhjourdan · 2020-06-17T13:00:46Z

If we did not bother about development time, I think the solution could actually to use splay trees, which can adapt to the expected very unbalanced distribution of pointers in the stack. Perhaps with some engineering, the implementation could even be shared with free lists, who knows...

But yes, I agree that nobody would have time for that, given that your current implementation gives good performances in usual cases and not-too-bad performances with made-up examples. And this is not even speaking about the fact that concurrency and splay tree do not play well together.

gasche · 2020-06-17T14:28:49Z

I think that most toplevel workflows (which include HOL-light in interactive mode and the killer jitted-ocamlnat of the future) will spend most of their heavy computation in functions defined in libraries, rather than just defined in the toplevel section -- which is typically used for exploratory usage of an API. If I understand correctly, this corresponds to a best-case within the worst-case, as the "main" code fragment can be located in 1 step within the skip list, right?

xavierleroy · 2020-06-17T14:33:12Z

this corresponds to a best-case within the worst-case, as the "main" code fragment can be located in 1 step within the skip list, right?

No. The skiplist is sorted by increasing code address, so there is no reason the "main" code fragment will be at head. The splay tree mentioned by @jhjourdan could perform better in this case.

xavierleroy · 2020-06-17T14:35:56Z

Generally speaking: I'm not excluding future work on better data structures for the table of code fragments. What I'd rather not do now nor in the foreseeable future is to add metadata to bytecode (e.g. frame descriptors) to support GC scanning and backtrace construction.

xavierleroy · 2020-06-18T11:23:28Z

@jhjourdan you left a review requesting changes, but didn't say which ones. Would you please clarify?

jhjourdan · 2020-06-18T11:54:42Z

@jhjourdan you left a review requesting changes, but didn't say which ones. Would you please clarify?

As I said in my review:

The patch of callback.c is actually a bug fix. It should be inserted earlier in the history in order to make sure that the intermediate commit are, to the best of our knowledge, correct (for easier bisecting).

Perhaps you could add a comment in caml_oldify_local_roots to document the fact that caml_oldify_one can be called with code pointers (or naked pointers in general) without fear.

Sure these are not critical changes, and the PR could merged as is, but these are still easy improvements. Don't take my "change requested" flag too seriously. If you prefer, I can revert it.

So that the return address pushed on the bytecode interpreter stack is correctly recognized as a code pointer.

Rather than storing a pointer to the previous frame in the Trap_link field of the current frame, store the distance (pointer difference) between the current frame and the previous frame, tagged as an OCaml integer. Using a tagged integer instead of a raw pointer means fever problems later with strict no-naked-pointer support. Using a distance rather than an absolute address simplifies the code that resizes the stack.

In no-naked-pointers mode, recognize and skip code pointers present in the stack of the bytecode interpreter. This is needed only for the scan at beginning of a major GC cycle, not for the scan done at every minor GC.

Earlier, no-naked-pointers mode was effective only in native code.

…ge table `!Is_in_value_area(pc)` is always false if we turn the page table off. A better check would be `caml_find_code_fragment_by_pc(pc) != NULL`, but I feel this is too costly even for the debug mode of the interpreter.

xavierleroy · 2020-06-18T13:18:08Z

@jhjourdan: thank you for the clarification, and apologies for my confusion, I was looking at the wrong place in this discussion...

I just pushed the two suggested changes.

A reason why "requested changes" make me nervous is that they would block merging had we activated the "mandatory review" feature of Github. We haven't yet, but that's something we might do. That's why I use "requested changes" only when there's a big problem.

jhjourdan

Looks good to me!

@xavierleroy : understood. I will try to change my convention, then.

xavierleroy · 2020-06-21T08:29:06Z

We need to make progress. I'm merging based on @jhjourdan's review and lack of alternative approaches that work.

xavierleroy · 2020-06-21T08:33:54Z

PS. The lack of a Changes entry is intentional. I plan to have a single Changes entry for the various pull requests that improve support for no-naked-pointers mode.

kit-ty-kate · 2020-11-08T22:05:06Z

Does anyone know what should be changed in delimcc for it to work after this?
Is it just a simple translation such as ... ?:

Trap_link(p) = rvalue; --> Trap_link_offset(p) = rvalue;
lvalue = Trap_link(p); --> lvalue = p + Long_val(Trap_link_offset(p));

xavierleroy added the multicore-prerequisite label Jun 14, 2020

xavierleroy mentioned this pull request Jun 15, 2020

Introduce and use an Is_naked_pointer(v) macro in the runtime #9685

Closed

xavierleroy mentioned this pull request Jun 16, 2020

Naked pointers and the bytecode interpreter, alternative approach #9687

Closed

jhjourdan requested changes Jun 16, 2020

View reviewed changes

xavierleroy added 5 commits June 18, 2020 15:06

callback.c: register the bytecode fragment used for callbacks

f8970c4

So that the return address pushed on the bytecode interpreter stack is correctly recognized as a code pointer.

Revised scanning of bytecode interpreter stack

9476d5d

In no-naked-pointers mode, recognize and skip code pointers present in the stack of the bytecode interpreter. This is needed only for the scan at beginning of a major GC cycle, not for the scan done at every minor GC.

major_gc.c: use no-naked-pointers mode even in bytecode

4a953a2

Earlier, no-naked-pointers mode was effective only in native code.

xavierleroy force-pushed the bytecode-nnp branch from 00ad7fb to 552bc3e Compare June 18, 2020 13:14

jhjourdan approved these changes Jun 18, 2020

View reviewed changes

xavierleroy merged commit 5f45428 into ocaml:trunk Jun 21, 2020

xavierleroy deleted the bytecode-nnp branch June 21, 2020 08:29

xavierleroy mentioned this pull request Jun 21, 2020

The end of the page table is near #9698

Merged

kit-ty-kate mentioned this pull request Feb 9, 2021

Add support for OCaml 4.12 kit-ty-kate/delimcc#1

Draft

ctk21 mentioned this pull request Mar 23, 2021

Change interpreter to uses naked code pointers ocaml-multicore/ocaml-multicore#507

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Naked pointers and the bytecode interpreter #9680

Naked pointers and the bytecode interpreter #9680

xavierleroy commented Jun 14, 2020 •

edited

xavierleroy commented Jun 14, 2020

stedolan commented Jun 15, 2020

ejgallego commented Jun 15, 2020

jhjourdan commented Jun 16, 2020

xavierleroy commented Jun 16, 2020

xavierleroy commented Jun 16, 2020

xavierleroy commented Jun 16, 2020

jhjourdan left a comment •

edited

jhjourdan Jun 16, 2020

xavierleroy Jun 16, 2020

jhjourdan commented Jun 16, 2020

xavierleroy commented Jun 16, 2020

xavierleroy commented Jun 16, 2020

xavierleroy commented Jun 16, 2020

jhjourdan commented Jun 17, 2020

xavierleroy commented Jun 17, 2020

xavierleroy commented Jun 17, 2020

jhjourdan commented Jun 17, 2020

gasche commented Jun 17, 2020

xavierleroy commented Jun 17, 2020

xavierleroy commented Jun 17, 2020

xavierleroy commented Jun 18, 2020

jhjourdan commented Jun 18, 2020

xavierleroy commented Jun 18, 2020

jhjourdan left a comment

xavierleroy commented Jun 21, 2020

xavierleroy commented Jun 21, 2020

kit-ty-kate commented Nov 8, 2020

Naked pointers and the bytecode interpreter #9680

Naked pointers and the bytecode interpreter #9680

Conversation

xavierleroy commented Jun 14, 2020 • edited

xavierleroy commented Jun 14, 2020

stedolan commented Jun 15, 2020

ejgallego commented Jun 15, 2020

jhjourdan commented Jun 16, 2020

xavierleroy commented Jun 16, 2020

xavierleroy commented Jun 16, 2020

xavierleroy commented Jun 16, 2020

jhjourdan left a comment • edited

Choose a reason for hiding this comment

jhjourdan Jun 16, 2020

Choose a reason for hiding this comment

xavierleroy Jun 16, 2020

Choose a reason for hiding this comment

jhjourdan commented Jun 16, 2020

xavierleroy commented Jun 16, 2020

xavierleroy commented Jun 16, 2020

xavierleroy commented Jun 16, 2020

jhjourdan commented Jun 17, 2020

xavierleroy commented Jun 17, 2020

xavierleroy commented Jun 17, 2020

jhjourdan commented Jun 17, 2020

gasche commented Jun 17, 2020

xavierleroy commented Jun 17, 2020

xavierleroy commented Jun 17, 2020

xavierleroy commented Jun 18, 2020

jhjourdan commented Jun 18, 2020

xavierleroy commented Jun 18, 2020

jhjourdan left a comment

Choose a reason for hiding this comment

xavierleroy commented Jun 21, 2020

xavierleroy commented Jun 21, 2020

kit-ty-kate commented Nov 8, 2020

xavierleroy commented Jun 14, 2020 •

edited

jhjourdan left a comment •

edited