Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Naked pointers and the bytecode interpreter #9680

Merged
merged 5 commits into from Jun 21, 2020

Conversation

xavierleroy
Copy link
Contributor

@xavierleroy xavierleroy commented Jun 14, 2020

This is the next installment in the "make no-naked-pointers mode work in preparation for Multicore OCaml" series. The hero of today's episode is the bytecode interpreter.

The stack of the bytecode interpreter is an interesting mix of

  • OCaml values (tagged integers or heap pointers), for values of local variables and for intermediate evaluation results;
  • code pointers, for return addresses into calling functions;
  • pointers within the stack, for chaining of exception handlers.

The latter two are "naked" pointers outside the heap, so in no-naked-pointers mode something must be done so that the [edited] major GC never sees them.

There are at least 4 solutions:

  1. Don't store naked pointers inside the bytecode interpreter stack
    1a. Set the low bit to 1 so that the (aligned) pointer looks like an integer
    1b. Store an offset (a relative address) instead of a pointer (an absolute address), encoding the offset as an OCaml tagged integer.

  2. Skip over the naked pointers when scanning the stack during a [edit] major GC
    2a. Use metadata (like the frame descriptors in native-code) to know where the values are and where the other pointers are in the stack
    2b. Use tests such as caml_find_code_fragment_by_pc to identify code pointers, and comparisons with the bounds of the stack to identify pointers within the stack.

Multicore OCaml currently uses 1a (encapsulated pointers) for code pointers and [edited] 1b (offsets) for pointers within the stack, see https://github.com/ocaml-multicore/ocaml-multicore/blob/parallel_minor_gc/runtime/interp.c.

It's a fine solution [edit: modulo a problem with imprecise stack backtraces] but a fairly big patch. In this PR I explore another possibility:

  • Approach 1b for chaining exception handlers: the pointer to the next handler is replaced by an offset. This has the additional benefit that the chaining of handlers no longer needs updating when the bytecode stack is reallocated. (Oleg Kiselyov told me so a long time ago; I should have listened.)
  • Approach 2b to identify and exclude code pointers during [edited] major GC stack scanning.

The diff is quite small, but suffices to make no-naked-pointers mode work in bytecode. (Before this PR, it was active only in native code.)

A note about performance: caml_find_code_fragment_by_pc is not cheap, even with the nifty skiplist-based implementation; however, the slowdown on caml_do_local_roots appears very small -- I was not able to measure it yet. This is because the call stack is rarely very deep, and caml_do_local_roots is called only at the beginning of a major GC cycle, i.e. quite rarely. The scanning function that is called at every minor GC, caml_oldify_local_roots, does not need to worry about code pointers, because it excludes everything that is not a pointer in the minor heap.

For the record: I also have an implementation of approach 1a if it is deemed preferable. [edit: see #9687] But it is a bigger change.

@xavierleroy
Copy link
Contributor Author

(Mentioning @ejgallego because this discussion applies to the Coq VM interpreter too.)

@stedolan
Copy link
Contributor

Multicore actually uses 1a (tagging) for code pointers, and 1b (offsets) for trap handlers. The difference is that the multicore branch maintains trap handlers as offsets all the time (and so push/pop does no conversion), whereas this patch keeps absolute references in the trap handler globals (e.g. Caml_state->trapsp), converting to offsets when pushed/popped from the stack.

For the record, I think what's proposed here is a better design. In hindsight, keeping the globals as offsets made the diff larger without much benefit. Using the code fragments table will have some performance impact in multicore (whatever synchronisation is needed will have to do well under contention, as all domains start a major GC cycle simultaneously), but I don't imagine it will be too slow, especially as this only affects bytecode.

@ejgallego
Copy link

Thanks for the heads up @xavierleroy , I'll ping @coq/vm-native-maintainers so they are in the loop.

We're going to add 4.12 to the CI so we can try to track upstream development cc: coq/coq#12518

@jhjourdan
Copy link
Contributor

A note about the performance impact of caml_find_code_fragment_by_pc: which kind of measurement have you done? I would expect, indeed, that the impact is rather small if very little modules are loaded (i.e., only one module in the most common case) since the overhead of searching in the skiplist is very small when the skiplist is almost empty. The story might be different with e.g., a few hundreds of modules loaded (which seems large, but not completely crazy either).

@xavierleroy
Copy link
Contributor Author

A concern with the approach proposed in this PR is that it makes it much more difficult to implement control operators by copying part of the bytecode interpreter stack and saving it in a heap block, like Oleg K's delimcc library does.

When naked pointers are supported or when code pointers in the stack are encapsulated, the heap blocks in question can safely be traversed by the GC. This is not the case with the approach proposed in this PR, so a more complicated copying strategy is needed, like delimcc does for native-code stacks, if I remember correctly.

@xavierleroy
Copy link
Contributor Author

To encourage fair competition, I submitted the alternate approach (of encapsulating code pointers in the stack) as #9687. Ideally, some brave souls would read both and perhaps run benchmarks...

@xavierleroy
Copy link
Contributor Author

A note about the performance impact of caml_find_code_fragment_by_pc: which kind of measurement have you done?

Batch-compiled bytecode executables.

I would expect, indeed, that the impact is rather small if very little modules are loaded (i.e., only one module in the most common case) since the overhead of searching in the skiplist is very small when the skiplist is almost empty.

Yes, there was only one fragment in my tests. No, there is already some overhead in calling caml_find_code_fragment_by_pc and then calling the skiplist function.

The story might be different with e.g., a few hundreds of modules loaded (which seems large, but not completely crazy either).

You can get that at the toplevel REPL. But log2 100 is less than 7...

Copy link
Contributor

@jhjourdan jhjourdan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The patch of callback.c is actually a bug fix. It should be inserted earlier in the history in order to make sure that the intermediate commit are, to the best of our knowledge, correct (for easier bisecting).

Perhaps you could add a comment in caml_oldify_local_roots to document the fact that caml_oldify_one can be called with code pointers (or naked pointers in general) without fear.

About the change in interp.c, I am somewhat confused. On the one hand you say that calling caml_find_code_fragment_by_pc for every single (non-integer) stack entry at each major collection is negligible, but on the other hand you say that calling it once in debug mode for each closure allocation is too costly. This seems contradictory to me.

@@ -96,6 +92,8 @@ CAMLexport value caml_callbackN_exn(value closure, int narg, value args[])
local_callback_code[4] = POP;
local_callback_code[5] = 1;
local_callback_code[6] = STOP;
/* Not registering the code fragment, as code fragment management
would need to be revised thoroughly for an hypothetical JIT */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know the history of the LOCAL_CALLBACK_BYTECODE flag, but it seems to be completely dead code since its insertion in 2004.

Frankly, keeping pieces of untested dead code for hypothetical clients that are very likely to never exist and for which we actually do not even know that they will need this trick, is, IMO, not good software engineering practice. This is even worse with this patch since we know for sure that this is bogus (the code fragment is not registered)....

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The history is that circa 2004 Basile Starynkevitch was experimenting with a JIT compiler for OCaml bytecode, and left some hooks in the code. I agree these hooks are outdated and should be removed, but in another PR.

@jhjourdan
Copy link
Contributor

You can get that at the toplevel REPL. But log2 100 is less than 7...

Which is still seven times larger than log2 2 == 1.

Do you want me to do benchmarks with large pieces of code executed in the toplevel and recursive functions? We would be in the the worst case : a large stack and a large number of code fragments.

@xavierleroy
Copy link
Contributor Author

Do you want me to do benchmarks with large pieces of code executed in the toplevel and recursive functions? We would be in the the worst case : a large stack and a large number of code fragments.

That would be very nice. We could then compare with #9687.

@xavierleroy
Copy link
Contributor Author

Multicore actually uses 1a (tagging) for code pointers, and 1b (offsets) for trap handlers.

I stand corrected for the trap handlers.

Concerning 1a (tagging of code pointers), @jhjourdan noticed a serious problem (see #9687): an integer value on the stack can look like a (tagged) code pointer, messing up the recording of the call stack. I'm afraid we'll have to abandon this idea and use approach 2b like in this PR.

@xavierleroy
Copy link
Contributor Author

About the change in interp.c, I am somewhat confused. On the one hand you say that calling caml_find_code_fragment_by_pc for every single (non-integer) stack entry at each major collection is negligible, but on the other hand you say that calling it once in debug mode for each closure allocation is too costly. This seems contradictory to me.

Between two start-of-major-GC stack scans, you can have a million closure allocations (think CPS, LWT, etc). so the overhead of assertions adds up. I care about the speed of the debug runtime, as it should be able to run possibly big programs.

@jhjourdan
Copy link
Contributor

Do you want me to do benchmarks with large pieces of code executed in the toplevel and recursive functions? We would be in the the worst case : a large stack and a large number of code fragments.

That would be very nice. We could then compare with #9687.

I did that. Here is the benchmark I ran:

#load "unix.cma";;

(fun _ -> ());; (* repeated 1000 times *)

let rec allocate_list = function
  | 0 -> []
  | n ->
     let l = allocate_list (n-1) in
     n :: l

let () =
  let n = 30 in
  let sum = ref 0. in
  let sqsum = ref 0. in
  for j = 0 to n-1 do
    let t1 = Unix.gettimeofday () in
    for i = 0 to 30 do
      ignore (allocate_list 100000)
    done;
    let t2 = Unix.gettimeofday () in
    let dt = t2-.t1 in
    Printf.fprintf stderr "%f\n%!" dt;
    sum := !sum +. dt;
    sqsum := !sqsum +. dt*.dt;
  done;
  let mean = !sum /. float n in
  let sqmean = !sqsum /. float n in
  let stddev = sqrt (sqmean -. mean *. mean) in
  Printf.fprintf stderr "MEAN : %f STDDEV %f STDERR %f\n%!"
    mean stddev (stddev /. sqrt (float n));;

let () =
  Gc.print_stat stderr;;

With trunk, it gives :
MEAN : 0.178121 STDDEV 0.004059 STDERR 0.000741

With this branch (merged with trunk to workaround #9681), I get :
MEAN : 0.199789 STDDEV 0.007673 STDERR 0.001401

Which is a 12% difference. Not negligible.

Sure, this is a micro-benchmark designed to hit the worst case (essentially, the worst case happens when there are a lot of code fragments and a large stack containing a lot of pointers to the heap, which happens to have a size of the order of magnitude of the major heap).

@xavierleroy
Copy link
Contributor Author

Thanks for the example and for the testing. +12% for a worst-case example is not that bad...

I played a bit with your example:

  • Compiled with ocamlc (-> only two code fragments), I see a 1% slowdown.
  • Running under the toplevel (-> a lot of code fragments), but reducing the size of lists to 1000, I see a 4% slowdown.

Why the smaller lists? A long time ago I worried about stack scanning time in ocamlopt (with the frame descriptors and all that), and before trying to speed it up, I made some measurements on a few OCaml programs that use a lot of recursive functions (Coq, the OCaml compilers, etc). On an average, the stack was quite small: typically 100 call frames and less than 1000 roots. Sometimes it can get much deeper than that, of course, but not for very long, hence the probability of scanning a huge stack is low.

@xavierleroy
Copy link
Contributor Author

At any rate, quoting Mrs Thatcher, I'm afraid there is no (reasonable) alternative to the approach followed in this PR. So, can we proceed with the reviewing?

@jhjourdan
Copy link
Contributor

If we did not bother about development time, I think the solution could actually to use splay trees, which can adapt to the expected very unbalanced distribution of pointers in the stack. Perhaps with some engineering, the implementation could even be shared with free lists, who knows...

But yes, I agree that nobody would have time for that, given that your current implementation gives good performances in usual cases and not-too-bad performances with made-up examples. And this is not even speaking about the fact that concurrency and splay tree do not play well together.

@gasche
Copy link
Member

gasche commented Jun 17, 2020

I think that most toplevel workflows (which include HOL-light in interactive mode and the killer jitted-ocamlnat of the future) will spend most of their heavy computation in functions defined in libraries, rather than just defined in the toplevel section -- which is typically used for exploratory usage of an API. If I understand correctly, this corresponds to a best-case within the worst-case, as the "main" code fragment can be located in 1 step within the skip list, right?

@xavierleroy
Copy link
Contributor Author

this corresponds to a best-case within the worst-case, as the "main" code fragment can be located in 1 step within the skip list, right?

No. The skiplist is sorted by increasing code address, so there is no reason the "main" code fragment will be at head. The splay tree mentioned by @jhjourdan could perform better in this case.

@xavierleroy
Copy link
Contributor Author

Generally speaking: I'm not excluding future work on better data structures for the table of code fragments. What I'd rather not do now nor in the foreseeable future is to add metadata to bytecode (e.g. frame descriptors) to support GC scanning and backtrace construction.

@xavierleroy
Copy link
Contributor Author

@jhjourdan you left a review requesting changes, but didn't say which ones. Would you please clarify?

@jhjourdan
Copy link
Contributor

@jhjourdan you left a review requesting changes, but didn't say which ones. Would you please clarify?

As I said in my review:

The patch of callback.c is actually a bug fix. It should be inserted earlier in the history in order to make sure that the intermediate commit are, to the best of our knowledge, correct (for easier bisecting).

Perhaps you could add a comment in caml_oldify_local_roots to document the fact that caml_oldify_one can be called with code pointers (or naked pointers in general) without fear.

Sure these are not critical changes, and the PR could merged as is, but these are still easy improvements. Don't take my "change requested" flag too seriously. If you prefer, I can revert it.

So that the return address pushed on the bytecode interpreter stack
is correctly recognized as a code pointer.
Rather than storing a pointer to the previous frame in the Trap_link
field of the current frame, store the distance (pointer difference)
between the current frame and the previous frame, tagged as an OCaml
integer.

Using a tagged integer instead of a raw pointer means fever problems
later with strict no-naked-pointer support.

Using a distance rather than an absolute address simplifies
the code that resizes the stack.
In no-naked-pointers mode, recognize and skip code pointers present in
the stack of the bytecode interpreter.  This is needed only for
the scan at beginning of a major GC cycle, not for the scan done
at every minor GC.
Earlier, no-naked-pointers mode was effective only in native code.
…ge table

`!Is_in_value_area(pc)` is always false if we turn the page table off.
A better check would be `caml_find_code_fragment_by_pc(pc) != NULL`,
but I feel this is too costly even for the debug mode of the interpreter.
@xavierleroy
Copy link
Contributor Author

@jhjourdan: thank you for the clarification, and apologies for my confusion, I was looking at the wrong place in this discussion...

I just pushed the two suggested changes.

A reason why "requested changes" make me nervous is that they would block merging had we activated the "mandatory review" feature of Github. We haven't yet, but that's something we might do. That's why I use "requested changes" only when there's a big problem.

Copy link
Contributor

@jhjourdan jhjourdan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me!

@xavierleroy : understood. I will try to change my convention, then.

@xavierleroy
Copy link
Contributor Author

We need to make progress. I'm merging based on @jhjourdan's review and lack of alternative approaches that work.

@xavierleroy xavierleroy merged commit 5f45428 into ocaml:trunk Jun 21, 2020
@xavierleroy xavierleroy deleted the bytecode-nnp branch June 21, 2020 08:29
@xavierleroy
Copy link
Contributor Author

PS. The lack of a Changes entry is intentional. I plan to have a single Changes entry for the various pull requests that improve support for no-naked-pointers mode.

@kit-ty-kate
Copy link
Member

Does anyone know what should be changed in delimcc for it to work after this?
Is it just a simple translation such as ... ?:

  • Trap_link(p) = rvalue; --> Trap_link_offset(p) = rvalue;
  • lvalue = Trap_link(p); --> lvalue = p + Long_val(Trap_link_offset(p));

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants