Join GitHub today
GitHub is home to over 50 million developers working together to host and review code, manage projects, and build software together.Sign up
8247937: Specialize downcall binding recipes using MethodHandle combinators #212
This PR adds specialization of downcall binding recipes using MethodHandles. The main addition to the code is the ProgrammableInvoker::specialize method, but there are also some refactorings which I will explain;
The current code has a single
This Object is then passed to
This split is done so that we can replace the invokeInterpBindings call with a specialized MethodHandle chain based on the binding recipe instead, which is built on top a type handle of invokeMoves.
This takes care of the binding recipe specialization. invokeMoves can later be replaced during C2 compilation with code that passes these low-level values into registers directly (but that is for another patch).
BindingInterpreter now doesn't write values to the intermediate buffer directly, and so instead of passing functors to obtain a pointer to write a low-level value to, it now takes functors that handle the reading or writing (see StoreFunc and LoadFunc).
Since the read/write methods in BindingInterpreter are now called from multiple places, I've moved them to SharedUtils.
The process of specializing the binding recipe is as follows:
We first calculate a low-level method type (the 'intrinsicType') based on the MOVE bindings of a particular recipe, this gives us a method type that takes only primitives as arguments, which represent the values to be copied into the various CPU registers before calling the native target function. We then get a low-level MethodHandle that calls invokeMoves, and adapt it to the intrinsic type. On top of that we build the specialized binding recipe method handle chain.
For each argument, we iterate through the bindings in reverse, and for each binding insert a new filter MH onto the handle we already have. At the end we will end up with a handle that features the high-level arguments of our native function (MemorySegment, MemoryAddress, etc.).
The same is done for the return bindings; the return value is repeatedly filtered according to the bindings.
This currently doesn't work for multiple return values, in which case invokeMoves will return an Object instead of just a plain Object. This case is not specialized with method handles, but is instead handled solely by invokeInterpBindings. (we detect this in getBoundMethodHandle)
In case the bindings need to do allocation, a NativeAllocationScope is created and passed in as the first argument, which is then forwarded to the various filter MH using mergeArguments (which is a wrapper for MethodHandles::permuteArguments that merges 2 arguments given their indices). The allocation and cleanup of the NativeAllocationScope are handled with a tryFinally combinator.
I've added a system property to disable the MethodHandle specialization so that the performance difference can be measured. The speed up I've seen on the CallOverhead benchmark between this and the current code is about 1.5x (but with the C2 intrinsics this gets us on par with JNI). The difference between specialization turned off and on is a little bigger than that, since the split of the 'invoke' method adds a bit of overhead as well, but it's still on overall gain compared to what we currently have.
Finally, I've also fixed a minor problems with TestUpcall where the expected and actual values were reversed when calling asserts.
If I've missed anything in the explanation, please feel free to ask!
mcimadamore left a comment
Overall looks good, I've left a couple of minor editorial comments.
The architectural approach, while sound, to seems a bit risky. We're essentially adding yet another way to interpret bindings, and one that is less scrutable (since interpreting happens implicitly, via a MH chain) which will add cost in terms of maintenance going forwards. The speedup, seems good, but at the same time I can't help thinking that the basic binding interpreter has not been optimized much, so we don't really know how much of that performance gap is really due to the fact that interpreting bindings using a chain of MH is faster (I can see advantages in not allocating the binding array, but other than that things look less obvious).
As a code organization strategy, I also wonder if it would pay off to bring more API to bindings, e.g. so that each binding could support a box/unbox/specialize triad of methods - rather than having switches scattered in several places?
@JornVernee This change now passes all automated pre-integration checks, type
There are currently no new commits on the
I've applied your suggestion of putting the code for interpreting, verifying and specializing the Bindings in methods on Binding itself, instead of scattered throughout the switch statements. This removes a bit of code as well, since there is no more need for casting, and fields are directly accessible. There was one place where this didn't completely work out; in the specialization we need to update the argument index (insertPos) when we encounter a MOVE binding, so the code currently has to explicitly check for MOVE bindings.
I've also added a few more CallOverhead benchmarks, with a more diverse argument set. The results show that especially once you start adding more arguments, or the arguments become more complex (see the MemoryAddress case), that the MH specialization starts to make more of a difference.
mcimadamore left a comment
Code looks really good - I like how moving stuff to bindings turned out.
Binding interpreter is now very thin, we might consider later if we wanna remove it and add maybe some static helpers to Binding. But that's something that can wait.