Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

8275646: Implement optimized upcall stubs on AArch64 #610

Closed
wants to merge 2 commits into from

Conversation

nick-arm
Copy link
Member

@nick-arm nick-arm commented Nov 5, 2021

This is a fairly direct port of the X86 code. The changes to
frame_aarch64 and foreign_globals_aarch64 are identical: perhaps
ForeignGlobals::parse_call_regs_impl() could be moved to shared code?

The X86 version has a call to reinit_heapbase() before the return from
the stub. I think this is redundant because the heap base register will
be clobbered immediately by the native code?

I hit a really weird bug in release builds where the first few
instructions in the code buffer were overwritten by the fields of
OptimizedEntryBlob. I think we need to pass sizeof(OptimizedEntryBlob)
instead of sizeof(BufferBlob) as the header_size argument to the
RuntimeBlob constructor (none of the other subclasses of BufferBlob have
extra fields). I added a header_size argument to BufferBlob's
constructor to thread this through.

I removed the calls to change the W^X state in on_entry/on_exit calls:
in the on_entry case the stub must already be executable because we
called into the VM from there, and for on_exit we need the code to be
executable not writable otherwise we'll get a SIGBUS as soon as we
return to the stub. The newly added stack tests in TestUpcall hit
JDK-8275584 on MacOS/AArch64 so I've problem-listed that for now.

JMH results from org.openjdk.bench.jdk.incubator.foreign.Upcalls before:

Benchmark                Mode  Cnt     Score    Error  Units
Upcalls.jni_args10       avgt   30   450.417 ?  4.755  ns/op
Upcalls.jni_args5        avgt   30   245.898 ?  3.171  ns/op
Upcalls.jni_blank        avgt   30   195.606 ?  5.459  ns/op
Upcalls.jni_identity     avgt   30   369.788 ? 15.165  ns/op
Upcalls.panama_args10    avgt   30  1253.189 ? 62.261  ns/op
Upcalls.panama_args5     avgt   30   927.101 ? 35.369  ns/op
Upcalls.panama_blank     avgt   30   637.708 ? 11.353  ns/op
Upcalls.panama_identity  avgt   30   697.109 ?  9.971  ns/op

After:

Benchmark                Mode  Cnt    Score    Error  Units
Upcalls.jni_args10       avgt   30  455.304 ? 21.838  ns/op
Upcalls.jni_args5        avgt   30  247.279 ?  2.513  ns/op
Upcalls.jni_blank        avgt   30  194.113 ?  4.317  ns/op
Upcalls.jni_identity     avgt   30  366.145 ?  4.912  ns/op
Upcalls.panama_args10    avgt   30  236.337 ? 11.072  ns/op
Upcalls.panama_args5     avgt   30  223.858 ? 12.345  ns/op
Upcalls.panama_blank     avgt   30  203.631 ?  8.840  ns/op
Upcalls.panama_identity  avgt   30  208.783 ?  9.914  ns/op

Tested tier1 and jdk_foreign on Linux/AArch64 and MacOS/AArch64.


Progress

  • Change must not contain extraneous whitespace
  • Change must be properly reviewed

Issue

  • JDK-8275646: Implement optimized upcall stubs on AArch64

Reviewers

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.java.net/panama-foreign pull/610/head:pull/610
$ git checkout pull/610

Update a local copy of the PR:
$ git checkout pull/610
$ git pull https://git.openjdk.java.net/panama-foreign pull/610/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 610

View PR using the GUI difftool:
$ git pr show -t 610

Using diff file

Download this PR as a diff file:
https://git.openjdk.java.net/panama-foreign/pull/610.diff

This is a fairly direct port of the X86 code.  The changes to
frame_aarch64 and foreign_globals_aarch64 are identical: perhaps
ForeignGlobals::parse_call_regs_impl() could be moved to shared code?

The X86 version has a call to reinit_heapbase() before the return from
the stub.  I think this is redundant because the heap base register will
be clobbered immediately by the native code?

I hit a really weird bug in release builds where the first few
instructions in the code buffer were overwritten by the fields of
OptimizedEntryBlob.  I think we need to pass sizeof(OptimizedEntryBlob)
instead of sizeof(BufferBlob) as the header_size argument to the
RuntimeBlob constructor (none of the other subclasses of BufferBlob have
extra fields).  I added a header_size argument to BufferBlob's
constructor to thread this through.

I removed the calls to change the W^X state in on_entry/on_exit calls:
in the on_entry case the stub must already be executable because we
called into the VM from there, and for on_exit we need the code to be
executable not writable otherwise we'll get a SIGBUS as soon as we
return to the stub.  The newly added stack tests in TestUpcall hit
JDK-8275584 so I've problem-listed that for now.

JMH results from org.openjdk.bench.jdk.incubator.foreign.Upcalls before:

Benchmark                Mode  Cnt     Score    Error  Units
Upcalls.jni_args10       avgt   30   450.417 ?  4.755  ns/op
Upcalls.jni_args5        avgt   30   245.898 ?  3.171  ns/op
Upcalls.jni_blank        avgt   30   195.606 ?  5.459  ns/op
Upcalls.jni_identity     avgt   30   369.788 ? 15.165  ns/op
Upcalls.panama_args10    avgt   30  1253.189 ? 62.261  ns/op
Upcalls.panama_args5     avgt   30   927.101 ? 35.369  ns/op
Upcalls.panama_blank     avgt   30   637.708 ? 11.353  ns/op
Upcalls.panama_identity  avgt   30   697.109 ?  9.971  ns/op

After:

Benchmark                Mode  Cnt    Score    Error  Units
Upcalls.jni_args10       avgt   30  455.304 ? 21.838  ns/op
Upcalls.jni_args5        avgt   30  247.279 ?  2.513  ns/op
Upcalls.jni_blank        avgt   30  194.113 ?  4.317  ns/op
Upcalls.jni_identity     avgt   30  366.145 ?  4.912  ns/op
Upcalls.panama_args10    avgt   30  236.337 ? 11.072  ns/op
Upcalls.panama_args5     avgt   30  223.858 ? 12.345  ns/op
Upcalls.panama_blank     avgt   30  203.631 ?  8.840  ns/op
Upcalls.panama_identity  avgt   30  208.783 ?  9.914  ns/op

Tested tier1 and jdk_foreign on Linux/AArch64 and MacOS/AArch64.
@bridgekeeper
Copy link

@bridgekeeper bridgekeeper bot commented Nov 5, 2021

👋 Welcome back ngasson! A progress list of the required criteria for merging this PR into foreign-jextract will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@nick-arm nick-arm changed the base branch from foreign-jextract to foreign-memaccess+abi Nov 5, 2021
@openjdk openjdk bot added the rfr label Nov 5, 2021
@mlbridge
Copy link

@mlbridge mlbridge bot commented Nov 5, 2021

Webrevs

@theRealAph
Copy link
Contributor

@theRealAph theRealAph commented Nov 5, 2021

The AArch64 code parts look reasonable enough, but parts of them seem to repeat some logic already present in MacroAssembler. I guess as it's desirable to stay close to the x86 port that's OK.
I'm interested to know where all that 250ns is going. Did you look at -prof perfasm?

Copy link
Member

@JornVernee JornVernee left a comment

I removed the calls to change the W^X state in on_entry/on_exit calls:
in the on_entry case the stub must already be executable because we
called into the VM from there, and for on_exit we need the code to be
executable not writable otherwise we'll get a SIGBUS as soon as we
return to the stub.

Thanks for cleaning this up as well. That code was taken from JavaCallWrapper code. Do you happen to know which use-case those calls were supposed to address? (I'm assuming things still work on MacOS/AArch64 without them).


W.R.T. to JDK-8275584, I was wondering how JNI handles this, but it looks like it doesn't:

#ifdef __APPLE__
          // Less-than word types are stored one after another.
          // The code is unable to handle this so bailout.
          return -1;
#endif

(in SharedRuntime_aarch64).

Maybe for now we should add a similar early bailout in Java code on MacOS/AArch64, until we have a chance to implement this.


WRT to performance, I see somewhat comparable numbers (but on a different scale), on my x64 machine:

Benchmark                Mode  Cnt    Score   Error  Units
Upcalls.jni_args10       avgt   30  148.993 � 1.492  ns/op
Upcalls.jni_args5        avgt   30   79.690 � 0.753  ns/op
Upcalls.jni_blank        avgt   30   55.017 � 1.067  ns/op
Upcalls.jni_identity     avgt   30  120.529 � 1.265  ns/op
Upcalls.panama_args10    avgt   30   55.065 � 1.037  ns/op
Upcalls.panama_args5     avgt   30   53.792 � 2.640  ns/op
Upcalls.panama_blank     avgt   30   47.916 � 1.764  ns/op
Upcalls.panama_identity  avgt   30   49.680 � 1.604  ns/op

ForeignGlobals::parse_call_regs_impl() could be moved to shared code?

Ok, feel free to move it to shared code if you want.

src/hotspot/cpu/aarch64/universalUpcallHandler_aarch64.cpp Outdated Show resolved Hide resolved
src/hotspot/share/code/codeBlob.cpp Show resolved Hide resolved
src/hotspot/cpu/aarch64/foreign_globals_aarch64.cpp Outdated Show resolved Hide resolved
assert(entry->is_static(), "static only");
// Fill in the signature array, for the calling-convention call.
const int total_out_args = entry->size_of_parameters();
assert(total_out_args > 0, "receiver arg");

BasicType* out_sig_bt = NEW_RESOURCE_ARRAY(BasicType, total_out_args);
BasicType ret_type;
{
int i = 0;
SignatureStream ss(entry->signature());
for (; !ss.at_return_type(); ss.next()) {
out_sig_bt[i++] = ss.type(); // Collect remaining bits of signature
if (ss.type() == T_LONG || ss.type() == T_DOUBLE)
out_sig_bt[i++] = T_VOID; // Longs & doubles take 2 Java slots
}
assert(i == total_out_args, "");
ret_type = ss.type();
}
// skip receiver
BasicType* in_sig_bt = out_sig_bt + 1;
int total_in_args = total_out_args - 1;
Copy link
Member

@JornVernee JornVernee Nov 5, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section could also be moved to shared code as well (I did that in the pending patch as well: https://github.com/openjdk/panama-foreign/pull/608/files#r741389756 but it would be fine to do it here instead of you want)

Copy link
Member Author

@nick-arm nick-arm Nov 8, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'll leave it as it is here to minimise changes - you'll need to rebase that patch anyway? I made the other two suggested clean-ups.

Copy link
Member

@JornVernee JornVernee Nov 8, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, that's fine.

@openjdk
Copy link

@openjdk openjdk bot commented Nov 5, 2021

@nick-arm This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8275646: Implement optimized upcall stubs on AArch64

Reviewed-by: jvernee

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been 96 new commits pushed to the foreign-memaccess+abi branch:

As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.

➡️ To integrate this PR with the above commit message to the foreign-memaccess+abi branch, type /integrate in a new comment.

@openjdk openjdk bot added the ready label Nov 5, 2021
@nick-arm
Copy link
Member Author

@nick-arm nick-arm commented Nov 8, 2021

I'm interested to know where all that 250ns is going. Did you look at -prof perfasm?

I should have tested on a more modern machine. Here's the results from an N1 server:

Before:

Benchmark                Mode  Cnt    Score    Error  Units
Upcalls.jni_args10       avgt   30  168.016 ±  0.571  ns/op
Upcalls.jni_args5        avgt   30   98.907 ±  0.833  ns/op
Upcalls.jni_blank        avgt   30   78.007 ±  0.138  ns/op
Upcalls.jni_identity     avgt   30  150.933 ±  0.854  ns/op
Upcalls.panama_args10    avgt   30  600.304 ± 35.944  ns/op
Upcalls.panama_args5     avgt   30  447.607 ± 24.090  ns/op
Upcalls.panama_blank     avgt   30  238.953 ± 12.512  ns/op
Upcalls.panama_identity  avgt   30  314.388 ± 31.383  ns/op

After:

Benchmark                Mode  Cnt    Score    Error  Units
Upcalls.jni_args10       avgt   30  168.528 ±  0.658  ns/op
Upcalls.jni_args5        avgt   30   98.595 ±  0.628  ns/op
Upcalls.jni_blank        avgt   30   78.420 ±  0.376  ns/op
Upcalls.jni_identity     avgt   30  154.403 ±  2.090  ns/op
Upcalls.panama_args10    avgt   30   86.066 ±  4.202  ns/op
Upcalls.panama_args5     avgt   30   78.094 ±  3.718  ns/op
Upcalls.panama_blank     avgt   30   68.683 ±  2.107  ns/op
Upcalls.panama_identity  avgt   30   76.841 ± 11.340  ns/op

Which is closer to the x86 results. The top two functions are the on_entry and on_exit VM calls:

 20.44%           libjvm.so  ProgrammableUpcallHandler::on_entry (228 bytes)
 12.30%           libjvm.so  ProgrammableUpcallHandler::on_exit (88 bytes)
  9.20%        runtime stub  StubRoutines::atomic entry points (148 bytes)
  8.65%         c2, level 4  org.openjdk.bench.jdk.incubator.foreign.jmh_generated.Upcalls_panama_args5_jmhTest::panama_args5_avgt_jmhStub, version 1386 (100 bytes)
  4.64%           libjvm.so  ThreadShadow::clear_pending_exception (32 bytes)
  4.07%         c2, level 4  java.lang.invoke.LambdaForm$MH.0x0000000800d12c00::invoke, version 1369 (212 bytes)
  3.94%         c2, level 4  java.lang.invoke.LambdaForm$MH.0x0000000800d12c00::invoke, version 1369 (268 bytes)

That code was taken from JavaCallWrapper code. Do you happen to know which use-case those calls were supposed to address? (I'm assuming things still work on MacOS/AArch64 without them).

I think the flow for JavaCallWrapper is slightly different. It's doing:

Native -> JavaCallWrapper -> Java (need X) -> ~JavaCallWrapper -> Native (need W?)

Whereas here we're doing:

Native -> Stub (need X) -> on_entry -> Java (need X) -> on_exit -> Stub (need X) -> Native

I'm not sure why ~JavaCallWrapper needs to set the W mode though.

@theRealAph
Copy link
Contributor

@theRealAph theRealAph commented Nov 8, 2021

Tried on MacOS, this is very weird:

....[Hottest Methods (after inlining)]..............................................................
 25.52%               runtime stub  StubRoutines::updateBytesAdler32 
 19.98%                c2, level 4  java.lang.invoke.LambdaForm$MH.0x00000008001b9000::invoke, version 1269 
 13.64%               libjvm.dylib  ProgrammableUpcallHandler::on_entry(OptimizedEntryBlob::FrameData*) 
  9.54%                c2, level 4  java.lang.invoke.LambdaForm$MH.0x00000008001b8000::invoke, version 1260 
  5.74%              libdyld.dylib  tlv_get_addr 
  5.35%               libjvm.dylib  JNIHandleBlock::allocate_block(Thread*, AllocFailStrategy::AllocFailEnum) 
  4.50%                c2, level 4  org.openjdk.bench.jdk.incubator.foreign.jmh_generated.Upcalls_panama_args10_jmhTest::panama_args10_avgt_jmhStub, version 1299 

Adler32 is used by zlib, right? So something is repeatedly scanning compressed jarfiles, or something.

@theRealAph
Copy link
Contributor

@theRealAph theRealAph commented Nov 8, 2021

I removed the calls to change the W^X state in on_entry/on_exit calls:
in the on_entry case the stub must already be executable because we
called into the VM from there, and for on_exit we need the code to be
executable not writable otherwise we'll get a SIGBUS as soon as we
return to the stub.

Thanks for cleaning this up as well. That code was taken from JavaCallWrapper code. Do you happen to know which use-case those calls were supposed to address? (I'm assuming things still work on MacOS/AArch64 without them).

This is in case native calls cause JIT patching.

Copy link
Member

@JornVernee JornVernee left a comment

LGTM. Feel free to integrate like this.

@nick-arm
Copy link
Member Author

@nick-arm nick-arm commented Nov 9, 2021

Tried on MacOS, this is very weird:

....[Hottest Methods (after inlining)]..............................................................
 25.52%               runtime stub  StubRoutines::updateBytesAdler32 

I think it's a bug in JMH's parsing of the -XX:+PrintStubCode output: it's expecting these to all be prefixed with StubRoutines:: but many aren't including the optimized_upcall_stub added by this patch. StubRoutines::updateBytesAdler32 is the last one printed with that prefix on MacOS, and likewise StubRoutines::atomic entry points is the last one printed on Linux (allegedly the third hottest method above). So I think it's been misattributed. I'll see if I can fix it...

@nick-arm
Copy link
Member Author

@nick-arm nick-arm commented Nov 9, 2021

/integrate

@openjdk
Copy link

@openjdk openjdk bot commented Nov 9, 2021

Going to push as commit 9e9ba2e.
Since your change was applied there have been 96 commits pushed to the foreign-memaccess+abi branch:

Your commit was automatically rebased without conflicts.

@openjdk openjdk bot closed this Nov 9, 2021
@openjdk openjdk bot added integrated and removed ready rfr labels Nov 9, 2021
@openjdk
Copy link

@openjdk openjdk bot commented Nov 9, 2021

@nick-arm Pushed as commit 9e9ba2e.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

@theRealAph
Copy link
Contributor

@theRealAph theRealAph commented Nov 9, 2021

Tried on MacOS, this is very weird:

....[Hottest Methods (after inlining)]..............................................................
 25.52%               runtime stub  StubRoutines::updateBytesAdler32 

I think it's a bug in JMH's parsing of the -XX:+PrintStubCode output: it's expecting these to all be prefixed with StubRoutines:: but many aren't including the optimized_upcall_stub added by this patch. StubRoutines::updateBytesAdler32 is the last one printed with that prefix on MacOS, and likewise StubRoutines::atomic entry points is the last one printed on Linux (allegedly the third hottest method above). So I think it's been misattributed. I'll see if I can fix it...

That makes perfect sense. I'm making rather a fuss about this because I think I know how long a direct call should make_, and indeed does make in C++. It could be that this really is the best we can do.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
integrated
3 participants