Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

8260355: AArch64: deoptimization stub should save vector registers #2279

Closed
wants to merge 4 commits into from

Conversation

@nick-arm
Copy link
Member

@nick-arm nick-arm commented Jan 28, 2021

This is an AArch64 port of the fix for JDK-8256056 "Deoptimization stub
doesn't save vector registers on x86". The problem is that a vector
produced by the Vector API may be stored in a register when the deopt
blob is called. Because the deopt blob only stores the lower half of
vector registers, the full vector object cannot be rematerialized during
deoptimization. So the following will crash on AArch64 with current JDK:

make test TEST="jdk/incubator/vector"
JTREG="VM_OPTIONS=-XX:+DeoptimizeALot -XX:DeoptimizeALotInterval=0"

The fix is to store the full vector registers by passing
save_vectors=true to save_live_registers() in the deopt blob. Because
save_live_registers() places the integer registers above the floating
registers in the stack frame, RegisterSaver::r0_offset_in_bytes() needs
to calculate the SP offset based on whether full vectors were saved, and
whether those vectors were NEON or SVE, rather than using a static
offset as it does currently.

The change to VectorSupport::allocate_vector_payload_helper() is
required because we only store the lowest VMReg slot in the oop map.
However unlike x86 the vector registers are always saved in a contiguous
region of memory, so we can calculate the address of each vector element
as an offset from the address of the first slot. X86 handles this in
RegisterMap::pd_location() but that won't work on AArch64 because with
SVE there isn't a unique VMReg corresponding to each four-byte physical
slot in the vector (there are always exactly eight logical VMRegs
regardless of the actual vector length).

Tested hotspot_all_no_apps and jdk_core.


Progress

  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue
  • Change must be properly reviewed

Issue

  • JDK-8260355: AArch64: deoptimization stub should save vector registers

Reviewers

Download

$ git fetch https://git.openjdk.java.net/jdk pull/2279/head:pull/2279
$ git checkout pull/2279

This is an AArch64 port of the fix for JDK-8256056 "Deoptimization stub
doesn't save vector registers on x86". The problem is that a vector
produced by the Vector API may be stored in a register when the deopt
blob is called. Because the deopt blob only stores the lower half of
vector registers, the full vector object cannot be rematerialized during
deoptimization. So the following will crash on AArch64 with current JDK:

  make test TEST="jdk/incubator/vector" \
    JTREG="VM_OPTIONS=-XX:+DeoptimizeALot -XX:DeoptimizeALotInterval=0"

The fix is to store the full vector registers by passing
save_vectors=true to save_live_registers() in the deopt blob. Because
save_live_registers() places the integer registers above the floating
registers in the stack frame, RegisterSaver::r0_offset_in_bytes() needs
to calculate the SP offset based on whether full vectors were saved, and
whether those vectors were NEON or SVE, rather than using a static
offset as it does currently.

The change to VectorSupport::allocate_vector_payload_helper() is
required because we only store the lowest VMReg slot in the oop map.
However unlike x86 the vector registers are always saved in a contiguous
region of memory, so we can calculate the address of each vector element
as an offset from the address of the first slot. X86 handles this in
RegisterMap::pd_location() but that won't work on AArch64 because with
SVE there isn't a unique VMReg corresponding to each four-byte physical
slot in the vector (there are always exactly eight logical VMRegs
regardless of the actual vector length).

Tested hotspot_all_no_apps and jdk_core.
@bridgekeeper
Copy link

@bridgekeeper bridgekeeper bot commented Jan 28, 2021

👋 Welcome back ngasson! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk openjdk bot added the rfr label Jan 28, 2021
@openjdk
Copy link

@openjdk openjdk bot commented Jan 28, 2021

@nick-arm The following label will be automatically applied to this pull request:

  • hotspot

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

@openjdk openjdk bot added the hotspot label Jan 28, 2021
@mlbridge
Copy link

@mlbridge mlbridge bot commented Jan 28, 2021

@mlbridge
Copy link

@mlbridge mlbridge bot commented Jan 28, 2021

Mailing list message from Andrew Haley on hotspot-dev:

On 1/28/21 8:31 AM, Nick Gasson wrote:

The fix is to store the full vector registers by passing
save_vectors=true to save_live_registers() in the deopt blob. Because
save_live_registers() places the integer registers above the floating
registers in the stack frame, RegisterSaver::r0_offset_in_bytes() needs
to calculate the SP offset based on whether full vectors were saved, and
whether those vectors were NEON or SVE, rather than using a static
offset as it does currently.

It seems to me that save_vectors is only set here:

bool save_vectors = COMPILER2_OR_JVMCI != 0;

which means that save_vectors is a static property of a build, not something
that can vary. Therefore, there is nothing to be gained by passing save_vectors
around. Could we simply have a save_vectors_on_deopt() function?

Also, I'm wondering how much all of this complexity gains us for the sake
of configurations without C2. I'd be very tempted to just leave a hole in
the stack for these systems. How much stack would that cost?

--
Andrew Haley (he/him)
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
https://keybase.io/andrewhaley
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671

@@ -135,10 +135,18 @@ Handle VectorSupport::allocate_vector_payload_helper(InstanceKlass* ik, frame* f
VMReg vreg = VMRegImpl::as_VMReg(location.register_number());

for (int i = 0; i < num_elem; i++) {
int vslot = (i * elem_size) / VMRegImpl::stack_slot_size;
int off = (i * elem_size) % VMRegImpl::stack_slot_size;
bool contiguous = X86_ONLY(false) NOT_X86(true);

This comment has been minimized.

@iwanowww

iwanowww Jan 28, 2021

I don't like this change. It's not x86-specific, but SVE-specific code. What is broken here is VMReg::next() doesn't work properly for VecA registers. And, as a result, it makes RegisterMap::location(VMReg) unusable as well.

So, a proper fix should address that instead. If there's no way to save VMReg::next() and RegisterMap::location(VMReg), then new cross-platform API should be introduced and VectorSupport::allocate_vector_payload_helper() migrated to it.

This comment has been minimized.

@nsjian

nsjian Jan 28, 2021
Contributor

For Arm NEON (and PPC) we don't set VMReg::next() in oopmap either, and their vector slots are contiguous, so that's x86-specific? But yes, NEON can also generate correct full oopmap as for fixed vector size. For SVE, I have no idea to have proper VMReg::next() support, so Nick's solution looks good to me. Regarding to introducing new cross-platform API, which API do you mean? If we could have some better api, that would be perfect. Currently, allocate_vector_payload_helper() is the only one I can see that is vector related for RegisterMap::location() call.

This comment has been minimized.

@iwanowww

iwanowww Jan 28, 2021

Probably, x86 is unique in using non-contiguous representation for vector values, but it doesn't make the code in question x86-specific. AArch64 is the only user ofVecA and VecA is the only register type that has a mismatch in size between in-memory and RegMask representation. So, I conclude it is AArch64/SVE-specific.

On x86 RegisterMap isn't fully populated for vector registers as well, but there'sRegisterMap::pd_location() to cover that.

Regarding new API, I mean the alternative to VMReg::next()/RegisterMap::location(VMReg) which is able to handle VecA case well. As Nick pointed out earlier, the problem with VecA is that there's no VMReg representation for all the slots which comprise the register value.

Either enhancing VMReg::next(int) to produce special values for VecA case or introducing RegisterMap::location(VMReg base_reg, int slot) is a better way to handle the problem.

This comment has been minimized.

@nick-arm

nick-arm Feb 1, 2021
Author Member

@iwanowww please take a look at the latest set of changes and let me know what you think. There's now a RegisterMap::location(VMReg base_reg, int slot) method as you suggest. That in turn uses a new method VMReg::is_expressible(int slot_delta) which is true if offsetting a VMReg by slot_delta slots gives another valid VMReg which is also a slot of the same physical register (i.e. reg->next(slot_delta) is valid). We can use this to fall back to pd_location if a slot of a vector is not expressible as a VMReg (i.e. for SVE). Unfortunately it touches a lot of files but that seems unavoidable.

@mlbridge
Copy link

@mlbridge mlbridge bot commented Jan 29, 2021

Mailing list message from Nick Gasson on hotspot-dev:

On 01/28/21 17:48 pm, Andrew Haley wrote:

It seems to me that save_vectors is only set here:

bool save_vectors = COMPILER2_OR_JVMCI != 0;

which means that save_vectors is a static property of a build, not something
that can vary. Therefore, there is nothing to be gained by passing save_vectors
around. Could we simply have a save_vectors_on_deopt() function?

RegisterSaver is also used by generate_resolve_blob (which never saves
vectors) and generate_handler_blob (which saves vectors when poll_type
== POLL_AT_VECTOR_LOOP). How about we make RegisterSaver a class you can
instantiate, and pass whether to save vectors or not to the constructor?
That would look like:

RegisterSaver reg_save(COMPILER2_OR_JVMCI != 0 /* save_vectors */);
map = reg_save.save_live_registers(masm, 0, &frame_size_in_words);
...
reg_save.restore_live_registers(masm);

Which avoids passing save_vectors around everywhere.

Also, I'm wondering how much all of this complexity gains us for the sake
of configurations without C2. I'd be very tempted to just leave a hole in
the stack for these systems. How much stack would that cost?

For NEON the difference is 768 bytes vs 512, but SVE could be a lot
more.

83 // FIXME -- this is used by C1
84 class RegisterSaver {

Do you remember what this is referring to? That it's duplicating
save_live_registers() in c1_Runtime_aarch64.cpp?

--
Thanks,
Nick

@mlbridge
Copy link

@mlbridge mlbridge bot commented Jan 29, 2021

Mailing list message from Andrew Haley on hotspot-dev:

On 1/29/21 7:53 AM, Nick Gasson wrote:

On 01/28/21 17:48 pm, Andrew Haley wrote:

It seems to me that save_vectors is only set here:

bool save_vectors = COMPILER2_OR_JVMCI != 0;

which means that save_vectors is a static property of a build, not something
that can vary. Therefore, there is nothing to be gained by passing save_vectors
around. Could we simply have a save_vectors_on_deopt() function?

RegisterSaver is also used by generate_resolve_blob (which never saves
vectors) and generate_handler_blob (which saves vectors when poll_type
== POLL_AT_VECTOR_LOOP). How about we make RegisterSaver a class you can
instantiate, and pass whether to save vectors or not to the constructor?
That would look like:

RegisterSaver reg_save(COMPILER2_OR_JVMCI != 0 /* save_vectors */);
map = reg_save.save_live_registers(masm, 0, &frame_size_in_words);
...
reg_save.restore_live_registers(masm);

Which avoids passing save_vectors around everywhere.

That sounds like a great improvement.

Also, I'm wondering how much all of this complexity gains us for the sake
of configurations without C2. I'd be very tempted to just leave a hole in
the stack for these systems. How much stack would that cost?

For NEON the difference is 768 bytes vs 512, but SVE could be a lot
more.

OK, so it probably wouldn't be worth doing on NEON. But A64FX vectors are 64
bytes each, right? So 2kbytes to save, in which case we need this variable-size
register-save area.

83 // FIXME -- this is used by C1
84 class RegisterSaver {

Do you remember what this is referring to? That it's duplicating
save_live_registers() in c1_Runtime_aarch64.cpp?

Probably, yes.

--
Andrew Haley (he/him)
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
https://keybase.io/andrewhaley
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671

@iwanowww
Copy link

@iwanowww iwanowww commented Feb 1, 2021

Much better, thanks.

I suggest the following changes:

  • please, leave original location(VMReg reg)/pd_location(VMReg reg) intact and introduce (VMReg reg, uint slot_idx) overloads;

  • it's fair for location(VMReg reg, uint slot_idx) to require reg to always be a base register:

address RegisterMap::location(VMReg base_reg, uint slot_idx) const {
  if (slot_idx > 0) {
    return pd_location(base_reg, slot_idx);
  } else {
    return location(base_reg);
}
  • on all platforms except AArch64 define pd_location(VMReg base_reg, int slot_idx) as:
address RegisterMap::pd_location(VMReg base_reg, int slot_idx) const {
  return location(base_reg->next(slot_idx));
}
  • on AArch64 check for vector register case and special-case it to:
address RegisterMap::pd_location(VMReg base_reg, int slot_idx) const {
  if (base_reg->is_FloatRegister()) {
    assert((base_reg->value() - ConcreteRegisterImpl::max_gpr) % FloatRegisterImpl::max_slots_per_register == 0, "not a base register");
    intptr_t offset_in_bytes = slot_idx * VMRegImpl::stack_slot_size;
    address base_location = location(base_reg);
    if (base_location != NULL) {
      return base_location + offset_in_bytes;
    }
  }
  return location(base_reg->next(slot_idx));
}

Or, as an alternative (since all the registers are stored contiguously on AArch64 anyway):

address RegisterMap::pd_location(VMReg base_reg, int slot_idx) const {
  intptr_t offset_in_bytes = slot_idx * VMRegImpl::stack_slot_size;
  address base_location = location(base_reg);
  if (base_location != NULL) {
    return base_location + offset_in_bytes;
  }
}
  • keep the assert in VMReg::next(int) if you wish, but refactor it into an out-of-bounds check instead. Personally, I'd prefer to see it as a separate enhancement.
@nick-arm
Copy link
Member Author

@nick-arm nick-arm commented Feb 2, 2021

Much better, thanks.

I suggest the following changes:

I've changed it as suggested. This way seems much simpler, thanks.

Copy link

@iwanowww iwanowww left a comment

RegisterMap-related changes look good.

@openjdk
Copy link

@openjdk openjdk bot commented Feb 2, 2021

@nick-arm This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8260355: AArch64: deoptimization stub should save vector registers

Reviewed-by: vlivanov, aph

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been 107 new commits pushed to the master branch:

  • d0a8f2f: 8260593: javac can skip a temporary local variable when pattern matching over a local variable
  • deb0544: 8261251: Shenandoah: Use object size for full GC humongous compaction
  • d45343e: 8260899: ARM32: SyncOnValueBasedClassTest fails with assert(is_valid()) failed: invalid register
  • 9d59dec: 8248876: LoadObject with bad base address created for exec file on linux
  • aa5bc6e: 8258953: AArch64: move NEON instructions to aarch64_neon.ad
  • c5ff454: 8250989: Consolidate buffer allocation code for CDS static/dynamic dumping
  • 0e18634: 8261270: MakeMethodNotCompilableTest fails with -XX:TieredStopAtLevel={1,2,3}
  • 7a2db85: 8261022: Fix incorrect result of Math.abs() with char type
  • 2c3a86f: 8261280: Remove THREAD argument from compute_loader_lock_object
  • 74d40ab: 8261200: Some code in the ICC_Profile may not close file streams properly
  • ... and 97 more: https://git.openjdk.java.net/jdk/compare/ab727f0a584ffc12c3eb57ea0d6a01188224f9fc...master

As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.

➡️ To integrate this PR with the above commit message to the master branch, type /integrate in a new comment.

@openjdk openjdk bot added the ready label Feb 2, 2021
@nick-arm
Copy link
Member Author

@nick-arm nick-arm commented Feb 3, 2021

@theRealAph are the sharedRuntime_aarch64.cpp changes ok?

@mlbridge
Copy link

@mlbridge mlbridge bot commented Feb 3, 2021

Mailing list message from Andrew Haley on hotspot-dev:

On 2/3/21 7:02 AM, Nick Gasson wrote:

On Tue, 2 Feb 2021 11:08:46 GMT, Vladimir Ivanov <vlivanov at openjdk.org> wrote:

Nick Gasson has updated the pull request incrementally with one additional commit since the last revision:

Review comments

`RegisterMap`-related changes look good.

@theRealAph are the sharedRuntime_aarch64.cpp changes ok?

I guess so, but the code changes are so complex and delicate it's extremely
hard to tell.

What have you done about stress testing? I guess we need some code that's
repeatedly deoptimized and recompiled millions of times, with continuous
testing. I guess that in order to make sure nothing has regressed, a
bootstrap with DeoptimizeALot would help gain some confidence.

--
Andrew Haley (he/him)
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
https://keybase.io/andrewhaley
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671

@mlbridge
Copy link

@mlbridge mlbridge bot commented Feb 4, 2021

Mailing list message from Nick Gasson on hotspot-dev:

On 02/03/21 17:36 pm, Andrew Haley wrote:

@theRealAph are the sharedRuntime_aarch64.cpp changes ok?

I guess so, but the code changes are so complex and delicate it's extremely
hard to tell.

What have you done about stress testing? I guess we need some code that's
repeatedly deoptimized and recompiled millions of times, with continuous
testing. I guess that in order to make sure nothing has regressed, a
bootstrap with DeoptimizeALot would help gain some confidence.

I tried make bootcycle-images as you suggest with -XX:+DeoptimizeALot
added to JAVA_FLAGS and JAVA_FLAGS_BIG in bootcycle-spec.gmk.in (not
sure if there's a better way to do that...).

I've also previously run the tier1 and java/incubator/vector/* tests
with -XX:+DeoptimizeALot. My experience of modifying that code is that
DeoptimizeALot fails pretty quickly if you get something wrong.

--
Thanks,
Nick

@mlbridge
Copy link

@mlbridge mlbridge bot commented Feb 4, 2021

Mailing list message from Andrew Haley on hotspot-dev:

On 2/4/21 7:21 AM, Nick Gasson wrote:

On 02/03/21 17:36 pm, Andrew Haley wrote:

@theRealAph are the sharedRuntime_aarch64.cpp changes ok?

I guess so, but the code changes are so complex and delicate it's extremely
hard to tell.

What have you done about stress testing? I guess we need some code that's
repeatedly deoptimized and recompiled millions of times, with continuous
testing. I guess that in order to make sure nothing has regressed, a
bootstrap with DeoptimizeALot would help gain some confidence.

I tried make bootcycle-images as you suggest with -XX:+DeoptimizeALot
added to JAVA_FLAGS and JAVA_FLAGS_BIG in bootcycle-spec.gmk.in (not
sure if there's a better way to do that...).

I've also previously run the tier1 and java/incubator/vector/* tests
with -XX:+DeoptimizeALot. My experience of modifying that code is that
DeoptimizeALot fails pretty quickly if you get something wrong.

Yeah. The problem here is that safepoints with live vectors aren't so
common, so it's hard to test, I get it. Maybe this will have to do.

--
Andrew Haley (he/him)
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
https://keybase.io/andrewhaley
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671

@mlbridge
Copy link

@mlbridge mlbridge bot commented Feb 4, 2021

Mailing list message from Nick Gasson on hotspot-dev:

On 02/04/21 16:18 pm, Andrew Haley wrote:

Yeah. The problem here is that safepoints with live vectors aren't so
common, so it's hard to test, I get it. Maybe this will have to do.

You can test that situation quite readily with:

make test TEST="jdk/incubator/vector" \
JTREG="VM_OPTIONS=-XX:+DeoptimizeALot -XX:DeoptimizeALotInterval=0"

Which will segfault with current JDK. I guess the difficulty is showing
it hasn't regressed something else.

--
Thanks,
Nick

@mlbridge
Copy link

@mlbridge mlbridge bot commented Feb 4, 2021

Mailing list message from Vladimir Ivanov on hotspot-dev:

I've also previously run the tier1 and java/incubator/vector/* tests
with -XX:+DeoptimizeALot. My experience of modifying that code is that
DeoptimizeALot fails pretty quickly if you get something wrong.

Yeah. The problem here is that safepoints with live vectors aren't so
common, so it's hard to test, I get it. Maybe this will have to do.

FTR jdk/java/incubator/vector tests w/ -XX:+DeoptimizeALot are very good
at verifying that in-register vector values are properly preserved:
vectors (as exposed by Vector API) are routinely kept in registers
across safepoints and during deoptimization they are rematerialized into
full-blown Vector instances, so the tests fail quickly on broken vector
values.

Best regards,
Vladimir Ivanov

@mlbridge
Copy link

@mlbridge mlbridge bot commented Feb 5, 2021

Mailing list message from Andrew Haley on hotspot-dev:

On 2/4/21 10:03 AM, Vladimir Ivanov wrote:

I've also previously run the tier1 and java/incubator/vector/* tests
with -XX:+DeoptimizeALot. My experience of modifying that code is that
DeoptimizeALot fails pretty quickly if you get something wrong.

Yeah. The problem here is that safepoints with live vectors aren't so
common, so it's hard to test, I get it. Maybe this will have to do.

FTR jdk/java/incubator/vector tests w/ -XX:+DeoptimizeALot are very good
at verifying that in-register vector values are properly preserved:
vectors (as exposed by Vector API) are routinely kept in registers
across safepoints and during deoptimization they are rematerialized into
full-blown Vector instances, so the tests fail quickly on broken vector
values.

Great, thanks.

--
Andrew Haley (he/him)
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
https://keybase.io/andrewhaley
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671

@nick-arm
Copy link
Member Author

@nick-arm nick-arm commented Feb 8, 2021

@theRealAph Is this one ok to push now?

@nick-arm
Copy link
Member Author

@nick-arm nick-arm commented Feb 9, 2021

/integrate

@openjdk openjdk bot closed this Feb 9, 2021
@openjdk openjdk bot added integrated and removed ready rfr labels Feb 9, 2021
@openjdk
Copy link

@openjdk openjdk bot commented Feb 9, 2021

@nick-arm Since your change was applied there have been 123 commits pushed to the master branch:

  • 5d8204b: 8261368: The new TestNullSetColor test is placed in the wrong group
  • f03e839: 8261127: Cleanup THREAD/TRAPS/CHECK usage in CDS code
  • 7451962: 8129776: The optimized Stream returned from Files.lines should unmap the mapped byte buffer (if created) when closed
  • ad525bc: 8261281: Linking jdk.jpackage fails for linux aarch32 builds after 8254702
  • 2fd8ed0: 8240632: Note differences between IEEE 754-2019 math lib special cases and java.lang.Math
  • ace8f94: 8195744: Avoid calling ClassLoader.checkPackageAccess if security manager is not installed
  • ab65d53: 8261261: The version extra fields needs to be overridable in jib-profiles.js
  • 20d7713: 8261334: NMT: tuning statistic shows incorrect hash distribution
  • 92c6e6d: 8261254: Initialize charset mapping data lazily
  • 351d788: 8259074: regex benchmarks and tests
  • ... and 113 more: https://git.openjdk.java.net/jdk/compare/ab727f0a584ffc12c3eb57ea0d6a01188224f9fc...master

Your commit was automatically rebased without conflicts.

Pushed as commit 5183d8a.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
4 participants