Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

8285790: AArch64: Merge C2 NEON and SVE matching rules #9346

Closed
wants to merge 10 commits into from

Conversation

shqking
Copy link
Contributor

@shqking shqking commented Jul 1, 2022

MOTIVATION

This is a big refactoring patch of merging rules in aarch64_sve.ad and
aarch64_neon.ad. The motivation can also be found at [1].

Currently AArch64 has aarch64_sve.ad and aarch64_neon.ad to support SVE
and NEON codegen respectively. 1) For SVE rules we use vReg operand to
match VecA for an arbitrary length of vector type, when SVE is enabled;
2) For NEON rules we use vecX/vecD operands to match VecX/VecD for
128-bit/64-bit vectors, when SVE is not enabled.

This separation looked clean at the time of introducing SVE support.
However, there are two main drawbacks now.

Drawback-1: NEON (Advanced SIMD) is the mandatory feature on AArch64 and
SVE vector registers share the lower 128 bits with NEON registers. For
some cases, even when SVE is enabled, we still prefer to match NEON
rules and emit NEON instructions.

Drawback-2: With more and more vector rules added to support VectorAPI,
there are lots of rules in both two ad files with different predication
conditions, e.g., different values of UseSVE or vector type/size.

Examples can be found in [1]. These two drawbacks make the code less
maintainable and increase the libjvm.so code size.

KEY UPDATES

In this patch, we mainly do two things, using generic vReg to match all
NEON/SVE vector registers and merging NEON/SVE matching rules.

  • Update-1: Use generic vReg to match all NEON/SVE vector registers

Two different approaches were considered, and we prefer to use generic
vector solution but keep VecA operand for all >128-bit vectors. See the
last slide in [1]. All the changes lie in the AArch64 backend.

  1. Some helpers are updated in aarch64.ad to enable generic vector on
    AArch64. See vector_ideal_reg(), pd_specialize_generic_vector_operand(),
    is_reg2reg_move() and is_generic_vector().

  2. Operand vecA is created to match VecA register, and vReg is updated
    to match VecA/D/X registers dynamically.

With the introduction of generic vReg, difference in register types
between NEON rules and SVE rules can be eliminated, which makes it easy
to merge these rules.

  • Update-2: Try to merge existing rules

As updated in GensrcAdlc.gmk, new ad file, aarch64_vector.ad, is
introduced to hold the grouped and merged matching rules.

  1. Similar rules with difference in vector type/size can be merged into
    new rules, where different types and vector sizes are handled in the
    codegen part, e.g., vadd(). This resolves Drawback-2.

  2. In most cases, we tend to emit NEON instructions for 128-bit vector
    operations on SVE platforms, e.g., vadd(). This resolves Drawback-1.

It's important to note that there are some exceptions.

Exception-1: For some rules, there are no direct NEON instructions, but
exists simple SVE implementation due to newly added SVE ISA. Such rules
include vloadconB, vmulL_neon, vminL_neon, vmaxL_neon,
reduce_addF_le128b (4F case), reduce_and/or/xor_neon, reduce_minL_neon,
reduce_maxL_neon, vcvtLtoF_neon, vcvtDtoI_neon, rearrange_HS_neon.

Exception-2: Vector mask generation and operation rules are different
because vector mask is stored in different types of registers between
NEON and SVE, e.g., vmaskcmp_neon and vmask_truecount_neon rules.

Exception-3: Shift right related rules are different because vector
shift right instructions differ a bit between NEON and SVE.

For these exceptions, we emit NEON or SVE code simply based on UseSVE
options.

MINOR UPDATES and CODE REFACTORING

Since we've touched all lines of code during merging rules, we further
do more minor updates and refactoring.

  • Reduce regmask bits

Stack slot alignment is handled specially for scalable vector, which
will firstly align to SlotsPerVecA, and then align to the real vector
length. We should guarantee SlotsPerVecA is no bigger than the real
vector length. Otherwise, unused stack space would be allocated.

In AArch64 SVE, the vector length can be 128 to 2048 bits. However,
SlotsPerVecA is currently set as 8, i.e. 8 * 32 = 256 bits. As a result,
on a 128-bit SVE platform, the stack slot is aligned to 256 bits,
leaving the half 128 bits unused. In this patch, we reduce SlotsPerVecA
from 8 to 4.

See the updates in register_aarch64.hpp, regmask.hpp and aarch64.ad
(chunk1 and vectora_reg).

  • Refactor NEON/SVE vector op support check.

Merge NEON and SVE vector supported check into one single function. To
be consistent, SVE default size supported check now is relaxed from no
less than 64 bits to the same condition as NEON's min_vector_size(),
i.e. 4 bytes and 4/2 booleans are also supported. This should be fine,
as we assume at least we will emit NEON code for those small vectors,
with unified rules.

  • Some notes for new rules
  1. Since new rules are unique and it makes no sense to set different
    "ins_cost", we turn to use the default cost.

  2. By default we use "pipe_slow" for matching rules in aarch64_vector.ad
    now. Hence, many SIMD pipeline classes at aarch64.ad become unused and
    can be removed.

  3. Suffixes '_le128b/_gt128b' and '_neon/_sve' are appended in the
    matching rule names if needed.
    a) 'le128b' means the vector length is less than or equal to 128 bits.
    This rule can be matched on both NEON and 128-bit SVE.
    b) 'gt128b' means the vector length is greater than 128 bits. This rule
    can only be matched on SVE.
    c) 'neon' means this rule can only be matched on NEON, i.e. the
    generated instruction is not better than those in 128-bit SVE.
    d) 'sve' means this rule is only matched on SVE for all possible vector
    length, i.e. not limited to gt128b.

Note-1: m4 file is not introduced because many duplications are highly
reduced now.
Note-2: We guess the code review for this big patch would probably take
some time and we may need to merge latest code from master branch from
time to time. We prefer to keep aarch64_neon/sve.ad and the
corresponding m4 files for easy comparison and review. Of course, they
will be finally removed after some solid reviews before integration.
Note-3: Several other minor refactorings are done in this patch, but we
cannot list all of them in the commit message. We have reviewed and
tested the rules carefully to guarantee the quality.

TESTING

  1. Cross compilations on arm32/s390/pps/riscv passed.
  2. tier1~3 jtreg passed on both x64 and aarch64 machines.
  3. vector tests: all the test cases under the following directories can
    pass on both NEON and SVE systems with max vector length 16/32/64 bytes.
  "test/hotspot/jtreg/compiler/vectorapi/"
  "test/jdk/jdk/incubator/vector/"
  "test/hotspot/jtreg/compiler/vectorization/"
  1. Performance evaluation: we choose vector micro-benchmarks from
    panama-vector:vectorIntrinsics [2] to evaluate the performance of this
    patch. We've tested *MaxVectorTests.java cases on one 128-bit SVE
    platform and one NEON platform, and didn't see any visiable regression
    with NEON and SVE. We will continue to verify more cases on other
    platforms with NEON and different SVE vector sizes.

BENEFITS

The number of matching rules is reduced to ~ 42%.
before: 373 (aarch64_neon.ad) + 380 (aarch64_sve.ad) = 753
after : 313 (aarch64_vector.ad)

Code size for libjvm.so (release build) on aarch64 is reduced to ~ 96%.
before: 25246528 B (commit 7905788)
after : 24208776 B (nearly 1 MB reduction)

[1] http://cr.openjdk.java.net/~njian/8285790/JDK-8285790.pdf
[2] https://github.com/openjdk/panama-vector/tree/vectorIntrinsics/test/micro/org/openjdk/bench/jdk/incubator/vector/operation

Co-Developed-by: Ningsheng Jian Ningsheng.Jian@arm.com
Co-Developed-by: Eric Liu eric.c.liu@arm.com


Progress

  • Change must be properly reviewed (1 review required, with at least 1 Reviewer)
  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue

Issue

  • JDK-8285790: AArch64: Merge C2 NEON and SVE matching rules

Reviewers

Contributors

  • Ningsheng Jian <njian@openjdk.org>
  • Eric Liu <eliu@openjdk.org>

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk pull/9346/head:pull/9346
$ git checkout pull/9346

Update a local copy of the PR:
$ git checkout pull/9346
$ git pull https://git.openjdk.org/jdk pull/9346/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 9346

View PR using the GUI difftool:
$ git pr show -t 9346

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/9346.diff

MOTIVATION

This is a big refactoring patch of merging rules in aarch64_sve.ad and
aarch64_neon.ad. The motivation can also be found at [1].

Currently AArch64 has aarch64_sve.ad and aarch64_neon.ad to support SVE
and NEON codegen respectively. 1) For SVE rules we use vReg operand to
match VecA for an arbitrary length of vector type, when SVE is enabled;
2) For NEON rules we use vecX/vecD operands to match VecX/VecD for
128-bit/64-bit vectors, when SVE is not enabled.

This separation looked clean at the time of introducing SVE support.
However, there are two main drawbacks now.

Drawback-1: NEON (Advanced SIMD) is the mandatory feature on AArch64 and
SVE vector registers share the lower 128 bits with NEON registers. For
some cases, even when SVE is enabled, we still prefer to match NEON
rules and emit NEON instructions.

Drawback-2: With more and more vector rules added to support VectorAPI,
there are lots of rules in both two ad files with different predication
conditions, e.g., different values of UseSVE or vector type/size.

Examples can be found in [1]. These two drawbacks make the code less
maintainable and increase the libjvm.so code size.

KEY UPDATES

In this patch, we mainly do two things, using generic vReg to match all
NEON/SVE vector registers and merging NEON/SVE matching rules.

Update-1: Use generic vReg to match all NEON/SVE vector registers

Two different approaches were considered, and we prefer to use generic
vector solution but keep VecA operand for all >128-bit vectors. See the
last slide in [1]. All the changes lie in the AArch64 backend.

1) Some helpers are updated in aarch64.ad to enable generic vector on
AArch64. See vector_ideal_reg(), pd_specialize_generic_vector_operand(),
is_reg2reg_move() and is_generic_vector().

2) Operand vecA is created to match VecA register, and vReg is updated
to match VecA/D/X registers dynamically.

With the introduction of generic vReg, difference in register types
between NEON rules and SVE rules can be eliminated, which makes it easy
to merge these rules.

Update-2: Try to merge existing rules

As updated in GensrcAdlc.gmk, new ad file, aarch64_vector.ad, is
introduced to hold the grouped and merged matching rules.

1) Similar rules with difference in vector type/size can be merged into
new rules, where different types and vector sizes are handled in the
codegen part, e.g., vadd(). This resolves Drawback-2.

2) In most cases, we tend to emit NEON instructions for 128-bit vector
operations on SVE platforms, e.g., vadd(). This resolves Drawback-1.

It's important to note that there are some exceptions.

Exception-1: For some rules, there are no direct NEON instructions, but
exists simple SVE implementation due to newly added SVE ISA. Such rules
include vloadconB, vmulL_neon, vminL_neon, vmaxL_neon,
reduce_addF_le128b (4F case), reduce_and/or/xor_neon, reduce_minL_neon,
reduce_maxL_neon, vcvtLtoF_neon, vcvtDtoI_neon, rearrange_HS_neon.

Exception-2: Vector mask generation and operation rules are different
because vector mask is stored in different types of registers between
NEON and SVE, e.g., vmaskcmp_neon and vmask_truecount_neon rules.

Exception-3: Shift right related rules are different because vector
shift right instructions differ a bit between NEON and SVE.

For these exceptions, we emit NEON or SVE code simply based on UseSVE
options.

MINOR UPDATES and CODE REFACTORING

Since we've touched all lines of code during merging rules, we further
do more minor updates and refactoring.

1. Reduce regmask bits

Stack slot alignment is handled specially for scalable vector, which
will firstly align to SlotsPerVecA, and then align to the real vector
length. We should guarantee SlotsPerVecA is no bigger than the real
vector length. Otherwise, unused stack space would be allocated.

In AArch64 SVE, the vector length can be 128 to 2048 bits. However,
SlotsPerVecA is currently set as 8, i.e. 8 * 32 = 256 bits. As a result,
on a 128-bit SVE platform, the stack slot is aligned to 256 bits,
leaving the half 128 bits unused. In this patch, we reduce SlotsPerVecA
from 8 to 4.

See the updates in register_aarch64.hpp, regmask.hpp and aarch64.ad
(chunk1 and vectora_reg).

2. Refactor NEON/SVE vector op support check.

Merge NEON and SVE vector supported check into one single function. To
be consistent, SVE default size supported check now is relaxed from no
less than 64 bits to the same condition as NEON's min_vector_size(),
i.e. 4 bytes and 4/2 booleans are also supported. This should be fine,
as we assume at least we will emit NEON code for those small vectors,
with unified rules.

3. Some notes for new rules

1) Since new rules are unique and it makes no sense to set different
"ins_cost", we turn to use the default cost.

2) By default we use "pipe_slow" for matching rules in aarch64_vector.ad
now. Hence, many SIMD pipeline classes at aarch64.ad become unused and
can be removed.

3) Suffixes '_le128b/_gt128b' and '_neon/_sve' are appended in the
matching rule names if needed.
a) 'le128b' means the vector length is less than or equal to 128 bits.
This rule can be matched on both NEON and 128-bit SVE.
b) 'gt128b' means the vector length is greater than 128 bits. This rule
can only be matched on SVE.
c) 'neon' means this rule can only be matched on NEON, i.e. the
generated instruction is not better than those in 128-bit SVE.
d) 'sve' means this rule is only matched on SVE for all possible vector
length, i.e. not limited to gt128b.

Note-1: m4 file is not introduced because many duplications are highly
reduced now.
Note-2: We guess the code review for this big patch would probably take
some time and we may need to merge latest code from master branch from
time to time. We prefer to keep aarch64_neon/sve.ad and the
corresponding m4 files for easy comparison and review. Of course, they
will be finally removed after some solid reviews before integration.
Note-3: Several other minor refactorings are done in this patch, but we
cannot list all of them in the commit message. We have reviewed and
tested the rules carefully to guarantee the quality.

TESTING

1) Cross compilations on arm32/s390/pps/riscv passed.
2) tier1~3 jtreg passed on both x64 and aarch64 machines.
3) vector tests: all the test cases under the following directories can
pass on both NEON and SVE systems with max vector length 16/32/64 bytes.

  "test/hotspot/jtreg/compiler/vectorapi/"
  "test/jdk/jdk/incubator/vector/"
  "test/hotspot/jtreg/compiler/vectorization/"

4) Performance evaluation: we choose vector micro-benchmarks from
panama-vector:vectorIntrinsics [2] to evaluate the performance of this
patch. We've tested *MaxVectorTests.java cases on one 128-bit SVE
platform and one NEON platform, and didn't see any visiable regression
with NEON and SVE. We will continue to verify more cases on other
platforms with NEON and different SVE vector sizes.

BENEFITS

The number of matching rules is reduced to ~ 42%.
  before: 373 (aarch64_neon.ad) + 380 (aarch64_sve.ad) = 753
  after : 313(aarch64_vector.ad)

Code size for libjvm.so (release build) on aarch64 is reduced to ~ 96%.
  before: 25246528 B (commit 7905788)
  after : 24208776 B (nearly 1 MB reduction)

[1] http://cr.openjdk.java.net/~njian/8285790/JDK-8285790.pdf
[2] https://github.com/openjdk/panama-vector/tree/vectorIntrinsics/test/micro/org/openjdk/bench/jdk/incubator/vector/operation

Co-Developed-by: Ningsheng Jian <Ningsheng.Jian@arm.com>
Co-Developed-by: Eric Liu <eric.c.liu@arm.com>
@bridgekeeper
Copy link

bridgekeeper bot commented Jul 1, 2022

👋 Welcome back haosun! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@shqking
Copy link
Contributor Author

shqking commented Jul 1, 2022

/contributor add nsjian
/contributor add theRealELiu

@openjdk openjdk bot added the rfr Pull request is ready for review label Jul 1, 2022
@openjdk
Copy link

openjdk bot commented Jul 1, 2022

@shqking nsjian was not found in the census.

Syntax: /contributor (add|remove) [@user | openjdk-user | Full Name <email@address>]. For example:

  • /contributor add @openjdk-bot
  • /contributor add duke
  • /contributor add J. Duke <duke@openjdk.org>

User names can only be used for users in the census associated with this repository. For other contributors you need to supply the full name and email address.

@openjdk
Copy link

openjdk bot commented Jul 1, 2022

@shqking theRealELiu was not found in the census.

Syntax: /contributor (add|remove) [@user | openjdk-user | Full Name <email@address>]. For example:

  • /contributor add @openjdk-bot
  • /contributor add duke
  • /contributor add J. Duke <duke@openjdk.org>

User names can only be used for users in the census associated with this repository. For other contributors you need to supply the full name and email address.

@openjdk
Copy link

openjdk bot commented Jul 1, 2022

@shqking The following labels will be automatically applied to this pull request:

  • build
  • hotspot-compiler

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing lists. If you would like to change these labels, use the /label pull request command.

@openjdk openjdk bot added build build-dev@openjdk.org hotspot-compiler hotspot-compiler-dev@openjdk.org labels Jul 1, 2022
@mlbridge
Copy link

mlbridge bot commented Jul 1, 2022

Webrevs

@shqking
Copy link
Contributor Author

shqking commented Jul 1, 2022

/contributor add njian
/contributor add eliu

@openjdk
Copy link

openjdk bot commented Jul 1, 2022

@shqking
Contributor Ningsheng Jian <njian@openjdk.org> successfully added.

@openjdk
Copy link

openjdk bot commented Jul 1, 2022

@shqking
Contributor Eric Liu <eliu@openjdk.org> successfully added.

@theRealAph
Copy link
Contributor

theRealAph commented Jul 1, 2022 via email

@shqking
Copy link
Contributor Author

shqking commented Jul 4, 2022

We got one GHA test failure on linux-x86/tier1, with the following error log. See the link.
Error: Unable to find an artifact with the name: bundles-linux-x86

I suppose it's not related to our patch. Besides, we have tested tier1~3 on linux x64/aarch64 locally.

@shqking
Copy link
Contributor Author

shqking commented Jul 4, 2022

@theRealAph Thanks for your comment.
Yes. There are still duplicate code. I can easily list several ones, such as the reduce-and/or/xor, vector shift ops and several reg with imm rules. We're open to keep m4 file.

But I would suggest that we may put our attention firstly on 1) our implementation on generic vector registers and 2) the merged rules (in particular those we share the codegen for NEON only platform and 128-bit vector ops on SVE platform).
After that we may discuss whether to use m4 file and how to implement it if needed.

@theRealAph
Copy link
Contributor

@theRealAph Thanks for your comment. Yes. There are still duplicate code. I can easily list several ones, such as the reduce-and/or/xor, vector shift ops and several reg with imm rules. We're open to keep m4 file.

But I would suggest that we may put our attention firstly on 1) our implementation on generic vector registers and 2) the merged rules (in particular those we share the codegen for NEON only platform and 128-bit vector ops on SVE platform). After that we may discuss whether to use m4 file and how to implement it if needed.

We can do both: there's no sense in which one excludes the other, and we have time.

However, just putting aside for a moment the lack of useful abstraction mechanisms, I note that there's a lot of code like this:

    if (length_in_bytes <= 16) {
      // ... Neon
    } else {
      assert(UseSVE > 0, "must be sve");
      // ... SVE
    }

which is to say, there's an implicit assumption that if an operation can be done with Neon it will be, and SVE will only be used if not. What is the justification for that assumption?

@shqking
Copy link
Contributor Author

shqking commented Jul 5, 2022

However, just putting aside for a moment the lack of useful abstraction mechanisms, I note that there's a lot of code like this:

    if (length_in_bytes <= 16) {
      // ... Neon
    } else {
      assert(UseSVE > 0, "must be sve");
      // ... SVE
    }

which is to say, there's an implicit assumption that if an operation can be done with Neon it will be, and SVE will only be used if not. What is the justification for that assumption?

Not exactly.
It's only for common 64/128-bit unpredicated vector operations, when NEON have equivalent instructions as SVE.

Recall the Drawback-1 and Update-2 (part 2) in the commit message.

Besides the code pattern you mentioned, there are many pairs of rules with "_le128b" and "_gt128b" suffixes, e.g., vmulI_le128b() and vmulI_gt128b(). We use two rules mainly because different numbers of arguments are used. Otherwise, we tend to put them into one rule, which is your mentioned pattern, e.g., vadd().

The main reason we conduct this change lies in that from Neoverse V1 and N2 optimization guides, if the size fit, common NEON instructions are no slower than equivalent SVE instructions in latency and throughput.

Note-1: In current aarch64_sve.ad file, there are already several rules under this rule, e.g., loadV16_vreg(), vroundFtoI(), insertI_le128bits(). There is an ongoing patch as well in link. This patch makes them more clear.
Note-2: As we mentioned in the part 4 in TESTING section, we ran JMH testing on one SVE machine and didn't observe regression and we will do more measurement on different systems.

@magicus
Copy link
Member

magicus commented Jul 5, 2022

/label -build

@openjdk openjdk bot removed the build build-dev@openjdk.org label Jul 5, 2022
@openjdk
Copy link

openjdk bot commented Jul 5, 2022

@magicus
The build label was successfully removed.

Add VM_Version flag use_neon_for_vector() to control whether to generate
NEON instructions for 128-bit vector operations.

Currently only vector length is checked inside and it returns true for
existing SVE cores. More specific things might be checked in the near
future, e.g., the basic data type or SVE CPU model.

Besides, new macro assembler helpers neon_vector_extend/narrow() are
introduced to make the code clean.

Note: AddReductionVF/D rules are updated so that SVE instructions are
generated for 64/128-bit vector operations, because floating point
reduction add instructions are supported directly in SVE.
Add the corresponding M4 file
@shqking
Copy link
Contributor Author

shqking commented Jul 22, 2022

Hi @theRealAph , three commits are uploaded. Could you help take a look at them when you have spare time? Thanks.

commit-1: merge the master branch as of 7th-July. Of course, more merges would be done during the follow-up review process.

commit-2: add one VM_Version flag to control whether to generate NEON instructions for 64/128-bit vector operations on SVE.

commit-3: add the m4 file. We tried to make as many abstractions as possible in the m4 file.
Before m4 is introduced, aarch64_vector.ad file is ~5k LOC. And now with this commit, we use ~4k LOC aarch64_vector_ad.m4 file, i.e. only ~20% reduction.
I personally think the reduction is not that big, compared to the reductions between aarch64_neon/sve_ad.m4 and aarch64_neon/sve.ad.

dnl
dnl REDUCE_BITWISE_OP_NEON($1, $2 $3 $4 )
dnl REDUCE_BITWISE_OP_NEON(insn_name, is_long, type, op_name)
define(`REDUCE_BITWIESE_OP_NEON', `
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

REDUCE_BITWIESE_OP_NEON doesn't look right,

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing this out! Updated here and other similar macros.

//

dnl Generate the warning
// This file is automatically generated by running "m4 aarch64_vector_ad.m4". Do not edit!
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We had to put a message like this one around every method in aarch64.ad or people edited them by hand. Maybe that won't be a problem in this file because it's an entire file, not parts of one.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Different from aarch64.ad file, the scope of this message is the whole file, not some sections inside the file. Hence, I think current version is fine.

case Op_MulReductionVF:
case Op_MulReductionVI:
case Op_MulReductionVL:
// No multiply reduction instructions, and we emit scalar
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is a little unclear. Is this what you actually mean?

"No vector multiply reduction instructions, but we do emit scalar instructions for 64/128-bit vectors"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. You're right. Updated in the new revision. Thanks.

// DOUBLE
EXTRACT_FP(D, fmovd, 2, D, 3)

// ------------------------------ Vector mask loat/store -----------------------
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be "load/store"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's a typo. Updated in the new revision.

@openjdk openjdk bot added the ready Pull request is ready to be integrated label Aug 3, 2022
@nsjian
Copy link

nsjian commented Aug 4, 2022

@nsjian Hi Ningsheng! Thanks for the summary of testing status. I am happy with the current patch and tests.

I agree that both your TODOs need to be followed up before integration.

I would also ask that you obtain approval from @theRealAph before integrating.

Thank you for the review, Andrew!

@theRealAph
Copy link
Contributor

I just pulled this patch and src/hotspot/cpu/aarch64/aarch64_neon.ad etc. seem still to be there.

@nsjian
Copy link

nsjian commented Aug 4, 2022

I just pulled this patch and src/hotspot/cpu/aarch64/aarch64_neon.ad etc. seem still to be there.

Yes, @shqking left those files there during the review just for easy compare and rebase/merge. In his PR comment:

Note-2: We guess the code review for this big patch would probably take
some time and we may need to merge latest code from master branch from
time to time. We prefer to keep aarch64_neon/sve.ad and the
corresponding m4 files for easy comparison and review. Of course, they
will be finally removed after some solid reviews before integration.

Thanks for the reminder! It seems that it's time to have a new merge and remove those files, as we've got some solid reviews.

@adinn
Copy link
Contributor

adinn commented Aug 4, 2022

@shqking @nsjian I talked to the @theRealAph and he wanted to see if there were any test results from Oracle's test suite before reviewing.

Also, @theRealAph noticed that the patch foes not delete the old sve/neon.ad files. That needs to be included before it is integrated.

@nsjian
Copy link

nsjian commented Aug 4, 2022

@shqking @nsjian I talked to the @theRealAph and he wanted to see if there were any test results from Oracle's test suite before reviewing.

Also, @theRealAph noticed that the patch foes not delete the old sve/neon.ad files. That needs to be included before it is integrated.

OK, we will do a rebase and remove the files first. Thanks!

@nsjian
Copy link

nsjian commented Aug 5, 2022

@shqking @nsjian I talked to the @theRealAph and he wanted to see if there were any test results from Oracle's test suite before reviewing.
Also, @theRealAph noticed that the patch foes not delete the old sve/neon.ad files. That needs to be included before it is integrated.

OK, we will do a rebase and remove the files first. Thanks!

I have deleted those old ad/m4 files and done a merge.

Can someone from Oracle please help to run tests in Oracle CI? Perhaps @TobiHartmann or @vnkozlov ? Thanks!

@TobiHartmann
Copy link
Member

I submitted testing and will report back once it finished.

@TobiHartmann
Copy link
Member

All tests passed.

@nsjian
Copy link

nsjian commented Aug 10, 2022

All tests passed.

Thank you, Tobias!

@nsjian
Copy link

nsjian commented Aug 10, 2022

Thanks for the review @theRealAph. We will run performance tests on more hardware before integration.

Resolve the conflict in file c2_MacroAssembler_aarch64.hpp.
Copy link

@XiaohongGong XiaohongGong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@shqking
Copy link
Contributor Author

shqking commented Aug 17, 2022

c) JMH testing
We choose vector micro-benchmarks from panama-vector:vectorIntrinsics [1] to evaluate the performance of this
patch. We've tested *MaxVectorTests.java cases on one 128-bit SVE platform and one NEON platform, and didn't
see any visible regression with NEON and SVE.

TODOs, before integration:
a). If possible, we will ask Oracle engineers to help run tests with this patch which may include different tests.
b). We will continue to conduct more performance testings with NEON and different SVE vector sizes on different hardware.

[1] https://github.com/openjdk/panama-vector/tree/vectorIntrinsics/test/micro/org/openjdk/bench/jdk/incubator/vector/operation

We further ran more performance testing with the latest code:

  1. *MaxVectorTests.java cases on one 128-bit SVE platform (with UseSVE=0 and UseSVE=1)
  2. *MaxVectorTests.java cases on one NEON platform (with MaxVectorSize=8 and MaxVectorSize=16)
  3. *128VectorTests.java and *256VectorTests.java on one 256-bit SVE platform (with UseSVE=0 and UseSVE=1)

The test result is that we didn't see obvious performance regressions.

Thanks @TobiHartmann for his testing in Oracle CI.
I suppose this refactoring patch is ready for integration now.
/integrate

@openjdk openjdk bot added the sponsor Pull request is ready to be sponsored label Aug 17, 2022
@openjdk
Copy link

openjdk bot commented Aug 17, 2022

@shqking
Your change (at version e695cf3) is now ready to be sponsored by a Committer.

@nsjian
Copy link

nsjian commented Aug 17, 2022

/sponsor

@openjdk
Copy link

openjdk bot commented Aug 17, 2022

Going to push as commit 0cc66ae.
Since your change was applied there have been 55 commits pushed to the master branch:

Your commit was automatically rebased without conflicts.

@openjdk openjdk bot added the integrated Pull request has been integrated label Aug 17, 2022
@openjdk openjdk bot closed this Aug 17, 2022
@openjdk openjdk bot removed ready Pull request is ready to be integrated rfr Pull request is ready for review sponsor Pull request is ready to be sponsored labels Aug 17, 2022
@openjdk
Copy link

openjdk bot commented Aug 17, 2022

@nsjian @shqking Pushed as commit 0cc66ae.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

@shqking shqking deleted the 8285790-merge-rules branch August 17, 2022 05:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hotspot-compiler hotspot-compiler-dev@openjdk.org integrated Pull request has been integrated
7 participants