Skip to content

Conversation

@vamsi-parasa
Copy link
Contributor

@vamsi-parasa vamsi-parasa commented Feb 6, 2025

The goal of this PR is to generate x86 code using Intel Advanced Performance Extensions (APX) instructions which doubles the number of general-purpose registers, from 16 to 32. Intel APX adds nondestructive destination (NDD) and no flags (NF) flavor for the scalar instructions through EVEX encoding.

For more information about APX, see https://www.intel.com/content/www/us/en/developer/articles/technical/advanced-performance-extensions-apx.html.


Progress

  • Change must be properly reviewed (1 review required, with at least 1 Reviewer)
  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue

Issue

  • JDK-8349582: APX NDD code generation for OpenJDK (Sub-task - P4)

Reviewers

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/23501/head:pull/23501
$ git checkout pull/23501

Update a local copy of the PR:
$ git checkout pull/23501
$ git pull https://git.openjdk.org/jdk.git pull/23501/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 23501

View PR using the GUI difftool:
$ git pr show -t 23501

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/23501.diff

Using Webrev

Link to Webrev Comment

@bridgekeeper
Copy link

bridgekeeper bot commented Feb 6, 2025

👋 Welcome back sparasa! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk
Copy link

openjdk bot commented Feb 6, 2025

@vamsi-parasa This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8349582: APX NDD code generation for OpenJDK

Reviewed-by: epeter, jbhateja, sviswanathan

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been 110 new commits pushed to the master branch:

  • 7d1fe0e: 8339543: [vectorapi] laneHelper and withLaneHelper should be ForceInline
  • ddc4318: 8352628: Refine Grapheme test
  • d8c2f59: 8352064: AIX: now also able to build static-jdk image with a statically linked launcher
  • 5709f79: 8352648: JFR: 'jfr query' should not be available in product builds
  • 02a4ce2: 8352147: G1: TestEagerReclaimHumongousRegionsClearMarkBits test takes very long
  • de58009: 8351468: C2: array fill optimization assigns wrong type to intrinsic call
  • a875733: 8352486: [ubsan] compilationMemoryStatistic.cpp:659:21: runtime error: index 64 out of bounds for type const struct unnamed struct
  • 5591f8a: 8351515: C2 incorrectly removes double negation for double and float
  • 56a4ffa: 8352597: [IR Framework] test bug: TestNotCompilable.java fails on product build
  • e23e0f8: 8352591: Missing UnlockDiagnosticVMOptions in VerifyGraphEdgesWithDeadCodeCheckFromSafepoints test
  • ... and 100 more: https://git.openjdk.org/jdk/compare/37ec796255ae857588a5c7e0d572407dd81cbec9...master

As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.

As you do not have Committer status in this project an existing Committer must agree to sponsor your change. Possible candidates are the reviewers of this PR (@jatin-bhateja, @sviswa7, @eme64) but any other Committer may sponsor as well.

➡️ To flag this PR as ready for integration with the above commit message, type /integrate in a new comment. (Afterwards, your sponsor types /sponsor in a new comment to perform the integration).

@openjdk openjdk bot added the rfr Pull request is ready for review label Feb 6, 2025
@openjdk
Copy link

openjdk bot commented Feb 6, 2025

@vamsi-parasa The following label will be automatically applied to this pull request:

  • hotspot-compiler

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

@openjdk openjdk bot added the hotspot-compiler hotspot-compiler-dev@openjdk.org label Feb 6, 2025
@mlbridge
Copy link

mlbridge bot commented Feb 6, 2025

Copy link
Member

@jatin-bhateja jatin-bhateja left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some initial comments.

format %{ "eaddq $dst, $src1, $src2\t# long ndd" %}
ins_encode %{
__ eaddq($dst$$Register, $src1$$Register, $src2$$Register, false);
%}
Copy link
Member

@jatin-bhateja jatin-bhateja Feb 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your current scheme favors the emission of NDD instruction on APX-enabled targets, even if destination and source registers belong legacy GPR set. We should extend the assembler layer to demote EEVEX to REX encoding if dst matches with source operands.

FYI, the processor backend also uses Move Elimination to prevent the dispatch of GPR to GPR moves to execution ports.

This can be used to further break NDD patterns with all different legacy register operands, GPR to GPR moves should consume no more than 3 bytes encoding 1 byte for REX prefix, 1 for opcode, and 1 for ModRM byte, still less than 4 byte prefix.

By biasing the allocation of definition towards non-intersecting source operands, we can enhance the chances of EEVEX to REX demotions.

Comment on lines 7342 to 7359
predicate(UseAPX);
match(Set dst (AddL src1 src2));
effect(KILL cr);

format %{ "eaddq $dst, $src1, $src2\t# long ndd" %}
ins_encode %{
__ eaddq($dst$$Register, $src1$$Register, $src2$$constant, false);
%}
ins_pipe( ialu_reg );
%}

instruct addL_rReg_mem_imm_ndd(rRegL dst, memory src1, immL32 src2, rFlagsReg cr)
%{
predicate(UseAPX);
match(Set dst (AddL (LoadL src1) src2));
effect(KILL cr);

format %{ "eaddq $dst, $src1, $src2\t# long ndd" %}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you kindly share the impact of adding all these new patterns on libjvm.so size ?

I am curious to know the amount of ADLC generated code for all these new patterns.

Comment on lines 1550 to 1555
int reg = ra_->get_encode(this);
if (reg > 15) {
return (offset < 0x80) ? 6 : 9; // REX2
} else {
return (offset < 0x80) ? 5 : 8; // REX
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please move this out to a separate patch.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please create the PR as you identified and fixed this issue? I will remove this code block once your updated BoxLock node fix is integrated into the mainline.

match(Set dst (CountTrailingZerosL (LoadL src)));
effect(KILL cr);

ins_cost(175);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please let me know why you're using ins_cost here and not in above pattern?

"eshrl $dst, $mem, markWord::klass_shift_at_offset\t# compressed klass ptr, shifted (ndd)\n\t"
%}
ins_encode %{
__ eshrl($dst$$Register, $mem$$Address, markWord::klass_shift_at_offset, false);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change could be done as part of loadNKlassCompactHeaders instruct itself as there is no additional register needed.
Something like below:
if (UseAPX_ {
__ eshrl($dst$$Register, $mem$$Address, markWord::klass_shift_at_offset, false);
} else {
__ movl($dst$$Register, $mem$$Address);
__ shrl($dst$$Register, markWord::klass_shift_at_offset);
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please see this suggestion incorporated in the updated code.

Comment on lines 5563 to 5573
instruct countLeadingZerosI_nf(rRegI dst, rRegI src) %{
predicate(UseAPX && UseCountLeadingZerosInstruction);
match(Set dst (CountLeadingZerosI src));

format %{ "elzcntl $dst, $src\t# count leading zeros (int nf)" %}
ins_encode %{
__ elzcntl($dst$$Register, $src$$Register, true);
%}
ins_pipe(ialu_reg);
%}

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This instruct could be removed as this is already an unary operation with separate destination, Likewise other unary operator instructs could also be removed where the destination is already separate from source.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ndd instructions epopcnt, elzcnt and etzcnt were removed as they have a separate destination. Please see the update code.

@@ -5551,6 +5583,17 @@ instruct countLeadingZerosI_mem(rRegI dst, memory src, rFlagsReg cr) %{
ins_pipe(ialu_reg_mem);
%}

instruct countLeadingZerosI_mem_nf(rRegI dst, memory src) %{
predicate(UseAPX && UseCountLeadingZerosInstruction);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This instruct could be removed as this is already an unary operation with separate destination,

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please see this suggestion incorporated in the updated code.

# additional tests with rax as destination
if RegOp in [RegRegImmNddInstruction]:
test_reg1 = 'rax'
test_reg2 = random.choice(test_regs)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need to generate another test_reg2, we could use the same from above. Thereby, we will not modify the existing tests and only have the new test instructions added in asmtest.out.h.

ins_pipe(ialu_reg);
%}


Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A nit pick, unnecessary extra blank lines :).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please see the additional blank lines removed in the updated code.

%}


instruct cmovI_regUCF2_ne(cmpOpUCF2 cop, rFlagsRegUCF cr, rRegI dst, rRegI src) %{
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cmovI_regUCF2_ne, cmovl_regUCF2_eq, cmovP_regUCF2_ne, cmovP_regUCF2_eq, cmovL_regUCF2_ne, cmovL_regUCF2_eq instructs could also use the ecmovl() instructions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please see the added NDD versions for cmovI_regUCF2_ne, cmovl_regUCF2_eq, cmovP_regUCF2_ne, cmovP_regUCF2_eq, cmovL_regUCF2_ne, cmovL_regUCF2_eq.

%{
predicate(UseAPX);
match(Set dst (AddI src1 src2));
effect(KILL cr);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also bring in the corresponding flag(PD::...); line from instruct addI_rReg in this and other rules where applicable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please see the updated code with flags(PD::...)

ins_pipe(ialu_reg);
%}

instruct negI_rReg_ndd(rRegI src, rRegI dst, immI_0 zero, rFlagsReg cr)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A nit pick in many of the new negI/negL instructs, we usually list the dst first in instruct.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please see the order of dst and src fixed.

%}

// Arithmetic Shift Right by variable
instruct sarI_rReg_CL_ndd(rRegI dst, rRegI src, rcx_RegI shift, rFlagsReg cr)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new instructs sarI_rReg_CL_ndd, shrI_rReg_CL_ndd, salL_rReg_CL_ndd, sarL_rReg_CL_ndd, shrL_rReg_CL_ndd could be removed and the original !bmi2 versions could be kept. We dont need to optimize with APX instructions for non bmi2 platforms.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please see the updated code which removed NDD support for non bmi2 platforms.

ins_pipe(ialu_reg_mem);
%}

instruct andL_rReg_rReg_mem_ndd(rRegL dst, rRegL src1, memory src2, rFlagsReg cr)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks to me that we only need one of andL_rReg_rReg_mem_ndd & andL_rReg_mem_rReg_ndd as and is a commutative operator. Likewise for other commutative operators like or, add, xor, mul, etc.

// Or Instructions
// Or Register with Register
instruct orI_rReg(rRegI dst, rRegI src, rFlagsReg cr)
instruct orI_rReg_imm_rReg_ndd(rRegI dst, immI src1, rRegI src2, rFlagsReg cr)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks to me that we only need one of orI_rReg_rReg_imm_ndd or orI_rReg_imm_rReg_ndd as orI is a commutative operator.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After doing a quick test, it was noticed that both the rules (orI_rReg_rReg_imm and orI_rReg_imm_rReg) are needed to generate NDD instruction depending on the position of the immediate value.

Comment on lines 9742 to 9744
instruct rorI_immI8_legacy(rRegI dst, immI8 shift, rFlagsReg cr)
%{
predicate(!VM_Version::supports_bmi2() && n->bottom_type()->basic_type() == T_INT);
predicate(!UseAPX && !VM_Version::supports_bmi2() && n->bottom_type()->basic_type() == T_INT);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't need to change anything for non bmi2 platforms. The original predicate can be kept as is. This applies to all rorI, rolI, rorL, rolL.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please see the updated code removing rorI, rolI, rorL, rolL for non bmi2 platforms.

Copy link

@sviswa7 sviswa7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.

@openjdk openjdk bot added the ready Pull request is ready to be integrated label Mar 12, 2025
@vamsi-parasa
Copy link
Contributor Author

Hi Vladimir (@vnkozlov),

Could you please review this PR?

Thanks,
Vamsi

Copy link
Contributor

@eme64 eme64 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a quick look over this. It's a bit hard to review for me, because it is basically about specific APX instructions. We probably have to heavily rely on testing. But APX hardware is not yet available, right?

How can be best test this? Is there any way to emulate, maybe using SDE? What testing did you run for this?

predicate(UseAPX);
match(Set dst (CMoveI (Binary cop cr) (Binary src1 src2)));

ins_cost(200); // XXX
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does the XXX stand for?


// Conditional move
instruct cmovI_rReg_rReg_mem_ndd(rRegI dst, cmpOp cop, rFlagsReg cr, rRegI src1, memory src2)
%{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this bracket usually not on the previous line?

@eme64
Copy link
Contributor

eme64 commented Mar 13, 2025

@vamsi-parasa I tried to launch testing, but my script fails because of some merge issue. Would you mind merging from master?

@eme64
Copy link
Contributor

eme64 commented Mar 24, 2025

@vamsi-parasa Testing looks good / no related test failures.

registers_mapping = {
# skip rax, rsi, rdi, rsp, rbp as they have special encodings
# 'rax': {64: 'rax', 32: 'eax', 16: 'ax', 8: 'al'},
'rax': {64: 'rax', 32: 'eax', 16: 'ax', 8: 'al'},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should probably update the copyright year, right?

Also: is this script ever run in testing? I didn't know we had python files in the repository 😅

Copy link
Contributor

@eme64 eme64 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I cannot really review he content, nor can I really test the APX features.
But I also cannot directly see any issues with it.

Therefore, I'll approve.
Once the hardware is available, we will probably discover some issues and have to fix them then.

Thanks @vamsi-parasa for the work!

@vamsi-parasa
Copy link
Contributor Author

@vamsi-parasa Testing looks good / no related test failures.

Thank you very much Emanuel! (@eme64) :)

@openjdk openjdk bot removed the ready Pull request is ready to be integrated label Mar 24, 2025
@vamsi-parasa
Copy link
Contributor Author

I cannot really review he content, nor can I really test the APX features. But I also cannot directly see any issues with it.

Therefore, I'll approve. Once the hardware is available, we will probably discover some issues and have to fix them then.

Thanks @vamsi-parasa for the work!

Thank You Emanuel(@eme64) for doing the review, testing and approving the PR!

@vamsi-parasa
Copy link
Contributor Author

/integrate

@openjdk
Copy link

openjdk bot commented Mar 24, 2025

@vamsi-parasa This pull request has not yet been marked as ready for integration.

Co-authored-by: Emanuel Peter <emanuel.peter@oracle.com>
Copy link
Contributor

@eme64 eme64 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-approved :)

@vamsi-parasa
Copy link
Contributor Author

Re-approved :)

Thank you, Emanuel! :)

@openjdk openjdk bot added the ready Pull request is ready to be integrated label Mar 24, 2025
@sviswa7
Copy link

sviswa7 commented Mar 24, 2025

/sponsor

@openjdk
Copy link

openjdk bot commented Mar 24, 2025

@sviswa7 The change author (@vamsi-parasa) must issue an integrate command before the integration can be sponsored.

@vamsi-parasa
Copy link
Contributor Author

/integrate

@openjdk openjdk bot added the sponsor Pull request is ready to be sponsored label Mar 24, 2025
@openjdk
Copy link

openjdk bot commented Mar 24, 2025

@vamsi-parasa
Your change (at version 06c52ce) is now ready to be sponsored by a Committer.

@sviswa7
Copy link

sviswa7 commented Mar 24, 2025

/sponsor

@openjdk
Copy link

openjdk bot commented Mar 24, 2025

Going to push as commit c87e1be.
Since your change was applied there have been 110 commits pushed to the master branch:

  • 7d1fe0e: 8339543: [vectorapi] laneHelper and withLaneHelper should be ForceInline
  • ddc4318: 8352628: Refine Grapheme test
  • d8c2f59: 8352064: AIX: now also able to build static-jdk image with a statically linked launcher
  • 5709f79: 8352648: JFR: 'jfr query' should not be available in product builds
  • 02a4ce2: 8352147: G1: TestEagerReclaimHumongousRegionsClearMarkBits test takes very long
  • de58009: 8351468: C2: array fill optimization assigns wrong type to intrinsic call
  • a875733: 8352486: [ubsan] compilationMemoryStatistic.cpp:659:21: runtime error: index 64 out of bounds for type const struct unnamed struct
  • 5591f8a: 8351515: C2 incorrectly removes double negation for double and float
  • 56a4ffa: 8352597: [IR Framework] test bug: TestNotCompilable.java fails on product build
  • e23e0f8: 8352591: Missing UnlockDiagnosticVMOptions in VerifyGraphEdgesWithDeadCodeCheckFromSafepoints test
  • ... and 100 more: https://git.openjdk.org/jdk/compare/37ec796255ae857588a5c7e0d572407dd81cbec9...master

Your commit was automatically rebased without conflicts.

@openjdk openjdk bot added the integrated Pull request has been integrated label Mar 24, 2025
@openjdk openjdk bot closed this Mar 24, 2025
@openjdk openjdk bot removed ready Pull request is ready to be integrated rfr Pull request is ready for review sponsor Pull request is ready to be sponsored labels Mar 24, 2025
@openjdk
Copy link

openjdk bot commented Mar 24, 2025

@sviswa7 @vamsi-parasa Pushed as commit c87e1be.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

@TobiHartmann
Copy link
Member

This caused a regression: JDK-8356281 @vamsi-parasa Could you please have a look?

@vamsi-parasa
Copy link
Contributor Author

This caused a regression: JDK-8356281 @vamsi-parasa Could you please have a look?

Hi Tobias, thank you for letting us know. I have been working on a fix for this issue over the last few days. The root cause has been figured out and it will be addressed soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

hotspot-compiler hotspot-compiler-dev@openjdk.org integrated Pull request has been integrated

Development

Successfully merging this pull request may close these issues.

5 participants