Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

8323497: On x64, use 32-bit immediate moves for narrow klass base if possible #17340

Conversation

tstuefe
Copy link
Member

@tstuefe tstuefe commented Jan 10, 2024

On x64, we always use the long form of mov immediate to load the klass base into a register. If the klass base fits into 32 bits, we could use the short form and save four instruction bytes.

Before: mov uses 10 instruction bytes:

   35  ;; decode_klass_not_null
   36   0x00007f8b089e51c4:   movabs $0x82000000,%r11
   37   0x00007f8b089e51ce:   add    %r11,%r10

Now: mov uses 6 instruction bytes:

   35  ;; decode_klass_not_null
   36   0x00007fbe609e51c4:   mov    $0x82000000,%r11d
   37   0x00007fbe609e51ca:   add    %r11,%r10

Note that this optimization does not depend on zero-based addressing, and therefore we change class space reservation: we now always look in low-address regions first.


Tests: tier1 (GHA), tier 2 on x64 linux


Progress

  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue
  • Change must be properly reviewed (2 reviews required, with at least 1 Reviewer, 1 Author)

Issue

  • JDK-8323497: On x64, use 32-bit immediate moves for narrow klass base if possible (Enhancement - P4)

Reviewers

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/17340/head:pull/17340
$ git checkout pull/17340

Update a local copy of the PR:
$ git checkout pull/17340
$ git pull https://git.openjdk.org/jdk.git pull/17340/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 17340

View PR using the GUI difftool:
$ git pr show -t 17340

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/17340.diff

Webrev

Link to Webrev Comment

@bridgekeeper
Copy link

bridgekeeper bot commented Jan 10, 2024

👋 Welcome back stuefe! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk
Copy link

openjdk bot commented Jan 10, 2024

@tstuefe The following label will be automatically applied to this pull request:

  • hotspot

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

@openjdk openjdk bot added the hotspot hotspot-dev@openjdk.org label Jan 10, 2024
@@ -13364,6 +13364,16 @@ void Assembler::mov64(Register dst, int64_t imm64, relocInfo::relocType rtype, i
emit_data64(imm64, rtype, format);
}

#ifdef _LP64
void Assembler::mov32_or_64(Register dst, int64_t imm) {
if ((uint64_t)imm < nth_bit(32)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Drive-by comments:
a) macro-assembler stuff like this should be in macroAssembler;
b) there is is_simm32(imm) for checks like these;
c) I did JDK-8319406 recently, maybe you could just use that?

Copy link
Member

@shipilev shipilev Jan 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[deleted]

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

God you are fast.

movptr looks suitable. Why did you name it "ptr" if its not for pointers?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't changed the name of the method. movptr means "move something of the width of the machine pointer", like everywhere else in assembler code. That fits your case, I think?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, movptr only works its magic if the input is smaller than signed int max (2gb), and it needs one byte more than my variant.

base < 2g:

   35  ;; decode_klass_not_null
   36   0x00007efdcc9e51c4:   mov    $0x27000000,%r11                                                                                                                                                                                                                               
   37   0x00007efdcc9e51cb:   add    %r11,%r10

base > 2g:

   35  ;; decode_klass_not_null
   36   0x00007ff7a89e51c4:   movabs $0x82000000,%r11
   37   0x00007ff7a89e51ce:   add    %r11,%r10          

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shipilev It is just the code representation of the matching, the corresponding nodes are loadConUL32, loadConL32 and loadConL and the matcher uses node costs to sort out the priority.

instruct loadConL(rRegL dst, immL src)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, subtle. Thanks!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the new movptr variant, things look good:

< 2g

 ;; decode_klass_not_null
  0x00007fcb608681c4:   mov    $0x18000000,%r11d
0x00007fcb608681c4:   41 bb 00 00 00 18 
  0x00007fcb608681ca:   add    %r11,%r10
0x00007fcb608681ca:   4d 03 d3 

< 4g

 ;; decode_klass_not_null
  0x00007f382c8681c4:   mov    $0x85000000,%r11d
0x00007f382c8681c4:   41 bb 00 00 00 85 
  0x00007f382c8681ca:   add    %r11,%r10
0x00007f382c8681ca:   4d 03 d3 

4g

 ;; decode_klass_not_null
  0x00007fd8908681c4:   movabs $0x7fd7b6000000,%r11
0x00007fd8908681c4:   49 bb 00 00 00 b6 d7 7f 00 00 
  0x00007fd8908681ce:   add    %r11,%r10
0x00007fd8908681ce:   4d 03 d3 

I assume if the 32-bit mov variants were to use non-Rxx registers, the REX prefix could also be omitted? Giving us another byte back.

I'll run tests with the new variant.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, I think it would be useful for PrintAssembly to also optionally print the raw bytes in addition to the decoded instruction. It certainly helps me understanding stuff. Do you agree? If yes, I'll prepare a patch.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pretty marginal on this. If I ever have needed to see bytes, I've used the debugger.

@tstuefe
Copy link
Member Author

tstuefe commented Jan 29, 2024

Still waiting for https://git.openjdk.org/jdk/pull/17343

@tstuefe tstuefe force-pushed the use-32bit-immediate-moves-on-x64-for-klass-encoding-base branch from d222361 to 2159990 Compare January 30, 2024 11:46
@tstuefe tstuefe marked this pull request as ready for review January 30, 2024 11:47
@openjdk openjdk bot added the rfr Pull request is ready for review label Jan 30, 2024
@mlbridge
Copy link

mlbridge bot commented Jan 30, 2024

Webrevs

@tstuefe
Copy link
Member Author

tstuefe commented Feb 20, 2024

@shipilev , @merykitty ? Could you please review?

Copy link
Member

@merykitty merykitty left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have taken a look and left some comments. Thanks.

@@ -40,9 +41,12 @@ char* CompressedKlassPointers::reserve_address_space_for_compressed_classes(size
if (result == nullptr) {
result = reserve_address_space_for_zerobased_encoding(size, aslr);
}
} // end: low-address reservation
} else {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be:

if (result == nullptr) {
    // If we cannot use zero-based encoding (when CDS is enabled), optimizing for an
    ...
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so. Why?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the point is to reflow this into the series of ifs:

char* result = nullptr;
if (optimize_for_zero_base) {
   result = ...
}

if (result == nullptr) {
   result = ...
}

Probably good for future maintenance if we want to add more cases in this block.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah okay. But that's wrong, since for the optimize_for_zero_base case, we would search the low address regions twice.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you need to whole space to lie between 0 and 4G then, it seems you only want to have the base being < 4G, then the top of the klass space can be upto size + 4G

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, right, good catch.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And yes, you are correct that this only applies when optimize_for_zero_base is false. Since in the other case we have already tried a much larger space up to $2^{32 + LogKlassAlignmentInBytes}$.

@@ -5591,7 +5592,8 @@ void MacroAssembler::decode_klass_not_null(Register r, Register tmp) {
shlq(r, LogKlassAlignmentInBytes);
}
if (CompressedKlassPointers::base() != nullptr) {
mov64(tmp, (int64_t)CompressedKlassPointers::base());
// Uses 32-bit mov if base is small enough
movptr(tmp, (intptr_t)CompressedKlassPointers::base());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we can reserve the base in the low 2G then this could be optimized further to addq(r, CompressedKlassPointers::base()) and if LogKlassAlignmentInBytes is 8 (as asserted in decode_and_move_klass_not_null below), we can shorten the whole sequence into leaq(r, Address(noreg, r, Address::times_8, CompressedKlassPointers::base())

Copy link
Member Author

@tstuefe tstuefe Feb 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! But could we do this in a follow-up RFE? I don't have any more time to spend on this atm, and I would like to get this PR at least integrated. It has been sitting in review for 7 weeks now.

I opened a follow-up RFE: https://bugs.openjdk.org/browse/JDK-8326383

(about the leaq, we may move to a different encoding scheme with a different shift value with Lilliput, so I am hesitant to put in too much work that relies on a 8-byte shift)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with that.

@@ -5611,7 +5613,8 @@ void MacroAssembler::decode_and_move_klass_not_null(Register dst, Register src)
movl(dst, src);
} else {
if (CompressedKlassPointers::base() != nullptr) {
mov64(dst, (int64_t)CompressedKlassPointers::base());
// Uses 32-bit mov if base is small enough
movptr(dst, (intptr_t)CompressedKlassPointers::base());
} else {
xorq(dst, dst);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can also be 1 instruction when the base is smaller than $2^{31}$.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my remark above.

Copy link
Member

@shipilev shipilev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks reasonable to me, but it keeps me wondering if any code anywhere relies on compressed oops decoding path to take the exact amount of bytes. Could you please run make test TEST=all on at least one arch before integrating?

/reviewers 2

@@ -40,9 +41,12 @@ char* CompressedKlassPointers::reserve_address_space_for_compressed_classes(size
if (result == nullptr) {
result = reserve_address_space_for_zerobased_encoding(size, aslr);
}
} // end: low-address reservation
} else {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the point is to reflow this into the series of ifs:

char* result = nullptr;
if (optimize_for_zero_base) {
   result = ...
}

if (result == nullptr) {
   result = ...
}

Probably good for future maintenance if we want to add more cases in this block.

@openjdk
Copy link

openjdk bot commented Feb 21, 2024

@tstuefe This change is no longer ready for integration - check the PR body for details.

@openjdk openjdk bot added the ready Pull request is ready to be integrated label Feb 21, 2024
@openjdk
Copy link

openjdk bot commented Feb 21, 2024

@shipilev
The total number of required reviews for this PR (including the jcheck configuration and the last /reviewers command) is now set to 2 (with at least 1 Reviewer, 1 Author).

@openjdk openjdk bot removed the ready Pull request is ready to be integrated label Feb 21, 2024
@bridgekeeper
Copy link

bridgekeeper bot commented Mar 20, 2024

@tstuefe This pull request has been inactive for more than 4 weeks and will be automatically closed if another 4 weeks passes without any activity. To avoid this, simply add a new comment to the pull request. Feel free to ask for assistance if you need help with progressing this pull request towards integration!

@openjdk openjdk bot changed the title JDK-8323497: On x64, use 32-bit immediate moves for narrow klass base if possible 8323497: On x64, use 32-bit immediate moves for narrow klass base if possible Apr 10, 2024
@bridgekeeper
Copy link

bridgekeeper bot commented Apr 17, 2024

@tstuefe This pull request has been inactive for more than 8 weeks and will now be automatically closed. If you would like to continue working on this pull request in the future, feel free to reopen it! This can be done using the /open pull request command.

@bridgekeeper bridgekeeper bot closed this Apr 17, 2024
@shipilev
Copy link
Member

@tstuefe, do you want to restart this? Code density is important :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hotspot hotspot-dev@openjdk.org rfr Pull request is ready for review
Development

Successfully merging this pull request may close these issues.

4 participants