-
Notifications
You must be signed in to change notification settings - Fork 5.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
8323497: On x64, use 32-bit immediate moves for narrow klass base if possible #17340
8323497: On x64, use 32-bit immediate moves for narrow klass base if possible #17340
Conversation
👋 Welcome back stuefe! A progress list of the required criteria for merging this PR into |
@@ -13364,6 +13364,16 @@ void Assembler::mov64(Register dst, int64_t imm64, relocInfo::relocType rtype, i | |||
emit_data64(imm64, rtype, format); | |||
} | |||
|
|||
#ifdef _LP64 | |||
void Assembler::mov32_or_64(Register dst, int64_t imm) { | |||
if ((uint64_t)imm < nth_bit(32)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Drive-by comments:
a) macro-assembler stuff like this should be in macroAssembler;
b) there is is_simm32(imm)
for checks like these;
c) I did JDK-8319406 recently, maybe you could just use that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[deleted]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
God you are fast.
movptr looks suitable. Why did you name it "ptr" if its not for pointers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't changed the name of the method. movptr
means "move something of the width of the machine pointer", like everywhere else in assembler code. That fits your case, I think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, movptr only works its magic if the input is smaller than signed int max (2gb), and it needs one byte more than my variant.
base < 2g:
35 ;; decode_klass_not_null
36 0x00007efdcc9e51c4: mov $0x27000000,%r11
37 0x00007efdcc9e51cb: add %r11,%r10
base > 2g:
35 ;; decode_klass_not_null
36 0x00007ff7a89e51c4: movabs $0x82000000,%r11
37 0x00007ff7a89e51ce: add %r11,%r10
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@shipilev It is just the code representation of the matching, the corresponding nodes are loadConUL32
, loadConL32
and loadConL
and the matcher uses node costs to sort out the priority.
jdk/src/hotspot/cpu/x86/x86_64.ad
Line 4807 in c1282b5
instruct loadConL(rRegL dst, immL src) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, subtle. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the new movptr variant, things look good:
< 2g
;; decode_klass_not_null
0x00007fcb608681c4: mov $0x18000000,%r11d
0x00007fcb608681c4: 41 bb 00 00 00 18
0x00007fcb608681ca: add %r11,%r10
0x00007fcb608681ca: 4d 03 d3
< 4g
;; decode_klass_not_null
0x00007f382c8681c4: mov $0x85000000,%r11d
0x00007f382c8681c4: 41 bb 00 00 00 85
0x00007f382c8681ca: add %r11,%r10
0x00007f382c8681ca: 4d 03 d3
4g
;; decode_klass_not_null
0x00007fd8908681c4: movabs $0x7fd7b6000000,%r11
0x00007fd8908681c4: 49 bb 00 00 00 b6 d7 7f 00 00
0x00007fd8908681ce: add %r11,%r10
0x00007fd8908681ce: 4d 03 d3
I assume if the 32-bit mov variants were to use non-Rxx registers, the REX prefix could also be omitted? Giving us another byte back.
I'll run tests with the new variant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, I think it would be useful for PrintAssembly to also optionally print the raw bytes in addition to the decoded instruction. It certainly helps me understanding stuff. Do you agree? If yes, I'll prepare a patch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm pretty marginal on this. If I ever have needed to see bytes, I've used the debugger.
c54df67
to
369831e
Compare
Still waiting for https://git.openjdk.org/jdk/pull/17343 |
d222361
to
2159990
Compare
Webrevs
|
…or-klass-encoding-base
…or-klass-encoding-base
@shipilev , @merykitty ? Could you please review? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have taken a look and left some comments. Thanks.
@@ -40,9 +41,12 @@ char* CompressedKlassPointers::reserve_address_space_for_compressed_classes(size | |||
if (result == nullptr) { | |||
result = reserve_address_space_for_zerobased_encoding(size, aslr); | |||
} | |||
} // end: low-address reservation | |||
} else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be:
if (result == nullptr) {
// If we cannot use zero-based encoding (when CDS is enabled), optimizing for an
...
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so. Why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the point is to reflow this into the series of ifs:
char* result = nullptr;
if (optimize_for_zero_base) {
result = ...
}
if (result == nullptr) {
result = ...
}
Probably good for future maintenance if we want to add more cases in this block.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah okay. But that's wrong, since for the optimize_for_zero_base case, we would search the low address regions twice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you need to whole space to lie between 0 and 4G then, it seems you only want to have the base being < 4G, then the top of the klass space can be upto size + 4G
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, right, good catch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And yes, you are correct that this only applies when optimize_for_zero_base
is false
. Since in the other case we have already tried a much larger space up to
@@ -5591,7 +5592,8 @@ void MacroAssembler::decode_klass_not_null(Register r, Register tmp) { | |||
shlq(r, LogKlassAlignmentInBytes); | |||
} | |||
if (CompressedKlassPointers::base() != nullptr) { | |||
mov64(tmp, (int64_t)CompressedKlassPointers::base()); | |||
// Uses 32-bit mov if base is small enough | |||
movptr(tmp, (intptr_t)CompressedKlassPointers::base()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we can reserve the base in the low 2G then this could be optimized further to addq(r, CompressedKlassPointers::base())
and if LogKlassAlignmentInBytes
is 8 (as asserted in decode_and_move_klass_not_null
below), we can shorten the whole sequence into leaq(r, Address(noreg, r, Address::times_8, CompressedKlassPointers::base())
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! But could we do this in a follow-up RFE? I don't have any more time to spend on this atm, and I would like to get this PR at least integrated. It has been sitting in review for 7 weeks now.
I opened a follow-up RFE: https://bugs.openjdk.org/browse/JDK-8326383
(about the leaq, we may move to a different encoding scheme with a different shift value with Lilliput, so I am hesitant to put in too much work that relies on a 8-byte shift)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with that.
@@ -5611,7 +5613,8 @@ void MacroAssembler::decode_and_move_klass_not_null(Register dst, Register src) | |||
movl(dst, src); | |||
} else { | |||
if (CompressedKlassPointers::base() != nullptr) { | |||
mov64(dst, (int64_t)CompressedKlassPointers::base()); | |||
// Uses 32-bit mov if base is small enough | |||
movptr(dst, (intptr_t)CompressedKlassPointers::base()); | |||
} else { | |||
xorq(dst, dst); | |||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can also be 1 instruction when the base is smaller than
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See my remark above.
…or-klass-encoding-base
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks reasonable to me, but it keeps me wondering if any code anywhere relies on compressed oops decoding path to take the exact amount of bytes. Could you please run make test TEST=all
on at least one arch before integrating?
/reviewers 2
@@ -40,9 +41,12 @@ char* CompressedKlassPointers::reserve_address_space_for_compressed_classes(size | |||
if (result == nullptr) { | |||
result = reserve_address_space_for_zerobased_encoding(size, aslr); | |||
} | |||
} // end: low-address reservation | |||
} else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the point is to reflow this into the series of ifs:
char* result = nullptr;
if (optimize_for_zero_base) {
result = ...
}
if (result == nullptr) {
result = ...
}
Probably good for future maintenance if we want to add more cases in this block.
@tstuefe This change is no longer ready for integration - check the PR body for details. |
@tstuefe This pull request has been inactive for more than 4 weeks and will be automatically closed if another 4 weeks passes without any activity. To avoid this, simply add a new comment to the pull request. Feel free to ask for assistance if you need help with progressing this pull request towards integration! |
@tstuefe This pull request has been inactive for more than 8 weeks and will now be automatically closed. If you would like to continue working on this pull request in the future, feel free to reopen it! This can be done using the |
@tstuefe, do you want to restart this? Code density is important :) |
On x64, we always use the long form of mov immediate to load the klass base into a register. If the klass base fits into 32 bits, we could use the short form and save four instruction bytes.
Before: mov uses 10 instruction bytes:
Now: mov uses 6 instruction bytes:
Note that this optimization does not depend on zero-based addressing, and therefore we change class space reservation: we now always look in low-address regions first.
Tests: tier1 (GHA), tier 2 on x64 linux
Progress
Issue
Reviewers
Reviewing
Using
git
Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/17340/head:pull/17340
$ git checkout pull/17340
Update a local copy of the PR:
$ git checkout pull/17340
$ git pull https://git.openjdk.org/jdk.git pull/17340/head
Using Skara CLI tools
Checkout this PR locally:
$ git pr checkout 17340
View PR using the GUI difftool:
$ git pr show -t 17340
Using diff file
Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/17340.diff
Webrev
Link to Webrev Comment