New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
8280872: Reorder code cache segments to improve code density #7517
Conversation
|
@bulasevich The following label will be automatically applied to this pull request:
When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command. |
8ab9342
to
7822f2f
Compare
@bulasevich this pull request can not be integrated into git checkout codecache_segments_order
git fetch https://git.openjdk.java.net/jdk master
git merge FETCH_HEAD
# resolve conflicts and follow the instructions given by git merge
git commit -m "Merge master"
git push |
Webrevs
|
increase far jump threshold: sideof(codecache)=128M -> sizeof(nonprofiled+nonmethod)=128M
fc8686d
to
8f1a8c9
Compare
address tpc1 = trampoline_call(has_neg); | ||
if (tpc1 == NULL) { | ||
DEBUG_ONLY(reset_labels(STUB_LONG, SET_RESULT, DONE)); | ||
postcond(pc() == badAddress); | ||
return NULL; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe replacing trampoline_call
by far_call
should be a separate PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. I remove this part of the change.
@@ -390,7 +390,8 @@ void MacroAssembler::far_call(Address entry, CodeBuffer *cbuf, Register tmp) { | |||
assert(ReservedCodeCacheSize < 4*G, "branch out of range"); | |||
assert(CodeCache::find_blob(entry.target()) != NULL, | |||
"destination of far call not found in code cache"); | |||
if (far_branches()) { | |||
assert(CodeCache::is_non_nmethod(entry.target()), "must be a call to the code stub"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This restricts far calls to be calls of non-nmethod code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. In fact the function is used for non-method code calls only. I put an assert here to be check this fact for future code updates.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand why we should restrict uses of far_call
to calls of non-nmethod code. Could you please explain this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wanted to avoid the untested code paths.
You are right. Let me change it to work the same way with far_jump impl: if (target_needs_far_branch)..
@@ -52,7 +52,7 @@ void InlineCacheBuffer::assemble_ic_buffer_code(address code_begin, void* cached | |||
address start = __ pc(); | |||
Label l; | |||
__ ldr(rscratch2, l); | |||
__ far_jump(ExternalAddress(entry_point)); | |||
__ far_jump(ExternalAddress(entry_point), NULL, rscratch1, true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This complicates assemble_ic_buffer_code
. You need to know far_jump
implementation, especially the generation of NOPs. I understand why we need those NOPs.
Do we have calls of non-nmethod code here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, there are entry points from both non_method and method segments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest the following code:
int inst_count = __ far_jump(ExternalAddress(entry_point));
// IC stub code size is not expected to vary depending on target address.
// We use NOPs to make the size equal to ic_stub_code_size.
for (int i = 3 + inst_count; i < ic_stub_code_size(); ++i) {
__ nop();
}
…dd trampoline_needs_far_jump func
@@ -52,7 +52,7 @@ void InlineCacheBuffer::assemble_ic_buffer_code(address code_begin, void* cached | |||
address start = __ pc(); | |||
Label l; | |||
__ ldr(rscratch2, l); | |||
__ far_jump(ExternalAddress(entry_point)); | |||
__ far_jump(ExternalAddress(entry_point), NULL, rscratch1, true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest the following code:
int inst_count = __ far_jump(ExternalAddress(entry_point));
// IC stub code size is not expected to vary depending on target address.
// We use NOPs to make the size equal to ic_stub_code_size.
for (int i = 3 + inst_count; i < ic_stub_code_size(); ++i) {
__ nop();
}
Can we have tests for this? |
9cb0354
to
9650abc
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Boris,
Thanks for doing this change. In general it looks good! I only have minor comments and questions which I've added inline.
Also, can you please update the Summary of the JBS issue to match that of the PR? I think the Summary of the PR is more adequate because the change also contains some shared changes.
And please update the description of the PR and replace "It changes nothing for any platform besides AARCH" with something like "Currently only the aarch64 backend is adapted to make use of these changes". Because the segments are actually re-orded for all platforms.
Thank you and best regards,
Volker
assert(ReservedCodeCacheSize < 4*G, "branch out of range"); | ||
assert(CodeCache::find_blob(entry.target()) != NULL, | ||
"destination of far call not found in code cache"); | ||
if (far_branches()) { | ||
address start = pc(); | ||
if (target_needs_far_branch(entry.target())) { | ||
uint64_t offset; | ||
// We can use ADRP here because we know that the total size of | ||
// the code cache cannot exceed 2Gb. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not directly related to your change, but what's correct here:
- the comment which says "code cache can't exceed 2gb"
- the assertion above which asserts
ReservedCodeCacheSize < 4*G
Maybe you can fix this while you're on it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ADRP limit is 4GB - it is checked by assert. The comment reminds us that CODE_CACHE_SIZE_LIMIT (defined in globalDefinitions.hpp) is 2G which is Ok for us. let me update comment a little:
- // the code cache cannot exceed 2Gb (ADRP limit is 4GB)
__ far_jump(ExternalAddress(entry_point)); | ||
__ align(wordSize); | ||
int jump_code_size = __ far_jump(ExternalAddress(entry_point)); | ||
// IC stub code size is not expected to vary depending on target address. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the new code still align cached_value
on a wordSize
boundary as this was ensured before by align(wordsize)
? I think that's only true if code_begin
is guaranteed to start at a wordSize
boundary because far_jump
is either one or three instructions (plus one ldr
instruction). If yes, please add a comment explaining that. Otherwise explain why the alignment isn't necessary anymore.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for pointing this out. I think alignment is important because of data load penalty issues and MT issues.
Actually the stub is aligned on the CodeEntryAlignment and stub_size is aligned, so this address is also aligned.
I removed align(wordSize) because the stub_size is constant and there is no room for additional alignments within a stub. Let me add an assert here to check the alignment.
return true; | ||
} | ||
// codecache size: 128M..240M | ||
return !CodeCache::is_non_nmethod(addr); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to further refine this to also catch calls from C1 to C1 and C2 to C2 which obviously wouldn't need a far call as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe they should be our next steps to guarantee we don't generate redundant code for such cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. But I see a little problem here. CodeCache is aksed for a room in profiled segment for C1 methods, but it places method to the non_profiled segment when the profiled segment is full (and same whay C2 method can be placed in profiled segment). So we can not know in advance where will be the final place for the generated method.
…hen the assert, check alignment, move comments, segments order: profiled - non_method - non_profiled
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, looks good now.
@bulasevich This change now passes all automated pre-integration checks. After integration, the commit message for the final commit will be:
You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed. At the time when this comment was updated there had been 6 new commits pushed to the
Please see this link for an up-to-date comparison between the source branch of this pull request and the
|
__ bind(l); | ||
assert((uintptr_t)__ pc() % wordSize == 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like this leads to a compilation failure in the debug build. Can you please check before submitting?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good suggestion. Let me test it before approval.
* @requires os.arch=="aarch64" | ||
* @requires vm.compiler2.enabled | ||
* | ||
* @run driver compiler.c2.TestFarJump |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Package name compiler.c2.aarch64
is different from compiler.c2.TestFarJump
.
Got testing failure:
java.lang.ClassNotFoundException: compiler.c2.TestFarJump
at java.base/java.net.URLClassLoader.findClass(URLClassLoader.java:445)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After I fixed test name I got next failure:
[Exception Handler]
0x0000fffe40140490: b3d5 9bd2 | 4201 80d2 | a4d5 9bd2 | a5d5 9bd2
----------System.err:(11/622)----------
java.lang.RuntimeException: ADRP instruction is expected on far jump
at compiler.c2.aarch64.TestFarJump.runVM(TestFarJump.java:112)
at compiler.c2.aarch64.TestFarJump.main(TestFarJump.java:126)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for checking that. My bad, last minute change have broken the test. I think the issue you observe is a debug build issue. I checked the functionality with branch_range=128M on debug build, and then reverted the fix:
// The maximum range of a branch is fixed for the AArch64
// architecture. In debug mode we shrink it in order to test
// trampolines, but not so small that branches in the interpreter
// are out of range.
- static const uint64_t branch_range = NOT_DEBUG(128 * M) DEBUG_ONLY(2 * M);
+ static const uint64_t branch_range = 128 * M; // test only
In current change I disable the test for the debug build (@requires vm.debug == false
)
int jump_code_size = __ far_jump(ExternalAddress(entry_point)); | ||
// IC stub code size is not expected to vary depending on target address. | ||
// We use NOPs to make the ldr+far_jump+int64 size equal to ic_stub_code_size. | ||
for (int i = jump_code_size; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need more explanation for loop's boundaries arithmetic.
The comment includes ldr
but you start from jump instruction size. Is it part of 3*instruction_size
? What are 3
instructions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean (ldr + int64 cached_value) = 4 + 8 = 12 = 3*instruction_size.
I change the code to make it clear now:
// We use NOPs to make the [ldr + far_jump + nops + int64] stub size equal to ic_stub_code_size.
for (int size = NativeInstruction::instruction_size + jump_code_size + 8;
size < ic_stub_code_size(); size += NativeInstruction::instruction_size) {
__ nop();
}
My performance testing shows that there is regression in some startup benchmarks on aarch64. |
Is there a chance to share what you are using as "startup benchmarks"? I think they would be definitely useful for others as well if they don't contain proprietary code. |
I would defer this question to @ericcaspole and @cl4es. I don't know how these benchmarks run and which one are open. But I agree that we can share how we run open startup benchmarks with community if we did not do it already. |
@ericcaspole investigated performance results and "okayed" these changes. @bulasevich, you still need to address issues with |
We have two frameworks to measure startup performance. The older one is cross platform and measures only wall clock time. Since about 4 years ago we created a new linux only framework using "perf stat" to measure the wall clock time, cycles and instructions. It includes a few small hello world type variations, some web framework startup, and a simple gui app. |
Thanks @ericcaspole , @vnkozlov ! @bulasevich was out of office this week. He'll address the remaining issues once he's back. |
Thanks @ericcaspole , @vnkozlov, @simonis !
Thank you! I checked a simple app startup time. It does not show any change for me.
I have done it. Also I did another round of testing with jtreg. |
* | ||
* @requires vm.flagless | ||
* @requires os.arch=="aarch64" | ||
* @requires vm.debug == false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why you disabled test with debug VM?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In a debug build the functionality does not work as expected. On AARCH64 branch range is +/-128MB. This is reflected by branch_range constant in assembler_aarch64.hpp. The issue is that for debug build the branch_range constant is intentionally modified to test trampolines, and my test fails because branches are encoded into three instructions starting at 2MB, not 128MB:
// The maximum range of a branch is fixed for the AArch64
// architecture. In debug mode we shrink it in order to test
// trampolines, but not so small that branches in the interpreter
// are out of range.
static const uint64_t branch_range = NOT_DEBUG(128 * M) DEBUG_ONLY(2 * M);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you!
/integrate |
Going to push as commit e524107.
Your commit was automatically rebased without conflicts. |
@bulasevich Pushed as commit e524107. |
Currently the codecache segment order is [non-nmethod, non-profiled, profiled]. With this change we move the non-nmethod segment between two code segments. Currently only the aarch64 backend is adapted to make use of these changes.
In AARCH the offset limit for a branch instruction is 128MB. The bigger jumps are encoded with three instructions. Most of far branches are jumps into the non-nmethod blobs. With the non-nmethod segment in between code segments the jump distance from method to the stub becomes shorter. The result is a 4% reduction in generated code size for the CodeCache range from 128MB to 240MB.
As a side effect, the performance of some tests is slightly improved:
ArraysFill.testCharFill 10 thrpt 15 170235.720 -> 178477.212 ops/ms
Testing: jdk/hotspot jtreg and microbenchmarks on AMD and AARCH
Progress
Issue
Reviewers
Reviewing
Using
git
Checkout this PR locally:
$ git fetch https://git.openjdk.java.net/jdk pull/7517/head:pull/7517
$ git checkout pull/7517
Update a local copy of the PR:
$ git checkout pull/7517
$ git pull https://git.openjdk.java.net/jdk pull/7517/head
Using Skara CLI tools
Checkout this PR locally:
$ git pr checkout 7517
View PR using the GUI difftool:
$ git pr show -t 7517
Using diff file
Download this PR as a diff file:
https://git.openjdk.java.net/jdk/pull/7517.diff