Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

8280872: Reorder code cache segments to improve code density #7517

Closed
wants to merge 10 commits into from

Conversation

bulasevich
Copy link
Contributor

@bulasevich bulasevich commented Feb 17, 2022

Currently the codecache segment order is [non-nmethod, non-profiled, profiled]. With this change we move the non-nmethod segment between two code segments. Currently only the aarch64 backend is adapted to make use of these changes.

In AARCH the offset limit for a branch instruction is 128MB. The bigger jumps are encoded with three instructions. Most of far branches are jumps into the non-nmethod blobs. With the non-nmethod segment in between code segments the jump distance from method to the stub becomes shorter. The result is a 4% reduction in generated code size for the CodeCache range from 128MB to 240MB.

As a side effect, the performance of some tests is slightly improved:
ArraysFill.testCharFill 10 thrpt 15 170235.720 -> 178477.212 ops/ms

Testing: jdk/hotspot jtreg and microbenchmarks on AMD and AARCH


Progress

  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue
  • Change must be properly reviewed

Issue

  • JDK-8280872: Reorder code cache segments to improve code density

Reviewers

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.java.net/jdk pull/7517/head:pull/7517
$ git checkout pull/7517

Update a local copy of the PR:
$ git checkout pull/7517
$ git pull https://git.openjdk.java.net/jdk pull/7517/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 7517

View PR using the GUI difftool:
$ git pr show -t 7517

Using diff file

Download this PR as a diff file:
https://git.openjdk.java.net/jdk/pull/7517.diff

@bridgekeeper
Copy link

bridgekeeper bot commented Feb 17, 2022

👋 Welcome back bulasevich! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk
Copy link

openjdk bot commented Feb 17, 2022

@bulasevich The following label will be automatically applied to this pull request:

  • hotspot

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

@openjdk openjdk bot added the hotspot hotspot-dev@openjdk.org label Feb 17, 2022
@bulasevich bulasevich force-pushed the codecache_segments_order branch 3 times, most recently from 8ab9342 to 7822f2f Compare February 22, 2022 09:59
@openjdk
Copy link

openjdk bot commented Feb 22, 2022

@bulasevich this pull request can not be integrated into master due to one or more merge conflicts. To resolve these merge conflicts and update this pull request you can run the following commands in the local repository for your personal fork:

git checkout codecache_segments_order
git fetch https://git.openjdk.java.net/jdk master
git merge FETCH_HEAD
# resolve conflicts and follow the instructions given by git merge
git commit -m "Merge master"
git push

@openjdk openjdk bot added the merge-conflict Pull request has merge conflict with target branch label Feb 22, 2022
@bulasevich bulasevich marked this pull request as ready for review February 23, 2022 19:59
@openjdk openjdk bot added the rfr Pull request is ready for review label Feb 23, 2022
@mlbridge
Copy link

mlbridge bot commented Feb 23, 2022

@openjdk openjdk bot removed the merge-conflict Pull request has merge conflict with target branch label Feb 23, 2022
src/hotspot/share/code/codeCache.cpp Outdated Show resolved Hide resolved
src/hotspot/cpu/aarch64/nativeInst_aarch64.cpp Outdated Show resolved Hide resolved
address tpc1 = trampoline_call(has_neg);
if (tpc1 == NULL) {
DEBUG_ONLY(reset_labels(STUB_LONG, SET_RESULT, DONE));
postcond(pc() == badAddress);
return NULL;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe replacing trampoline_call by far_call should be a separate PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. I remove this part of the change.

@@ -390,7 +390,8 @@ void MacroAssembler::far_call(Address entry, CodeBuffer *cbuf, Register tmp) {
assert(ReservedCodeCacheSize < 4*G, "branch out of range");
assert(CodeCache::find_blob(entry.target()) != NULL,
"destination of far call not found in code cache");
if (far_branches()) {
assert(CodeCache::is_non_nmethod(entry.target()), "must be a call to the code stub");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This restricts far calls to be calls of non-nmethod code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. In fact the function is used for non-method code calls only. I put an assert here to be check this fact for future code updates.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why we should restrict uses of far_call to calls of non-nmethod code. Could you please explain this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to avoid the untested code paths.
You are right. Let me change it to work the same way with far_jump impl: if (target_needs_far_branch)..

@@ -52,7 +52,7 @@ void InlineCacheBuffer::assemble_ic_buffer_code(address code_begin, void* cached
address start = __ pc();
Label l;
__ ldr(rscratch2, l);
__ far_jump(ExternalAddress(entry_point));
__ far_jump(ExternalAddress(entry_point), NULL, rscratch1, true);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This complicates assemble_ic_buffer_code. You need to know far_jump implementation, especially the generation of NOPs. I understand why we need those NOPs.
Do we have calls of non-nmethod code here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, there are entry points from both non_method and method segments.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest the following code:

int inst_count = __ far_jump(ExternalAddress(entry_point));
// IC stub code size is not expected to vary depending on target address.
// We use NOPs to make the size equal to ic_stub_code_size.
for (int i = 3 + inst_count; i < ic_stub_code_size(); ++i) {
  __ nop();
}

src/hotspot/cpu/aarch64/macroAssembler_aarch64.cpp Outdated Show resolved Hide resolved
@@ -52,7 +52,7 @@ void InlineCacheBuffer::assemble_ic_buffer_code(address code_begin, void* cached
address start = __ pc();
Label l;
__ ldr(rscratch2, l);
__ far_jump(ExternalAddress(entry_point));
__ far_jump(ExternalAddress(entry_point), NULL, rscratch1, true);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest the following code:

int inst_count = __ far_jump(ExternalAddress(entry_point));
// IC stub code size is not expected to vary depending on target address.
// We use NOPs to make the size equal to ic_stub_code_size.
for (int i = 3 + inst_count; i < ic_stub_code_size(); ++i) {
  __ nop();
}

@eastig
Copy link
Member

eastig commented Mar 10, 2022

Can we have tests for this?

@openjdk openjdk bot removed the rfr Pull request is ready for review label Mar 15, 2022
@openjdk openjdk bot added the rfr Pull request is ready for review label Mar 16, 2022
Copy link
Member

@eastig eastig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Copy link
Member

@simonis simonis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Boris,

Thanks for doing this change. In general it looks good! I only have minor comments and questions which I've added inline.

Also, can you please update the Summary of the JBS issue to match that of the PR? I think the Summary of the PR is more adequate because the change also contains some shared changes.

And please update the description of the PR and replace "It changes nothing for any platform besides AARCH" with something like "Currently only the aarch64 backend is adapted to make use of these changes". Because the segments are actually re-orded for all platforms.

Thank you and best regards,
Volker

assert(ReservedCodeCacheSize < 4*G, "branch out of range");
assert(CodeCache::find_blob(entry.target()) != NULL,
"destination of far call not found in code cache");
if (far_branches()) {
address start = pc();
if (target_needs_far_branch(entry.target())) {
uint64_t offset;
// We can use ADRP here because we know that the total size of
// the code cache cannot exceed 2Gb.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not directly related to your change, but what's correct here:

  • the comment which says "code cache can't exceed 2gb"
  • the assertion above which asserts ReservedCodeCacheSize < 4*G

Maybe you can fix this while you're on it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ADRP limit is 4GB - it is checked by assert. The comment reminds us that CODE_CACHE_SIZE_LIMIT (defined in globalDefinitions.hpp) is 2G which is Ok for us. let me update comment a little:

  • // the code cache cannot exceed 2Gb (ADRP limit is 4GB)

__ far_jump(ExternalAddress(entry_point));
__ align(wordSize);
int jump_code_size = __ far_jump(ExternalAddress(entry_point));
// IC stub code size is not expected to vary depending on target address.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the new code still align cached_value on a wordSize boundary as this was ensured before by align(wordsize)? I think that's only true if code_begin is guaranteed to start at a wordSize boundary because far_jump is either one or three instructions (plus one ldr instruction). If yes, please add a comment explaining that. Otherwise explain why the alignment isn't necessary anymore.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing this out. I think alignment is important because of data load penalty issues and MT issues.
Actually the stub is aligned on the CodeEntryAlignment and stub_size is aligned, so this address is also aligned.
I removed align(wordSize) because the stub_size is constant and there is no room for additional alignments within a stub. Let me add an assert here to check the alignment.

src/hotspot/cpu/aarch64/macroAssembler_aarch64.hpp Outdated Show resolved Hide resolved
src/hotspot/cpu/aarch64/macroAssembler_aarch64.hpp Outdated Show resolved Hide resolved
src/hotspot/cpu/aarch64/aarch64.ad Show resolved Hide resolved
return true;
}
// codecache size: 128M..240M
return !CodeCache::is_non_nmethod(addr);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to further refine this to also catch calls from C1 to C1 and C2 to C2 which obviously wouldn't need a far call as well?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe they should be our next steps to guarantee we don't generate redundant code for such cases.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. But I see a little problem here. CodeCache is aksed for a room in profiled segment for C1 methods, but it places method to the non_profiled segment when the profiled segment is full (and same whay C2 method can be placed in profiled segment). So we can not know in advance where will be the final place for the generated method.

src/hotspot/share/code/codeCache.cpp Show resolved Hide resolved
ayoudkev and others added 2 commits March 24, 2022 13:20
…hen the assert, check alignment, move comments, segments order: profiled - non_method - non_profiled
Copy link
Member

@eastig eastig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Copy link
Member

@simonis simonis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, looks good now.

@openjdk
Copy link

openjdk bot commented Mar 28, 2022

@bulasevich This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8280872: Reorder code cache segments to improve code density

Reviewed-by: simonis, kvn

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been 6 new commits pushed to the master branch:

  • e9e3aa7: 8283730: Improve discussion of modeling of packages and modules
  • 20acea4: 8284362: Remove the "unsupported API" warning from jdk.javadoc
  • ff1daa8: 8284380: ProblemList jdk/jshell/HighlightUITest.java on more platforms
  • 526e734: 8282819: Deprecate Locale class constructors
  • 648efd7: 8284275: Remove unused sun.nio.fs.Reflect
  • f20ed33: 8284278: Diagnostic Fix Updated.

Please see this link for an up-to-date comparison between the source branch of this pull request and the master branch.
As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.

➡️ To integrate this PR with the above commit message to the master branch, type /integrate in a new comment.

@openjdk openjdk bot added the ready Pull request is ready to be integrated label Mar 28, 2022
__ bind(l);
assert((uintptr_t)__ pc() % wordSize == 0);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this leads to a compilation failure in the debug build. Can you please check before submitting?

Copy link
Contributor

@vnkozlov vnkozlov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good suggestion. Let me test it before approval.

* @requires os.arch=="aarch64"
* @requires vm.compiler2.enabled
*
* @run driver compiler.c2.TestFarJump
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Package name compiler.c2.aarch64 is different from compiler.c2.TestFarJump.
Got testing failure:

java.lang.ClassNotFoundException: compiler.c2.TestFarJump
at java.base/java.net.URLClassLoader.findClass(URLClassLoader.java:445)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After I fixed test name I got next failure:

[Exception Handler]
  0x0000fffe40140490: b3d5 9bd2 | 4201 80d2 | a4d5 9bd2 | a5d5 9bd2 

----------System.err:(11/622)----------
java.lang.RuntimeException: ADRP instruction is expected on far jump
	at compiler.c2.aarch64.TestFarJump.runVM(TestFarJump.java:112)
	at compiler.c2.aarch64.TestFarJump.main(TestFarJump.java:126)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for checking that. My bad, last minute change have broken the test. I think the issue you observe is a debug build issue. I checked the functionality with branch_range=128M on debug build, and then reverted the fix:

  // The maximum range of a branch is fixed for the AArch64
  // architecture.  In debug mode we shrink it in order to test
  // trampolines, but not so small that branches in the interpreter
  // are out of range.
- static const uint64_t branch_range = NOT_DEBUG(128 * M) DEBUG_ONLY(2 * M);
+ static const uint64_t branch_range = 128 * M; // test only

In current change I disable the test for the debug build (@requires vm.debug == false)

int jump_code_size = __ far_jump(ExternalAddress(entry_point));
// IC stub code size is not expected to vary depending on target address.
// We use NOPs to make the ldr+far_jump+int64 size equal to ic_stub_code_size.
for (int i = jump_code_size;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need more explanation for loop's boundaries arithmetic.
The comment includes ldr but you start from jump instruction size. Is it part of 3*instruction_size? What are 3 instructions?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean (ldr + int64 cached_value) = 4 + 8 = 12 = 3*instruction_size.
I change the code to make it clear now:

  // We use NOPs to make the [ldr + far_jump + nops + int64] stub size equal to ic_stub_code_size.
  for (int size = NativeInstruction::instruction_size + jump_code_size + 8;
           size < ic_stub_code_size(); size += NativeInstruction::instruction_size) {
     __ nop();
  }

@vnkozlov
Copy link
Contributor

My performance testing shows that there is regression in some startup benchmarks on aarch64.

@simonis
Copy link
Member

simonis commented Mar 29, 2022

My performance testing shows that there is regression in some startup benchmarks on aarch64.

Is there a chance to share what you are using as "startup benchmarks"? I think they would be definitely useful for others as well if they don't contain proprietary code.

@vnkozlov
Copy link
Contributor

My performance testing shows that there is regression in some startup benchmarks on aarch64.

Is there a chance to share what you are using as "startup benchmarks"? I think they would be definitely useful for others as well if they don't contain proprietary code.

I would defer this question to @ericcaspole and @cl4es. I don't know how these benchmarks run and which one are open.

But I agree that we can share how we run open startup benchmarks with community if we did not do it already.

@vnkozlov
Copy link
Contributor

@ericcaspole investigated performance results and "okayed" these changes.

@bulasevich, you still need to address issues with icBuffer_aarch64.cpp and TestFarJump.java.

@ericcaspole
Copy link

We have two frameworks to measure startup performance. The older one is cross platform and measures only wall clock time. Since about 4 years ago we created a new linux only framework using "perf stat" to measure the wall clock time, cycles and instructions. It includes a few small hello world type variations, some web framework startup, and a simple gui app.

@simonis
Copy link
Member

simonis commented Apr 1, 2022

Thanks @ericcaspole , @vnkozlov !

@bulasevich was out of office this week. He'll address the remaining issues once he's back.

@bulasevich
Copy link
Contributor Author

Thanks @ericcaspole , @vnkozlov, @simonis !

@ericcaspole investigated performance results and "okayed" these changes.

Thank you! I checked a simple app startup time. It does not show any change for me.

@bulasevich, you still need to address issues with icBuffer_aarch64.cpp and TestFarJump.java.

I have done it. Also I did another round of testing with jtreg.
Thanks!

*
* @requires vm.flagless
* @requires os.arch=="aarch64"
* @requires vm.debug == false
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why you disabled test with debug VM?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In a debug build the functionality does not work as expected. On AARCH64 branch range is +/-128MB. This is reflected by branch_range constant in assembler_aarch64.hpp. The issue is that for debug build the branch_range constant is intentionally modified to test trampolines, and my test fails because branches are encoded into three instructions starting at 2MB, not 128MB:

  // The maximum range of a branch is fixed for the AArch64
  // architecture.  In debug mode we shrink it in order to test
  // trampolines, but not so small that branches in the interpreter
  // are out of range.
  static const uint64_t branch_range = NOT_DEBUG(128 * M) DEBUG_ONLY(2 * M);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

@bulasevich
Copy link
Contributor Author

/integrate

@openjdk
Copy link

openjdk bot commented Apr 6, 2022

Going to push as commit e524107.
Since your change was applied there have been 13 commits pushed to the master branch:

  • e18414a: 8284014: Menu items with submenus in JPopupMEnu are not spoken on macOS
  • b56df28: 8283935: Parallel: Crash during pretouch after large pages allocation failure
  • 0a67d68: 8284294: Create an automated regression test for RFE 4138746
  • 955d61d: 8284369: TestFailedAllocationBadGraph fails with -XX:TieredStopAtLevel < 4
  • 4ffe96a: 8282506: Clean up remaining references to TwoStacksPlain*SocketImpl
  • 741be46: 8183390: Fix and re-enable post loop vectorization
  • 500f9a5: 8283396: Null pointer dereference in loopnode.cpp:2851
  • e9e3aa7: 8283730: Improve discussion of modeling of packages and modules
  • 20acea4: 8284362: Remove the "unsupported API" warning from jdk.javadoc
  • ff1daa8: 8284380: ProblemList jdk/jshell/HighlightUITest.java on more platforms
  • ... and 3 more: https://git.openjdk.java.net/jdk/compare/632825c6d2933c8f1e63569199413ecaa74b6740...master

Your commit was automatically rebased without conflicts.

@openjdk openjdk bot added the integrated Pull request has been integrated label Apr 6, 2022
@openjdk openjdk bot closed this Apr 6, 2022
@openjdk openjdk bot removed ready Pull request is ready to be integrated rfr Pull request is ready for review labels Apr 6, 2022
@openjdk
Copy link

openjdk bot commented Apr 6, 2022

@bulasevich Pushed as commit e524107.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hotspot hotspot-dev@openjdk.org integrated Pull request has been integrated
6 participants