-
Notifications
You must be signed in to change notification settings - Fork 6.2k
8370947: Mitigate Neoverse-N1 erratum 1542419 negative impact on GCs and JIT performance #28328
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
👋 Welcome back eastigeevich! A progress list of the required criteria for merging this PR into |
|
❗ This change is not yet ready to be integrated. |
|
/contributor add xmas92 |
|
@eastig Syntax:
User names can only be used for users in the census associated with this repository. For other contributors you need to supply the full name and email address. |
|
/contributor add aboldtch |
|
@eastig |
|
Hi @fisk @theRealAph @xmas92 @shipilev I created this draft PR based on @xmas92 work master...xmas92:jdk:deferred_icache_invalidation Alex wrote about his implementation in JDK-8370947:
I see his changes touch other backends. I tried to minimize changes and to avoid them in other backends. This PR does not cover all cases in ZGC at the moment. It can be done as soon as we agree with a proper way to fix. I'd like to hear your opinion which way we should choose:
|
|
@eastig this pull request can not be integrated into git checkout JDK-8370947
git fetch https://git.openjdk.org/jdk.git master
git merge FETCH_HEAD
# resolve conflicts and follow the instructions given by git merge
git commit -m "Merge master"
git push |
|
/issue JDK-8370947 |
|
@eastig This issue is referenced in the PR title - it will now be updated. |
Webrevs
|
|
@fisk, @xmas92 I found one place where I am not sure: jdk/src/hotspot/share/gc/z/zHeapIterator.cpp Line 364 in b9ee954
For |
xmas92
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the implementation is fine. We can always extend it later if we find that other platforms or hardware needs this sort of treatment.
My knowledge and experience with arm hardware implementation specifics are rather lacking. So I cannot comment on the validity of the assertions made here w.r.t. only invalidating the first instruction in the nmethod etc.
Hopefully some of our resident arm experts can chime in.
/reviewers 2 reviewer
|
@fisk, @xmas92 |
| // the performance impact due to this workaround." | ||
| // | ||
| // As the address for icache invalidation is not relevant, we use the nmethod's code start address. | ||
| ICache::invalidate_word(_nm->code_begin()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than call ICache::invalidate_word(), I believe we should explicitly execute the instructions in the workaround.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We cannot execute tlbi vae3is here because it requires EL3. We are at EL0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or you mean IC IVAU?`
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I replaced the call of ICache::invalidate_word() with:
asm volatile("dsb ish \n"
"ic ivau, xzr \n"
"isb \n"
: : : "memory");
The code executed in ICache::invalidate_word() when all checks are done:
dsb ish
ic ivau
dsb ish
isb
I use xzr in ic ivau because an address in it does not matter. The instruction is trapped and ignored.
I think we don't need the second dsb because we will have dsb sy in the trap handler.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know if we want to jeapordize the correctness of the JVM code based on the exact instructions that are currently used to mitigate this issue in the kernel. Eliding the trailing dsb ish because we know the kernel mitigation runs it, seems unnecessarily fragile to me; if the kernel comes up with some smarter and cheaper way of mitigating this in the future, using some other magic incantation, then I don't want to have a correctness issue because of that implicit assumption.
Is it noticeably expensive to run the trailing dsb again?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we need dsb if we use ic, according to the Arm manual. They are redundant if we have hardware instruction cache coherence enable. On one side we know that the hardware icache coherence is working and ic is ignored. On another side, we check the hardware icache coherence is disabled and we should follow Arm ARM.
I don't expect that having dsb has noticeable performance impact. I haven't seen any.
I agree with prioritizing correctness.
|
I think we'll also want a workaround for |
Also we need to fix Should we do it in this PR or in separate PRs? |
Please, let's handle it all here. |
xmas92
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some style comments.
|
@xmas92
I'll check SpecJVM as well. |
|
@theRealAph @fisk From my experience of implementing spin pauses, we use Without deferred icache invalidation (baseline): With deferred icache invalidation: Improvements:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @copilot-pull-request-reviewer[bot], thanks for making a comment in an OpenJDK project!
All comments and discussions in the OpenJDK Community must be made available under the OpenJDK Terms of Use. If you already are an OpenJDK Author, Committer or Reviewer, please click here to open a new issue so that we can record that fact. Please Use "Add GitHub user copilot-pull-request-reviewer[bot]" for the summary.
If you are not an OpenJDK Author, Committer or Reviewer, simply check the box below to accept the OpenJDK Terms of Use for your comments.
- I agree to the OpenJDK Terms of Use for all comments I make in a project in the OpenJDK GitHub organization.
Your comment will be automatically restored once you have accepted the OpenJDK Terms of Use.
|
@eastig |
|
@theRealAph @fisk @shipilev
If |
|
/cc hotspot-gc |
|
@eastig |
shipilev
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting work! I was able to look through it very briefly:
| @Warmup(iterations = 0) | ||
| @Measurement(iterations = 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure what is the intent here. Maybe you wanted @BenchmarkMode(OneShot) instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current algorithm:
- Create an object used in Java methods.
- Run the methods in the interpreter.
- Compile the methods.
- Make the object garbage collectable.
- Run GC (we measure this).
There are not many things to warm-up. And setting up everything for multiple iterations of GC runs might be expensive. Instead we use forks.
IMO, Yes it is @BenchmarkMode(OneShot).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, but first GC would likely be slower, because it would have more real work to do. So you probably want OneShot with the default number of iterations. It will warmup by doing a few GCs, and then do a few other GCs for measurement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have Thread.sleep(1000) in setupCodeCache() to let everything to settle down. I use it because I saw high variance in GC times. With it variance became OK.
Maybe I should use System.gc() instead of Thread.sleep.
So you probably want OneShot with the default number of iterations.
Will I need to recreate an object and to rerun Java methods before each iteration? The first iteration will collect garbage object fields. So following iterations running GC will do nothing. Or will they patch nmethods again?
Arm Neoverse N1 erratum 1542419: "The core might fetch a stale instruction from memory which violates the ordering of instruction fetches". It is fixed in Neoverse N1 r4p1.
Neoverse-N1 implementations mitigate erratum 1542419 with a workaround:
tlbi vae3is, xzrdsb sytlbi vae3is, xzrinvalidates translations for all address spaces (global for address). It waits for all memory accesses using in-scope old translation information to complete before it is considered complete.As this workaround has significant overhead, Arm Neoverse N1 (MP050) Software Developer Errata Notice version 29.0 suggests:
"Since one TLB inner-shareable invalidation is enough to avoid this erratum, the number of injected TLB invalidations should be minimized in the trap handler to mitigate the performance impact due to this workaround."
This PR introduces a mechanism to defer instruction cache (ICache) invalidation for AArch64 to address the Arm Neoverse N1 erratum 1542419, which causes significant performance overhead if ICache invalidation is performed too frequently. The implementation includes detection of affected Neoverse N1 CPUs and automatic enabling of the workaround for relevant Neoverse N1 revisions.
Changes include:
NeoverseN1Errata1542419to enable or disable the workaround for the erratum. The flag is automatically enabled for Neoverse N1 CPUs prior to r4p1, as detected during VM initialization.ICacheInvalidationContextclass to manage deferred ICache invalidation, with platform-specific logic for AArch64. This context is used to batch ICache invalidations, reducing performance impact. As the address for icache invalidation is not relevant, we use the nmethod's code start address.ICacheInvalidationContexton platforms where the workaround is not needed, ensuring portability and minimal impact on other architectures.ZBarrierSetAssembler,ZNMethod,RelocIterator, and related code) to accept adefer_icache_invalidationparameter, allowing ICache invalidation to be deferred and later performed in bulk.Benchmarking results: Neoverse-N1 r3p1 (Graviton 2)
Progress
Issue
Reviewers
Contributors
<aboldtch@openjdk.org>Reviewing
Using
gitCheckout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/28328/head:pull/28328$ git checkout pull/28328Update a local copy of the PR:
$ git checkout pull/28328$ git pull https://git.openjdk.org/jdk.git pull/28328/headUsing Skara CLI tools
Checkout this PR locally:
$ git pr checkout 28328View PR using the GUI difftool:
$ git pr show -t 28328Using diff file
Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/28328.diff
Using Webrev
Link to Webrev Comment