Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

8072070: Improve interpreter stack banging #7247

Closed

Conversation

shipilev
Copy link
Member

@shipilev shipilev commented Jan 27, 2022

This is an old issue, I submitted the first RFE about this back in 2015. This shows up every time I benchmark the interpreter-only code. Most recently, it showed up in my work to get java.lang.invoke infra work reasonably fast when cold, which includes lots of interpreter paths.

The underlying problem is that template interpreters rebang the entire shadow zone on every method entry. This takes tens of instructions, blows out TLB caches with accessing tens of pages (on some implementations, I reckon, almost the entire L1 TLB cache!), etc. I think we can make it universally better for all template interpreters by introducing the safe limit / growth watermarks for thread stacks, so that we bang only when needed. It also drops the need for special-casing the native_call, because we might as well bang the entire shadow zone in native case as well.

This patch makes a pilot change for x86, without touching other architectures. Other architectures can follow this example later. This is why native_call argument persists, even though it is not used in x86 case anymore. There is also a new test group that I found useful when debugging on Windows, that group is going to go away before integration.

I tried to capture the current mechanics of stack banging in stackOverflow.hpp, hoping the change becomes more obvious, and so that arch-specific template interpreter codes could just reference it without copy-pasting it around.

I think it is fairly complete, and so would like to solicit more feedback and testing here.

Point runs on SPECjvm2008 with -Xint shows huge improvements on half of the tests, without any regressions:

 compiler.compiler: +77%
 compiler.sunflow: +69%
 compress: +166%
 crypto.rsa: +15%
 crypto.signverify: +70%
 mpegaudio: +8%
 serial: +50%
 sunflow: +57%
 xml.transform: +61%
 xml.validation: +43%

My new java.lang.invoke benchmarks improve a lot as well:

Benchmark              Mode  Cnt    Score    Error  Units

# Mainline
MHInvoke.methodHandle  avgt    5  799.671 ± 9.087  ns/op
MHInvoke.plain         avgt    5  261.947 ± 1.421  ns/op
VHGet.plain            avgt    5  231.372 ± 3.044  ns/op
VHGet.varHandle        avgt    5  924.880 ± 6.026  ns/op

# This WIP
MHInvoke.methodHandle  avgt    5  240.456 ± 3.931  ns/op
MHInvoke.plain         avgt    5   70.851 ± 1.986  ns/op
VHGet.plain            avgt    5   52.506 ± 3.768  ns/op
VHGet.varHandle        avgt    5  335.785 ± 4.398  ns/op

It also palpably improves startup even on small HelloWorld, even when compilers are present:

$ perf stat -r 5000 build/baseline/bin/java -Xms128m -Xmx128m Hello > /dev/null

 Performance counter stats for 'build/baseline/bin/java -Xms128m -Xmx128m Hello' (5000 runs):

             22.06 msec task-clock                #    1.030 CPUs utilized            ( +-  0.04% )
                96      context-switches          #    4.353 K/sec                    ( +-  0.07% )
                 7      cpu-migrations            #  333.181 /sec                     ( +-  0.32% )
             2,437      page-faults               #  110.469 K/sec                    ( +-  0.00% )
        78,763,038      cycles                    #    3.571 GHz                      ( +-  0.05% )  (77.30%)
         2,107,182      stalled-cycles-frontend   #    2.68% frontend cycles idle     ( +-  0.41% )  (77.40%)
         2,235,371      stalled-cycles-backend    #    2.84% backend cycles idle      ( +-  1.05% )  (71.39%)
        67,296,528      instructions              #    0.85  insn per cycle         
                                                  #    0.03  stalled cycles per insn  ( +-  0.03% )  (89.79%)
        12,483,022      branches                  #  565.911 M/sec                    ( +-  0.01% )  (99.73%)
           384,412      branch-misses             #    3.08% of all branches          ( +-  0.07% )  (85.91%)

         0.0214224 +- 0.0000875 seconds time elapsed  ( +-  0.41% )

$ perf stat -r 5000 build/interp-bang/bin/java -Xms128m -Xmx128m Hello > /dev/null

 Performance counter stats for 'build/interp-bang/bin/java -Xms128m -Xmx128m Hello' (5000 runs):

             21.78 msec task-clock                #    1.031 CPUs utilized            ( +-  0.05% )
                98      context-switches          #    4.519 K/sec                    ( +-  0.07% )
                 7      cpu-migrations            #  339.292 /sec                     ( +-  0.31% )
             2,434      page-faults               #  111.755 K/sec                    ( +-  0.00% )
        77,746,317      cycles                    #    3.569 GHz                      ( +-  0.05% )  (76.94%)
         2,143,121      stalled-cycles-frontend   #    2.76% frontend cycles idle     ( +-  0.45% )  (76.03%)
         2,059,440      stalled-cycles-backend    #    2.65% backend cycles idle      ( +-  1.11% )  (71.82%)
        66,742,892      instructions              #    0.86  insn per cycle         
                                                  #    0.03  stalled cycles per insn  ( +-  0.03% )  (91.40%)
        12,494,797      branches                  #  573.634 M/sec                    ( +-  0.01% )  (99.80%)
           386,145      branch-misses             #    3.09% of all branches          ( +-  0.08% )  (85.56%)

         0.0211278 +- 0.0000877 seconds time elapsed  ( +-  0.42% )

Additional testing:

  • Linux x86_64 fastdebug, tier1
  • Linux x86_64 fastdebug, tier2
  • Linux x86_64 fastdebug, tier3
  • Linux x86_64 fastdebug, tier4
  • Linux x86_32 fastdebug, tier1
  • Linux x86_32 fastdebug, tier2
  • Linux x86_32 fastdebug, tier3

Progress

  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue
  • Change must be properly reviewed

Issue

Reviewers

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.java.net/jdk pull/7247/head:pull/7247
$ git checkout pull/7247

Update a local copy of the PR:
$ git checkout pull/7247
$ git pull https://git.openjdk.java.net/jdk pull/7247/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 7247

View PR using the GUI difftool:
$ git pr show -t 7247

Using diff file

Download this PR as a diff file:
https://git.openjdk.java.net/jdk/pull/7247.diff

@bridgekeeper
Copy link

bridgekeeper bot commented Jan 27, 2022

👋 Welcome back shade! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk
Copy link

openjdk bot commented Jan 27, 2022

@shipilev The following label will be automatically applied to this pull request:

  • hotspot

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

@openjdk openjdk bot added the hotspot hotspot-dev@openjdk.org label Jan 27, 2022
@shipilev shipilev changed the title 8072070: Improve interpreter shadow zone banging 8072070: Improve interpreter stack banging Feb 4, 2022
@shipilev shipilev force-pushed the JDK-8072070-interp-shadow-zone-bang branch from e369b6e to b1ed28f Compare February 4, 2022 10:17
@shipilev shipilev marked this pull request as ready for review February 4, 2022 11:16
@openjdk openjdk bot added the rfr Pull request is ready for review label Feb 4, 2022
@mlbridge
Copy link

mlbridge bot commented Feb 4, 2022

Webrevs

@TheRealMDoerr
Copy link
Contributor

Hi Aleksey, thanks for working on the stack banging code. I wanted to do so for a long time, but couldn't make it, yet. Results are impressive!

A quick question. Why can't we just use something like the following on linux?

  __ cmpptr(rsp, Address(r15_thread, JavaThread::stack_overflow_limit_offset()));
  __ jump_cc(Assembler::belowEqual, ExternalAddress(Interpreter::_throw_StackOverflowError_entry));

Is banging the shadow area strictly required on linux?
Could be that it is needed on some OSes.

@shipilev
Copy link
Member Author

shipilev commented Feb 4, 2022

A quick question. Why can't we just use something like the following on linux?

  __ cmpptr(rsp, Address(r15_thread, JavaThread::stack_overflow_limit_offset()));
  __ jump_cc(Assembler::belowEqual, ExternalAddress(Interpreter::_throw_StackOverflowError_entry));

Is banging the shadow area strictly required on linux? Could be that it is needed on some OSes.

(There is a large comment in stackOverflow.hpp -- do you see blind spots there?)

AFAIU, the only OS that needs to bang page by page to commit stacks is Windows; got some funky GHA failures without it.

My early patches for Linux were something like what you proposed. But the deeper I got into this, the more I realized it is safer to keep banging in order to cooperate with the rest of stack overflow machinery. For example, I am not at all sure that throwing the SOE when below stack_overflow_limit works well with reserved zone handling. It was probably okay when we only had the yellow+red zones.

Notice that watermark code effectively bangs (most useful parts of) the stack once. What I think it achieves is making OS-specific or call-flavor-specific optimization questions moot. We can do everything to handle the worst case, have the single path taken in all configurations (simplifies development/testing), and pay peanuts in performance penalties for it.

@TheRealMDoerr
Copy link
Contributor

I think it would be interesting to figure out if we can let the linux kernel do all the stack management work for us and avoid stack banging, protected pages etc. inside of hotspot completely. But that may be beyond the scope of your PR. (Windows is a different story.) I hope that I can find time to figure it out at some point of time.

Copy link
Member

@navyxliu navyxliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since you this PR touches stackoverflow.hpp, Could you also take a look at this?
https://github.com/openjdk/jdk/blob/master/src/hotspot/share/runtime/stackOverflow.cpp#L66

we actually get the page size from os. why do we need alignment = 4k?

@shipilev
Copy link
Member Author

shipilev commented Feb 5, 2022

since you this PR touches stackoverflow.hpp, Could you also take a look at this? https://github.com/openjdk/jdk/blob/master/src/hotspot/share/runtime/stackOverflow.cpp#L66

we actually get the page size from os. why do we need alignment = 4k?

Look here:

// We need to adapt the configured number of stack protection pages given
// in 4K pages to the actual os page size. We must do this before setting
// up minimal stack sizes etc. in os::init_2().
size_t alignment = 4*K;
-- the StackYellowPages, StackRedPages, StackShadowPages are defined in as 4K pages. It should probably be called unit, not alignment. I'd like to avoid scope creep for this PR, so that's for another day.

@shipilev
Copy link
Member Author

shipilev commented Feb 7, 2022

since you this PR touches stackoverflow.hpp, Could you also take a look at this? https://github.com/openjdk/jdk/blob/master/src/hotspot/share/runtime/stackOverflow.cpp#L66
we actually get the page size from os. why do we need alignment = 4k?

Look here:

// We need to adapt the configured number of stack protection pages given
// in 4K pages to the actual os page size. We must do this before setting
// up minimal stack sizes etc. in os::init_2().
size_t alignment = 4*K;

-- the StackYellowPages, StackRedPages, StackShadowPages are defined in as 4K pages. It should probably be called unit, not alignment. I'd like to avoid scope creep for this PR, so that's for another day.

Done in #7362.

@DamonFool
Copy link
Member

Hi @shipilev ,

Did you test the perf improvement base on the latest jdk?
I tried to test SPECjvm2008's compiler.compiler with jdk19, but failed with

  Benchmark:   compiler.compiler
  Run mode:    timed run
  Test type:   multi
  Threads:     8
  Warmup:      120s
  Iterations:  1
  Run length:  240s
Error in setup of Benchmark.
spec.harness.StopBenchmarkException: Error invoking bmSetupBenchmarkMethod
        at spec.harness.ProgramRunner.invokeBmSetupBenchmark(ProgramRunner.java:185)
        at spec.harness.ProgramRunner.runBenchmark(ProgramRunner.java:301)
        at spec.harness.ProgramRunner.run(ProgramRunner.java:98)
Caused by: java.lang.reflect.InvocationTargetException
        at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:119)
        at java.base/java.lang.reflect.Method.invoke(Method.java:577)
        at spec.harness.ProgramRunner.invokeBmSetupBenchmark(ProgramRunner.java:183)
        ... 2 more
Caused by: java.lang.NoClassDefFoundError: com/sun/tools/javac/util/JavacFileManager
        at java.base/java.lang.ClassLoader.defineClass1(Native Method)
        at java.base/java.lang.ClassLoader.defineClass(ClassLoader.java:1013)
        at java.base/java.security.SecureClassLoader.defineClass(SecureClassLoader.java:150)
        at java.base/jdk.internal.loader.BuiltinClassLoader.defineClass(BuiltinClassLoader.java:862)
        at java.base/jdk.internal.loader.BuiltinClassLoader.findClassOnClassPathOrNull(BuiltinClassLoader.java:760)
        at java.base/jdk.internal.loader.BuiltinClassLoader.loadClassOrNull(BuiltinClassLoader.java:681)
        at java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:639)
        at java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:188)
        at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:521)
        at spec.benchmarks.compiler.MainBase.preSetupBenchmark(MainBase.java:38)
        at spec.benchmarks.compiler.compiler.Main.setupBenchmark(Main.java:38)
        at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:104)
        ... 4 more
Caused by: java.lang.ClassNotFoundException: com.sun.tools.javac.util.JavacFileManager
        at java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:641)
        at java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:188)
        at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:521)
        ... 16 more

Warmup (120s) begins: Mon Feb 07 20:35:40 CST 2022
Warmup (120s) ends:   Mon Feb 07 20:35:40 CST 2022
Warmup (120s) result:  **NOT VALID**

Am I missed something?
Thanks.

Copy link
Contributor

@coleenp coleenp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a nice optimization to me. I initially thought that calling the new limit of where we can elide stack banging "watermark" had something to do with the GC stack watermark code but "watermark" is sort of the best word for this. If we had another descriptive word that might be better, but I can't think of anything.
Thank you for fixing this! We didn't have tests showing the motivation ourselves so sorry that we ignored it for so long.

src/hotspot/cpu/x86/templateInterpreterGenerator_x86.cpp Outdated Show resolved Hide resolved
@openjdk
Copy link

openjdk bot commented Feb 7, 2022

@shipilev This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8072070: Improve interpreter stack banging

Reviewed-by: xliu, coleenp, mdoerr

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been 75 new commits pushed to the master branch:

  • 8441d51: 8281419: The source data for the color conversion can be discarded
  • a037b3c: 8281460: Let ObjectMonitor have its own NMT category
  • 65831eb: 8281318: Improve jfr/event/allocation tests reliability
  • eee6a56: 8281522: Rename ADLC classes which have the same name as hotspot variants
  • 84868e3: 8281275: Upgrading from 8 to 11 no longer accepts '/' as filepath separator in gc paths
  • 58c2bd3: 8281536: JFR: Improve jdk.jfr.ContentType documentation
  • 83b6e4b: 8281294: [vectorapi] FIRST_NONZERO reduction operation throws IllegalArgumentExcept on zero vectors
  • 039313d: 8054449: Incompatible type in example code in TreePath
  • 3ce1c5b: 8280832: Update usage docs for NonblockingQueue
  • d442328: 8281262: Windows builds in different directories are not fully reproducible
  • ... and 65 more: https://git.openjdk.java.net/jdk/compare/63a00a0df24b154ef459936dbd69bcd2f0626235...master

As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.

➡️ To integrate this PR with the above commit message to the master branch, type /integrate in a new comment.

@openjdk openjdk bot added the ready Pull request is ready to be integrated label Feb 7, 2022
@shipilev
Copy link
Member Author

shipilev commented Feb 7, 2022

I tried to test SPECjvm2008's compiler.compiler with jdk19.

I think some benchmarks in currently public SPECjvm2008 do not work with JDK 19 due to missing dependencies. I have a hacky version that is able to work with modern JDK. I used that to estimate performance on JDK mainline.

@shipilev
Copy link
Member Author

shipilev commented Feb 7, 2022

Fiddly code, documentation update for the core subsystem, so:

/reviewers 3

@openjdk
Copy link

openjdk bot commented Feb 7, 2022

@shipilev
The number of required reviews for this PR is now set to 3 (with at least 1 of role reviewers).

@openjdk openjdk bot removed the ready Pull request is ready to be integrated label Feb 7, 2022
Copy link
Member

@navyxliu navyxliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! I am not a reviewer. we still need reviewers to approve this.

Copy link
Contributor

@TheRealMDoerr TheRealMDoerr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. And a step into the right direction IMHO. We should check the code on other platforms, too (separately is ok).

@openjdk openjdk bot added the ready Pull request is ready to be integrated label Feb 9, 2022
@shipilev
Copy link
Member Author

All right, thanks for reviews! Last call for comments. I am planning to integrate it later today.

@theRealAph
Copy link
Contributor

All right, thanks for reviews! Last call for comments. I am planning to integrate it later today.

x86-32 has some weird stack handling, particularly when using the invocation interface. I guess we assume our regression tests will catch breakage there.

@shipilev
Copy link
Member Author

All right, thanks for reviews! Last call for comments. I am planning to integrate it later today.

x86-32 has some weird stack handling, particularly when using the invocation interface. I guess we assume our regression tests will catch breakage there.

As you can see in "Additional testing", I ran tier{1,2,3} on x86_32 without problems. It is hard to tell how this patch would break x86_32 though: it would still bang the same way when close to guard zone.

@shipilev
Copy link
Member Author

I am sure nothing bad is going to happen if I integrate this on Friday!

/integrate

@openjdk
Copy link

openjdk bot commented Feb 11, 2022

Going to push as commit 3a13425.
Since your change was applied there have been 75 commits pushed to the master branch:

  • 8441d51: 8281419: The source data for the color conversion can be discarded
  • a037b3c: 8281460: Let ObjectMonitor have its own NMT category
  • 65831eb: 8281318: Improve jfr/event/allocation tests reliability
  • eee6a56: 8281522: Rename ADLC classes which have the same name as hotspot variants
  • 84868e3: 8281275: Upgrading from 8 to 11 no longer accepts '/' as filepath separator in gc paths
  • 58c2bd3: 8281536: JFR: Improve jdk.jfr.ContentType documentation
  • 83b6e4b: 8281294: [vectorapi] FIRST_NONZERO reduction operation throws IllegalArgumentExcept on zero vectors
  • 039313d: 8054449: Incompatible type in example code in TreePath
  • 3ce1c5b: 8280832: Update usage docs for NonblockingQueue
  • d442328: 8281262: Windows builds in different directories are not fully reproducible
  • ... and 65 more: https://git.openjdk.java.net/jdk/compare/63a00a0df24b154ef459936dbd69bcd2f0626235...master

Your commit was automatically rebased without conflicts.

@openjdk openjdk bot added the integrated Pull request has been integrated label Feb 11, 2022
@openjdk openjdk bot closed this Feb 11, 2022
@openjdk openjdk bot removed ready Pull request is ready to be integrated rfr Pull request is ready for review labels Feb 11, 2022
@openjdk
Copy link

openjdk bot commented Feb 11, 2022

@shipilev Pushed as commit 3a13425.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

@adinn
Copy link
Contributor

adinn commented Feb 11, 2022

ship it and be damned :-)

@TheRealMDoerr
Copy link
Contributor

Seems like Power is not affected by this TLB / Cache bottleneck. We use 64k pages and typically 2 store instructions for banging. On the other side, I think it's a good thing to avoid touching any storage which we don't need. So, we could overwork the PPC64 implementation, too (optionally). Or wait until more experiments have been made.

@shipilev
Copy link
Member Author

Seems like Power is not affected by this TLB / Cache bottleneck. We use 64k pages and typically 2 store instructions for banging. On the other side, I think it's a good thing to avoid touching any storage which we don't need. So, we could overwork the PPC64 implementation, too (optionally). Or wait until more experiments have been made.

Yes, larger VM pages mean fewer addresses to touch. OTOH, in my related experiments with removing the stack banging on compiled entry whatsoever, we seem to redeem single-digit percent improvements, even though we only touch one location far away.

Anyhow, I think a good plan is to wait and see if this x86 pilot change runs into any interesting problems, before translating it to other architectures.

@shipilev shipilev deleted the JDK-8072070-interp-shadow-zone-bang branch March 7, 2022 10:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hotspot hotspot-dev@openjdk.org integrated Pull request has been integrated
Development

Successfully merging this pull request may close these issues.

8 participants