Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

8276662: Scalability bottleneck in SymbolTable::lookup_common() #6400

Closed
wants to merge 5 commits into from

Conversation

dwhite-intel
Copy link

@dwhite-intel dwhite-intel commented Nov 15, 2021

Symbol table lookup had an optimization added when the symbol table was split into a shared table (for CDS?) and a local table. The optimization tries to track which table successfully found a symbol, so it can try that table first the next time.

Symbol table lookup is used in many JVM operations, including classloading, serialization, and reflection.

At startup time, more symbols will be from the shared table, but over time lookup can will be from a mix of local and shared symbols (eg user classes still have java.lang.String fields or subclass from java.lang.Object), resulting in multiple threads fighting over the value of this global variable.

With enough threads and cores, this can result in "true sharing" cache line contention.

This fix solves the scalability issue by using a thread-local variable to track which table to search first instead of a static global. The "symbol table success history" also makes more sense logically as a thread local state than as a global state.

Other options would also solve the the scaling problem, but may change the behavior that we're trying to optimize, or add more overhead or complexity than warranted, such as:

  • Statically preferring the shared or local table
  • Checking the shared table first "early on". When enough local symbols have been added, then check the local table first.
  • Using a NUMA-aware set of N variables distributed over M threads.

Progress

  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue
  • Change must be properly reviewed

Issue

  • JDK-8276662: Scalability bottleneck in SymbolTable::lookup_common()

Reviewers

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.java.net/jdk pull/6400/head:pull/6400
$ git checkout pull/6400

Update a local copy of the PR:
$ git checkout pull/6400
$ git pull https://git.openjdk.java.net/jdk pull/6400/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 6400

View PR using the GUI difftool:
$ git pr show -t 6400

Using diff file

Download this PR as a diff file:
https://git.openjdk.java.net/jdk/pull/6400.diff

@bridgekeeper
Copy link

@bridgekeeper bridgekeeper bot commented Nov 15, 2021

👋 Welcome back drwhite! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@dwhite-intel dwhite-intel changed the title JDK-8276662 8276662: Scalability bottleneck in SymbolTable::lookup_common() Nov 15, 2021
@openjdk
Copy link

@openjdk openjdk bot commented Nov 15, 2021

@dwhite-intel The following label will be automatically applied to this pull request:

  • hotspot-runtime

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

@openjdk openjdk bot added the hotspot-runtime label Nov 15, 2021
…on traces of symboltable lookup failures during javac compilation
@dwhite-intel dwhite-intel marked this pull request as ready for review Nov 16, 2021
@openjdk openjdk bot added the rfr label Nov 16, 2021
@mlbridge
Copy link

@mlbridge mlbridge bot commented Nov 16, 2021

Webrevs

@shipilev
Copy link
Contributor

@shipilev shipilev commented Nov 16, 2021

I have been looking at this code in JDK-8260718 too. The switching heuristics needs some work, and some experimentation. I think for this sharing fix, slapping THREAD_LOCAL to _lookup_shared_first would be enough? This would throw away any questions about the changes to switching heuristics. We have one defined already, so the fix becomes (crudely, untested):

diff --git a/src/hotspot/share/classfile/symbolTable.cpp b/src/hotspot/share/classfile/symbolTable.cpp
index fa966fc7b22..edf48a2c2c1 100644
--- a/src/hotspot/share/classfile/symbolTable.cpp
+++ b/src/hotspot/share/classfile/symbolTable.cpp
@@ -93,3 +93,3 @@ static volatile bool   _has_items_to_clean = false;
 static volatile bool _alt_hash = false;
-static volatile bool _lookup_shared_first = false;
+static THREAD_LOCAL bool _lookup_shared_first = false;

@shipilev
Copy link
Contributor

@shipilev shipilev commented Nov 16, 2021

diff --git a/src/hotspot/share/classfile/symbolTable.cpp b/src/hotspot/share/classfile/symbolTable.cpp
index fa966fc7b22..edf48a2c2c1 100644
--- a/src/hotspot/share/classfile/symbolTable.cpp
+++ b/src/hotspot/share/classfile/symbolTable.cpp
@@ -93,3 +93,3 @@ static volatile bool   _has_items_to_clean = false;
 static volatile bool _alt_hash = false;
-static volatile bool _lookup_shared_first = false;
+static THREAD_LOCAL bool _lookup_shared_first = false;

Woot, I think it works and even the simple "Hello World" startup seems to respond:

$ perf stat -r 1000 build/server-baseline/bin/java -Xms128m -Xmx128m Hello > /dev/null

 Performance counter stats for 'build/server-baseline/bin/java -Xms128m -Xmx128m Hello' (1000 runs):

             29.62 msec task-clock                #    1.165 CPUs utilized            ( +-  0.11% )
                93      context-switches          #    0.003 M/sec                    ( +-  0.14% )
                 0      cpu-migrations            #    0.010 K/sec                    ( +-  6.36% )
             2,608      page-faults               #    0.088 M/sec                    ( +-  0.01% )
        83,596,348      cycles                    #    2.822 GHz                      ( +-  0.22% )  (46.41%)
        10,204,346      stalled-cycles-frontend   #   12.21% frontend cycles idle     ( +-  0.40% )  (43.73%)
        28,164,763      stalled-cycles-backend    #   33.69% backend cycles idle      ( +-  0.36% )  (45.36%)
        63,232,465      instructions              #    0.76  insn per cycle         
                                                  #    0.45  stalled cycles per insn  ( +-  0.10% )  (53.59%)
        12,144,831      branches                  #  409.979 M/sec                    ( +-  0.11% )  (56.27%)
           448,789      branch-misses             #    3.70% of all branches          ( +-  0.25% )  (54.64%)

         0.0254298 +- 0.0000362 seconds time elapsed  ( +-  0.14% )

$ perf stat -r 1000 build/linux-x86_64-server-release/images/jdk/bin/java -Xms128m -Xmx128m Hello > /dev/null

 Performance counter stats for 'build/linux-x86_64-server-release/images/jdk/bin/java -Xms128m -Xmx128m Hello' (1000 runs):

             29.22 msec task-clock                #    1.166 CPUs utilized            ( +-  0.11% )
                93      context-switches          #    0.003 M/sec                    ( +-  0.14% )
                 0      cpu-migrations            #    0.009 K/sec                    ( +-  6.53% )
             2,605      page-faults               #    0.089 M/sec                    ( +-  0.01% )
        81,731,944      cycles                    #    2.797 GHz                      ( +-  0.24% )  (46.56%)
        10,163,520      stalled-cycles-frontend   #   12.44% frontend cycles idle     ( +-  0.40% )  (43.32%)
        26,563,080      stalled-cycles-backend    #   32.50% backend cycles idle      ( +-  0.36% )  (44.87%)
        63,660,354      instructions              #    0.78  insn per cycle         
                                                  #    0.42  stalled cycles per insn  ( +-  0.09% )  (53.44%)
        12,049,198      branches                  #  412.400 M/sec                    ( +-  0.12% )  (56.68%)
           447,857      branch-misses             #    3.72% of all branches          ( +-  0.22% )  (55.13%)

         0.0250630 +- 0.0000353 seconds time elapsed  ( +-  0.14% )

@dwhite-intel
Copy link
Author

@dwhite-intel dwhite-intel commented Nov 16, 2021

Yes, the thread-local should fix the scalability issue (I only tested on a small machine). And logically this performance hint makes more sense per-thread, not globally.

On the other hand HotSpot doesn't through around C++ thread locals everywhere. "_thread" is pervasive enough to make it worthwhile. If there are concerns about using too many thread locals I wasn't sure if it be worth burning one for this purpose.

If we want to use THREAD_LOCAL, we may need this:

#ifdef USE_LIBRARY_BASED_TLS_ONLY
static volatile bool _lookup_shared_first = false;
#else
static THREAD_LOCAL bool _lookup_shared_first = false;
#endif

(Not that I know what USE_LIBRARY_BASED_TLS_ONLY is really for :-)

@shipilev
Copy link
Contributor

@shipilev shipilev commented Nov 16, 2021

(Not that I know what USE_LIBRARY_BASED_TLS_ONLY is really for :-)

Some history: JDK-8259372. Yes, it looks like generic code needs to be wrapped by USE_LIBRARY_BASED_TLS_ONLY for a time being. Like it is in thread.cpp, if we look for existing code.

@cl4es
Copy link
Member

@cl4es cl4es commented Nov 16, 2021

I agree the hint provided by _lookup_shared_first makes more sense as a thread local than global state, and that this particular flag seem a worthwhile THREAD_LOCAL candidate even with some potential for footprint overhead.

cl4es
cl4es approved these changes Nov 19, 2021
Copy link
Member

@cl4es cl4es left a comment

LGTM, although I think you need a review from a CDS maintainer. @iklam?

@openjdk
Copy link

@openjdk openjdk bot commented Nov 19, 2021

@dwhite-intel This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8276662: Scalability bottleneck in SymbolTable::lookup_common()

Reviewed-by: redestad, dholmes, iklam, shade

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been 1985 new commits pushed to the master branch:

  • e47cc81: 8275386: Change nested classes in jdk.jlink to static nested classes
  • f609b8f: 8274946: Cleanup unnecessary calls to Throwable.initCause() in java.rmi
  • 36b59f3: 8274333: Redundant null comparison after Pattern.split
  • 6677554: 8274949: Use String.contains() instead of String.indexOf() in java.base
  • 09e8c8c: 8277342: vmTestbase/nsk/stress/strace/strace004.java fails with SIGSEGV in InstanceKlass::jni_id_for
  • 976c2bb: 8277212: GC accidentally cleans valid megamorphic vtable inline caches
  • 03f8c0f: 8275887: jarsigner prints invalid digest/signature algorithm warnings if keysize is weak/disabled
  • 936f7ff: 8276150: Quarantined jpackage apps are labeled as "damaged"
  • a022796: 8275745: Reproducible copyright headers
  • b1a1bf4: 8277427: Update jib-profiles.js to use JMH 1.33 devkit
  • ... and 1975 more: https://git.openjdk.java.net/jdk/compare/8c8422e0f8886d9bbfca29fd228368f88bf46f2c...master

As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.

➡️ To integrate this PR with the above commit message to the master branch, type /integrate in a new comment.

@openjdk openjdk bot added the ready label Nov 19, 2021
Copy link
Member

@dholmes-ora dholmes-ora left a comment

Based on the discussions this seems fine.

Thanks,
David

static volatile bool _lookup_shared_first = false;
#else
static THREAD_LOCAL bool _lookup_shared_first = false;
Copy link
Member

@dholmes-ora dholmes-ora Nov 19, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add an explanatory comment as to why this is a thread-local please. Just a couple of lines. Thanks.

@iklam
Copy link
Member

@iklam iklam commented Nov 19, 2021

@dwhite-intel do you have any performance results that you can share? Also, the PR description needs to be updated to reflect the final version.

@iklam
Copy link
Member

@iklam iklam commented Nov 19, 2021

I did a quick check of the latest patch. Small start-up runs like this are susceptible to effects of dynamic frequency scaling due to CPU heat, so I interleaved the runs just to be safe:

for i in 1 2 3 4 5; do
    for v in old new; do
        echo $v.$i.txt
        perf stat -o $v.$i.txt -r 500 ./jdk_$v/bin/java \
             -Xmx128m -Xshare:on -version 2> /dev/null
    done
done


old.1.txt:          0.044077 +- 0.000425 seconds time elapsed  ( +-  0.96% )
old.2.txt:          0.043140 +- 0.000385 seconds time elapsed  ( +-  0.89% )
old.3.txt:          0.044393 +- 0.000415 seconds time elapsed  ( +-  0.93% )
old.4.txt:          0.043779 +- 0.000407 seconds time elapsed  ( +-  0.93% )
old.5.txt:          0.043221 +- 0.000390 seconds time elapsed  ( +-  0.90% )
geomean = 0.043719

new.1.txt:          0.043791 +- 0.000405 seconds time elapsed  ( +-  0.93% )
new.2.txt:          0.042716 +- 0.000364 seconds time elapsed  ( +-  0.85% )
new.3.txt:          0.043203 +- 0.000382 seconds time elapsed  ( +-  0.88% )
new.4.txt:          0.043437 +- 0.000398 seconds time elapsed  ( +-  0.92% )
new.5.txt:          0.043012 +- 0.000382 seconds time elapsed  ( +-  0.89% )
geomean = 0.043230

So at least on my machine (10 year old dual socket Xeon/Sandybridge) the patch doesn't seem to introduce any regression.

iklam
iklam approved these changes Nov 19, 2021
@dwhite-intel
Copy link
Author

@dwhite-intel dwhite-intel commented Nov 19, 2021

@iklam, we've seen 5-15% improvement in serialization benchmarks on CascadeLake/IceLake systems, and almost 2x improvement on mutli-socket systems with more cores.

@dwhite-intel
Copy link
Author

@dwhite-intel dwhite-intel commented Nov 19, 2021

Thanks for the reviews everyone!
I believe all review comments have been responded to. If there's anything else, please let me know. Otherwise I plan on integrating by EOD.

@dwhite-intel
Copy link
Author

@dwhite-intel dwhite-intel commented Nov 20, 2021

/integrate

@openjdk
Copy link

@openjdk openjdk bot commented Nov 20, 2021

Going to push as commit 1d7cef3.
Since your change was applied there have been 1989 commits pushed to the master branch:

  • c79a485: 8277494: [BACKOUT] JDK-8276150 Quarantined jpackage apps are labeled as "damaged"
  • 2ab43ec: 8273544: Increase test coverage for snippets
  • 2d4af22: 8277370: configure script cannot distinguish WSL version
  • a3406a1: 8277092: TestMetaspaceAllocationMT2.java#ndebug-default fails with "RuntimeException: Committed seems high: NNNN expected at most MMMM"
  • e47cc81: 8275386: Change nested classes in jdk.jlink to static nested classes
  • f609b8f: 8274946: Cleanup unnecessary calls to Throwable.initCause() in java.rmi
  • 36b59f3: 8274333: Redundant null comparison after Pattern.split
  • 6677554: 8274949: Use String.contains() instead of String.indexOf() in java.base
  • 09e8c8c: 8277342: vmTestbase/nsk/stress/strace/strace004.java fails with SIGSEGV in InstanceKlass::jni_id_for
  • 976c2bb: 8277212: GC accidentally cleans valid megamorphic vtable inline caches
  • ... and 1979 more: https://git.openjdk.java.net/jdk/compare/8c8422e0f8886d9bbfca29fd228368f88bf46f2c...master

Your commit was automatically rebased without conflicts.

@openjdk openjdk bot closed this Nov 20, 2021
@openjdk openjdk bot added integrated and removed ready rfr labels Nov 20, 2021
@openjdk
Copy link

@openjdk openjdk bot commented Nov 20, 2021

@dwhite-intel Pushed as commit 1d7cef3.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hotspot-runtime integrated
5 participants