New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
8276618: Pad cacheline for Thread::_rcu_counter #6246
Conversation
|
@Hamlin-Li The following label will be automatically applied to this pull request:
When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command. |
/cc hotspot-gc |
@Hamlin-Li |
Hi Hamlin,
This seems reasonable to me, however whenever we add padding to optimise the placement of one field, I always wonder if that same padding has de-optimised the placement of other fields? I think we need to see a broader run of benchmarks here and across more than just x86_64.
I will see if I can assist on the benchmark front.
Thanks,
David
@@ -250,10 +251,10 @@ class Thread: public ThreadShadow { | |||
|
|||
// Support for GlobalCounter | |||
private: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pre-existing nit: this private is not needed; nor is the public at line 260.
Thanks a lot David, it will be very helpful. |
Thanks a lot Thomas. :) |
Hi, we tried to reproduce your numbers internally, but failed to do so. Differences seem to be within noise. We tried with specjbb2015 multi-jvm on a fairly large machine (152 threads; it was what has been "on hand") and multiple runs of our internal benchmarks and specjbb2015 composite runs on various (not-so-large but still fairly big sized machines). Could you post or send more details about your configuration? The other concern that has been brought up internally has been that this increases the size of Thread by ~40% from 624 to 872 bytes; do you think there a way to save some memory by reorganizing the fields so that the counter is on a separate cache line "naturally"? Thanks, |
Thanks Thomas for the feedback. I have updated the summary of this PR with more configuration and environment info, and I also updated the 2nd round of run. Although the improvement is not that much as the previous round (10%), but still got about 3~4% improvement, and seems the data is more stable than the 1st round of run. |
Sure, if current change is proven to bring some performance benefit, let me do some more research in this direction. |
@Hamlin-Li This pull request has been inactive for more than 4 weeks and will be automatically closed if another 4 weeks passes without any activity. To avoid this, simply add a new comment to the pull request. Feel free to ask for assistance if you need help with progressing this pull request towards integration! |
need further investigation. |
@Hamlin-Li This pull request has been inactive for more than 4 weeks and will be automatically closed if another 4 weeks passes without any activity. To avoid this, simply add a new comment to the pull request. Feel free to ask for assistance if you need help with progressing this pull request towards integration! |
@Hamlin-Li This pull request has been inactive for more than 8 weeks and will now be automatically closed. If you would like to continue working on this pull request in the future, feel free to reopen it! This can be done using the |
Currently, Thread::_rcu_counter is not padded by cacheline, it should be beneficail to do so.
The initial spebjbb test shows about 10.5% improvement of critical, and 0.7% improvement of max in specjbb2015.
========= test result (1st round) ==========
rcu base
45096 38980
41741 41468
42349 41053
44485 42030
47103 39915
43864 36004
==== average ====
44106.33333 39908.33333
==== improvement ====
10.5%
========= test result (2nd round) ==========
Second round of run includes 3 types:
Although the improvement is not that much as the previous round (10%), but still got about 3~4% improvement.
gc data & rcu rcu base
41284 41860 37099
42296 42166 44692
42810 43423 41801
43492 45603 40274
43808 40641 39627
43029 40242 39793
42543 41662 41544
43420 42702 37991
44212 43354 40319
42692 43442 45264
44773 44577 44213
40835 41870 42008
44282 44167 42527
==== average ====
43036.61538 42746.84615 41319.38462
==== improvement ====
gc data + rcu / base: 4.156%
rcu / base: 3.45%
========= configuration and environment ==========
specjbb arguments:
GROUP_COUNT=4
TI_JVM_COUNT=1
SPEC_OPTS_C="-Dspecjbb.group.count=$GROUP_COUNT -Dspecjbb.txi.pergroup.count=$TI_JVM_COUNT"
SPEC_OPTS_TI=""
SPEC_OPTS_BE=""
JAVA_OPTS_C="-server -Xms2g -Xmx2g -XX:+UseParallelGC"
JAVA_OPTS_TI="-server -Xms2g -Xmx2g -XX:+UseParallelGC"
JAVA_OPTS_BE="-server -XX:+UseG1GC -Xms32g -Xmx32g"
MODE_ARGS_C="-ikv"
MODE_ARGS_TI="-ikv"
MODE_ARGS_BE="-ikv"
NUM_OF_RUNS=1
HW:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 224
On-line CPU(s) list: 0-223
Thread(s) per core: 2
Core(s) per socket: 28
Socket(s): 4
NUMA node(s): 4
Vendor ID: GenuineIntel
CPU family: 6
Model: 85
Model name: Intel(R) Xeon(R) Platinum 8176M CPU @ 2.10GHz
Stepping: 4
CPU MHz: 1001.925
CPU max MHz: 2101.0000
CPU min MHz: 1000.0000
BogoMIPS: 4200.00
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 1024K
L3 cache: 39424K
NUMA node0 CPU(s): 0-27,112-139
NUMA node1 CPU(s): 28-55,140-167
NUMA node2 CPU(s): 56-83,168-195
NUMA node3 CPU(s): 84-111,196-223
Mem: 3.0T 3.8G 2.9T 18M 25G 2.9T
Swap: 99G 0B 99G
Progress
Issue
Reviewing
Using
git
Checkout this PR locally:
$ git fetch https://git.openjdk.java.net/jdk pull/6246/head:pull/6246
$ git checkout pull/6246
Update a local copy of the PR:
$ git checkout pull/6246
$ git pull https://git.openjdk.java.net/jdk pull/6246/head
Using Skara CLI tools
Checkout this PR locally:
$ git pr checkout 6246
View PR using the GUI difftool:
$ git pr show -t 6246
Using diff file
Download this PR as a diff file:
https://git.openjdk.java.net/jdk/pull/6246.diff