Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JDK-8277178: Reduce the priority of data dependent nodes when OptoScheduling enabled #6407

Closed
wants to merge 5 commits into from

Conversation

sunny868
Copy link

@sunny868 sunny868 commented Nov 16, 2021

when doing gcm/lcm, We should not only consider the height of nodes(latency), but also consider whether there is data dependency between nodes. When there is data dependency between two nodes and the delay of the previous node is large, another node without data dependency can be considered inserting between the two nodes. For example:
for java code


    public static final double fval = 2.00;
    public static double[] A = new double[N];
    public static int[] B = new int[N];

    public static void testP(){
	for (int i=0; i

when use -XX:+OptoScheduling in aarch64, the sequence is


190     B15: #	out( B15 B16 ) <- in( B14 B15 ) Loop( B15-B15 inner main of N118 strip mined) Freq: 9.9999e+11
190     sxtw  R13, R15	# i2l
194 +   add R14, R17, R13, LShiftL #3	# ptr
198     ldrd  V16, [R14, #16]	# double
19c +   fmuld   V18, V16, V17
1a0 +   faddd   V16, V18, V16
1a4     strd  V16, [R14, #16]	# double
1a8 +   add R13, R0, R13, LShiftL #2	# ptr
1ac +   ldrw  R1, [R13, #16]	# int
1b0 +   addw  R14, R1, R1
1b4 +   addw R1, R14, #2
1b8 +   addw R15, R15, #1
1bc     strw  R1, [R13, #16]	# int
1c0 +   cmpw  R15, R12
1c4     blt B15 	// counted loop end  P=1.000000 C=40960.000000

Then a more efficient sequence should be:


190     B15: #	out( B15 B16 ) <- in( B14 B15 ) Loop( B15-B15 inner main of N118 strip mined) Freq: 9.9999e+11
190     sxtw  R13, R14	# i2l
194     add R15, R17, R13, LShiftL #3	# ptr
198     add R13, R0, R13, LShiftL #2	# ptr
19c     ldrd  V16, [R15, #16]	# double
1a0     ldrw  R2, [R13, #16]	# int
1a4     fmuld   V18, V16, V17
1a8     addw  R1, R2, R2
1ac     faddd   V16, V18, V16
1b0     strd  V16, [R15, #16]	# double
1b4     addw R1, R1, #2
1b8     strw  R1, [R13, #16]	# int
1bc     addw R14, R14, #1
1c0     cmpw  R14, R12
1c4     blt B15 	// counted loop end  P=1.000000 C=40960.000000

This problem also exists in MIPS architecture. This is a patch to fix this problem. Please help review it.
Thanks


Progress

  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue
  • Change must be properly reviewed

Issue

  • JDK-8277178: Reduce the priority of data dependent nodes when OptoScheduling enabled

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.java.net/jdk pull/6407/head:pull/6407
$ git checkout pull/6407

Update a local copy of the PR:
$ git checkout pull/6407
$ git pull https://git.openjdk.java.net/jdk pull/6407/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 6407

View PR using the GUI difftool:
$ git pr show -t 6407

Using diff file

Download this PR as a diff file:
https://git.openjdk.java.net/jdk/pull/6407.diff

@bridgekeeper
Copy link

@bridgekeeper bridgekeeper bot commented Nov 16, 2021

👋 Welcome back sunny868! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk openjdk bot added the rfr label Nov 16, 2021
@openjdk
Copy link

@openjdk openjdk bot commented Nov 16, 2021

@sunny868 The following label will be automatically applied to this pull request:

  • hotspot-compiler

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

@openjdk openjdk bot added the hotspot-compiler label Nov 16, 2021
@mlbridge
Copy link

@mlbridge mlbridge bot commented Nov 16, 2021

@sunny868 sunny868 changed the title JDKJDK-8277178: Reduce the priority of data dependent nodes when OptoScheduling enabled JDK-8277178: Reduce the priority of data dependent nodes when OptoScheduling enabled Nov 16, 2021
@TobiHartmann
Copy link
Member

@TobiHartmann TobiHartmann commented Nov 30, 2021

I'm not too familiar with this code but I gave it a quick run through our performance testing and all results look good except for the MonteCarlo benchmark from SPECjvm2008 with G1 which shows a 1% regression. It could just be run-to-run variance but better double check.

Also, I think it would be good to add a JMH benchmark for this.

@bridgekeeper
Copy link

@bridgekeeper bridgekeeper bot commented Dec 28, 2021

@sunny868 This pull request has been inactive for more than 4 weeks and will be automatically closed if another 4 weeks passes without any activity. To avoid this, simply add a new comment to the pull request. Feel free to ask for assistance if you need help with progressing this pull request towards integration!

Copy link
Contributor

@vnkozlov vnkozlov left a comment

@sunny868 Please, explain changes in output.cpp.

@sunny868
Copy link
Author

@sunny868 sunny868 commented Jan 6, 2022

I'm very sorry, I can't receive the mailing list recently, so I don't see your feedback in time.
Under MIPS architecture, this patch can improve the execution performance of some use cases. But I didn't retest under other architectures. Later, I will confirm whether it will cause performance regression to other platforms.

@sunny868
Copy link
Author

@sunny868 sunny868 commented Jan 6, 2022

@sunny868 Please, explain changes in output.cpp.

I found that after phase ScheduleAndBundle(), the instruction sequence (good in lcm.cpp) will be adjusted again, and the adjusted sequence is data dependent. Maybe it's more reasonable to cancel the use of ScheduleAndBundle(). I'm not sure about this problem.

@vnkozlov
Copy link
Contributor

@vnkozlov vnkozlov commented Jan 6, 2022

In addition to aarch64 C2's scheduler is used on some x86 platforms so you can not just remove part of it without proving that it helps on them too:
https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/x86/vm_version_x86.cpp#L1576

I would suggest to investigate ScheduleAndBundle() and fix it too.

@openjdk openjdk bot removed the rfr label Jan 13, 2022
@openjdk openjdk bot added the rfr label Jan 13, 2022
@sunny868
Copy link
Author

@sunny868 sunny868 commented Jan 13, 2022

I'm not too familiar with this code but I gave it a quick run through our performance testing and all results look good except for the MonteCarlo benchmark from SPECjvm2008 with G1 which shows a 1% regression. It could just be run-to-run variance but better double check.

Also, I think it would be good to add a JMH benchmark for this.

I have add a new benmark test file (InstructionScheduling.java). I had make a test for LoongArch and MIPS64 architecture , and can significantly observe the performance improvement(about 30%). However, the performance improvement on aarch64 architecture is not obvious, and the test jitter is very large.

I don't have Silvermont or Centerton x86 machine, so can you make a performance testing again for this patch? Thank you very much.

@@ -0,0 +1,44 @@
/*
* Copyright (c) 2022, Loongson Technology. All rights reserved.
Copy link
Member

@DamonFool DamonFool Jan 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @sunny868 ,

We already have the copyright header for Loongson like the following.

./hotspot/jtreg/compiler/print/PrintCompileQueue.java: * Copyright (c) 2019, Loongson Technology Co. Ltd. All rights reserved.
./hotspot/jtreg/compiler/compilercontrol/CompilationModeHighOnlyTest.java: * Copyright (c) 2019, Loongson Technology Co. Ltd. All rights reserved.
./hotspot/jtreg/compiler/profiling/TestProfileCounterOverflow.java: * Copyright (c) 2019, Loongson Technology Co. Ltd. All rights reserved.

I'm not sure whether it had been changed.

Copy link
Author

@sunny868 sunny868 Jan 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @sunny868 ,

We already have the copyright header for Loongson like the following.

./hotspot/jtreg/compiler/print/PrintCompileQueue.java: * Copyright (c) 2019, Loongson Technology Co. Ltd. All rights reserved.
./hotspot/jtreg/compiler/compilercontrol/CompilationModeHighOnlyTest.java: * Copyright (c) 2019, Loongson Technology Co. Ltd. All rights reserved.
./hotspot/jtreg/compiler/profiling/TestProfileCounterOverflow.java: * Copyright (c) 2019, Loongson Technology Co. Ltd. All rights reserved.

I'm not sure whether it had been changed.

typo

@sunny868
Copy link
Author

@sunny868 sunny868 commented Jan 13, 2022

In addition to aarch64 C2's scheduler is used on some x86 platforms so you can not just remove part of it without proving that it helps on them too: https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/x86/vm_version_x86.cpp#L1576

I would suggest to investigate ScheduleAndBundle() and fix it too.

From many test results for InstructionScheduling.java on LoongArch and aarch64, ScheduleAndBundle() has no obvious impact on performance, so I undo the change to output.cpp.

for (int i=0; i<N; i++) {
D[i] += D[i] * fval;
D[i] += D[i] / fval;
I[i] += I[i]*2;
Copy link
Member

@DamonFool DamonFool Jan 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why no space before and after * ?


@Benchmark
public void testMethod(){
for (int i=0; i<N; i++) {
Copy link
Member

@DamonFool DamonFool Jan 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest adding spaces for = and <.
Thanks.

for (int i = 0; i < N; i++) {
D[i] += D[i] * fval;
D[i] += D[i] / fval;
I[i] += I[i] * 2;
Copy link
Member

@theaoqi theaoqi Jan 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we define a variable for 2 like the one for 2.00?

@vnkozlov
Copy link
Contributor

@vnkozlov vnkozlov commented Jan 14, 2022

My tier1-3 testing passed clean (x86 and Aarch64).
But performance variations on x86 is significant. I am rerunning it again to confirm it. If it is confirmed (may be next day) can you consider apply changes only when OptoScheduling is on:

   bool block_size_threshold_ok = (recalc_pressure_nodes != NULL) && (block->number_of_nodes() > 10);
+  Node* last_n = C->do_scheduling() ? block->get_node(--sched_slot) : nullptr;

Also please check nullptr instead of implicit null check:

     if (last_n != nullptr) {

@vnkozlov
Copy link
Contributor

@vnkozlov vnkozlov commented Jan 14, 2022

In general, inserting instructions between load value and its use can lead to spilling on stack if registers pressure is high. We saw such cases before. We should be careful here because your current changes affect all platforms.

@sunny868
Copy link
Author

@sunny868 sunny868 commented Jan 17, 2022

In general, inserting instructions between load value and its use can lead to spilling on stack if registers pressure is high. We saw such cases before. We should be careful here because your current changes affect all platforms.

Thanks, I will further confirm whether this patch will have an impact on register pressure.

@vnkozlov
Copy link
Contributor

@vnkozlov vnkozlov commented Jan 19, 2022

After rerun I see less variations on x86 (or different benchmarks were affected).
On other hand I see more variations on Aarch64.

@sunny868
Copy link
Author

@sunny868 sunny868 commented Jan 20, 2022

On other hand I see more variations on Aarch64

regression?
Theoretically, instruction reordering may prolongs live range and lead to spilling on stack, But I can't find the spill code caused by this patch at present. This patch only makes local instructions reorder in the block, and it is estimated that it has little effect on the register pressure. I'm not quite sure.

@sunny868
Copy link
Author

@sunny868 sunny868 commented Jan 20, 2022

I also done a test for InstructionScheduling.java on Linux/aarch64 with args -XX:-UseSuperWord. And found regression with -XX:+OptoScheduling than with -XX:-OptoScheduling

arg Score 1 Score 2 Score 3 Units
-XX:+OptoScheduling 22129.754 22440.702 23233.260 ops/s
-XX:-OptoScheduling 26337.224 25831.751 26008.293 ops/s

Maybe the pipeline configuration in the aarch64.ad file is unreasonable.

@vnkozlov
Copy link
Contributor

@vnkozlov vnkozlov commented Jan 20, 2022

I asked our performance experts and they observed before variations I saw on Aarch64. It seems they are not caused by your changes.

You did not answered by suggestion about added C->do_scheduling() check.

@sunny868
Copy link
Author

@sunny868 sunny868 commented Jan 21, 2022

My tier1-3 testing passed clean (x86 and Aarch64). But performance variations on x86 is significant. I am rerunning it again to confirm it. If it is confirmed (may be next day) can you consider apply changes only when OptoScheduling is on:

   bool block_size_threshold_ok = (recalc_pressure_nodes != NULL) && (block->number_of_nodes() > 10);
+  Node* last_n = C->do_scheduling() ? block->get_node(--sched_slot) : nullptr;

Also please check nullptr instead of implicit null check:

     if (last_n != nullptr) {

Yes, you are right. MachNode latency in lcm will be 0 when OptoScheduling is false. so add Node* last_n = C->do_scheduling() ? block->get_node(--sched_slot) : nullptr; is reasonable.

@bridgekeeper
Copy link

@bridgekeeper bridgekeeper bot commented Feb 18, 2022

@sunny868 This pull request has been inactive for more than 4 weeks and will be automatically closed if another 4 weeks passes without any activity. To avoid this, simply add a new comment to the pull request. Feel free to ask for assistance if you need help with progressing this pull request towards integration!

@bridgekeeper
Copy link

@bridgekeeper bridgekeeper bot commented Mar 18, 2022

@sunny868 This pull request has been inactive for more than 8 weeks and will now be automatically closed. If you would like to continue working on this pull request in the future, feel free to reopen it! This can be done using the /open pull request command.

@bridgekeeper bridgekeeper bot closed this Mar 18, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hotspot-compiler rfr
5 participants