8286823: Default to UseAVX=2 on all Skylake/Cascade Lake CPUs #8731

olivergillespie · 2022-05-16T15:52:22Z

The current code already does this for 'older' Skylake processors,
namely those with _stepping < 5. My testing indicates this is a
problem for later processors in this family too, so I have removed the
max stepping condition.

The original exclusion was added in https://bugs.openjdk.java.net/browse/JDK-8221092.

A general description of the overall issue is given at
https://en.wikipedia.org/wiki/Advanced_Vector_Extensions#Downclocking.

According to https://en.wikichip.org/wiki/intel/microarchitectures/cascade_lake#CPUID,
stepping values 5..7 indicate Cascade Lake. I have tested on a CPU with stepping=7,
and I see CPU frequency reduction from 3.1GHz down to 2.7GHz (~23%) when using
-XX:UseAVX=3, along with a corresponding performance reduction.

I first saw this issue in a real production workload, where the main AVX3 instructions
being executed were those generated for various flavours of disjoint_arraycopy.

I can reproduce a similar effect using SPECjvm2008's xml.transform benchmark.

java --add-opens=java.xml/com.sun.org.apache.xerces.internal.parsers=ALL-UNNAMED \
--add-opens=java.xml/com.sun.org.apache.xerces.internal.util=ALL-UNNAMED \
-jar SPECjvm2008.jar -ikv -ict xml.transform

Before the change, or with -XX:UseAVX=3:

Valid run!
Score on xml.transform: 776.00 ops/m

After the change, or with -XX:UseAVX=2:

Valid run!
Score on xml.transform: 894.07 ops/m

So, a 15% improvement in this benchmark. It's possible some benchmarks will be negatively
affected by this change, but I contend that this is still the right move given the stark
difference in this benchmark combined with the fact that use of AVX3 instructions can
affect all processes/code on the host due to the downclocking, and the fact that this
effect is very hard to root-cause, for example CPU profiles look very similar before and
after since all code is equally slowed.

Progress

Change must not contain extraneous whitespace
Commit message must refer to an issue
Change must be properly reviewed (1 review required, with at least 1 reviewer)

Issue

JDK-8286823: Default to UseAVX=2 on all Skylake/Cascade Lake CPUs

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.java.net/jdk pull/8731/head:pull/8731
$ git checkout pull/8731

Update a local copy of the PR:
$ git checkout pull/8731
$ git pull https://git.openjdk.java.net/jdk pull/8731/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 8731

View PR using the GUI difftool:
$ git pr show -t 8731

Using diff file

Download this PR as a diff file:
https://git.openjdk.java.net/jdk/pull/8731.diff

The current code already does this for 'older' Skylake processors, namely those with _stepping < 5. My testing indicates this is a problem for later processors in this family too, so I have removed the max stepping condition. The original exclusion was added in https://bugs.openjdk.java.net/browse/JDK-8221092. A general description of the overall issue is given at https://en.wikipedia.org/wiki/Advanced_Vector_Extensions#Downclocking. According to https://en.wikichip.org/wiki/intel/microarchitectures/cascade_lake#CPUID, stepping values 5..7 indicate Cascade Lake. I have tested on a CPU with stepping=7, and I see CPU frequency reduction from 3.1GHz down to 2.7GHz (~23%) when using -XX:UseAVX=3, along with a corresponding performance reduction. I first saw this issue in a real production workload, where the main AVX3 instructions being executed were those generated for various flavours of disjoint_arraycopy. I can reproduce a similar effect using SPECjvm2008's xml.transform benchmark. ``` java --add-opens=java.xml/com.sun.org.apache.xerces.internal.parsers=ALL-UNNAMED \ --add-opens=java.xml/com.sun.org.apache.xerces.internal.util=ALL-UNNAMED \ -jar SPECjvm2008.jar -ikv -ict xml.transform ``` Before the change, or with -XX:UseAVX=3: ``` Valid run! Score on xml.transform: 776.00 ops/m ``` After the change, or with -XX:UseAVX=2: ``` Valid run! Score on xml.transform: 894.07 ops/m ``` So, a 15% improvement in this benchmark. It's possible some benchmarks will be negatively affected by this change, but I contend that this is still the right move given the stark difference in this benchmark combined with the fact that use of AVX3 instructions can affect *all* processes/code on the host due to the downclocking, and the fact that this effect is very hard to root-cause, for example CPU profiles look very similar before and after since all code is equally slowed.

bridgekeeper · 2022-05-16T15:54:21Z

👋 Welcome back olivergillespie! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

openjdk · 2022-05-16T15:56:58Z

@olivergillespie The following label will be automatically applied to this pull request:

hotspot

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

mlbridge · 2022-05-16T16:01:17Z

Webrevs

00: Full (d77e5680)

eastig · 2022-05-17T17:18:01Z

src/hotspot/cpu/x86/vm_version_x86.cpp

-    if (use_avx_limit > 2 && is_intel_skylake() && _stepping < 5) {
+    // Don't use AVX-512 on Skylake (or the related Cascade Lake) CPUs unless explicitly
+    // requested - these instructions can cause performance issues on these processors.
+    if (use_avx_limit > 2 && is_intel_skylake()) {


Maybe is_intel_skylake needs to be changed to is_cpu_model_intel_skylake? It will make clear that all CPUs based on Skylake model are excluded.

I agree it's not necessarily a perfect name, but I haven't changed its behaviour so I figured I can avoid making my change any bigger than necessary. The intention of this particular usage is clarified in my comment. It's used in another place too, where it's evidently understood to include Cascade lake since the comment mentions Ice lake + (Cascade lake successor).

eastig · 2022-05-17T17:20:37Z

@olivergillespie, how did you test your changes?

olivergillespie · 2022-05-17T17:33:30Z

My testing was simply running the SPECjvm2008 xml.transform benchmark before and after as shared in the description.

eastig · 2022-05-17T17:43:08Z

You need to run gtest and tier1 at least. The PR did not started them automatically.

olivergillespie · 2022-05-17T18:27:00Z

You need to run gtest and tier1 at least. The PR did not started them automatically.

Thanks, I have now run tier1 and gtest successfully with my change.

vnkozlov

Seems reasonable. I have to run testing before approval.

sviswa7 · 2022-05-17T21:11:55Z

We restricted this to Skylake only as Cascade Lake has improvements in this regard on the platform.
There are AVX512 optimizations in the JVM that are beneficial on Cascade Lake.

eastig · 2022-05-17T22:15:19Z

Hi Sandhya,
Oli's main point is:

use of AVX3 instructions can
affect all processes/code on the host due to the downclocking, and the fact that this
effect is very hard to root-cause, for example CPU profiles look very similar before and
after since all code is equally slowed.

Oli mentioned disjoint_arraycopy. I found #6512. The PR mentions 'old' and 'latest' CPUs but it does not specify them. Was the PR tested on Cascade Lake? Also SPECjvm2008 was mentioned in the PR but without details. Microbenchmarks show single-thread improvements. Do you have any improvements in multi-threaded benchmarks, like SPECjbb, DaCapo or Renaissance? Maybe disjoint_arraycopy should not use AVX3 on Cascade Lake? Or AVX3Threshold should not be set to 0 for Cascade Lake?

sviswa7 · 2022-05-17T23:10:49Z

@eastig AVX3Threshold is not set to 0 for CascadeLake.

vnkozlov · 2022-05-18T00:05:57Z

I ran very briefly specjvm2008 on CascadeLake and I see big regression in crypto.aes with AVX2.

vnkozlov · 2022-05-18T00:09:46Z

@olivergillespie, do you have results for other SPECjvm2008 benchmarks?

olivergillespie · 2022-05-18T08:52:49Z

Thanks for the comments, all. I will run all SPECjvm2008 benchmarks before and after the change for us to look at - I'll share the results later today.

In the application that uncovered this issue for me, minor use of AVX3 instructions (about 3-4% of the overall CPU usage, all in various flavours of disjoint_arraycopy) on my Cascade Lake processor downclocks the whole machine by ~15% - that means all threads/cores/processes slowed by 15% across the board.

The local speedup thanks to AVX3 has to be weighed against the global downclocking overhead, and I contend that most real-world applications will see far more overhead than benefit from AVX3 on Cascade Lake.

olivergillespie · 2022-05-18T13:08:54Z

Below are my complete SPECjvm2008 results, running on an AWS EC2 m4.4xl host (CPU details also shared below), with warmup time of 120s and 1 iteration of 240s per benchmark. My results are somewhat noisy, in part due to running on virtualized hardware.

There are a range of both regressions and improvements after my change, roughly equal in count and magnitudes. Do keep in mind that any regressions (where UseAVX=2 is slower) are local to that operation, but improvements (where UseAVX=2 is faster) can often be felt by the whole machine - avoiding 15% downclocking is worth a lot more than 15% speedup in one code path. On this basis, I think UseAVX=2 is the right default for this hardware.

@vnkozlov - I don't see a significant regression in crypto.aes on my runs, could you please share more info about your test and hardware?

cpu family      : 6
model           : 85
model name      : Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz
stepping        : 7
microcode       : 0x5003103

Note for the Change column: a positive % means the UseAVX=2 run is faster/better (higher ops/m) compared to the UseAVX=3 baseline/default behaviour. Negative means it was slower.

	UseAVX=3 (ops/m)	UseAVX=2 (ops/m)	Change
startup.helloworld	217	190	-12%
startup.compiler.compiler	218	205	-6%
startup.compiler.sunflow	218	221	+1%
startup.compress	42	42	+1%
startup.crypto.aes	18	18	-1%
startup.crypto.rsa	99	94	-5%
startup.crypto.signverify	80	77	-5%
startup.mpegaudio	29	28	-4%
startup.scimark.fft	67	69	+3%
startup.scimark.lu	90	75	-17%
startup.scimark.monte_carlo	17	17	+1%
startup.scimark.sor	32	35	+9%
startup.scimark.sparse	42	42	+0%
startup.serial	31	32	+2%
startup.sunflow	37	34	-9%
startup.xml.transform	35	35	-1%
startup.xml.validation	47	50	+7%
compress	576	568	-1%
crypto.aes	202	200	-1%
crypto.rsa	4065	4049	-0%
crypto.signverify	2033	1964	-3%
derby	1391	1411	+1%
mpegaudio	324	363	+12%
scimark.fft.large	274	271	-1%
scimark.lu.large	73	73	-0%
scimark.sor.large	160	154	-4%
scimark.sparse.large	155	129	-17%
scimark.fft.small	1340	1421	+6%
scimark.lu.small	1967	2467	+25%
scimark.sor.small	700	679	-3%
scimark.sparse.small	544	492	-10%
scimark.monte_carlo	729	700	-4%
serial	477	466	-2%
sunflow	208	219	+5%
xml.transform	778	894	+15%
xml.validation	1610	1926	+20%

vnkozlov · 2022-05-18T16:28:42Z

I got regression when I did not load my system - I run spec with -bt 8. I will do rerun with full load. My system is slightly different: 8260L CPU @ 2.40GHz.

simonis · 2022-05-18T18:36:49Z

@olivergillespie I think these benchmarks should be done on bare metal machines. As you have correctly observed, usage of AVX3 can impact the whole CPU. How can you make sure that other applications running in parallel to yours on the same virtualized hardware don't influence your results by using AVX3 themselves?

Also, I think you should run any sub-benchmark in isolation to make sure that the CPU has recovered after potential down-clocking because of a previous sub-benchmark which heavily used AVX3.

It might also be helpful to run a tool like i7z in parallel to your benchmarks which displays the clock speed of all CPUs (there might be better tools but this should at least give you an idea about what's going on). Ideally, you should have an additional column for every result which displays the average CPU speed during the benchmark run.

olivergillespie · 2022-05-19T17:02:45Z

I just ran into this again on a different machine, on this one it downclocks from 3.6GHz to 3.1GHz.

cpu family	: 6
model		: 85
model name	: Intel(R) Xeon(R) Platinum 8275CL CPU @ 3.00GHz
stepping	: 7

@vnkozlov I tried with 1, 2, 4, 8 and 16 benchmark threads, all showed no regression on crypto.aes. Did you manage to re-run it?

sviswa7 · 2022-05-19T17:52:35Z

From what I understand, only the core which is executing 512 bit vector instructions will observe this lower frequency and not the entire processor. It is doing double the work per clock during that time so overall we should come out ok.
@olivergillespie As Volker mentioned, the SPECjvm2008 data that you shared was all over the place. You would agree that we cannot globally change something based on that. In addition, SPECjvm2008 has lot of run-to-run variation in general due to various reasons (e.g. data locality if the threads are spread across multiple socket). Please take that into account as well.

olivergillespie · 2022-05-19T18:16:46Z

Thanks for the comments.

From what I understand, only the core which is executing 512 bit vector instructions will observe this lower frequency and not the entire processor.

Yes, but in a typical multithreaded java service (say a web app), many threads end up using these instructions via common operations like StringBuilder, plus the threads are not pinned to cores. This in practice can mean all cores occasionally hit AVX3 instructions. The slowdown on these processors does not only last while the instruction is executing, it persists far beyond that, and so even a few short uses of AVX3 can end up perma-throttling all CPUs.

From https://lemire.me/blog/2018/09/07/avx-512-when-and-how-to-use-these-new-instructions/

Downclocking, when it happens, is per core and for a short time after you have used particular instructions (e.g., ~2ms).

2 milliseconds is a huge amount of time to run at the lower frequency. Each core only needs to hit these instructions once every 2 milliseconds to have permanent throttling, and that's what I see on my applications.

More info: https://blog.cloudflare.com/on-the-dangers-of-intels-frequency-scaling/

I'm not an expert on the mechanics of this, only that I observe this same behaviour in every real-world JDK17 application I've looked at which run on Cascade Lake (4 so far); namely 15% global slowdown with very little AVX3 usage. The AVX3 speedup definitely doesn't make up for it, because when I disable AVX3 my application's performance improves significantly (reduced latency, increased throughput).

Do you know what measurements were used to justify the original exception or model 85 stepping <5? I could re-run those tests to compare to my hardware. Is there any other bechmark or measurement which you'd like to see which might justify this change?

Maybe we can look at the change in this way: are model 85 stepping 5,6,7 affected by the same issue that the earlier steppings are, which are already excluded from AVX3 by default? It was evidently decided that the issue was severe enough for those CPUs to have AVX3 disabled by default, I merely find that 3 later models also suffer severely from this issue and should receive the same treatment.

vnkozlov · 2022-05-19T19:19:05Z

@vnkozlov I tried with 1, 2, 4, 8 and 16 benchmark threads, all showed no regression on crypto.aes. Did you manage to re-run it?

I rerun it but it did not get this regression. So it is indeed non-stable. I think @simonis is right that we need to run each sub-benchmarks separately. I am currently running with different -bt and results are jumping. I am on bare-bone machine.

I will try to run with increased iterations and separate sub-benchmarks.

Based on your comment, you are using JDK 17. Which particular version you have?
I am running with latest JDK 19 which may also affects difference.

vnkozlov · 2022-05-20T05:20:20Z

I reproduced my regression (and your improvement) running on 1/4 of total threads on machine:

But when I use all threads (loading system) picture is different (except mpegaudio):

vnkozlov · 2022-05-20T05:30:54Z

I think I spent enough time on this already. Performance tracking is "rabbit hole" :(

As I said before, results are mixed comparing to running on Skylake where results were all positive.
Even running on more recent Intel's CPUs I see some mixed results.

Yes, AVXV512 is not for all applications. I agree with your observations. Even so, I can't support this change.
For such cases we have official solution by providing ability to choose instructions set with UseAVX flag.

olivergillespie · 2022-05-20T09:52:43Z

Thanks for running the tests and sharing the results!

Based on your comment, you are using JDK 17. Which particular version you have?

My tests were run on the latest JDK19 tip build, but my real applications where I have observed the problems are using 17.0.3.6.

I agree that results are mixed (though the biggest changes are in favour of AVX2), so given that we know for sure that AVX512 affects other processes and other code, mixed results is definitely not a good enough reason to enable AVX512 on these CPUs. I don't believe we would ever decide to enable AVX512 based on these results, hence my suggested change - don't enable AVX512 on these CPUs. There is no benefit on average to having AVX512 enabled on these models, yet it comes with huge downsides, so I think the risk profile leans heavily in favour of not enabling it by default for these models.

simonis · 2022-05-20T11:52:00Z

@vnkozlov I think I agree with @olivergillespie. Looking at the graphs you've posted, AVX2 seems superior to AVX3 even if only 1/4 of the threads are used but clearly if the machine is fully loaded. As @olivergillespie said, it would be hard to justify enabling AVX3 on these CPUs today, given these results.

I would argue we should disable it by default and as you said, let the few use cases which benefit from it like AES on non-loaded machines, enable it manually with -XX:+UseAVX=3.

vnkozlov · 2022-05-21T16:21:12Z

I did standalone runs of sub-benchmarks several times to get best results. Some of them show 2 sets of results: fast and slow. crypto.aes and crypto.rsa one of them. I suspect it is caused by difference in code generation (slightly different profiling and inlining) and nothing to do with AVX setting. I did see (termal?/turbo boost?) effects if I run benchmarks without pause between them. So I kept system idle for about 10 min between runs.

I did not re-run benchmarks if difference was < 3%.

We need to look what is going on with mpegaudio and scimark.lu.small. And also test with Parallel GC.

vnkozlov · 2022-05-23T18:17:17Z

These results seems promising but we should not base our decision only on this.
We need some time to analyze these results and look on other benchmarks.
Thank you for patience.

olivergillespie · 2022-05-24T00:33:22Z

Thanks for all the help so far. Is there anything I can help with?

vnkozlov · 2022-05-24T00:53:17Z

Thanks for all the help so far. Is there anything I can help with?

Thank you for offering help. We just need time to run different benchmarks we use for performance testing.

sviswa7 · 2022-06-01T23:42:17Z

@olivergillespie In our analysis of mpegaudio, we found that the problem was due to auto-vectorizer kicking in for small array initialization. The auto-vectorization does not take into consideration AVX3Threshold. I have submitted a simple PR to answer your concerns #8877. Please take a look. The PR limits the auto-vectorizer to 256-bit or 32-byte for Cascade Lake.

olivergillespie · 2022-06-07T14:17:37Z

Thanks @sviswa7 ! I don't have the expertise to evaluate your auto-vectorizer change from a theoretical standpoint. I'd like to test it out in my real applications to see if the issue you found in mpegaudio is the same issue I see on a more complex app, but unfortunately I won't be able to do that quickly (not easy to swap out JDK versions in those apps).

I'm still of the opinion that AVX3 on Cascade Lake in the JVM seems more dangerous than it's worth across the board, due to the documented inherent downclocking behaviour of that architecture. Various other SPECjvm benchmarks consistently downclock by >15%, so even if they don't show a direct performance regression with AVX3 (the improved AVX3 performance roughly balances the downclocking), they have a negative impact on other code running on the host.

It's possible your tighter change covers most of the cases in practice, which would be great, but I think this broader change still makes sense conceptually.

vnkozlov · 2022-06-07T15:48:55Z

@olivergillespie you can test your application with -XX:MaxVectorSize=32 product flag to see effects of #8877 changes.
Main concern about changing AVX is reduction of instructions set which could be useful.

bridgekeeper · 2022-07-05T21:52:21Z

@olivergillespie This pull request has been inactive for more than 4 weeks and will be automatically closed if another 4 weeks passes without any activity. To avoid this, simply add a new comment to the pull request. Feel free to ask for assistance if you need help with progressing this pull request towards integration!

bridgekeeper · 2022-08-03T01:49:16Z

@olivergillespie This pull request has been inactive for more than 8 weeks and will now be automatically closed. If you would like to continue working on this pull request in the future, feel free to reopen it! This can be done using the /open pull request command.

openjdk bot added the rfr Pull request is ready for review label May 16, 2022

openjdk bot added the hotspot hotspot-dev@openjdk.org label May 16, 2022

eastig reviewed May 17, 2022

View reviewed changes

vnkozlov reviewed May 17, 2022

View reviewed changes

bridgekeeper bot closed this Aug 3, 2022

8286823: Default to UseAVX=2 on all Skylake/Cascade Lake CPUs #8731

8286823: Default to UseAVX=2 on all Skylake/Cascade Lake CPUs #8731

Conversation

olivergillespie commented May 16, 2022 • edited by openjdk bot

Progress

Issue

Reviewing

bridgekeeper bot commented May 16, 2022

openjdk bot commented May 16, 2022

mlbridge bot commented May 16, 2022

Webrevs

eastig May 17, 2022

Choose a reason for hiding this comment

olivergillespie May 17, 2022

Choose a reason for hiding this comment

eastig commented May 17, 2022

olivergillespie commented May 17, 2022

eastig commented May 17, 2022

olivergillespie commented May 17, 2022

vnkozlov left a comment

Choose a reason for hiding this comment

sviswa7 commented May 17, 2022

eastig commented May 17, 2022

sviswa7 commented May 17, 2022

vnkozlov commented May 18, 2022

vnkozlov commented May 18, 2022

olivergillespie commented May 18, 2022

olivergillespie commented May 18, 2022

vnkozlov commented May 18, 2022

simonis commented May 18, 2022

olivergillespie commented May 19, 2022

sviswa7 commented May 19, 2022

olivergillespie commented May 19, 2022

vnkozlov commented May 19, 2022

vnkozlov commented May 20, 2022

vnkozlov commented May 20, 2022

olivergillespie commented May 20, 2022

simonis commented May 20, 2022

vnkozlov commented May 21, 2022

vnkozlov commented May 23, 2022

olivergillespie commented May 24, 2022

vnkozlov commented May 24, 2022

sviswa7 commented Jun 1, 2022

olivergillespie commented Jun 7, 2022

vnkozlov commented Jun 7, 2022 • edited

bridgekeeper bot commented Jul 5, 2022

bridgekeeper bot commented Aug 3, 2022

olivergillespie commented May 16, 2022 •

edited by openjdk bot

vnkozlov commented Jun 7, 2022 •

edited