Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

8286823: Default to UseAVX=2 on all Skylake/Cascade Lake CPUs #8731

Closed
wants to merge 1 commit into from

Conversation

olivergillespie
Copy link
Contributor

@olivergillespie olivergillespie commented May 16, 2022

The current code already does this for 'older' Skylake processors,
namely those with _stepping < 5. My testing indicates this is a
problem for later processors in this family too, so I have removed the
max stepping condition.

The original exclusion was added in https://bugs.openjdk.java.net/browse/JDK-8221092.

A general description of the overall issue is given at
https://en.wikipedia.org/wiki/Advanced_Vector_Extensions#Downclocking.

According to https://en.wikichip.org/wiki/intel/microarchitectures/cascade_lake#CPUID,
stepping values 5..7 indicate Cascade Lake. I have tested on a CPU with stepping=7,
and I see CPU frequency reduction from 3.1GHz down to 2.7GHz (~23%) when using
-XX:UseAVX=3, along with a corresponding performance reduction.

I first saw this issue in a real production workload, where the main AVX3 instructions
being executed were those generated for various flavours of disjoint_arraycopy.

I can reproduce a similar effect using SPECjvm2008's xml.transform benchmark.

java --add-opens=java.xml/com.sun.org.apache.xerces.internal.parsers=ALL-UNNAMED \
--add-opens=java.xml/com.sun.org.apache.xerces.internal.util=ALL-UNNAMED \
-jar SPECjvm2008.jar -ikv -ict xml.transform

Before the change, or with -XX:UseAVX=3:

Valid run!
Score on xml.transform: 776.00 ops/m

After the change, or with -XX:UseAVX=2:

Valid run!
Score on xml.transform: 894.07 ops/m

So, a 15% improvement in this benchmark. It's possible some benchmarks will be negatively
affected by this change, but I contend that this is still the right move given the stark
difference in this benchmark combined with the fact that use of AVX3 instructions can
affect all processes/code on the host due to the downclocking, and the fact that this
effect is very hard to root-cause, for example CPU profiles look very similar before and
after since all code is equally slowed.


Progress

  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue
  • Change must be properly reviewed (1 review required, with at least 1 reviewer)

Issue

  • JDK-8286823: Default to UseAVX=2 on all Skylake/Cascade Lake CPUs

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.java.net/jdk pull/8731/head:pull/8731
$ git checkout pull/8731

Update a local copy of the PR:
$ git checkout pull/8731
$ git pull https://git.openjdk.java.net/jdk pull/8731/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 8731

View PR using the GUI difftool:
$ git pr show -t 8731

Using diff file

Download this PR as a diff file:
https://git.openjdk.java.net/jdk/pull/8731.diff

The current code already does this for 'older' Skylake processors,
namely those with _stepping < 5. My testing indicates this is a
problem for later processors in this family too, so I have removed the
max stepping condition.

The original exclusion was added in https://bugs.openjdk.java.net/browse/JDK-8221092.

A general description of the overall issue is given at
https://en.wikipedia.org/wiki/Advanced_Vector_Extensions#Downclocking.

According to https://en.wikichip.org/wiki/intel/microarchitectures/cascade_lake#CPUID,
stepping values 5..7 indicate Cascade Lake. I have tested on a CPU with stepping=7,
and I see CPU frequency reduction from 3.1GHz down to 2.7GHz (~23%) when using
-XX:UseAVX=3, along with a corresponding performance reduction.

I first saw this issue in a real production workload, where the main AVX3 instructions
being executed were those generated for various flavours of disjoint_arraycopy.

I can reproduce a similar effect using SPECjvm2008's xml.transform benchmark.

```
java --add-opens=java.xml/com.sun.org.apache.xerces.internal.parsers=ALL-UNNAMED \
--add-opens=java.xml/com.sun.org.apache.xerces.internal.util=ALL-UNNAMED \
-jar SPECjvm2008.jar -ikv -ict xml.transform
```

Before the change, or with -XX:UseAVX=3:

```
Valid run!
Score on xml.transform: 776.00 ops/m
```

After the change, or with -XX:UseAVX=2:

```
Valid run!
Score on xml.transform: 894.07 ops/m
```

So, a 15% improvement in this benchmark. It's possible some benchmarks will be negatively
affected by this change, but I contend that this is still the right move given the stark
difference in this benchmark combined with the fact that use of AVX3 instructions can
affect *all* processes/code on the host due to the downclocking, and the fact that this
effect is very hard to root-cause, for example CPU profiles look very similar before and
after since all code is equally slowed.
@bridgekeeper
Copy link

bridgekeeper bot commented May 16, 2022

👋 Welcome back olivergillespie! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk openjdk bot added the rfr Pull request is ready for review label May 16, 2022
@openjdk
Copy link

openjdk bot commented May 16, 2022

@olivergillespie The following label will be automatically applied to this pull request:

  • hotspot

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

@openjdk openjdk bot added the hotspot hotspot-dev@openjdk.org label May 16, 2022
@mlbridge
Copy link

mlbridge bot commented May 16, 2022

Webrevs

if (use_avx_limit > 2 && is_intel_skylake() && _stepping < 5) {
// Don't use AVX-512 on Skylake (or the related Cascade Lake) CPUs unless explicitly
// requested - these instructions can cause performance issues on these processors.
if (use_avx_limit > 2 && is_intel_skylake()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe is_intel_skylake needs to be changed to is_cpu_model_intel_skylake? It will make clear that all CPUs based on Skylake model are excluded.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree it's not necessarily a perfect name, but I haven't changed its behaviour so I figured I can avoid making my change any bigger than necessary. The intention of this particular usage is clarified in my comment. It's used in another place too, where it's evidently understood to include Cascade lake since the comment mentions Ice lake + (Cascade lake successor).

@eastig
Copy link
Member

eastig commented May 17, 2022

@olivergillespie, how did you test your changes?

@olivergillespie
Copy link
Contributor Author

My testing was simply running the SPECjvm2008 xml.transform benchmark before and after as shared in the description.

@eastig
Copy link
Member

eastig commented May 17, 2022

You need to run gtest and tier1 at least. The PR did not started them automatically.

@olivergillespie
Copy link
Contributor Author

You need to run gtest and tier1 at least. The PR did not started them automatically.

Thanks, I have now run tier1 and gtest successfully with my change.

Copy link
Contributor

@vnkozlov vnkozlov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems reasonable. I have to run testing before approval.

@sviswa7
Copy link

sviswa7 commented May 17, 2022

We restricted this to Skylake only as Cascade Lake has improvements in this regard on the platform.
There are AVX512 optimizations in the JVM that are beneficial on Cascade Lake.

@eastig
Copy link
Member

eastig commented May 17, 2022

Hi Sandhya,
Oli's main point is:

use of AVX3 instructions can
affect all processes/code on the host due to the downclocking, and the fact that this
effect is very hard to root-cause, for example CPU profiles look very similar before and
after since all code is equally slowed.

Oli mentioned disjoint_arraycopy. I found #6512. The PR mentions 'old' and 'latest' CPUs but it does not specify them. Was the PR tested on Cascade Lake? Also SPECjvm2008 was mentioned in the PR but without details. Microbenchmarks show single-thread improvements. Do you have any improvements in multi-threaded benchmarks, like SPECjbb, DaCapo or Renaissance? Maybe disjoint_arraycopy should not use AVX3 on Cascade Lake? Or AVX3Threshold should not be set to 0 for Cascade Lake?

@sviswa7
Copy link

sviswa7 commented May 17, 2022

@eastig AVX3Threshold is not set to 0 for CascadeLake.

@vnkozlov
Copy link
Contributor

I ran very briefly specjvm2008 on CascadeLake and I see big regression in crypto.aes with AVX2.

@vnkozlov
Copy link
Contributor

@olivergillespie, do you have results for other SPECjvm2008 benchmarks?

@olivergillespie
Copy link
Contributor Author

Thanks for the comments, all. I will run all SPECjvm2008 benchmarks before and after the change for us to look at - I'll share the results later today.

In the application that uncovered this issue for me, minor use of AVX3 instructions (about 3-4% of the overall CPU usage, all in various flavours of disjoint_arraycopy) on my Cascade Lake processor downclocks the whole machine by ~15% - that means all threads/cores/processes slowed by 15% across the board.

The local speedup thanks to AVX3 has to be weighed against the global downclocking overhead, and I contend that most real-world applications will see far more overhead than benefit from AVX3 on Cascade Lake.

@olivergillespie
Copy link
Contributor Author

Below are my complete SPECjvm2008 results, running on an AWS EC2 m4.4xl host (CPU details also shared below), with warmup time of 120s and 1 iteration of 240s per benchmark. My results are somewhat noisy, in part due to running on virtualized hardware.

There are a range of both regressions and improvements after my change, roughly equal in count and magnitudes. Do keep in mind that any regressions (where UseAVX=2 is slower) are local to that operation, but improvements (where UseAVX=2 is faster) can often be felt by the whole machine - avoiding 15% downclocking is worth a lot more than 15% speedup in one code path. On this basis, I think UseAVX=2 is the right default for this hardware.

@vnkozlov - I don't see a significant regression in crypto.aes on my runs, could you please share more info about your test and hardware?

cpu family      : 6
model           : 85
model name      : Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz
stepping        : 7
microcode       : 0x5003103

Note for the Change column: a positive % means the UseAVX=2 run is faster/better (higher ops/m) compared to the UseAVX=3 baseline/default behaviour. Negative means it was slower.

UseAVX=3 (ops/m) UseAVX=2 (ops/m) Change
startup.helloworld 217 190 -12%
startup.compiler.compiler 218 205 -6%
startup.compiler.sunflow 218 221 +1%
startup.compress 42 42 +1%
startup.crypto.aes 18 18 -1%
startup.crypto.rsa 99 94 -5%
startup.crypto.signverify 80 77 -5%
startup.mpegaudio 29 28 -4%
startup.scimark.fft 67 69 +3%
startup.scimark.lu 90 75 -17%
startup.scimark.monte_carlo 17 17 +1%
startup.scimark.sor 32 35 +9%
startup.scimark.sparse 42 42 +0%
startup.serial 31 32 +2%
startup.sunflow 37 34 -9%
startup.xml.transform 35 35 -1%
startup.xml.validation 47 50 +7%
compress 576 568 -1%
crypto.aes 202 200 -1%
crypto.rsa 4065 4049 -0%
crypto.signverify 2033 1964 -3%
derby 1391 1411 +1%
mpegaudio 324 363 +12%
scimark.fft.large 274 271 -1%
scimark.lu.large 73 73 -0%
scimark.sor.large 160 154 -4%
scimark.sparse.large 155 129 -17%
scimark.fft.small 1340 1421 +6%
scimark.lu.small 1967 2467 +25%
scimark.sor.small 700 679 -3%
scimark.sparse.small 544 492 -10%
scimark.monte_carlo 729 700 -4%
serial 477 466 -2%
sunflow 208 219 +5%
xml.transform 778 894 +15%
xml.validation 1610 1926 +20%

@vnkozlov
Copy link
Contributor

I got regression when I did not load my system - I run spec with -bt 8. I will do rerun with full load. My system is slightly different: 8260L CPU @ 2.40GHz.

@simonis
Copy link
Member

simonis commented May 18, 2022

@olivergillespie I think these benchmarks should be done on bare metal machines. As you have correctly observed, usage of AVX3 can impact the whole CPU. How can you make sure that other applications running in parallel to yours on the same virtualized hardware don't influence your results by using AVX3 themselves?

Also, I think you should run any sub-benchmark in isolation to make sure that the CPU has recovered after potential down-clocking because of a previous sub-benchmark which heavily used AVX3.

It might also be helpful to run a tool like i7z in parallel to your benchmarks which displays the clock speed of all CPUs (there might be better tools but this should at least give you an idea about what's going on). Ideally, you should have an additional column for every result which displays the average CPU speed during the benchmark run.

@olivergillespie
Copy link
Contributor Author

I just ran into this again on a different machine, on this one it downclocks from 3.6GHz to 3.1GHz.

cpu family	: 6
model		: 85
model name	: Intel(R) Xeon(R) Platinum 8275CL CPU @ 3.00GHz
stepping	: 7

@vnkozlov I tried with 1, 2, 4, 8 and 16 benchmark threads, all showed no regression on crypto.aes. Did you manage to re-run it?

@sviswa7
Copy link

sviswa7 commented May 19, 2022

From what I understand, only the core which is executing 512 bit vector instructions will observe this lower frequency and not the entire processor. It is doing double the work per clock during that time so overall we should come out ok.
@olivergillespie As Volker mentioned, the SPECjvm2008 data that you shared was all over the place. You would agree that we cannot globally change something based on that. In addition, SPECjvm2008 has lot of run-to-run variation in general due to various reasons (e.g. data locality if the threads are spread across multiple socket). Please take that into account as well.

@olivergillespie
Copy link
Contributor Author

Thanks for the comments.

From what I understand, only the core which is executing 512 bit vector instructions will observe this lower frequency and not the entire processor.

Yes, but in a typical multithreaded java service (say a web app), many threads end up using these instructions via common operations like StringBuilder, plus the threads are not pinned to cores. This in practice can mean all cores occasionally hit AVX3 instructions. The slowdown on these processors does not only last while the instruction is executing, it persists far beyond that, and so even a few short uses of AVX3 can end up perma-throttling all CPUs.

From https://lemire.me/blog/2018/09/07/avx-512-when-and-how-to-use-these-new-instructions/

Downclocking, when it happens, is per core and for a short time after you have used particular instructions (e.g., ~2ms).

2 milliseconds is a huge amount of time to run at the lower frequency. Each core only needs to hit these instructions once every 2 milliseconds to have permanent throttling, and that's what I see on my applications.

More info: https://blog.cloudflare.com/on-the-dangers-of-intels-frequency-scaling/

I'm not an expert on the mechanics of this, only that I observe this same behaviour in every real-world JDK17 application I've looked at which run on Cascade Lake (4 so far); namely 15% global slowdown with very little AVX3 usage. The AVX3 speedup definitely doesn't make up for it, because when I disable AVX3 my application's performance improves significantly (reduced latency, increased throughput).

Do you know what measurements were used to justify the original exception or model 85 stepping <5? I could re-run those tests to compare to my hardware. Is there any other bechmark or measurement which you'd like to see which might justify this change?

Maybe we can look at the change in this way: are model 85 stepping 5,6,7 affected by the same issue that the earlier steppings are, which are already excluded from AVX3 by default? It was evidently decided that the issue was severe enough for those CPUs to have AVX3 disabled by default, I merely find that 3 later models also suffer severely from this issue and should receive the same treatment.

@vnkozlov
Copy link
Contributor

@vnkozlov I tried with 1, 2, 4, 8 and 16 benchmark threads, all showed no regression on crypto.aes. Did you manage to re-run it?

I rerun it but it did not get this regression. So it is indeed non-stable. I think @simonis is right that we need to run each sub-benchmarks separately. I am currently running with different -bt and results are jumping. I am on bare-bone machine.

I will try to run with increased iterations and separate sub-benchmarks.

Based on your comment, you are using JDK 17. Which particular version you have?
I am running with latest JDK 19 which may also affects difference.

@vnkozlov
Copy link
Contributor

I reproduced my regression (and your improvement) running on 1/4 of total threads on machine:
SEPCjvm2008_avx2_vs_avx512_qrtr

But when I use all threads (loading system) picture is different (except mpegaudio):
SEPCjvm2008_avx2_vs_avx512_full

@vnkozlov
Copy link
Contributor

I think I spent enough time on this already. Performance tracking is "rabbit hole" :(

As I said before, results are mixed comparing to running on Skylake where results were all positive.
Even running on more recent Intel's CPUs I see some mixed results.

Yes, AVXV512 is not for all applications. I agree with your observations. Even so, I can't support this change.
For such cases we have official solution by providing ability to choose instructions set with UseAVX flag.

@olivergillespie
Copy link
Contributor Author

Thanks for running the tests and sharing the results!

Based on your comment, you are using JDK 17. Which particular version you have?

My tests were run on the latest JDK19 tip build, but my real applications where I have observed the problems are using 17.0.3.6.

I agree that results are mixed (though the biggest changes are in favour of AVX2), so given that we know for sure that AVX512 affects other processes and other code, mixed results is definitely not a good enough reason to enable AVX512 on these CPUs. I don't believe we would ever decide to enable AVX512 based on these results, hence my suggested change - don't enable AVX512 on these CPUs. There is no benefit on average to having AVX512 enabled on these models, yet it comes with huge downsides, so I think the risk profile leans heavily in favour of not enabling it by default for these models.

@simonis
Copy link
Member

simonis commented May 20, 2022

@vnkozlov I think I agree with @olivergillespie. Looking at the graphs you've posted, AVX2 seems superior to AVX3 even if only 1/4 of the threads are used but clearly if the machine is fully loaded. As @olivergillespie said, it would be hard to justify enabling AVX3 on these CPUs today, given these results.

I would argue we should disable it by default and as you said, let the few use cases which benefit from it like AES on non-loaded machines, enable it manually with -XX:+UseAVX=3.

@vnkozlov
Copy link
Contributor

I did standalone runs of sub-benchmarks several times to get best results. Some of them show 2 sets of results: fast and slow. crypto.aes and crypto.rsa one of them. I suspect it is caused by difference in code generation (slightly different profiling and inlining) and nothing to do with AVX setting. I did see (termal?/turbo boost?) effects if I run benchmarks without pause between them. So I kept system idle for about 10 min between runs.

I did not re-run benchmarks if difference was < 3%.

We need to look what is going on with mpegaudio and scimark.lu.small. And also test with Parallel GC.

SEPCjvm2008_avx2_vs_avx512_qrtr_separate

@vnkozlov
Copy link
Contributor

These results seems promising but we should not base our decision only on this.
We need some time to analyze these results and look on other benchmarks.
Thank you for patience.

@olivergillespie
Copy link
Contributor Author

Thanks for all the help so far. Is there anything I can help with?

@vnkozlov
Copy link
Contributor

Thanks for all the help so far. Is there anything I can help with?

Thank you for offering help. We just need time to run different benchmarks we use for performance testing.

@sviswa7
Copy link

sviswa7 commented Jun 1, 2022

@olivergillespie In our analysis of mpegaudio, we found that the problem was due to auto-vectorizer kicking in for small array initialization. The auto-vectorization does not take into consideration AVX3Threshold. I have submitted a simple PR to answer your concerns #8877. Please take a look. The PR limits the auto-vectorizer to 256-bit or 32-byte for Cascade Lake.

@olivergillespie
Copy link
Contributor Author

Thanks @sviswa7 ! I don't have the expertise to evaluate your auto-vectorizer change from a theoretical standpoint. I'd like to test it out in my real applications to see if the issue you found in mpegaudio is the same issue I see on a more complex app, but unfortunately I won't be able to do that quickly (not easy to swap out JDK versions in those apps).

I'm still of the opinion that AVX3 on Cascade Lake in the JVM seems more dangerous than it's worth across the board, due to the documented inherent downclocking behaviour of that architecture. Various other SPECjvm benchmarks consistently downclock by >15%, so even if they don't show a direct performance regression with AVX3 (the improved AVX3 performance roughly balances the downclocking), they have a negative impact on other code running on the host.

It's possible your tighter change covers most of the cases in practice, which would be great, but I think this broader change still makes sense conceptually.

@vnkozlov
Copy link
Contributor

vnkozlov commented Jun 7, 2022

@olivergillespie you can test your application with -XX:MaxVectorSize=32 product flag to see effects of #8877 changes.
Main concern about changing AVX is reduction of instructions set which could be useful.

@bridgekeeper
Copy link

bridgekeeper bot commented Jul 5, 2022

@olivergillespie This pull request has been inactive for more than 4 weeks and will be automatically closed if another 4 weeks passes without any activity. To avoid this, simply add a new comment to the pull request. Feel free to ask for assistance if you need help with progressing this pull request towards integration!

@bridgekeeper
Copy link

bridgekeeper bot commented Aug 3, 2022

@olivergillespie This pull request has been inactive for more than 8 weeks and will now be automatically closed. If you would like to continue working on this pull request in the future, feel free to reopen it! This can be done using the /open pull request command.

@bridgekeeper bridgekeeper bot closed this Aug 3, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hotspot hotspot-dev@openjdk.org rfr Pull request is ready for review
5 participants