New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
8286823: Default to UseAVX=2 on all Skylake/Cascade Lake CPUs #8731
Conversation
The current code already does this for 'older' Skylake processors, namely those with _stepping < 5. My testing indicates this is a problem for later processors in this family too, so I have removed the max stepping condition. The original exclusion was added in https://bugs.openjdk.java.net/browse/JDK-8221092. A general description of the overall issue is given at https://en.wikipedia.org/wiki/Advanced_Vector_Extensions#Downclocking. According to https://en.wikichip.org/wiki/intel/microarchitectures/cascade_lake#CPUID, stepping values 5..7 indicate Cascade Lake. I have tested on a CPU with stepping=7, and I see CPU frequency reduction from 3.1GHz down to 2.7GHz (~23%) when using -XX:UseAVX=3, along with a corresponding performance reduction. I first saw this issue in a real production workload, where the main AVX3 instructions being executed were those generated for various flavours of disjoint_arraycopy. I can reproduce a similar effect using SPECjvm2008's xml.transform benchmark. ``` java --add-opens=java.xml/com.sun.org.apache.xerces.internal.parsers=ALL-UNNAMED \ --add-opens=java.xml/com.sun.org.apache.xerces.internal.util=ALL-UNNAMED \ -jar SPECjvm2008.jar -ikv -ict xml.transform ``` Before the change, or with -XX:UseAVX=3: ``` Valid run! Score on xml.transform: 776.00 ops/m ``` After the change, or with -XX:UseAVX=2: ``` Valid run! Score on xml.transform: 894.07 ops/m ``` So, a 15% improvement in this benchmark. It's possible some benchmarks will be negatively affected by this change, but I contend that this is still the right move given the stark difference in this benchmark combined with the fact that use of AVX3 instructions can affect *all* processes/code on the host due to the downclocking, and the fact that this effect is very hard to root-cause, for example CPU profiles look very similar before and after since all code is equally slowed.
👋 Welcome back olivergillespie! A progress list of the required criteria for merging this PR into |
@olivergillespie The following label will be automatically applied to this pull request:
When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command. |
if (use_avx_limit > 2 && is_intel_skylake() && _stepping < 5) { | ||
// Don't use AVX-512 on Skylake (or the related Cascade Lake) CPUs unless explicitly | ||
// requested - these instructions can cause performance issues on these processors. | ||
if (use_avx_limit > 2 && is_intel_skylake()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe is_intel_skylake
needs to be changed to is_cpu_model_intel_skylake
? It will make clear that all CPUs based on Skylake model are excluded.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree it's not necessarily a perfect name, but I haven't changed its behaviour so I figured I can avoid making my change any bigger than necessary. The intention of this particular usage is clarified in my comment. It's used in another place too, where it's evidently understood to include Cascade lake since the comment mentions Ice lake + (Cascade lake successor).
@olivergillespie, how did you test your changes? |
My testing was simply running the SPECjvm2008 |
You need to run gtest and tier1 at least. The PR did not started them automatically. |
Thanks, I have now run tier1 and gtest successfully with my change. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems reasonable. I have to run testing before approval.
We restricted this to Skylake only as Cascade Lake has improvements in this regard on the platform. |
Hi Sandhya,
Oli mentioned |
@eastig AVX3Threshold is not set to 0 for CascadeLake. |
I ran very briefly specjvm2008 on CascadeLake and I see big regression in crypto.aes with AVX2. |
@olivergillespie, do you have results for other SPECjvm2008 benchmarks? |
Thanks for the comments, all. I will run all SPECjvm2008 benchmarks before and after the change for us to look at - I'll share the results later today. In the application that uncovered this issue for me, minor use of AVX3 instructions (about 3-4% of the overall CPU usage, all in various flavours of disjoint_arraycopy) on my Cascade Lake processor downclocks the whole machine by ~15% - that means all threads/cores/processes slowed by 15% across the board. The local speedup thanks to AVX3 has to be weighed against the global downclocking overhead, and I contend that most real-world applications will see far more overhead than benefit from AVX3 on Cascade Lake. |
Below are my complete SPECjvm2008 results, running on an AWS EC2 m4.4xl host (CPU details also shared below), with warmup time of 120s and 1 iteration of 240s per benchmark. My results are somewhat noisy, in part due to running on virtualized hardware. There are a range of both regressions and improvements after my change, roughly equal in count and magnitudes. Do keep in mind that any regressions (where UseAVX=2 is slower) are local to that operation, but improvements (where UseAVX=2 is faster) can often be felt by the whole machine - avoiding 15% downclocking is worth a lot more than 15% speedup in one code path. On this basis, I think UseAVX=2 is the right default for this hardware. @vnkozlov - I don't see a significant regression in crypto.aes on my runs, could you please share more info about your test and hardware?
Note for the
|
I got regression when I did not load my system - I run spec with |
@olivergillespie I think these benchmarks should be done on bare metal machines. As you have correctly observed, usage of AVX3 can impact the whole CPU. How can you make sure that other applications running in parallel to yours on the same virtualized hardware don't influence your results by using AVX3 themselves? Also, I think you should run any sub-benchmark in isolation to make sure that the CPU has recovered after potential down-clocking because of a previous sub-benchmark which heavily used AVX3. It might also be helpful to run a tool like i7z in parallel to your benchmarks which displays the clock speed of all CPUs (there might be better tools but this should at least give you an idea about what's going on). Ideally, you should have an additional column for every result which displays the average CPU speed during the benchmark run. |
I just ran into this again on a different machine, on this one it downclocks from 3.6GHz to 3.1GHz.
@vnkozlov I tried with 1, 2, 4, 8 and 16 benchmark threads, all showed no regression on crypto.aes. Did you manage to re-run it? |
From what I understand, only the core which is executing 512 bit vector instructions will observe this lower frequency and not the entire processor. It is doing double the work per clock during that time so overall we should come out ok. |
Thanks for the comments.
Yes, but in a typical multithreaded java service (say a web app), many threads end up using these instructions via common operations like StringBuilder, plus the threads are not pinned to cores. This in practice can mean all cores occasionally hit AVX3 instructions. The slowdown on these processors does not only last while the instruction is executing, it persists far beyond that, and so even a few short uses of AVX3 can end up perma-throttling all CPUs. From https://lemire.me/blog/2018/09/07/avx-512-when-and-how-to-use-these-new-instructions/
2 milliseconds is a huge amount of time to run at the lower frequency. Each core only needs to hit these instructions once every 2 milliseconds to have permanent throttling, and that's what I see on my applications. More info: https://blog.cloudflare.com/on-the-dangers-of-intels-frequency-scaling/ I'm not an expert on the mechanics of this, only that I observe this same behaviour in every real-world JDK17 application I've looked at which run on Cascade Lake (4 so far); namely 15% global slowdown with very little AVX3 usage. The AVX3 speedup definitely doesn't make up for it, because when I disable AVX3 my application's performance improves significantly (reduced latency, increased throughput). Do you know what measurements were used to justify the original exception or model 85 stepping <5? I could re-run those tests to compare to my hardware. Is there any other bechmark or measurement which you'd like to see which might justify this change? Maybe we can look at the change in this way: are model 85 stepping 5,6,7 affected by the same issue that the earlier steppings are, which are already excluded from AVX3 by default? It was evidently decided that the issue was severe enough for those CPUs to have AVX3 disabled by default, I merely find that 3 later models also suffer severely from this issue and should receive the same treatment. |
I rerun it but it did not get this regression. So it is indeed non-stable. I think @simonis is right that we need to run each sub-benchmarks separately. I am currently running with different I will try to run with increased iterations and separate sub-benchmarks. Based on your comment, you are using JDK 17. Which particular version you have? |
I think I spent enough time on this already. Performance tracking is "rabbit hole" :( As I said before, results are mixed comparing to running on Skylake where results were all positive. Yes, AVXV512 is not for all applications. I agree with your observations. Even so, I can't support this change. |
Thanks for running the tests and sharing the results!
My tests were run on the latest JDK19 tip build, but my real applications where I have observed the problems are using 17.0.3.6. I agree that results are mixed (though the biggest changes are in favour of AVX2), so given that we know for sure that AVX512 affects other processes and other code, mixed results is definitely not a good enough reason to enable AVX512 on these CPUs. I don't believe we would ever decide to enable AVX512 based on these results, hence my suggested change - don't enable AVX512 on these CPUs. There is no benefit on average to having AVX512 enabled on these models, yet it comes with huge downsides, so I think the risk profile leans heavily in favour of not enabling it by default for these models. |
@vnkozlov I think I agree with @olivergillespie. Looking at the graphs you've posted, AVX2 seems superior to AVX3 even if only 1/4 of the threads are used but clearly if the machine is fully loaded. As @olivergillespie said, it would be hard to justify enabling AVX3 on these CPUs today, given these results. I would argue we should disable it by default and as you said, let the few use cases which benefit from it like AES on non-loaded machines, enable it manually with |
I did standalone runs of sub-benchmarks several times to get best results. Some of them show 2 sets of results: fast and slow. I did not re-run benchmarks if difference was < 3%. We need to look what is going on with |
These results seems promising but we should not base our decision only on this. |
Thanks for all the help so far. Is there anything I can help with? |
Thank you for offering help. We just need time to run different benchmarks we use for performance testing. |
@olivergillespie In our analysis of mpegaudio, we found that the problem was due to auto-vectorizer kicking in for small array initialization. The auto-vectorization does not take into consideration AVX3Threshold. I have submitted a simple PR to answer your concerns #8877. Please take a look. The PR limits the auto-vectorizer to 256-bit or 32-byte for Cascade Lake. |
Thanks @sviswa7 ! I don't have the expertise to evaluate your auto-vectorizer change from a theoretical standpoint. I'd like to test it out in my real applications to see if the issue you found in mpegaudio is the same issue I see on a more complex app, but unfortunately I won't be able to do that quickly (not easy to swap out JDK versions in those apps). I'm still of the opinion that AVX3 on Cascade Lake in the JVM seems more dangerous than it's worth across the board, due to the documented inherent downclocking behaviour of that architecture. Various other SPECjvm benchmarks consistently downclock by >15%, so even if they don't show a direct performance regression with AVX3 (the improved AVX3 performance roughly balances the downclocking), they have a negative impact on other code running on the host. It's possible your tighter change covers most of the cases in practice, which would be great, but I think this broader change still makes sense conceptually. |
@olivergillespie you can test your application with |
@olivergillespie This pull request has been inactive for more than 4 weeks and will be automatically closed if another 4 weeks passes without any activity. To avoid this, simply add a new comment to the pull request. Feel free to ask for assistance if you need help with progressing this pull request towards integration! |
@olivergillespie This pull request has been inactive for more than 8 weeks and will now be automatically closed. If you would like to continue working on this pull request in the future, feel free to reopen it! This can be done using the |
The current code already does this for 'older' Skylake processors,
namely those with _stepping < 5. My testing indicates this is a
problem for later processors in this family too, so I have removed the
max stepping condition.
The original exclusion was added in https://bugs.openjdk.java.net/browse/JDK-8221092.
A general description of the overall issue is given at
https://en.wikipedia.org/wiki/Advanced_Vector_Extensions#Downclocking.
According to https://en.wikichip.org/wiki/intel/microarchitectures/cascade_lake#CPUID,
stepping values 5..7 indicate Cascade Lake. I have tested on a CPU with stepping=7,
and I see CPU frequency reduction from 3.1GHz down to 2.7GHz (~23%) when using
-XX:UseAVX=3, along with a corresponding performance reduction.
I first saw this issue in a real production workload, where the main AVX3 instructions
being executed were those generated for various flavours of disjoint_arraycopy.
I can reproduce a similar effect using SPECjvm2008's xml.transform benchmark.
Before the change, or with -XX:UseAVX=3:
After the change, or with -XX:UseAVX=2:
So, a 15% improvement in this benchmark. It's possible some benchmarks will be negatively
affected by this change, but I contend that this is still the right move given the stark
difference in this benchmark combined with the fact that use of AVX3 instructions can
affect all processes/code on the host due to the downclocking, and the fact that this
effect is very hard to root-cause, for example CPU profiles look very similar before and
after since all code is equally slowed.
Progress
Issue
Reviewing
Using
git
Checkout this PR locally:
$ git fetch https://git.openjdk.java.net/jdk pull/8731/head:pull/8731
$ git checkout pull/8731
Update a local copy of the PR:
$ git checkout pull/8731
$ git pull https://git.openjdk.java.net/jdk pull/8731/head
Using Skara CLI tools
Checkout this PR locally:
$ git pr checkout 8731
View PR using the GUI difftool:
$ git pr show -t 8731
Using diff file
Download this PR as a diff file:
https://git.openjdk.java.net/jdk/pull/8731.diff