-
Notifications
You must be signed in to change notification settings - Fork 17.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cmd/compile: jump tables perform much worse on AMD zen3? #53331
Comments
I cannot reproduce on a Ryzen 3600 CPU (zen2) 3200mhz cl16 ram:
It would be nice to confirm with an other zen3 cpu to know if that a problem for the complete generation. |
More Zen 2 results (that can't reproduce the problem): goos: linux goarch: amd64 cpu: AMD Ryzen 9 3950X 16-Core Processor name old time/op new time/op delta Same-32 19.9µs ± 4% 19.9µs ± 2% ~ (p=1.000 n=10+10) Ordered-32 34.1µs ± 4% 27.1µs ± 3% -20.68% (p=0.000 n=10+10) Random-32 127µs ± 1% 92µs ± 1% -27.19% (p=0.000 n=10+10) |
More for Details
|
AMD Ryzen 7 5800X goos: linux goarch: amd64 cpu: AMD Ryzen 7 5800X 8-Core Processor name old time/op new time/op delta Same-16 17.2µs ± 0% 15.1µs ± 1% -12.53% Ordered-16 25.9µs ± 3% 73.2µs ± 1% +182.54% Random-16 35.9µs ± 1% 71.0µs ± 2% +97.93% |
Thanks for results so far. Here's a summary for AMD chips:
|
AMD Ryzen 3900X (Windows, 64GB 2666MHZ ECC):
( AMD EPYC 7502P (Debian, 512GB 2933MHZ ECC):
|
Zen 2 arch AMD Ryzen 5 3400G with Radeon Vega Graphics
|
I can confirm that mobile zen3 is also affected;
|
Possibly this effect could be related to similar problems we saw with CMOV generation. Branches, if you occasionally guess right, can let loads happen in parallel that would otherwise be serialized. See the discussion starting at #26306 (comment) |
Also possibly the loss in performance has something to do with Spectre mitigation? |
Worth noting that I got my slow-down numbers above on Linux with |
More than happy to help benchmark attempts at solving the regression, as I imagine few in the compiler team have zen3 machines. |
It's possible that this report on reddit stems from this issue. |
Possibly unrelated, OP reported it's more likely a NUMA issue, as disabling the second CCD makes it go faster: https://www.reddit.com/r/golang/comments/y4w74o/comment/isjxqby/ |
Zen4 is also affected. I use 1.18.10 and 1.21.3.
|
On my AMD Ryzen 5900x, I noticed a drop in performance after jump tables were added (1ba96d8).
The CL warns that microbenchmarks will favor the older binary search approach, while jump tables will be better utilized in larger benchmarks since it uses fewer hardware resources (branch predictors, etc.). To measure, this I tried to craft a micro-benchmark that would exercise edge cases of either approach. This is what I came up with:
Essentially,
KindString
is just likereflect.Kind.String
but uses a giant switch statement.The
Same
,Ordered
, andRandom
benchmarks keep callingKindString
with either the same argument, sequentially ordered arguments (i.e., easily predictable), or completely random arguments. I would expect jump tables to perform better onRandom
over the binary search approach since binary search would continually run log2(27) layers of branch prediction failures.I ran the benchmarks using a Go toolchain built at 1ba96d8 and also the commit right before it.
On at least 3 CPUs that I have access to, I see promising results:
Here, jump tables do better than binary search on all cases. Great!
Here, jump tables does worse on
Ordered
, but has much better performance onRandom
. Somewhat expected.Here, this is a Digital Ocean droplet, which is probably an Intel Xeon processor of some kind.
Here, we see jump tables do better than binary search on all cases. Great again!
However, things turn for the worse on an AMD Ryzen 5900X:
Here, the jump tables are dramatically worse on all cases. I'm surprised that it's worse by 2x on the
Random
benchmark, which is where I expect jump tables to outshine binary search.I understand that performance improvements are really hard to quantify since what works well for one CPU design (e.g., Intel) may not be optimal for another CPU design (e.g., AMD). However, these results are much more drastic than I was expecting. I don't know whether this issue is just with the Ryzen 5900X or all AMD chips. I also, don't know if this is going to result in any changes to the implementation of jump tables, but I think this is a surprising result worth reporting.
\cc @randall77 @cherrymui
The text was updated successfully, but these errors were encountered: