Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/compile: jump tables perform much worse on AMD zen3? #53331

Open
dsnet opened this issue Jun 11, 2022 · 10 comments
Open

cmd/compile: jump tables perform much worse on AMD zen3? #53331

dsnet opened this issue Jun 11, 2022 · 10 comments
Labels
NeedsInvestigation Performance

Comments

@dsnet
Copy link
Member

@dsnet dsnet commented Jun 11, 2022

On my AMD Ryzen 5900x, I noticed a drop in performance after jump tables were added (1ba96d8).


The CL warns that microbenchmarks will favor the older binary search approach, while jump tables will be better utilized in larger benchmarks since it uses fewer hardware resources (branch predictors, etc.). To measure, this I tried to craft a micro-benchmark that would exercise edge cases of either approach. This is what I came up with:

func BenchmarkSame(b *testing.B) {
	for i := 0; i < b.N; i++ {
		for _, k := range sourcesSame {
			sink = KindString(k)
		}
	}
}
func BenchmarkOrdered(b *testing.B) {
	for i := 0; i < b.N; i++ {
		for _, k := range sourcesOrdered {
			sink = KindString(k)
		}
	}
}
func BenchmarkRandom(b *testing.B) {
	for i := 0; i < b.N; i++ {
		for _, k := range sourcesRandom {
			sink = KindString(k)
		}
	}
}

Essentially, KindString is just like reflect.Kind.String but uses a giant switch statement.
The Same, Ordered, and Random benchmarks keep calling KindString with either the same argument, sequentially ordered arguments (i.e., easily predictable), or completely random arguments. I would expect jump tables to perform better on Random over the binary search approach since binary search would continually run log2(27) layers of branch prediction failures.

I ran the benchmarks using a Go toolchain built at 1ba96d8 and also the commit right before it.

On at least 3 CPUs that I have access to, I see promising results:

cpu: Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz
benchmark              old ns/op     new ns/op     delta
BenchmarkSame-8        38859         31988         -17.68%
BenchmarkOrdered-8     41096         40540         -1.35%
BenchmarkRandom-8      154696        108772        -29.69%

Here, jump tables do better than binary search on all cases. Great!

cpu: Intel(R) Celeron(R) CPU J3355 @ 2.00GHz
benchmark              old ns/op     new ns/op     delta
BenchmarkSame-2        47493         44109         -7.13%
BenchmarkOrdered-2     81576         140271        +71.95%
BenchmarkRandom-2      240434        138600        -42.35%

Here, jump tables does worse on Ordered, but has much better performance on Random. Somewhat expected.

cpu: DO-Regular
benchmark            old ns/op     new ns/op     delta
BenchmarkSame        32203         23410         -27.30%
BenchmarkOrdered     35731         27531         -22.95%
BenchmarkRandom      179656        125754        -30.00%

Here, this is a Digital Ocean droplet, which is probably an Intel Xeon processor of some kind.
Here, we see jump tables do better than binary search on all cases. Great again!

However, things turn for the worse on an AMD Ryzen 5900X:

cpu: AMD Ryzen 9 5900X 12-Core Processor            
benchmark               old ns/op     new ns/op     delta
BenchmarkSame-24        19514         61886         +217.14%
BenchmarkOrdered-24     22926         75008         +227.17%
BenchmarkRandom-24      38740         72986         +88.40%

Here, the jump tables are dramatically worse on all cases. I'm surprised that it's worse by 2x on the Random benchmark, which is where I expect jump tables to outshine binary search.


I understand that performance improvements are really hard to quantify since what works well for one CPU design (e.g., Intel) may not be optimal for another CPU design (e.g., AMD). However, these results are much more drastic than I was expecting. I don't know whether this issue is just with the Ryzen 5900X or all AMD chips. I also, don't know if this is going to result in any changes to the implementation of jump tables, but I think this is a surprising result worth reporting.

\cc @randall77 @cherrymui

@Jorropo
Copy link
Contributor

@Jorropo Jorropo commented Jun 11, 2022

I don't know whether this issue is just with the Ryzen 5900X or all AMD chips.

I cannot reproduce on a Ryzen 3600 CPU (zen2) 3200mhz cl16 ram:

benchmark               old ns/op     new ns/op     delta
BenchmarkSame-12        19689         20289         +3.05%
BenchmarkOrdered-12     32540         27448         -15.65%
BenchmarkRandom-12      139527        93838         -32.75%

It would be nice to confirm with an other zen3 cpu to know if that a problem for the complete generation.


The core of your benchmark is a tight loop, it's 3 non-pipelinable loads (and the last one being instructions) back to back, that looks like a worst case scenario for jump tables, it would perform better if k was resolved earlier (like by adding more work in the loop body) which should happen in most real world code.

Your memory speed might play a role here since it should be bottlenecked by the memory latency (altho I hope everything found it's way in L2/L1 but I suspect with your CPU it doesn't).

@dominikh
Copy link
Member

@dominikh dominikh commented Jun 11, 2022

More Zen 2 results (that can't reproduce the problem):

goos: linux
goarch: amd64
cpu: AMD Ryzen 9 3950X 16-Core Processor
name        old time/op  new time/op  delta
Same-32     19.9µs ± 4%  19.9µs ± 2%     ~     (p=1.000 n=10+10)
Ordered-32  34.1µs ± 4%  27.1µs ± 3%  -20.68%  (p=0.000 n=10+10)
Random-32    127µs ± 1%    92µs ± 1%  -27.19%  (p=0.000 n=10+10)

@florianl
Copy link
Contributor

@florianl florianl commented Jun 11, 2022

More for AMD Ryzen 7 PRO 4750U

Details
goos: linux
goarch: amd64
pkg: jumptables
cpu: AMD Ryzen 7 PRO 4750U with Radeon Graphics
benchmark               old ns/op     new ns/op     delta
BenchmarkSame-16        20669         20676         +0.03%
BenchmarkOrdered-16     36472         28692         -21.33%
BenchmarkRandom-16      131109        97300         -25.79%

@hhoughgg
Copy link

@hhoughgg hhoughgg commented Jun 11, 2022

AMD Ryzen 7 5800X

goos: linux
goarch: amd64
cpu: AMD Ryzen 7 5800X 8-Core Processor
name        old time/op  new time/op  delta
Same-16     17.2µs ± 0%  15.1µs ± 1%   -12.53%  
Ordered-16  25.9µs ± 3%  73.2µs ± 1%  +182.54% 
Random-16   35.9µs ± 1%  71.0µs ± 2%   +97.93%

@dsnet
Copy link
Member Author

@dsnet dsnet commented Jun 11, 2022

Thanks for results so far. Here's a summary for AMD chips:

CPU MicroArch Same Delta Ordered Delta Random Delta Source
Ryzen 5 3400G Zen 2 +4% ~ -24% from @soypat
Ryzen 5 3600 Zen 2 +3% -16% -33% from @Jorropo
Ryzen 7 PRO 4750U Zen 2 ~ -21% -26% from @florianl
Ryzen 9 3900X Zen 2 -11% -15% -32% from @inkeliz
Ryzen 9 3950X Zen 2 ~ -21% -27% from @dominikh
EPYC 7502P Zen 2 -22% -3% -37% from @inkeliz
Ryzen 7 5800X Zen 3 -13% +183% +98% from @hhoughgg
Ryzen 7 PRO 5850U Zen 3 ~ +182% +109% from @mvdan
Ryzen 9 5900X Zen 3 +217% +227% +88% from @dsnet

@inkeliz
Copy link

@inkeliz inkeliz commented Jun 11, 2022

AMD Ryzen 3900X (Windows, 64GB 2666MHZ ECC):

name        old time/op  new time/op  delta
Same-24     20.9µs ± 1%  18.6µs ± 1%  -11.02%  (p=0.000 n=10+10)
Ordered-24  30.0µs ± 2%  25.6µs ± 0%  -14.75%  (p=0.000 n=10+10)
Random-24    130µs ± 2%    88µs ± 0%  -32.26%  (p=0.000 n=10+10)

(old = go1.18.2; new= go1.19-55590f3a2b)


AMD EPYC 7502P (Debian, 512GB 2933MHZ ECC):

name        old time/op  new time/op  delta
Same-64     27.2µs ± 0%  21.2µs ± 0%  -22.12%  (p=0.000 n=10+10)
Ordered-64  37.8µs ± 2%  36.6µs ± 0%   -3.20%  (p=0.000 n=10+9)
Random-64    170µs ± 1%   108µs ± 1%  -36.66%  (p=0.000 n=9+10)

@dsnet dsnet changed the title cmd/compile: jump tables perform much worse on AMD? cmd/compile: jump tables perform much worse on AMD zen3? Jun 11, 2022
@soypat
Copy link

@soypat soypat commented Jun 12, 2022

Zen 2 arch AMD Ryzen 5 3400G with Radeon Vega Graphics

$ go version go1.18.2 linux/amd64
$ go test -bench=. -count=100 . > old.txt
$ go1.19beta1 test -bench=. -count=35 . > new.txt
$ benchstat  old.txt new.txt 
name       old time/op  new time/op  delta
Same-8     23.3µs ± 5%  24.2µs ± 3%   +4.16%  (p=0.000 n=91+35)
Ordered-8  29.8µs ± 4%  29.8µs ± 3%     ~     (p=0.517 n=97+34)
Random-8    132µs ± 5%   100µs ± 3%  -24.34%  (p=0.000 n=95+33)

@mvdan
Copy link
Member

@mvdan mvdan commented Jun 12, 2022

I can confirm that mobile zen3 is also affected; AMD Ryzen 7 PRO 5850U with Radeon Graphics on linux/amd64, idle and plugged in so it benchmarks at ~25W, showing parent commit versus the change:

name        old time/op  new time/op  delta
Same-16     18.3µs ± 2%  18.2µs ± 1%      ~     (p=0.394 n=6+6)
Ordered-16  27.6µs ± 2%  77.8µs ± 1%  +181.89%  (p=0.002 n=6+6)
Random-16   37.4µs ± 0%  78.1µs ± 1%  +108.62%  (p=0.010 n=4+6)

@randall77
Copy link
Contributor

@randall77 randall77 commented Jun 13, 2022

Possibly this effect could be related to similar problems we saw with CMOV generation. Branches, if you occasionally guess right, can let loads happen in parallel that would otherwise be serialized. See the discussion starting at #26306 (comment)
I'm not convinced it is the same effect, but maybe. It would be worth examining the benchmark to see if that effect is possible, and if so maybe trying prefetches to see if that reduces the effect.

@randall77
Copy link
Contributor

@randall77 randall77 commented Jun 13, 2022

Also possibly the loss in performance has something to do with Spectre mitigation?

@thanm thanm added the NeedsInvestigation label Jun 14, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
NeedsInvestigation Performance
Projects
None yet
Development

No branches or pull requests

10 participants