New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/compile: understand Go vs assembly rc4 results #27184

Open
bradfitz opened this Issue Aug 23, 2018 · 8 comments

Comments

Projects
None yet
8 participants
@bradfitz
Member

bradfitz commented Aug 23, 2018

Tracking bug so somebody (@randall77?) looks into why we got totally opposite performance numbers on different Intel CPUs when we deleted the crypto/rc4 code's assembly in "favor" of the Go version in 30eda67 (https://golang.org/cl/130397):

name       old time/op   new time/op   delta
RC4_128-4    256ns ± 0%    196ns ± 0%  -23.78%  (p=0.029 n=4+4)
RC4_1K-4    2.38µs ± 0%   1.54µs ± 0%  -35.22%  (p=0.029 n=4+4)
RC4_8K-4    19.4µs ± 1%   12.0µs ± 0%  -38.35%  (p=0.029 n=4+4)

name       old speed     new speed     delta
RC4_128-4  498MB/s ± 0%  654MB/s ± 0%  +31.12%  (p=0.029 n=4+4)
RC4_1K-4   431MB/s ± 0%  665MB/s ± 0%  +54.34%  (p=0.029 n=4+4)
RC4_8K-4   418MB/s ± 1%  677MB/s ± 0%  +62.18%  (p=0.029 n=4+4)

vendor_id	: GenuineIntel
cpu family	: 6
model		: 142
model name	: Intel(R) Core(TM) i5-7Y54 CPU @ 1.20GHz
stepping	: 9
microcode	: 0x84
cpu MHz		: 800.036
cache size	: 4096 KB

name       old time/op   new time/op   delta
RC4_128-4    235ns ± 1%    431ns ± 0%  +83.00%  (p=0.000 n=10+10)
RC4_1K-4    1.74µs ± 0%   3.41µs ± 0%  +96.74%  (p=0.000 n=10+10)
RC4_8K-4    13.6µs ± 1%   26.8µs ± 0%  +97.58%   (p=0.000 n=10+9)

name       old speed     new speed     delta
RC4_128-4  543MB/s ± 0%  297MB/s ± 1%  -45.29%  (p=0.000 n=10+10)
RC4_1K-4   590MB/s ± 0%  300MB/s ± 0%  -49.16%  (p=0.000 n=10+10)
RC4_8K-4   596MB/s ± 1%  302MB/s ± 0%  -49.39%   (p=0.000 n=10+9)

vendor_id       : GenuineIntel
cpu family      : 6
model           : 63
model name      : Intel(R) Xeon(R) CPU @ 2.30GHz
stepping        : 0
microcode       : 0x1
cpu MHz         : 2300.000
cache size      : 46080 KB

Super mysterious, so we might want to understand it enough to decide whether we care and whether there's something the compiler might do better for more CPUs.

Maybe the benchmarks or benchstat are wrong? But then that'd be its own interesting bug.

/cc @josharian @aead @FiloSottile

@bradfitz bradfitz added this to the Go1.12 milestone Aug 23, 2018

@aead

This comment has been minimized.

Contributor

aead commented Aug 23, 2018

Again on linux/amd64 I see the following result for 1.10, 1.11rc2 and tip.

name       old time/op   new time/op   delta
RC4_128-4    262ns ± 1%    204ns ± 0%  -22.48%  (p=0.029 n=4+4)
RC4_1K-4    2.46µs ± 0%   1.62µs ± 0%  -34.16%  (p=0.029 n=4+4)
RC4_8K-4    19.7µs ± 0%   12.7µs ± 0%  -35.71%  (p=0.029 n=4+4)

name       old speed     new speed     delta
RC4_128-4  487MB/s ± 1%  627MB/s ± 0%  +28.71%  (p=0.029 n=4+4)
RC4_1K-4   416MB/s ± 0%  631MB/s ± 0%  +51.91%  (p=0.029 n=4+4)
RC4_8K-4   410MB/s ± 0%  638MB/s ± 0%  +55.55%  (p=0.029 n=4+4)

vendor_id	: GenuineIntel
cpu family	: 6
model		: 78
model name	: Intel(R) Core(TM) i7-6500U CPU @ 2.50GHz
stepping	: 3
microcode	: 0xc2
cpu MHz		: 546.356
cache size	: 4096 KB

Taking the the Macbook Pro (Intel(R) Core(TM) i5-6267U CPU @ 2.90GHz stats of CL-102255 into account it seems like the performance is reduced only on server chips (Xeon)...
Further the assembler code generated by the compiler for the generic implementation does not use any SIMD (SSE/SSE2) instructions while the manually written amd64 code does...

@FiloSottile

This comment has been minimized.

Member

FiloSottile commented Aug 24, 2018

It was suggested to me in private that the assembly might be penalized by a slowdown in accessing xmm registers.

the latency of moving things between general-purpose and xmm registers seems to have increased with Skylake, so those pinsrw instructions are probably more costly, and they are in the critical path

Also see the replies to
https://twitter.com/FiloSottile/status/1031994832460210176

However, I still find confusing that the compiled code, which should not be doing anything special, is faster on the 1.2GHz CPU than it is on the 2.3GHz one.

@uluyol

This comment has been minimized.

Contributor

uluyol commented Aug 24, 2018

The 1.2 GHz processor can turbo to 3.2 GHz.

@agnivade

This comment has been minimized.

Member

agnivade commented Aug 24, 2018

@magical

This comment has been minimized.

Contributor

magical commented Aug 24, 2018

Just to add another data point... I tried this on my work machine and got similar numbers to Brad's.

name       old time/op   new time/op   delta
RC4_128-8    183ns ± 1%    335ns ± 1%  +82.55%  (p=0.000 n=16+19)
RC4_1K-8    1.37µs ± 4%   2.65µs ± 1%  +94.01%  (p=0.000 n=20+19)
RC4_8K-8    10.6µs ± 1%   20.9µs ± 1%  +97.31%  (p=0.000 n=18+19)

name       old speed     new speed     delta
RC4_128-8  696MB/s ± 2%  382MB/s ± 1%  -45.09%  (p=0.000 n=17+19)
RC4_1K-8   748MB/s ± 4%  386MB/s ± 1%  -48.45%  (p=0.000 n=20+19)
RC4_8K-8   766MB/s ± 1%  388MB/s ± 1%  -49.32%  (p=0.000 n=18+19)

vendor_id	: GenuineIntel
cpu family	: 6
model		: 63
model name	: Intel(R) Xeon(R) CPU E5-1620 v3 @ 3.50GHz
stepping	: 2
microcode	: 54
cpu MHz		: 1200.000
cache size	: 10240 KB
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm ida arat epb xsaveopt pln pts dtherm tpr_shadow vnmi flexpriority ept vpid fsgsbase bmi1 avx2 smep bmi2 erms invpcid cqm cqm_llc cqm_occup_llc
@TocarIP

This comment has been minimized.

Contributor

TocarIP commented Aug 24, 2018

Looks like there are at least 2 issues:

  1. Asm version is slower on skylake than haswell and has lower IPC (instructions per cycle) 2.14 vs 1.47
  2. More importantly pure go version has significantly higher IPC on skylake (and is faster) 3.37 vs 1.84
@TocarIP

This comment has been minimized.

Contributor

TocarIP commented Sep 7, 2018

I think, I have a hypothesis, explaining code running much faster on skylake. Perf shows 1,466,664,692 resource_stalls.rs events, which means that for 1.4 *10^9 cycles reservation station couldn't accept uops. Skylake has 50% higher reservation station capacity, which should allow to interleave more iterations.

However I don't think that this explanation provides actionable advise (having less instruction and shorter dependency chains is already a goal)

@randall77

This comment has been minimized.

Contributor

randall77 commented Dec 12, 2018

Punting to unplanned, too late for anything major in 1.12.

@randall77 randall77 modified the milestones: Go1.12, Unplanned Dec 12, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment