Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/compile: unnecessary padding in stack frames #42385

Open
randall77 opened this issue Nov 5, 2020 · 9 comments
Open

cmd/compile: unnecessary padding in stack frames #42385

randall77 opened this issue Nov 5, 2020 · 9 comments

Comments

@randall77
Copy link
Contributor

@randall77 randall77 commented Nov 5, 2020

In cmd/compile/internal/gc/pgen.go:185, we do:

		if thearch.LinkArch.InFamily(sys.MIPS, sys.MIPS64, sys.ARM, sys.ARM64, sys.PPC64, sys.S390X) {
			s.stksize = Rnd(s.stksize, int64(Widthptr))
		}

This code originated in commit 2ac375b (then called cmd/gc/pgen.c:474, committed by Luuk van Dijk).

		if(thechar == '5')
			stksize = rnd(stksize, widthptr);

There's no indication why this code was added. I suspect it is not necessary (maybe it was, but is no longer). This issue is to investigate why we do this and whether we can remove it.

Reported on go-nuts by Eric from arm.

For a simple repro, compile

package main

func g()
func f(p, q *int16) {
	x := *p
	y := *q
	g()
	*p = y
	*q = x
}

The autotmps should be 2 bytes apart, not 8. They are 2 bytes apart on amd64, but not arm64.

@erifan
Copy link
Contributor

@erifan erifan commented Nov 5, 2020

Thanks @randall77 , I'm preparing a CL for this issue. I think it's worth fixing. For other architectures, I am not very clear, I will leave them as they are.

@cherrymui
Copy link
Contributor

@cherrymui cherrymui commented Nov 5, 2020

I remember that long time ago I tried to remove that padding, but it made many benchmarks slower (which I didn't actually investigate or understand), so I didn't proceed.

@mengzhuo
Copy link
Contributor

@mengzhuo mengzhuo commented Nov 5, 2020

I will check this CL on mips64

@erifan
Copy link
Contributor

@erifan erifan commented Nov 5, 2020

I remember that long time ago I tried to remove that padding, but it made many benchmarks slower (which I didn't actually investigate or understand), so I didn't proceed.

Okay~, I will test all benchmarks on more linux/arm64 machines to see the current situation. But because there are many benchmarks that have great fluctuations, it is difficult to say. Intuitively I think a smaller stack might be better, and gcc does the same. I tested the performance of the compiler on multiple linux/arm64 machines, and there was basically no change.

@gopherbot
Copy link

@gopherbot gopherbot commented Nov 6, 2020

Change https://golang.org/cl/267999 mentions this issue: cmd/compile: adjust stack slot alignment requirements on arm64

@erifan
Copy link
Contributor

@erifan erifan commented Nov 6, 2020

Hi, use the above CL, I have got the test results on linux/arm64.

$ go version
go version devel +3ef8562c9c Thu Nov 5 02:48:05 2020 +0000 linux/arm64

First of all go1, basically there is no change.

name                      old time/op    new time/op    delta
BinaryTree17-64              2.80s ± 2%     2.81s ± 2%    ~     (p=0.421 n=5+5)
Fannkuch11-64                2.72s ± 0%     2.72s ± 1%    ~     (p=0.730 n=4+5)
FmtFprintfEmpty-64          48.8ns ± 1%    49.1ns ± 1%    ~     (p=0.230 n=5+5)
FmtFprintfString-64         86.7ns ± 1%    86.8ns ± 1%    ~     (p=0.460 n=5+5)
FmtFprintfInt-64            95.9ns ± 0%    96.1ns ± 0%    ~     (p=0.183 n=5+5)
FmtFprintfIntInt-64          158ns ± 0%     158ns ± 0%    ~     (p=0.333 n=5+4)
FmtFprintfPrefixedInt-64     191ns ± 0%     190ns ± 0%    ~     (p=0.079 n=4+5)
FmtFprintfFloat-64           222ns ± 0%     223ns ± 0%  +0.45%  (p=0.016 n=4+5)
FmtManyArgs-64               601ns ± 0%     606ns ± 0%  +0.93%  (p=0.008 n=5+5)
GobDecode-64                6.81ms ± 0%    6.80ms ± 1%    ~     (p=0.690 n=5+5)
GobEncode-64                5.25ms ± 0%    5.34ms ± 0%  +1.57%  (p=0.008 n=5+5)
Gzip-64                      256ms ± 0%     254ms ± 0%  -0.74%  (p=0.008 n=5+5)
Gunzip-64                   39.0ms ± 0%    39.1ms ± 0%  +0.36%  (p=0.008 n=5+5)
HTTPClientServer-64         50.2µs ± 2%    51.0µs ± 2%    ~     (p=0.421 n=5+5)
JSONEncode-64               11.1ms ± 1%    11.0ms ± 0%  -1.11%  (p=0.016 n=5+4)
JSONDecode-64               48.4ms ± 0%    48.5ms ± 0%    ~     (p=0.095 n=5+5)
Mandelbrot200-64            3.43ms ± 0%    3.43ms ± 0%  -0.06%  (p=0.008 n=5+5)
GoParse-64                  3.33ms ± 0%    3.32ms ± 0%  -0.40%  (p=0.029 n=4+4)
RegexpMatchEasy0_32-64      79.6ns ± 0%    80.0ns ± 0%  +0.50%  (p=0.008 n=5+5)
RegexpMatchEasy0_1K-64       232ns ± 1%     233ns ± 1%    ~     (p=0.190 n=5+5)
RegexpMatchEasy1_32-64      75.0ns ± 0%    75.2ns ± 1%    ~     (p=0.841 n=5+5)
RegexpMatchEasy1_1K-64       406ns ± 0%     415ns ± 1%  +2.17%  (p=0.008 n=5+5)
RegexpMatchMedium_32-64     6.84ns ± 0%    6.85ns ± 0%    ~     (p=0.365 n=5+5)
RegexpMatchMedium_1K-64     37.6µs ± 0%    37.5µs ± 0%    ~     (p=0.310 n=5+5)
RegexpMatchHard_32-64       2.05µs ± 0%    2.06µs ± 0%  +0.28%  (p=0.008 n=5+5)
RegexpMatchHard_1K-64       62.6µs ± 0%    62.2µs ± 1%  -0.64%  (p=0.016 n=5+5)
Revcomp-64                   481ms ± 0%     480ms ± 0%  -0.26%  (p=0.008 n=5+5)
Template-64                 59.9ms ± 1%    60.4ms ± 1%  +0.85%  (p=0.032 n=5+5)
TimeParse-64                 329ns ± 0%     331ns ± 1%  +0.61%  (p=0.016 n=4+5)
TimeFormat-64                380ns ± 0%     384ns ± 1%  +1.11%  (p=0.008 n=5+5)

name                      old speed      new speed      delta
GobDecode-64               113MB/s ± 0%   113MB/s ± 1%    ~     (p=0.690 n=5+5)
GobEncode-64               146MB/s ± 0%   144MB/s ± 0%  -1.55%  (p=0.008 n=5+5)
Gzip-64                   75.9MB/s ± 0%  76.5MB/s ± 0%  +0.74%  (p=0.008 n=5+5)
Gunzip-64                  498MB/s ± 0%   496MB/s ± 0%  -0.36%  (p=0.008 n=5+5)
JSONEncode-64              175MB/s ± 1%   177MB/s ± 0%  +1.12%  (p=0.016 n=5+4)
JSONDecode-64             40.1MB/s ± 0%  40.0MB/s ± 0%    ~     (p=0.111 n=5+5)
GoParse-64                17.4MB/s ± 0%  17.5MB/s ± 0%  +0.42%  (p=0.029 n=4+4)
RegexpMatchEasy0_32-64     402MB/s ± 0%   400MB/s ± 0%  -0.47%  (p=0.008 n=5+5)
RegexpMatchEasy0_1K-64    4.41GB/s ± 1%  4.39GB/s ± 0%    ~     (p=0.095 n=5+5)
RegexpMatchEasy1_32-64     427MB/s ± 0%   426MB/s ± 1%    ~     (p=0.690 n=5+5)
RegexpMatchEasy1_1K-64    2.52GB/s ± 0%  2.47GB/s ± 1%  -2.08%  (p=0.008 n=5+5)
RegexpMatchMedium_32-64    146MB/s ± 0%   146MB/s ± 0%    ~     (p=0.167 n=5+5)
RegexpMatchMedium_1K-64   27.3MB/s ± 0%  27.3MB/s ± 0%    ~     (p=0.333 n=5+5)
RegexpMatchHard_32-64     15.6MB/s ± 0%  15.5MB/s ± 0%  -0.26%  (p=0.016 n=5+5)
RegexpMatchHard_1K-64     16.4MB/s ± 0%  16.5MB/s ± 1%  +0.65%  (p=0.016 n=5+5)
Revcomp-64                 528MB/s ± 0%   529MB/s ± 0%  +0.26%  (p=0.008 n=5+5)
Template-64               32.4MB/s ± 1%  32.1MB/s ± 1%  -0.85%  (p=0.032 n=5+5)

Then I ran all the benchmarks in the standard library and processed the results as follows:

  1. Remove items that fluctuate within 3%
  2. Test some benchmarks with large fluctuations separately again, several false items were removed. But I didn't double check all of them, only a few cases with very big changes.
    And the result is as follow:
name                                                       old time/op      new time/op      delta
pkg:bufio goos:linux goarch:arm64
ReaderWriteToOptimal-64                                         632ns ± 3%       601ns ± 0%   -4.87%  (p=0.008 n=5+5)
pkg:bytes goos:linux goarch:arm64
IndexRune/4M-64                                                 170µs ± 0%       161µs ± 0%   -5.09%  (p=0.008 n=5+5)
IndexRuneASCII/4M-64                                            170µs ± 0%       161µs ± 0%   -5.04%  (p=0.008 n=5+5)
Equal/4M-64                                                     232µs ± 0%       213µs ± 1%   -8.13%  (p=0.008 n=5+5)
IndexEasy/4M-64                                                 170µs ± 0%       161µs ± 0%   -5.09%  (p=0.008 n=5+5)
CountEasy/4M-64                                                 170µs ± 0%       161µs ± 0%   -5.11%  (p=0.008 n=5+5)
pkg:crypto/ecdsa goos:linux goarch:arm64
SignP384-64                                                    2.04ms ± 2%      1.95ms ± 4%   -4.32%  (p=0.008 n=5+5)
pkg:encoding/gob goos:linux goarch:arm64
EndToEndPipe-64                                                 263ns ± 2%       287ns ± 1%   +8.89%  (p=0.008 n=5+5)
DecodeComplex128Slice-64                                       14.1µs ± 3%      15.3µs ± 4%   +8.18%  (p=0.008 n=5+5)
DecodeInt32Slice-64                                            9.37µs ± 2%     10.10µs ± 5%   +7.79%  (p=0.008 n=5+5)
pkg:encoding/json goos:linux goarch:arm64
Issue10335-64                                                   125ns ± 1%       130ns ± 1%   +3.67%  (p=0.008 n=5+5)
Issue34127-64                                                  21.1ns ± 0%      24.1ns ± 2%  +14.41%  (p=0.008 n=5+5)
TypeFieldsCache/MissTypes100-64                                 327µs ± 1%       315µs ± 1%   -3.72%  (p=0.008 n=5+5)
TypeFieldsCache/MissTypes10000-64                              23.4ms ± 1%      24.2ms ± 0%   +3.20%  (p=0.008 n=5+5)
pkg:expvar goos:linux goarch:arm64
StringSet-64                                                   48.8ns ± 2%      51.4ns ± 2%   +5.33%  (p=0.008 n=5+5)
MapSet-64                                                      1.53µs ± 6%      1.65µs ± 0%   +7.83%  (p=0.016 n=5+4)
MapSetString-64                                                1.52µs ± 1%      1.63µs ± 1%   +7.13%  (p=0.008 n=5+5)
pkg:html goos:linux goarch:arm64
Escape-64                                                      12.1µs ± 0%      12.7µs ± 0%   +4.56%  (p=0.008 n=5+5)
pkg:html/template goos:linux goarch:arm64
URLEscaper-64                                                  3.03µs ± 0%      2.79µs ± 0%   -7.86%  (p=0.008 n=5+5)
URLNormalizer-64                                               2.24µs ± 0%      2.07µs ± 0%   -7.55%  (p=0.008 n=5+5)
SrcsetFilter-64                                                 929ns ± 0%       887ns ± 0%   -4.52%  (p=0.016 n=4+5)
pkg:index/suffixarray goos:linux goarch:arm64
New/text=rand/size=10M/bits=32-64                               919ms ± 3%       957ms ± 2%   +4.13%  (p=0.008 n=5+5)
New/text=rand/size=10M/bits=64-64                               1.11s ± 2%       1.16s ± 1%   +4.95%  (p=0.008 n=5+5)
New/text=rand/size=50M/bits=32-64                               7.38s ± 1%       7.63s ± 2%   +3.41%  (p=0.008 n=5+5)
pkg:math goos:linux goarch:arm64
Nextafter64-64                                                 4.03ns ± 1%      4.40ns ± 0%   +9.24%  (p=0.008 n=5+5)
Remainder-64                                                   35.6ns ± 0%      37.7ns ± 0%   +5.90%  (p=0.008 n=5+5)
SqrtGoLatency-64                                               49.2ns ± 0%      50.8ns ± 0%   +3.25%  (p=0.008 n=5+5)
pkg:math/big goos:linux goarch:arm64
SubVW/100000-64                                                27.6µs ± 2%      28.4µs ± 1%   +3.06%  (p=0.008 n=5+5)
SubVWext/100000-64                                             46.1µs ± 1%      47.9µs ± 2%   +4.08%  (p=0.008 n=5+5)
pkg:math/rand goos:linux goarch:arm64
Read64-64                                                       127ns ± 0%       140ns ± 0%  +10.24%  (p=0.008 n=5+5)
Read1000-64                                                    1.74µs ± 0%      1.96µs ± 0%  +13.02%  (p=0.008 n=5+5)
pkg:mime goos:linux goarch:arm64
TypeByExtension/.unused-64                                     1.81ns ± 0%      1.63ns ± 0%   -9.94%  (p=0.008 n=5+5)
ExtensionsByType/text/html;_charset=utf-8-64                    330ns ± 2%       313ns ± 2%   -5.27%  (p=0.008 n=5+5)
ExtensionsByType/application/octet-stream-64                    128ns ± 4%       123ns ± 1%   -4.06%  (p=0.008 n=5+5)
pkg:mime/quotedprintable goos:linux goarch:arm64
Writer-64                                                      9.33µs ± 0%      9.64µs ± 0%   +3.35%  (p=0.008 n=5+5)
pkg:net goos:linux goarch:arm64
InterfacesAndMulticastAddrs-64                                 62.3µs ± 0%      64.4µs ± 0%   +3.44%  (p=0.008 n=5+5)
Splice/tcp-to-tcp/1024-64                                      3.66µs ± 2%      3.54µs ± 2%   -3.31%  (p=0.016 n=5+5)
pkg:net/http goos:linux goarch:arm64
CookieString-64                                                 832ns ± 0%       862ns ± 0%   +3.56%  (p=0.008 n=5+5)
HeaderWriteSubset-64                                            688ns ± 0%       659ns ± 0%   -4.18%  (p=0.016 n=4+5)
FileAndServer_1KB/NoTLS-64                                     70.9µs ± 4%      76.8µs ± 1%   +8.25%  (p=0.008 n=5+5)
FileAndServer_1KB/TLS-64                                       82.7µs ± 1%      85.8µs ± 1%   +3.74%  (p=0.008 n=5+5)
ClientServerParallel4-64                                        135µs ± 4%       147µs ± 3%   +9.08%  (p=0.008 n=5+5)
CloseNotifier-64                                               86.5µs ± 1%      89.2µs ± 1%   +3.19%  (p=0.008 n=5+5)
pkg:os goos:linux goarch:arm64
Expand/multiple-64                                              197ns ± 0%       204ns ± 0%   +3.55%  (p=0.008 n=5+5)
pkg:reflect goos:linux goarch:arm64
CallArgCopy/size=256-64                                        4.16ns ± 4%      3.83ns ± 0%   -7.98%  (p=0.008 n=5+5)
CallArgCopy/size=1024-64                                       5.18ns ± 0%      4.84ns ± 0%   -6.53%  (p=0.000 n=5+4)
CallArgCopy/size=4096-64                                       8.91ns ± 0%      9.44ns ± 0%   +5.98%  (p=0.029 n=4+4)
CallArgCopy/size=65536-64                                       110ns ± 0%       106ns ± 0%   -3.64%  (p=0.000 n=5+4)
FieldByName2-64                                                 523ns ± 3%       478ns ± 1%   -8.57%  (p=0.008 n=5+5)
pkg:runtime goos:linux goarch:arm64
ChanProdCons0-64                                               1.95µs ± 0%      1.81µs ± 1%   -7.19%  (p=0.008 n=5+5)
ChanProdConsWork0-64                                           2.23µs ± 1%      2.12µs ± 1%   -4.65%  (p=0.008 n=5+5)
ChanSem-64                                                      419ns ± 3%       458ns ± 9%   +9.36%  (p=0.008 n=5+5)
SetTypePtr-64                                                  4.25ns ± 0%      4.41ns ± 0%   +3.67%  (p=0.016 n=4+5)
Allocation-64                                                  12.2µs ± 1%      12.7µs ± 1%   +4.12%  (p=0.008 n=5+5)
WriteBarrier-64                                                5.15ns ± 1%      5.71ns ± 2%  +10.87%  (p=0.008 n=5+5)
MSpanCountAlloc/bits=128-64                                    7.36ns ± 2%      7.01ns ± 0%   -4.76%  (p=0.000 n=5+4)
EqEfaceConcrete-64                                             0.80ns ± 0%      0.50ns ± 0%  -37.95%  (p=0.008 n=5+5)
NeEfaceConcrete-64                                             0.80ns ± 0%      0.50ns ± 0%  -37.88%  (p=0.000 n=5+4)
pkg:runtime/internal/atomic goos:linux goarch:arm64
And8Parallel-64                                                 299ns ± 9%       274ns ± 5%   -8.48%  (p=0.008 n=5+5)
Xadd64-64                                                      17.4ns ± 7%      15.1ns ± 3%  -13.42%  (p=0.016 n=5+4)
pkg:runtime/trace goos:linux goarch:arm64
NewTask-64                                                     69.4ns ± 2%      72.6ns ± 2%   +4.67%  (p=0.008 n=5+5)
pkg:strconv goos:linux goarch:arm64
Atof64Decimal-64                                               47.9ns ± 1%      46.4ns ± 2%   -3.05%  (p=0.008 n=5+5)
Atof64RandomBits-64                                             173ns ± 0%       194ns ± 0%  +11.88%  (p=0.016 n=5+4)
AppendUintVarlen/12345678901234567-64                          53.3ns ± 0%      51.6ns ± 0%   -3.19%  (p=0.029 n=4+4)
AppendUintVarlen/123456789012345678-64                         53.6ns ± 0%      51.8ns ± 0%   -3.28%  (p=0.000 n=4+5)
AppendUintVarlen/1234567890123456789-64                        57.5ns ± 0%      55.4ns ± 0%   -3.65%  (p=0.016 n=5+4)
AppendUintVarlen/12345678901234567890-64                       57.6ns ± 0%      55.2ns ± 0%   -4.10%  (p=0.008 n=5+5)
pkg:strings goos:linux goarch:arm64
ToLower/longStrinGwitHmixofsmaLLandcAps-64                      201ns ± 2%       194ns ± 1%   -3.48%  (p=0.016 n=5+5)
pkg:sync goos:linux goarch:arm64
LoadMostlyHits/*sync_test.RWMutexMap-64                        88.7ns ± 2%      84.6ns ± 3%   -4.62%  (p=0.008 n=5+5)
LoadOrStoreBalanced/*sync.Map-64                                621ns ± 2%       652ns ± 1%   +4.93%  (p=0.008 n=5+5)
LoadOrStoreUnique/*sync_test.RWMutexMap-64                      949ns ± 6%      1007ns ± 5%   +6.05%  (p=0.032 n=5+5)
LoadOrStoreUnique/*sync.Map-64                                 1.13µs ± 2%      1.21µs ± 2%   +6.76%  (p=0.008 n=5+5)
LoadAndDeleteBalanced/*sync_test.RWMutexMap-64                  354ns ± 2%       410ns ± 9%  +15.58%  (p=0.008 n=5+5)
Range/*sync_test.RWMutexMap-64                                 67.9µs ± 4%      70.9µs ± 1%   +4.33%  (p=0.032 n=5+5)
Mutex-64                                                        111ns ± 9%       140ns ± 5%  +25.54%  (p=0.008 n=5+5)
MutexWork-64                                                    150ns ± 1%       154ns ± 1%   +3.21%  (p=0.008 n=5+5)
// checked: fluctuation is too big
MutexWorkSlack-64                                               318ns ±23%       200ns ± 5%  -37.27%  (p=0.008 n=5+5)
MutexNoSpin-64                                                  846ns ± 1%       805ns ± 1%   -4.85%  (p=0.008 n=5+5)
// checked: fluctuation is too big
MutexSpin-64                                                   3.40µs ± 0%      2.08µs ±16%  -39.03%  (p=0.008 n=5+5)
SemaSyntNonblock-64                                             610ns ± 8%       743ns ± 6%  +21.74%  (p=0.008 n=5+5)
RWMutexWrite100-64                                              136ns ± 6%       117ns ± 5%  -14.26%  (p=0.008 n=5+5)
RWMutexWrite10-64                                               117ns ±21%        83ns ± 7%  -29.09%  (p=0.008 n=5+5)
WaitGroupAddDone-64                                            75.5ns ± 4%      79.9ns ± 7%   +5.88%  (p=0.040 n=5+5)
WaitGroupAddDoneWork-64                                         109ns ± 2%       122ns ± 3%  +12.09%  (p=0.008 n=5+5)
// checked: fluctuation is too big
WaitGroupWait-64                                               0.09ns ±36%      0.11ns ± 0%  +25.23%  (p=0.016 n=5+4)
WaitGroupWaitWork-64                                           0.80ns ± 7%      0.86ns ± 0%   +7.33%  (p=0.016 n=5+4)
WaitGroupActuallyWait-64                                       34.0ns ± 0%      32.2ns ± 1%   -5.35%  (p=0.008 n=5+5)
pkg:time goos:linux goarch:arm64
SimultaneousAfterFunc-64                                        249µs ± 2%       272µs ± 2%   +8.94%  (p=0.008 n=5+5)
StartStop-64                                                   44.3µs ± 5%      40.4µs ± 4%   -8.92%  (p=0.008 n=5+5)
Sleep-64                                                        272µs ± 1%       286µs ± 1%   +5.09%  (p=0.008 n=5+5)
TickerResetNaive-64                                            72.0µs ± 7%      64.4µs ± 6%  -10.63%  (p=0.016 n=5+5)
pkg:unicode/utf16 goos:linux goarch:arm64
EncodeRune-64                                                  6.00ns ± 0%      4.40ns ± 0%  -26.67%  (p=0.008 n=5+5)
pkg:unicode/utf8 goos:linux goarch:arm64
RuneCountTenJapaneseChars-64                                   82.8ns ± 0%      76.0ns ± 0%   -8.21%  (p=0.000 n=5+4)
RuneCountInStringTenJapaneseChars-64                           76.4ns ± 0%      82.8ns ± 0%   +8.38%  (p=0.016 n=5+4)
// checked 3: results are stable, caused by cache miss
ValidTenASCIIChars-64                                          8.00ns ± 0%      8.80ns ± 0%  +10.00%  (p=0.016 n=5+4)
ValidTenJapaneseChars-64                                       56.3ns ± 0%      59.6ns ± 0%   +5.86%  (p=0.008 n=5+5)
ValidStringTenASCIIChars-64                                    7.20ns ± 0%      9.20ns ± 0%  +27.78%  (p=0.016 n=5+4)


name                                                       old speed        new speed        delta
pkg:bytes goos:linux goarch:arm64
IndexRune/4M-64                                              24.7GB/s ± 0%    26.0GB/s ± 0%   +5.36%  (p=0.008 n=5+5)
IndexRuneASCII/4M-64                                         24.7GB/s ± 0%    26.0GB/s ± 0%   +5.31%  (p=0.008 n=5+5)
Equal/4M-64                                                  18.1GB/s ± 0%    19.7GB/s ± 1%   +8.86%  (p=0.008 n=5+5)
IndexEasy/4M-64                                              24.7GB/s ± 0%    26.0GB/s ± 0%   +5.36%  (p=0.008 n=5+5)
CountEasy/4M-64                                              24.7GB/s ± 0%    26.0GB/s ± 0%   +5.39%  (p=0.008 n=5+5)
pkg:index/suffixarray goos:linux goarch:arm64
New/text=rand/size=10M/bits=32-64                            10.9MB/s ± 3%    10.4MB/s ± 2%   -3.95%  (p=0.008 n=5+5)
New/text=rand/size=10M/bits=64-64                            9.04MB/s ± 2%    8.62MB/s ± 1%   -4.73%  (p=0.008 n=5+5)
New/text=rand/size=50M/bits=32-64                            6.77MB/s ± 1%    6.55MB/s ± 2%   -3.28%  (p=0.008 n=5+5)
pkg:math/big goos:linux goarch:arm64
SubVWext/100000-64                                           17.4GB/s ± 1%    16.7GB/s ± 2%   -3.91%  (p=0.008 n=5+5)
pkg:net goos:linux goarch:arm64
Splice/tcp-to-tcp/1024-64                                     280MB/s ± 2%     289MB/s ± 2%   +3.40%  (p=0.016 n=5+5)
pkg:net/http goos:linux goarch:arm64
FileAndServer_1KB/NoTLS-64                                   14.5MB/s ± 4%    13.3MB/s ± 1%   -7.69%  (p=0.008 n=5+5)
FileAndServer_1KB/TLS-64                                     12.4MB/s ± 1%    11.9MB/s ± 1%   -3.60%  (p=0.008 n=5+5)
pkg:reflect goos:linux goarch:arm64
CallArgCopy/size=256-64                                      61.6GB/s ± 4%    66.9GB/s ± 0%   +8.57%  (p=0.008 n=5+5)
CallArgCopy/size=1024-64                                      198GB/s ± 0%     211GB/s ± 0%   +6.94%  (p=0.008 n=5+5)
CallArgCopy/size=4096-64                                      460GB/s ± 0%     434GB/s ± 0%   -5.71%  (p=0.016 n=4+5)
CallArgCopy/size=65536-64                                     596GB/s ± 0%     616GB/s ± 0%   +3.29%  (p=0.008 n=5+5)
pkg:runtime goos:linux goarch:arm64
SetTypePtr-64                                                1.88GB/s ± 0%    1.82GB/s ± 0%   -3.52%  (p=0.008 n=5+5)

First of all, we can see that there are not many cases with large performance changes relative to the total number. Then through the above results, we can see that there are both obvious improvements and also obvious declines. Seems the overall improvement is more obvious ? Then we analyzed several stable regression cases, one of which is utf8.ValidStringTenASCIIChars.
ValidStringTenASCIIChars-64 7.20ns ± 0% 9.20ns ± 0% +27.78% (p=0.016 n=5+4)
After analysis, we found that the regression is caused by the increase of cache miss. This makes sense, because the layout of the stack has changed. And I think most of the above performance changes related to this change are basically caused by this reason. The change of the stack layout may lead to an increase in cache miss in some cases and a decrease in cache miss in some cases, and of course it may not change. If this is the case, I think we should make this change on linux/arm64. After all, the stack size is reduced and the program is easier to understand.

@randall77
Copy link
Contributor Author

@randall77 randall77 commented Nov 6, 2020

I agree. This seems in the absence of any data to be a good change. We want to see the data just to make sure that there aren't any unexpected regressions. The fact that it randomly makes some code worse and some code better is unfortunate but not a blocker. Only consistently worse would block this change.

Rant: Why is there so much variation here? This seems like a very minor change. I would expect the stack to be in L1 cache approximately always. Especially ValidStringTenASCIIChars, it is a leaf function and uses (on amd64) 32 bytes of stack. The only effect of packing stack frames better should be to occasionally need one less stack growth. It's not like you can get false sharing on a stack frame. Maybe L1 is only 1-way associative on arm64 and the stack and string it is operating on happen to conflict? I hate modern hardware.

@erifan
Copy link
Contributor

@erifan erifan commented Nov 9, 2020

Rant: Why is there so much variation here?

Without this change, there's also so much variation because many benchmarks themselves have relatively large fluctuations, especially benchmarks in sync package.

This seems like a very minor change. I would expect the stack to be in L1 cache approximately always. Especially ValidStringTenASCIIChars, it is a leaf function and uses (on amd64) 32 bytes of stack. The only effect of packing stack frames better should be to occasionally need one less stack growth. It's not like you can get false sharing on a stack frame. Maybe L1 is only 1-way associative on arm64 and the stack and string it is operating on happen to conflict? I hate modern hardware.

The test machine has 4-way associative L1 cache. There are only about 5 cache misses related to BenchmarkValidStringTenASCIIChars, the increased cache-miss mainly occurs in runtime and kernel related functions, such as __wake_up_common_lock, arch_local_irq_restore, runtime.greyobject, runtime.retake etc. I haven't studied Go's testing framework in depth, and I am also confused about the impact of these functions on testing. But the fact is that we can indeed influence this benchmark by adding some irrelevant and useless code, and we can see that the cache miss and performance changes are consistent, because there is no change in the code, so the cache miss seems to be the only reasonable explanation. There are some hardware-related details involved here, and I don't really understand it either. 😢

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
6 participants
You can’t perform that action at this time.