Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/compile,runtime: introduce growbyteslice? #49480

Open
quasilyte opened this issue Nov 9, 2021 · 22 comments
Open

cmd/compile,runtime: introduce growbyteslice? #49480

quasilyte opened this issue Nov 9, 2021 · 22 comments
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Milestone

Comments

@quasilyte
Copy link
Contributor

quasilyte commented Nov 9, 2021

Everything below applies to the latest Go master branch compiled from sources

When append() is compiled, this call is emited:

    growslice(elemtype, slice, capacity)

While building go tool, I found out that ~24% calls to growslice have first argument of []byte.
16.5% is for []string. Every other combination is less frequent and is not as interesting.

Here is a top20 results for go tool (total=10416):

   2486 growslice byte
   1721 growslice string
    334 growslice ir.Node
    312 growslice ssa.Edge
    183 growslice *Value
    150 growslice posetUndo
    140 growslice int
    118 growslice ID
    114 growslice *ir.Name
    104 growslice module.Version
    104 growslice loader.Sym
     98 growslice *Block
     86 growslice FileError
     78 growslice errorDesc
     77 growslice rune
     76 growslice ast.Node
     75 growslice *Regexp
     70 growslice *types.Type
     66 growslice Inst
     61 growslice Expr

Most other cases are in 1-5 range.

If we add a specialized wrapper growbyteslice like this one:

func growbyteslice(old slice, cap int) slice {
	return growslice(uint8Type, old, cap)
}

...then we can avoid passing the first argument for []byte.

Let's see how it affects an isolated case:

func f(xs []byte) []byte {
	xs = append(xs, 1)
	return xs
}
Attribute Before After
Func code size 126 110 (-16)
Func frame size 72 64 (-8)

Asm listings are below, for completeness.

Why this can be a worthwhile thing even if we add an extra call in slice growing path?
I believe that slice growing is not a hot path. It we're about to make allocations and
that would dominate the execution time. If we're following the happy path when newlen < cap,
then smaller code could help to be more code cache-friendly.

Now to some more real-world cases:

	// jx JSON encoder
	.text           -3424  561331   557907        -0.61%
	// go tool
	.text           -11488  5231050   5219562        -0.22%
	// hello world http server app
	.text           -7200  2117578  2110378       -0.34%
	// hugo
	.text           -61824  24407578  24345754       -0.25%

Different apps yield varying results, but the .text segment
size of the binary typically goes down by 0.22%-0.61%.

The main question: is adding a new runtime function growbyteslice too much for this case?

From one point of view, more functions -> bigger runtime code. At the same time, this is a simple
wrapper that decreases the code size more often than not without apparent performance issues.

All tests from all.bash seem to be passing for my prototype. I could send the CL is
this idea looks reasonable.


Before growbyteslice:

"".f STEXT size=126 args=0x18 locals=0x48 funcid=0x0 align=0x0
	0x0000 00000 (hello.go:3)	TEXT	"".f(SB), ABIInternal, $72-24
	0x0000 00000 (hello.go:3)	CMPQ	SP, 16(R14)
	0x0004 00004 (hello.go:3)	PCDATA	$0, $-2
	0x0004 00004 (hello.go:3)	JLS	89
	0x0006 00006 (hello.go:3)	PCDATA	$0, $-1
	0x0006 00006 (hello.go:3)	SUBQ	$72, SP
	0x000a 00010 (hello.go:3)	MOVQ	BP, 64(SP)
	0x000f 00015 (hello.go:3)	LEAQ	64(SP), BP
	0x0014 00020 (hello.go:3)	MOVQ	AX, "".xs+80(FP)
	0x0019 00025 (hello.go:3)	FUNCDATA	$0, gclocals·1a65e721a2ccc325b382662e7ffee780(SB)
	0x0019 00025 (hello.go:3)	FUNCDATA	$1, gclocals·69c1753bd5f81501d95132d08af04464(SB)
	0x0019 00025 (hello.go:3)	FUNCDATA	$5, "".f.arginfo1(SB)
	0x0019 00025 (hello.go:3)	FUNCDATA	$6, "".f.argliveinfo(SB)
	0x0019 00025 (hello.go:3)	PCDATA	$3, $1
	0x0019 00025 (hello.go:4)	LEAQ	1(BX), SI
	0x001d 00029 (hello.go:4)	NOP
	0x0020 00032 (hello.go:4)	CMPQ	CX, SI
	0x0023 00035 (hello.go:4)	JCC	72
	0x0025 00037 (hello.go:4)	MOVQ	BX, "".xs+88(SP)
	0x002a 00042 (hello.go:4)	PCDATA	$3, $2
	0x002a 00042 (hello.go:4)	MOVQ	CX, DI
	0x002d 00045 (hello.go:4)	MOVQ	BX, CX
	0x0030 00048 (hello.go:4)	MOVQ	AX, BX
	0x0033 00051 (hello.go:4)	LEAQ	type.uint8(SB), AX
	0x003a 00058 (hello.go:4)	PCDATA	$1, $0
	0x003a 00058 (hello.go:4)	CALL	runtime.growslice(SB)
	0x003f 00063 (hello.go:4)	LEAQ	1(BX), SI
	0x0043 00067 (hello.go:4)	MOVQ	"".xs+88(SP), BX
	0x0048 00072 (hello.go:4)	PCDATA	$1, $-1
	0x0048 00072 (hello.go:4)	PCDATA	$3, $1
	0x0048 00072 (hello.go:4)	MOVB	$1, (AX)(BX*1)
	0x004c 00076 (hello.go:5)	MOVQ	SI, BX
	0x004f 00079 (hello.go:5)	MOVQ	64(SP), BP
	0x0054 00084 (hello.go:5)	ADDQ	$72, SP
	0x0058 00088 (hello.go:5)	RET
	0x0059 00089 (hello.go:5)	NOP
	0x0059 00089 (hello.go:3)	PCDATA	$1, $-1
	0x0059 00089 (hello.go:3)	PCDATA	$0, $-2
	0x0059 00089 (hello.go:3)	MOVQ	AX, 8(SP)
	0x005e 00094 (hello.go:3)	MOVQ	BX, 16(SP)
	0x0063 00099 (hello.go:3)	MOVQ	CX, 24(SP)
	0x0068 00104 (hello.go:3)	CALL	runtime.morestack_noctxt(SB)
	0x006d 00109 (hello.go:3)	MOVQ	8(SP), AX
	0x0072 00114 (hello.go:3)	MOVQ	16(SP), BX
	0x0077 00119 (hello.go:3)	MOVQ	24(SP), CX
	0x007c 00124 (hello.go:3)	PCDATA	$0, $-1
	0x007c 00124 (hello.go:3)	JMP	0

After:

"".f STEXT size=110 args=0x18 locals=0x40 funcid=0x0 align=0x0
	0x0000 00000 (hello.go:3)	TEXT	"".f(SB), ABIInternal, $64-24
	0x0000 00000 (hello.go:3)	CMPQ	SP, 16(R14)
	0x0004 00004 (hello.go:3)	PCDATA	$0, $-2
	0x0004 00004 (hello.go:3)	JLS	73
	0x0006 00006 (hello.go:3)	PCDATA	$0, $-1
	0x0006 00006 (hello.go:3)	SUBQ	$64, SP
	0x000a 00010 (hello.go:3)	MOVQ	BP, 56(SP)
	0x000f 00015 (hello.go:3)	LEAQ	56(SP), BP
	0x0014 00020 (hello.go:3)	MOVQ	AX, "".xs+72(FP)
	0x0019 00025 (hello.go:3)	FUNCDATA	$0, gclocals·1a65e721a2ccc325b382662e7ffee780(SB)
	0x0019 00025 (hello.go:3)	FUNCDATA	$1, gclocals·69c1753bd5f81501d95132d08af04464(SB)
	0x0019 00025 (hello.go:3)	FUNCDATA	$5, "".f.arginfo1(SB)
	0x0019 00025 (hello.go:3)	FUNCDATA	$6, "".f.argliveinfo(SB)
	0x0019 00025 (hello.go:3)	PCDATA	$3, $1
	0x0019 00025 (hello.go:4)	LEAQ	1(BX), DI
	0x001d 00029 (hello.go:4)	NOP
	0x0020 00032 (hello.go:4)	CMPQ	CX, DI
	0x0023 00035 (hello.go:4)	JCC	56
	0x0025 00037 (hello.go:4)	MOVQ	BX, "".xs+80(SP)
	0x002a 00042 (hello.go:4)	PCDATA	$3, $2
	0x002a 00042 (hello.go:4)	PCDATA	$1, $0
	0x002a 00042 (hello.go:4)	CALL	runtime.growbyteslice(SB)
	0x002f 00047 (hello.go:4)	LEAQ	1(BX), DI
	0x0033 00051 (hello.go:4)	MOVQ	"".xs+80(SP), BX
	0x0038 00056 (hello.go:4)	PCDATA	$1, $-1
	0x0038 00056 (hello.go:4)	PCDATA	$3, $1
	0x0038 00056 (hello.go:4)	MOVB	$1, (AX)(BX*1)
	0x003c 00060 (hello.go:5)	MOVQ	DI, BX
	0x003f 00063 (hello.go:5)	MOVQ	56(SP), BP
	0x0044 00068 (hello.go:5)	ADDQ	$64, SP
	0x0048 00072 (hello.go:5)	RET
	0x0049 00073 (hello.go:5)	NOP
	0x0049 00073 (hello.go:3)	PCDATA	$1, $-1
	0x0049 00073 (hello.go:3)	PCDATA	$0, $-2
	0x0049 00073 (hello.go:3)	MOVQ	AX, 8(SP)
	0x004e 00078 (hello.go:3)	MOVQ	BX, 16(SP)
	0x0053 00083 (hello.go:3)	MOVQ	CX, 24(SP)
	0x0058 00088 (hello.go:3)	CALL	runtime.morestack_noctxt(SB)
	0x005d 00093 (hello.go:3)	MOVQ	8(SP), AX
	0x0062 00098 (hello.go:3)	MOVQ	16(SP), BX
	0x0067 00103 (hello.go:3)	MOVQ	24(SP), CX
	0x006c 00108 (hello.go:3)	PCDATA	$0, $-1
	0x006c 00108 (hello.go:3)	JMP	0

Update: makebyteslice

As pointed in the comments, it could be also worthwhile to see whether a specialized makeslice can be useful.

Top makeslice usages found during ./make.bash (total=4487):

   1194 makeslice byte (size=1) 64bit=false
    500 makeslice Word (size=8) 64bit=false
    422 makeslice *types.Field (size=8) 64bit=false
    246 makeslice string (size=16) 64bit=false
    177 makeslice uint32 (size=4) 64bit=false
    110 makeslice int (size=8) 64bit=false
     97 makeslice int32 (size=4) 64bit=false
     66 makeslice big.Word (size=8) 64bit=false
     60 makeslice ir.Node (size=16) 64bit=false
     58 makeslice obj.Reloc (size=24) 64bit=false
     56 makeslice bool (size=1) 64bit=false
     54 makeslice *types.Type (size=8) 64bit=false
     42 makeslice int64 (size=8) 64bit=false
     37 makeslice Type (size=16) 64bit=false
     34 makeslice module.Version (size=32) 64bit=false
     33 makeslice hcode (size=4) 64bit=false
     31 makeslice *Value (size=8) 64bit=false
     31 makeslice []byte (size=24) 64bit=false
     28 makeslice *Var (size=8) 64bit=false
     26 makeslice float64 (size=8) 64bit=false

All 1-byte makeslice usages found during ./make.bash (note that uint8 should be added to byte here):

   1194 makeslice byte (size=1) 64bit=false
     56 makeslice bool (size=1) 64bit=false
     25 makeslice uint8 (size=1) 64bit=false
      6 makeslice int8 (size=1) 64bit=false
      4 makeslice whiteSpace (size=1) 64bit=false
      3 makeslice level (size=1) 64bit=false
      2 makeslice sym.SymKind (size=1) 64bit=false
      2 makeslice markKind (size=1) 64bit=false
      1 makeslice RuneRole (size=1) 64bit=false

So, ~26% of makeslice calls have a byte type argument, most of the 1-byte sizes are byte/uint8 directly (named byte-sized types are a rarity).

More analysis

A detailed bincmp result for a simple web server is presented below (under a spoiler).

One may notice that for some symbols, the size has actually increased.

image

The reason for that, I assume, is that sometimes relevant arguments happen to be in the matching registers already. That could play in both directions though. It should be treated as a random noise.

Detailed bincmp output (A LOT OF TEXT)
```
.text           -4352  2117514  2113162       -0.21%

symbol name delta old new
bufio.(*Scanner).Scan -16 2319 2303 -0.69%
bytes.makeSlice -19 210 191 -9.05%
crypto/cipher.newCBC -13 370 357 -3.51%
crypto/elliptic.Marshal -21 383 362 -5.48%
crypto/hmac.New -13 709 696 -1.83%
crypto/md5.(*digest).MarshalBinary 2 527 529 0.38%
crypto/rsa.VerifyPKCS1v15 -11 1494 1483 -0.74%
crypto/rsa.VerifyPSS -14 855 841 -1.64%
crypto/sha1.(*digest).MarshalBinary 11 591 602 1.86%
crypto/sha256.(*digest).MarshalBinary 13 858 871 1.52%
crypto/sha512.(*digest).MarshalBinary 13 1050 1063 1.24%
crypto/tls.(*Conn).makeClientHello -26 3647 3621 -0.71%
crypto/tls.(*certificateMsg).marshal 1 805 806 0.12%
crypto/tls.(*certificateRequestMsg).marshal -12 1157 1145 -1.04%
crypto/tls.(*certificateRequestMsg).unmarshal -8 1423 1415 -0.56%
crypto/tls.(*cipherSuiteTLS13).expandLabel -1 1279 1278 -0.08%
crypto/tls.(*cipherSuiteTLS13).extract -10 357 347 -2.80%
crypto/tls.(*clientHandshakeState).doFullHandshake -96 5010 4914 -1.92%
crypto/tls.(*ecdheKeyAgreement).processServerKeyExchange 7 2245 2252 0.31%
crypto/tls.(*nistParameters).SharedKey -5 300 295 -1.67%
crypto/tls.ekmFromMasterSecret.func1 -27 1381 1354 -1.96%
crypto/tls.finishedHash.Sum -5 170 165 -2.94%
crypto/tls.finishedHash.clientSum -6 284 278 -2.11%
crypto/tls.finishedHash.serverSum -6 284 278 -2.11%
crypto/tls.keysFromMasterSecret -2 1129 1127 -0.18%
crypto/tls.masterFromPreMasterSecret -15 567 552 -2.65%
crypto/tls.md5SHA1Hash -5 466 461 -1.07%
crypto/tls.prf10 -5 636 631 -0.79%
crypto/tls.rsaKeyAgreement.generateClientKeyExchange -11 697 686 -1.58%
crypto/x509.parseRFC2821Mailbox -38 1413 1375 -2.69%
crypto/x509/pkix.RDNSequence.String -10 1554 1544 -0.64%
encoding/asn1.BitString.RightAlign -7 286 279 -2.45%
encoding/asn1.MarshalWithParams -7 517 510 -1.35%
encoding/asn1.appendTagAndLength -111 846 735 -13.12%
encoding/asn1.appendTimeCommon -93 2181 2088 -4.26%
encoding/asn1.appendUTCTime -13 716 703 -1.82%
encoding/asn1.makeGeneralizedTime -12 185 173 -6.49%
encoding/asn1.makeUTCTime -12 185 173 -6.49%
encoding/asn1.oidEncoder.Encode -12 680 668 -1.76%
encoding/asn1.parseBigInt -4 617 613 -0.65%
encoding/asn1.setEncoder.Encode -14 715 701 -1.96%
encoding/pem.Decode -3 2673 2670 -0.11%
encoding/pem.removeSpacesAndTabs -11 249 238 -4.42%
fmt.(*buffer).writeRune -20 441 421 -4.54%
fmt.(*fmt).fmtFloat -32 2137 2105 -1.50%
fmt.(*fmt).fmtSbx -64 1413 1349 -4.53%
fmt.(*pp).badVerb -55 1258 1203 -4.37%
fmt.(*pp).catchPanic -10 1332 1322 -0.75%
fmt.(*pp).doPrintf -64 4101 4037 -1.56%
fmt.(*pp).doPrintln -27 421 394 -6.41%
fmt.(*pp).fmtBytes -38 1747 1709 -2.18%
fmt.(*pp).fmtComplex -17 485 468 -3.51%
fmt.(*pp).fmtPointer -30 1145 1115 -2.62%
fmt.(*pp).printValue -204 7719 7515 -2.64%
fmt.(*pp).unknownType -22 693 671 -3.17%
io.copyBuffer -17 790 773 -2.15%
io.glob..func1 -13 138 125 -9.42%
log.(*Logger).Output -23 1029 1006 -2.24%
log.(*Logger).formatHeader -211 5477 5266 -3.85%
math/big.nat.itoa -30 1413 1383 -2.12%
mime.FormatMediaType -96 6119 6023 -1.57%
mime.consumeValue -12 807 795 -1.49%
mime.percentHexUnescape -23 756 733 -3.04%
net.(*IPMask).String -20 339 319 -5.90%
net.(*IPNet).String -32 837 805 -3.82%
net.IP.Mask -6 427 421 -1.41%
net.IP.String -27 2873 2846 -0.94%
net.IPMask.String -12 293 281 -4.10%
net.ParseCIDR -21 1065 1044 -1.97%
net.init -31 3823 3792 -0.81%
net.newRequest -32 618 586 -5.18%
net.open -2 231 229 -0.87%
net.parseIPv6 -32 1261 1229 -2.54%
net.readFull -16 424 408 -3.77%
net/http.(*http2Framer).WriteContinuation -21 554 533 -3.79%
net/http.(*http2Framer).WriteDataPadded -32 1068 1036 -3.00%
net/http.(*http2Framer).WriteGoAway -44 806 762 -5.46%
net/http.(*http2Framer).WriteHeaders -48 1311 1263 -3.66%
net/http.(*http2Framer).WritePing -22 340 318 -6.47%
net/http.(*http2Framer).WritePushPromise -32 1170 1138 -2.74%
net/http.(*http2Framer).WriteRSTStream -13 389 376 -3.34%
net/http.(*http2Framer).WriteSettings -12 599 587 -2.00%
net/http.(*http2Framer).WriteSettingsAck -4 186 182 -2.15%
net/http.(*http2Framer).WriteWindowUpdate -3 424 421 -0.71%
net/http.(*http2Framer).endWrite -8 293 285 -2.73%
net/http.(*http2bufferedWriter).Flush -12 347 335 -3.46%
net/http.(*http2bufferedWriter).Write -11 401 390 -2.74%
net/http.(*http2serverConn).readPreface.func1 -14 362 348 -3.87%
net/http.(*http2serverConn).runHandler.func1 -7 538 531 -1.30%
net/http.appendTime -7 1740 1733 -0.40%
net/http.glob..func1 -3 80 77 -3.75%
net/http.glob..func10 -10 218 208 -4.59%
net/http.glob..func15 -13 138 125 -9.42%
net/http.glob..func2 -3 80 77 -3.75%
net/http.glob..func3 -3 80 77 -3.75%
net/http.glob..func4 -3 80 77 -3.75%
net/http.glob..func5 -3 80 77 -3.75%
net/http.glob..func7 -13 138 125 -9.42%
net/http.glob..func8 -3 142 139 -2.11%
net/http.hexEscapeNonASCII -15 404 389 -3.71%
net/http.http2NewFramer.func2 -16 165 149 -9.70%
net/http.init -32 12490 12458 -0.26%
net/http.newBufioReader -32 788 756 -4.06%
net/http.newBufioWriterSize -10 583 573 -1.72%
net/http.putBufioWriter -3 274 271 -1.09%
net/textproto.(*Reader).readContinuedLineSlice -32 1509 1477 -2.12%
net/url.(*URL).String -39 3258 3219 -1.20%
net/url.EscapeError.Error -3 187 184 -1.60%
net/url.InvalidHostError.Error -8 205 197 -3.90%
net/url.unescape -11 2480 2469 -0.44%
os.ReadFile -42 869 827 -4.83%
os.Readlink -17 423 406 -4.02%
os.glob..func1 -13 138 125 -9.42%
path.Clean -9 2133 2124 -0.42%
path/filepath.Clean -63 2837 2774 -2.22%
reflect.addTypeBits -25 1127 1102 -2.22%
reflect.newAbiDesc -42 2298 2256 -1.83%
runtime.boundsError.Error -32 1543 1511 -2.07%
runtime.findfunctab -21 10340 10319 -0.20%
runtime.growbyteslice 56 56
runtime.makebyteslice 47 47
strconv.(*NumError).Error -3 442 439 -0.68%
strconv.appendEscapedRune -109 2075 1966 -5.25%
strconv.appendQuotedRuneWith -31 310 279 -10.00%
strconv.appendQuotedWith -72 1125 1053 -6.40%
strconv.fmtB -44 421 377 -10.45%
strconv.fmtE -138 1367 1229 -10.10%
strconv.fmtF -103 853 750 -12.08%
strconv.fmtX -173 1962 1789 -8.82%
strconv.formatDigits -16 599 583 -2.67%
strconv.unquote -37 2578 2541 -1.44%
strings.(*Builder).WriteRune -16 549 533 -2.91%
strings.(*byteStringReplacer).Replace 19 730 749 2.60%
strings.(*genericReplacer).Replace -13 217 204 -5.99%
strings.Join -19 1403 1384 -1.35%
strings.Map -23 1532 1509 -1.50%
strings.ToLower -30 628 598 -4.78%
syscall.NetlinkRIB -17 2181 2164 -0.78%
syscall.newNetlinkRouteRequest -9 302 293 -2.98%
time.Time.AppendFormat -96 9173 9077 -1.05%
time.Time.Format -7 312 305 -2.24%
time.Time.GoString -54 2397 2343 -2.25%
time.Time.MarshalBinary -14 586 572 -2.39%
time.Time.String -16 477 461 -3.35%
time.appendInt -32 645 613 -4.96%
time.formatNano -54 549 495 -9.84%
time.loadTzinfoFromZip -11 3273 3262 -0.34%
time.quote -36 1292 1256 -2.79%
vendor/golang.org/x/crypto/cry...byte.(*Builder).addLengthPrefixed 1 798 799 0.13%
vendor/golang.org/x/crypto/cryptobyte.(*Builder).flushChild 31 1520 1551 2.04%
vendor/golang.org/x/crypto/cry...yptobyte.(*String).readASN1BigInt 8 492 500 1.63%
vendor/golang.org/x/crypto/hkdf.Extract -11 280 269 -3.93%
vendor/golang.org/x/net/dns/dnsmessage.(*Name).pack -32 1395 1363 -2.29%
vendor/golang.org/x/net/dns/dn...smessage.(*Name).unpackCompressed -16 1085 1069 -1.47%
vendor/golang.org/x/net/dns/dnsmessage.(*Question).pack -27 436 409 -6.19%
vendor/golang.org/x/net/http2/hpack.(*Encoder).WriteField -59 2072 2013 -2.85%
vendor/golang.org/x/net/http2/hpack.AppendHuffmanString -21 653 632 -3.22%
vendor/golang.org/x/net/http2/hpack.appendHpackString -155 1099 944 -14.10%
vendor/golang.org/x/net/http2/hpack.appendIndexedName -45 677 632 -6.65%
vendor/golang.org/x/net/http2/hpack.appendNewName -30 313 283 -9.58%
vendor/golang.org/x/net/idna.encode -57 2046 1989 -2.79%
vendor/golang.org/x/text/unicode/norm.Form.Bytes -32 696 664 -4.60%
vendor/golang.org/x/text/unicode/norm.Form.String -32 665 633 -4.81%
vendor/golang.org/x/text/unicode/norm.appendQuick 14 710 724 1.97%
total -4209 204162 199953 -2.06%


    </pre>
</details>
@randall77
Copy link
Contributor

Seems reasonable. I wonder if there are other byte-specific cases for other builtins (maybe make or copy?).

growslice has a switch on et.size that is known if the element type is known to be a byte. I wonder if we can take advantage of that somehow. Without duplicating too much code, hopefully.

@ernado
Copy link
Contributor

ernado commented Nov 9, 2021

Encoders (at least those who encode to buffer) will greatly benefit from this optimization, because they are in general something like that:

func (e *Encoder) EncodeT(v T) {
    e.buf = append(e.buf, start)
    // ...
    e.buf = append(e.buf, value...)
    e.buf = append(e.buf, end)
}

Upd: example

@martisch
Copy link
Contributor

martisch commented Nov 9, 2021

Not sure how much performance/size win this still is, we can generally partition growslice into 3 use cases with reduced code path sizes and case switches internally:

  • growslice with no pointers in element: where the first arg is just the element size and then avoid the type load and just need a constant mov to set the argument (with usually small immediate).
  • growslice with pointers: arguments as before but can assume its only handling elements with pointers
  • growslice for 0 sized elements

@gopherbot
Copy link
Contributor

Change https://golang.org/cl/363095 mentions this issue: cmd/compile,runtime: add makebyteslice and growbyteslice

@CAFxX
Copy link
Contributor

CAFxX commented Nov 10, 2021

growslice has a switch on et.size that is known if the element type is known to be a byte. I wonder if we can take advantage of that somehow. Without duplicating too much code, hopefully.

Only tangentially related to the specific issue but, since make was mentioned,mallocgc would also be a prime candidate for specialization based on the type of object allocated. Since object size is often known at compile-time, it could be specialized for allocation size (potentially precomputing the size class for small allocations) and on whether the object is needzero or noscan.

@martisch
Copy link
Contributor

I dont think mallogc is a good candidate since AFAIK except for make+copy the compiler does not generate calls for it and thereby the amount of space to be saved is limited mostly by direct runtime calls. The compiler generated code has newobject calls which already only takes a pointer so at most we can make it take zero arguments e.g. newobjectpointer and newobject1, newobject2 .... Im not sure when we are going to hit a net loss where we trade a bit of binary size for reduced icache misses.

func newobject(typ *_type) unsafe.Pointer {

@quasilyte
Copy link
Contributor Author

quasilyte commented Nov 11, 2021

Initially, I only thought about growbyteslice as that call is:

  1. Quite frequently inserted into the code (append calls for byte slices)
  2. It's only called when we need to grow the slice, which is expected to be a cold path (that's my understanding)
  3. It looked like the change was trivial, a simple wrapper that will forward a call

This part of the change should be relatively safe and there should be no performance regressions.

I did one extra step that Keith suggested and added a makebyteslice as it turned out to be somewhat interesting too, but it will lead to an extra function call in the make([]byte, ) path (makebyteslice -> makeslice instead of makeslice directly).

I tried to avoid other ideas that can be related or useful, but they require a separate investigation each, and the results can be less obvious (more tradeoff-like). I wonder if these ideas should be dug upon and reported as a separate issue (anyone is welcome to do that, I don't think that I'll go too deep in this one).

In my tests, growbyteslice (append) contributes more than 70% of the size decrease, so I can drop the makebyteslice part if it looks not good enough.

@cagedmantis cagedmantis added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Nov 11, 2021
@cagedmantis cagedmantis added this to the Backlog milestone Nov 11, 2021
@gopherbot gopherbot added the compiler/runtime Issues related to the Go compiler and/or runtime. label Jul 7, 2022
@mknyszek mknyszek moved this to Triage Backlog in Go Compiler / Runtime Jul 15, 2022
@egonelbre
Copy link
Contributor

egonelbre commented May 9, 2023

It seems like I rediscovered this idea independently and did a quick experiment: https://go-review.googlesource.com/c/go/+/493836

goos: windows
goarch: amd64
pkg: runtime
cpu: AMD Ryzen Threadripper 2950X 16-Core Processor
                                 │ append_before.txt~ │         append_after.txt~          │
                                 │       sec/op       │   sec/op     vs base               │
Append-32                                12.320n ± 2%   8.272n ± 1%  -32.86% (p=0.002 n=6)
AppendSlice/1Bytes-32                     2.238n ± 3%   2.237n ± 1%        ~ (p=1.000 n=6)
AppendSlice/4Bytes-32                     1.994n ± 3%   1.995n ± 1%        ~ (p=1.000 n=6)
AppendSlice/7Bytes-32                     2.350n ± 5%   2.223n ± 2%   -5.43% (p=0.002 n=6)
AppendSlice/8Bytes-32                     2.098n ± 4%   2.133n ± 1%   +1.67% (p=0.002 n=6)
AppendSlice/15Bytes-32                    2.369n ± 2%   2.349n ± 1%        ~ (p=0.589 n=6)
AppendSlice/16Bytes-32                    2.028n ± 9%   2.244n ± 4%  +10.63% (p=0.002 n=6)
AppendSlice/32Bytes-32                    2.243n ± 8%   2.240n ± 1%        ~ (p=0.554 n=6)
AppendSliceLarge/1024Bytes-32             383.6n ± 5%   334.0n ± 3%  -12.92% (p=0.002 n=6)
AppendSliceLarge/4096Bytes-32             1.493µ ± 1%   1.268µ ± 4%  -15.08% (p=0.002 n=6)
AppendSliceLarge/16384Bytes-32            4.865µ ± 2%   4.266µ ± 1%  -12.31% (p=0.002 n=6)
AppendSliceLarge/65536Bytes-32            14.41µ ± 3%   13.23µ ± 4%   -8.17% (p=0.002 n=6)
AppendSliceLarge/262144Bytes-32           52.77µ ± 4%   50.45µ ± 2%   -4.39% (p=0.026 n=6)
AppendSliceLarge/1048576Bytes-32          170.7µ ± 8%   158.8µ ± 3%   -6.95% (p=0.002 n=6)
geomean                                   85.16n        79.39n        -6.77%

Note: that implementation doesn't omit the element type, so there might be some additional win by omitting it.

Adding similar optimization for all the

switch {
cases. I would guess []byte, []int, []*T and "power of two" would be the important ones.

@egonelbre
Copy link
Contributor

Unfortunately encoding/json Encode benchmarks seem slower...

                    │ encode_before.txt~ │          encode_after.txt~          │
                    │       sec/op       │    sec/op     vs base               │
CodeEncoder-32              351.3µ ±  1%   386.5µ ±  5%  +10.01% (p=0.002 n=6)
CodeEncoderError-32         393.1µ ±  4%   393.0µ ±  4%        ~ (p=0.937 n=6)
EncodeMarshaler-32          26.91n ± 16%   26.20n ± 19%        ~ (p=0.485 n=6)
EncoderEncode-32            15.95n ± 40%   15.42n ± 42%        ~ (p=0.485 n=6)
geomean                     2.775µ         2.799µ         +0.86%

Either way, seems promising, but definitely needs some analysis why things get slower.

@egonelbre
Copy link
Contributor

@quasilyte btw, what did you use to create the initial top 20?

@quasilyte
Copy link
Contributor Author

@egonelbre IIRC, it was a bincmp tool executed on a go binary.
It can show a per-symbol size diff between the two Go binaries.
If bin1 function f size is different from the same symbol size in bin2, it prints that difference.

@egonelbre
Copy link
Contributor

@quasilyte I didn't mean the difference between the binaries, but more of the output from:

   1194 makeslice byte (size=1) 64bit=false
    500 makeslice Word (size=8) 64bit=false
    422 makeslice *types.Field (size=8) 64bit=false
    246 makeslice string (size=16) 64bit=false
    177 makeslice uint32 (size=4) 64bit=false
    110 makeslice int (size=8) 64bit=false
    ...

Or does bincmp do that as well?

@quasilyte
Copy link
Contributor Author

quasilyte commented May 18, 2023

I used a patched Go compiler to collect these metrics; there is no tool that would give you this output.
I think I was using a make.bash to run a test (although it could be just a compilation of cmd/compile, I don't remember, it was more than a year ago).
Every time a compiler was about to generate a makeslice call, I increased a relevant counter (depending on the type of the slice being allocated) to dump the final stats when it finished.

If you run the same test now, you won't get the same numbers due to the Go source code changes. Nonetheless, I would guess that it wouldn't affect the distribution by a whole lot, a []byte is probably the #1 still.
To find a good #2, different code bases should be taken into account, even if []string looks like an intuitive candidate.

@josharian
Copy link
Contributor

Every time a compiler was about to generate a makeslice call, I increased a relevant counter

@egonelbre this is a really common tactic for stuff like this.

Even easier than maintaining counters is to print every instance to stdout and use a tool like https://github.com/josharian/pct to accumulate them. (See the README there for another compiler contributor re-creating a similar tool. This is clearly a local optimum. Maybe even a global one. :P)

@quasilyte
Copy link
Contributor Author

Even easier than maintaining counters is to print every instance to stdout

This is actually the exact thing I was doing, but I was too embarrassed to admit that. :D

@egonelbre
Copy link
Contributor

Anyways some rough stats from different things:

compiler:
 38.29%   937 growslice uint8 size=1
 23.29%   570 growslice string size=16 hasptr
  2.45%    60 growslice module.Version size=32 hasptr
  1.80%    44 growslice zip.FileError size=32 hasptr
  1.59%    39 growslice *load.Package size=8 hasptr
  1.47%    36 growslice *work.Action size=8 hasptr
  1.31%    32 growslice int 8
  1.14%    28 growslice int32 4

storj non pointer, grouped by size:
 84.79%  4575 growslice size=1
  6.76%   365 growslice size=8
  3.58%   193 growslice size=4

k3s:
 20.25%  6089 growslice uint8 size=1
 18.54%  5575 growslice string size=16 hasptr
  9.53%  2865 growslice *field.Error size=8 hasptr
  7.49%  2252 growslice error size=16 hasptr
  4.21%  1265 growslice *yaml.Node size=8 hasptr
  1.31%   395 growslice interface {} size=16 hasptr
  1.24%   374 growslice int size=8

k3s non pointer:
 82.85%  6089 growslice uint8 size=1
  5.09%   374 growslice int size=8
  2.27%   167 growslice int32 size=4
  1.05%    77 growslice int64 size=8
  0.64%    47 growslice uint64 size=8
  0.49%    36 growslice yaml.yaml_parser_state_t size=8
  0.44%    32 growslice yaml.yaml_emitter_state_t size=8
  0.42%    31 growslice uint16 size=2
  0.37%    27 growslice zstd.seq size=16
  0.34%    25 growslice uint32 size=4

Roughly speaking, it seems that handling []string, []byte, []uint32 and []uint64 should cover most of the cases.

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/495878 mentions this issue: runtime,cmd/compile: specialize growslice for []byte

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/496141 mentions this issue: runtime,cmd/compile: specialize growslice for []string

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/495876 mentions this issue: runtime: introduce nextslicecap

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/495877 mentions this issue: compile/internal/walk: add walkGrowslice

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/497318 mentions this issue: runtime,cmd/compile: specialize growslice for 4 and 8 byte types

@egonelbre
Copy link
Contributor

Ok, created the necessary changes for them... the benchmark results: https://gist.github.com/egonelbre/34efef3f5c4aef2035396a2f0a90cc96

The benefit from []byte, []string seems significant. []x32 and []x64 is harder to justify.

I still need to dig into the benchmarks, because from my initial testing it seems the append related tests are sensitive to code layout on my processor.

gopherbot pushed a commit that referenced this issue Sep 4, 2023
This allows to reuse the slice cap computation across
specialized growslice funcs.

Updates #49480

Change-Id: Ie075d9c3075659ea14c11d51a9cd4ed46aa0e961
Reviewed-on: https://go-review.googlesource.com/c/go/+/495876
Reviewed-by: Keith Randall <khr@google.com>
Reviewed-by: Matthew Dempsky <mdempsky@google.com>
Run-TryBot: Egon Elbre <egonelbre@gmail.com>
Reviewed-by: Keith Randall <khr@golang.org>
TryBot-Result: Gopher Robot <gobot@golang.org>
Auto-Submit: Ian Lance Taylor <iant@golang.org>
gopherbot pushed a commit that referenced this issue Sep 5, 2023
Move growslice generation to a separate func so that specialization
logic can be shared.

Updates #49480

Change-Id: I9ea5bb898753622d2d767546a46b4db6410dc725
Reviewed-on: https://go-review.googlesource.com/c/go/+/495877
Reviewed-by: Keith Randall <khr@golang.org>
Reviewed-by: Matthew Dempsky <mdempsky@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Keith Randall <khr@google.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Projects
Status: Triage Backlog
Development

No branches or pull requests

9 participants