Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/compile,runtime: introduce growbyteslice? #49480

Open
quasilyte opened this issue Nov 9, 2021 · 7 comments
Open

cmd/compile,runtime: introduce growbyteslice? #49480

quasilyte opened this issue Nov 9, 2021 · 7 comments

Comments

@quasilyte
Copy link
Contributor

@quasilyte quasilyte commented Nov 9, 2021

Everything below applies to the latest Go master branch compiled from sources

When append() is compiled, this call is emited:

    growslice(elemtype, slice, capacity)

While building go tool, I found out that ~24% calls to growslice have first argument of []byte.
16.5% is for []string. Every other combination is less frequent and is not as interesting.

Here is a top20 results for go tool (total=10416):

   2486 growslice byte
   1721 growslice string
    334 growslice ir.Node
    312 growslice ssa.Edge
    183 growslice *Value
    150 growslice posetUndo
    140 growslice int
    118 growslice ID
    114 growslice *ir.Name
    104 growslice module.Version
    104 growslice loader.Sym
     98 growslice *Block
     86 growslice FileError
     78 growslice errorDesc
     77 growslice rune
     76 growslice ast.Node
     75 growslice *Regexp
     70 growslice *types.Type
     66 growslice Inst
     61 growslice Expr

Most other cases are in 1-5 range.

If we add a specialized wrapper growbyteslice like this one:

func growbyteslice(old slice, cap int) slice {
	return growslice(uint8Type, old, cap)
}

...then we can avoid passing the first argument for []byte.

Let's see how it affects an isolated case:

func f(xs []byte) []byte {
	xs = append(xs, 1)
	return xs
}
Attribute Before After
Func code size 126 110 (-16)
Func frame size 72 64 (-8)

Asm listings are below, for completeness.

Why this can be a worthwhile thing even if we add an extra call in slice growing path?
I believe that slice growing is not a hot path. It we're about to make allocations and
that would dominate the execution time. If we're following the happy path when newlen < cap,
then smaller code could help to be more code cache-friendly.

Now to some more real-world cases:

	// jx JSON encoder
	.text           -3424  561331   557907        -0.61%
	// go tool
	.text           -11488  5231050   5219562        -0.22%
	// hello world http server app
	.text           -7200  2117578  2110378       -0.34%
	// hugo
	.text           -61824  24407578  24345754       -0.25%

Different apps yield varying results, but the .text segment
size of the binary typically goes down by 0.22%-0.61%.

The main question: is adding a new runtime function growbyteslice too much for this case?

From one point of view, more functions -> bigger runtime code. At the same time, this is a simple
wrapper that decreases the code size more often than not without apparent performance issues.

All tests from all.bash seem to be passing for my prototype. I could send the CL is
this idea looks reasonable.


Before growbyteslice:

"".f STEXT size=126 args=0x18 locals=0x48 funcid=0x0 align=0x0
	0x0000 00000 (hello.go:3)	TEXT	"".f(SB), ABIInternal, $72-24
	0x0000 00000 (hello.go:3)	CMPQ	SP, 16(R14)
	0x0004 00004 (hello.go:3)	PCDATA	$0, $-2
	0x0004 00004 (hello.go:3)	JLS	89
	0x0006 00006 (hello.go:3)	PCDATA	$0, $-1
	0x0006 00006 (hello.go:3)	SUBQ	$72, SP
	0x000a 00010 (hello.go:3)	MOVQ	BP, 64(SP)
	0x000f 00015 (hello.go:3)	LEAQ	64(SP), BP
	0x0014 00020 (hello.go:3)	MOVQ	AX, "".xs+80(FP)
	0x0019 00025 (hello.go:3)	FUNCDATA	$0, gclocals·1a65e721a2ccc325b382662e7ffee780(SB)
	0x0019 00025 (hello.go:3)	FUNCDATA	$1, gclocals·69c1753bd5f81501d95132d08af04464(SB)
	0x0019 00025 (hello.go:3)	FUNCDATA	$5, "".f.arginfo1(SB)
	0x0019 00025 (hello.go:3)	FUNCDATA	$6, "".f.argliveinfo(SB)
	0x0019 00025 (hello.go:3)	PCDATA	$3, $1
	0x0019 00025 (hello.go:4)	LEAQ	1(BX), SI
	0x001d 00029 (hello.go:4)	NOP
	0x0020 00032 (hello.go:4)	CMPQ	CX, SI
	0x0023 00035 (hello.go:4)	JCC	72
	0x0025 00037 (hello.go:4)	MOVQ	BX, "".xs+88(SP)
	0x002a 00042 (hello.go:4)	PCDATA	$3, $2
	0x002a 00042 (hello.go:4)	MOVQ	CX, DI
	0x002d 00045 (hello.go:4)	MOVQ	BX, CX
	0x0030 00048 (hello.go:4)	MOVQ	AX, BX
	0x0033 00051 (hello.go:4)	LEAQ	type.uint8(SB), AX
	0x003a 00058 (hello.go:4)	PCDATA	$1, $0
	0x003a 00058 (hello.go:4)	CALL	runtime.growslice(SB)
	0x003f 00063 (hello.go:4)	LEAQ	1(BX), SI
	0x0043 00067 (hello.go:4)	MOVQ	"".xs+88(SP), BX
	0x0048 00072 (hello.go:4)	PCDATA	$1, $-1
	0x0048 00072 (hello.go:4)	PCDATA	$3, $1
	0x0048 00072 (hello.go:4)	MOVB	$1, (AX)(BX*1)
	0x004c 00076 (hello.go:5)	MOVQ	SI, BX
	0x004f 00079 (hello.go:5)	MOVQ	64(SP), BP
	0x0054 00084 (hello.go:5)	ADDQ	$72, SP
	0x0058 00088 (hello.go:5)	RET
	0x0059 00089 (hello.go:5)	NOP
	0x0059 00089 (hello.go:3)	PCDATA	$1, $-1
	0x0059 00089 (hello.go:3)	PCDATA	$0, $-2
	0x0059 00089 (hello.go:3)	MOVQ	AX, 8(SP)
	0x005e 00094 (hello.go:3)	MOVQ	BX, 16(SP)
	0x0063 00099 (hello.go:3)	MOVQ	CX, 24(SP)
	0x0068 00104 (hello.go:3)	CALL	runtime.morestack_noctxt(SB)
	0x006d 00109 (hello.go:3)	MOVQ	8(SP), AX
	0x0072 00114 (hello.go:3)	MOVQ	16(SP), BX
	0x0077 00119 (hello.go:3)	MOVQ	24(SP), CX
	0x007c 00124 (hello.go:3)	PCDATA	$0, $-1
	0x007c 00124 (hello.go:3)	JMP	0

After:

"".f STEXT size=110 args=0x18 locals=0x40 funcid=0x0 align=0x0
	0x0000 00000 (hello.go:3)	TEXT	"".f(SB), ABIInternal, $64-24
	0x0000 00000 (hello.go:3)	CMPQ	SP, 16(R14)
	0x0004 00004 (hello.go:3)	PCDATA	$0, $-2
	0x0004 00004 (hello.go:3)	JLS	73
	0x0006 00006 (hello.go:3)	PCDATA	$0, $-1
	0x0006 00006 (hello.go:3)	SUBQ	$64, SP
	0x000a 00010 (hello.go:3)	MOVQ	BP, 56(SP)
	0x000f 00015 (hello.go:3)	LEAQ	56(SP), BP
	0x0014 00020 (hello.go:3)	MOVQ	AX, "".xs+72(FP)
	0x0019 00025 (hello.go:3)	FUNCDATA	$0, gclocals·1a65e721a2ccc325b382662e7ffee780(SB)
	0x0019 00025 (hello.go:3)	FUNCDATA	$1, gclocals·69c1753bd5f81501d95132d08af04464(SB)
	0x0019 00025 (hello.go:3)	FUNCDATA	$5, "".f.arginfo1(SB)
	0x0019 00025 (hello.go:3)	FUNCDATA	$6, "".f.argliveinfo(SB)
	0x0019 00025 (hello.go:3)	PCDATA	$3, $1
	0x0019 00025 (hello.go:4)	LEAQ	1(BX), DI
	0x001d 00029 (hello.go:4)	NOP
	0x0020 00032 (hello.go:4)	CMPQ	CX, DI
	0x0023 00035 (hello.go:4)	JCC	56
	0x0025 00037 (hello.go:4)	MOVQ	BX, "".xs+80(SP)
	0x002a 00042 (hello.go:4)	PCDATA	$3, $2
	0x002a 00042 (hello.go:4)	PCDATA	$1, $0
	0x002a 00042 (hello.go:4)	CALL	runtime.growbyteslice(SB)
	0x002f 00047 (hello.go:4)	LEAQ	1(BX), DI
	0x0033 00051 (hello.go:4)	MOVQ	"".xs+80(SP), BX
	0x0038 00056 (hello.go:4)	PCDATA	$1, $-1
	0x0038 00056 (hello.go:4)	PCDATA	$3, $1
	0x0038 00056 (hello.go:4)	MOVB	$1, (AX)(BX*1)
	0x003c 00060 (hello.go:5)	MOVQ	DI, BX
	0x003f 00063 (hello.go:5)	MOVQ	56(SP), BP
	0x0044 00068 (hello.go:5)	ADDQ	$64, SP
	0x0048 00072 (hello.go:5)	RET
	0x0049 00073 (hello.go:5)	NOP
	0x0049 00073 (hello.go:3)	PCDATA	$1, $-1
	0x0049 00073 (hello.go:3)	PCDATA	$0, $-2
	0x0049 00073 (hello.go:3)	MOVQ	AX, 8(SP)
	0x004e 00078 (hello.go:3)	MOVQ	BX, 16(SP)
	0x0053 00083 (hello.go:3)	MOVQ	CX, 24(SP)
	0x0058 00088 (hello.go:3)	CALL	runtime.morestack_noctxt(SB)
	0x005d 00093 (hello.go:3)	MOVQ	8(SP), AX
	0x0062 00098 (hello.go:3)	MOVQ	16(SP), BX
	0x0067 00103 (hello.go:3)	MOVQ	24(SP), CX
	0x006c 00108 (hello.go:3)	PCDATA	$0, $-1
	0x006c 00108 (hello.go:3)	JMP	0

Update: makebyteslice

As pointed in the comments, it could be also worthwhile to see whether a specialized makeslice can be useful.

Top makeslice usages found during ./make.bash (total=4487):

   1194 makeslice byte (size=1) 64bit=false
    500 makeslice Word (size=8) 64bit=false
    422 makeslice *types.Field (size=8) 64bit=false
    246 makeslice string (size=16) 64bit=false
    177 makeslice uint32 (size=4) 64bit=false
    110 makeslice int (size=8) 64bit=false
     97 makeslice int32 (size=4) 64bit=false
     66 makeslice big.Word (size=8) 64bit=false
     60 makeslice ir.Node (size=16) 64bit=false
     58 makeslice obj.Reloc (size=24) 64bit=false
     56 makeslice bool (size=1) 64bit=false
     54 makeslice *types.Type (size=8) 64bit=false
     42 makeslice int64 (size=8) 64bit=false
     37 makeslice Type (size=16) 64bit=false
     34 makeslice module.Version (size=32) 64bit=false
     33 makeslice hcode (size=4) 64bit=false
     31 makeslice *Value (size=8) 64bit=false
     31 makeslice []byte (size=24) 64bit=false
     28 makeslice *Var (size=8) 64bit=false
     26 makeslice float64 (size=8) 64bit=false

All 1-byte makeslice usages found during ./make.bash (note that uint8 should be added to byte here):

   1194 makeslice byte (size=1) 64bit=false
     56 makeslice bool (size=1) 64bit=false
     25 makeslice uint8 (size=1) 64bit=false
      6 makeslice int8 (size=1) 64bit=false
      4 makeslice whiteSpace (size=1) 64bit=false
      3 makeslice level (size=1) 64bit=false
      2 makeslice sym.SymKind (size=1) 64bit=false
      2 makeslice markKind (size=1) 64bit=false
      1 makeslice RuneRole (size=1) 64bit=false

So, ~26% of makeslice calls have a byte type argument, most of the 1-byte sizes are byte/uint8 directly (named byte-sized types are a rarity).

More analysis

A detailed bincmp result for a simple web server is presented below (under a spoiler).

One may notice that for some symbols, the size has actually increased.

image

The reason for that, I assume, is that sometimes relevant arguments happen to be in the matching registers already. That could play in both directions though. It should be treated as a random noise.

Detailed bincmp output (A LOT OF TEXT)
```
.text           -4352  2117514  2113162       -0.21%

symbol name delta old new
bufio.(*Scanner).Scan -16 2319 2303 -0.69%
bytes.makeSlice -19 210 191 -9.05%
crypto/cipher.newCBC -13 370 357 -3.51%
crypto/elliptic.Marshal -21 383 362 -5.48%
crypto/hmac.New -13 709 696 -1.83%
crypto/md5.(*digest).MarshalBinary 2 527 529 0.38%
crypto/rsa.VerifyPKCS1v15 -11 1494 1483 -0.74%
crypto/rsa.VerifyPSS -14 855 841 -1.64%
crypto/sha1.(*digest).MarshalBinary 11 591 602 1.86%
crypto/sha256.(*digest).MarshalBinary 13 858 871 1.52%
crypto/sha512.(*digest).MarshalBinary 13 1050 1063 1.24%
crypto/tls.(*Conn).makeClientHello -26 3647 3621 -0.71%
crypto/tls.(*certificateMsg).marshal 1 805 806 0.12%
crypto/tls.(*certificateRequestMsg).marshal -12 1157 1145 -1.04%
crypto/tls.(*certificateRequestMsg).unmarshal -8 1423 1415 -0.56%
crypto/tls.(*cipherSuiteTLS13).expandLabel -1 1279 1278 -0.08%
crypto/tls.(*cipherSuiteTLS13).extract -10 357 347 -2.80%
crypto/tls.(*clientHandshakeState).doFullHandshake -96 5010 4914 -1.92%
crypto/tls.(*ecdheKeyAgreement).processServerKeyExchange 7 2245 2252 0.31%
crypto/tls.(*nistParameters).SharedKey -5 300 295 -1.67%
crypto/tls.ekmFromMasterSecret.func1 -27 1381 1354 -1.96%
crypto/tls.finishedHash.Sum -5 170 165 -2.94%
crypto/tls.finishedHash.clientSum -6 284 278 -2.11%
crypto/tls.finishedHash.serverSum -6 284 278 -2.11%
crypto/tls.keysFromMasterSecret -2 1129 1127 -0.18%
crypto/tls.masterFromPreMasterSecret -15 567 552 -2.65%
crypto/tls.md5SHA1Hash -5 466 461 -1.07%
crypto/tls.prf10 -5 636 631 -0.79%
crypto/tls.rsaKeyAgreement.generateClientKeyExchange -11 697 686 -1.58%
crypto/x509.parseRFC2821Mailbox -38 1413 1375 -2.69%
crypto/x509/pkix.RDNSequence.String -10 1554 1544 -0.64%
encoding/asn1.BitString.RightAlign -7 286 279 -2.45%
encoding/asn1.MarshalWithParams -7 517 510 -1.35%
encoding/asn1.appendTagAndLength -111 846 735 -13.12%
encoding/asn1.appendTimeCommon -93 2181 2088 -4.26%
encoding/asn1.appendUTCTime -13 716 703 -1.82%
encoding/asn1.makeGeneralizedTime -12 185 173 -6.49%
encoding/asn1.makeUTCTime -12 185 173 -6.49%
encoding/asn1.oidEncoder.Encode -12 680 668 -1.76%
encoding/asn1.parseBigInt -4 617 613 -0.65%
encoding/asn1.setEncoder.Encode -14 715 701 -1.96%
encoding/pem.Decode -3 2673 2670 -0.11%
encoding/pem.removeSpacesAndTabs -11 249 238 -4.42%
fmt.(*buffer).writeRune -20 441 421 -4.54%
fmt.(*fmt).fmtFloat -32 2137 2105 -1.50%
fmt.(*fmt).fmtSbx -64 1413 1349 -4.53%
fmt.(*pp).badVerb -55 1258 1203 -4.37%
fmt.(*pp).catchPanic -10 1332 1322 -0.75%
fmt.(*pp).doPrintf -64 4101 4037 -1.56%
fmt.(*pp).doPrintln -27 421 394 -6.41%
fmt.(*pp).fmtBytes -38 1747 1709 -2.18%
fmt.(*pp).fmtComplex -17 485 468 -3.51%
fmt.(*pp).fmtPointer -30 1145 1115 -2.62%
fmt.(*pp).printValue -204 7719 7515 -2.64%
fmt.(*pp).unknownType -22 693 671 -3.17%
io.copyBuffer -17 790 773 -2.15%
io.glob..func1 -13 138 125 -9.42%
log.(*Logger).Output -23 1029 1006 -2.24%
log.(*Logger).formatHeader -211 5477 5266 -3.85%
math/big.nat.itoa -30 1413 1383 -2.12%
mime.FormatMediaType -96 6119 6023 -1.57%
mime.consumeValue -12 807 795 -1.49%
mime.percentHexUnescape -23 756 733 -3.04%
net.(*IPMask).String -20 339 319 -5.90%
net.(*IPNet).String -32 837 805 -3.82%
net.IP.Mask -6 427 421 -1.41%
net.IP.String -27 2873 2846 -0.94%
net.IPMask.String -12 293 281 -4.10%
net.ParseCIDR -21 1065 1044 -1.97%
net.init -31 3823 3792 -0.81%
net.newRequest -32 618 586 -5.18%
net.open -2 231 229 -0.87%
net.parseIPv6 -32 1261 1229 -2.54%
net.readFull -16 424 408 -3.77%
net/http.(*http2Framer).WriteContinuation -21 554 533 -3.79%
net/http.(*http2Framer).WriteDataPadded -32 1068 1036 -3.00%
net/http.(*http2Framer).WriteGoAway -44 806 762 -5.46%
net/http.(*http2Framer).WriteHeaders -48 1311 1263 -3.66%
net/http.(*http2Framer).WritePing -22 340 318 -6.47%
net/http.(*http2Framer).WritePushPromise -32 1170 1138 -2.74%
net/http.(*http2Framer).WriteRSTStream -13 389 376 -3.34%
net/http.(*http2Framer).WriteSettings -12 599 587 -2.00%
net/http.(*http2Framer).WriteSettingsAck -4 186 182 -2.15%
net/http.(*http2Framer).WriteWindowUpdate -3 424 421 -0.71%
net/http.(*http2Framer).endWrite -8 293 285 -2.73%
net/http.(*http2bufferedWriter).Flush -12 347 335 -3.46%
net/http.(*http2bufferedWriter).Write -11 401 390 -2.74%
net/http.(*http2serverConn).readPreface.func1 -14 362 348 -3.87%
net/http.(*http2serverConn).runHandler.func1 -7 538 531 -1.30%
net/http.appendTime -7 1740 1733 -0.40%
net/http.glob..func1 -3 80 77 -3.75%
net/http.glob..func10 -10 218 208 -4.59%
net/http.glob..func15 -13 138 125 -9.42%
net/http.glob..func2 -3 80 77 -3.75%
net/http.glob..func3 -3 80 77 -3.75%
net/http.glob..func4 -3 80 77 -3.75%
net/http.glob..func5 -3 80 77 -3.75%
net/http.glob..func7 -13 138 125 -9.42%
net/http.glob..func8 -3 142 139 -2.11%
net/http.hexEscapeNonASCII -15 404 389 -3.71%
net/http.http2NewFramer.func2 -16 165 149 -9.70%
net/http.init -32 12490 12458 -0.26%
net/http.newBufioReader -32 788 756 -4.06%
net/http.newBufioWriterSize -10 583 573 -1.72%
net/http.putBufioWriter -3 274 271 -1.09%
net/textproto.(*Reader).readContinuedLineSlice -32 1509 1477 -2.12%
net/url.(*URL).String -39 3258 3219 -1.20%
net/url.EscapeError.Error -3 187 184 -1.60%
net/url.InvalidHostError.Error -8 205 197 -3.90%
net/url.unescape -11 2480 2469 -0.44%
os.ReadFile -42 869 827 -4.83%
os.Readlink -17 423 406 -4.02%
os.glob..func1 -13 138 125 -9.42%
path.Clean -9 2133 2124 -0.42%
path/filepath.Clean -63 2837 2774 -2.22%
reflect.addTypeBits -25 1127 1102 -2.22%
reflect.newAbiDesc -42 2298 2256 -1.83%
runtime.boundsError.Error -32 1543 1511 -2.07%
runtime.findfunctab -21 10340 10319 -0.20%
runtime.growbyteslice 56 56
runtime.makebyteslice 47 47
strconv.(*NumError).Error -3 442 439 -0.68%
strconv.appendEscapedRune -109 2075 1966 -5.25%
strconv.appendQuotedRuneWith -31 310 279 -10.00%
strconv.appendQuotedWith -72 1125 1053 -6.40%
strconv.fmtB -44 421 377 -10.45%
strconv.fmtE -138 1367 1229 -10.10%
strconv.fmtF -103 853 750 -12.08%
strconv.fmtX -173 1962 1789 -8.82%
strconv.formatDigits -16 599 583 -2.67%
strconv.unquote -37 2578 2541 -1.44%
strings.(*Builder).WriteRune -16 549 533 -2.91%
strings.(*byteStringReplacer).Replace 19 730 749 2.60%
strings.(*genericReplacer).Replace -13 217 204 -5.99%
strings.Join -19 1403 1384 -1.35%
strings.Map -23 1532 1509 -1.50%
strings.ToLower -30 628 598 -4.78%
syscall.NetlinkRIB -17 2181 2164 -0.78%
syscall.newNetlinkRouteRequest -9 302 293 -2.98%
time.Time.AppendFormat -96 9173 9077 -1.05%
time.Time.Format -7 312 305 -2.24%
time.Time.GoString -54 2397 2343 -2.25%
time.Time.MarshalBinary -14 586 572 -2.39%
time.Time.String -16 477 461 -3.35%
time.appendInt -32 645 613 -4.96%
time.formatNano -54 549 495 -9.84%
time.loadTzinfoFromZip -11 3273 3262 -0.34%
time.quote -36 1292 1256 -2.79%
vendor/golang.org/x/crypto/cry...byte.(*Builder).addLengthPrefixed 1 798 799 0.13%
vendor/golang.org/x/crypto/cryptobyte.(*Builder).flushChild 31 1520 1551 2.04%
vendor/golang.org/x/crypto/cry...yptobyte.(*String).readASN1BigInt 8 492 500 1.63%
vendor/golang.org/x/crypto/hkdf.Extract -11 280 269 -3.93%
vendor/golang.org/x/net/dns/dnsmessage.(*Name).pack -32 1395 1363 -2.29%
vendor/golang.org/x/net/dns/dn...smessage.(*Name).unpackCompressed -16 1085 1069 -1.47%
vendor/golang.org/x/net/dns/dnsmessage.(*Question).pack -27 436 409 -6.19%
vendor/golang.org/x/net/http2/hpack.(*Encoder).WriteField -59 2072 2013 -2.85%
vendor/golang.org/x/net/http2/hpack.AppendHuffmanString -21 653 632 -3.22%
vendor/golang.org/x/net/http2/hpack.appendHpackString -155 1099 944 -14.10%
vendor/golang.org/x/net/http2/hpack.appendIndexedName -45 677 632 -6.65%
vendor/golang.org/x/net/http2/hpack.appendNewName -30 313 283 -9.58%
vendor/golang.org/x/net/idna.encode -57 2046 1989 -2.79%
vendor/golang.org/x/text/unicode/norm.Form.Bytes -32 696 664 -4.60%
vendor/golang.org/x/text/unicode/norm.Form.String -32 665 633 -4.81%
vendor/golang.org/x/text/unicode/norm.appendQuick 14 710 724 1.97%
total -4209 204162 199953 -2.06%


    </pre>
</details>
@randall77
Copy link
Contributor

@randall77 randall77 commented Nov 9, 2021

Seems reasonable. I wonder if there are other byte-specific cases for other builtins (maybe make or copy?).

growslice has a switch on et.size that is known if the element type is known to be a byte. I wonder if we can take advantage of that somehow. Without duplicating too much code, hopefully.

Loading

@ernado
Copy link
Contributor

@ernado ernado commented Nov 9, 2021

Encoders (at least those who encode to buffer) will greatly benefit from this optimization, because they are in general something like that:

func (e *Encoder) EncodeT(v T) {
    e.buf = append(e.buf, start)
    // ...
    e.buf = append(e.buf, value...)
    e.buf = append(e.buf, end)
}

Upd: example

Loading

@martisch
Copy link
Contributor

@martisch martisch commented Nov 9, 2021

Not sure how much performance/size win this still is, we can generally partition growslice into 3 use cases with reduced code path sizes and case switches internally:

  • growslice with no pointers in element: where the first arg is just the element size and then avoid the type load and just need a constant mov to set the argument (with usually small immediate).
  • growslice with pointers: arguments as before but can assume its only handling elements with pointers
  • growslice for 0 sized elements

Loading

@gopherbot
Copy link

@gopherbot gopherbot commented Nov 10, 2021

Change https://golang.org/cl/363095 mentions this issue: cmd/compile,runtime: add makebyteslice and growbyteslice

Loading

@CAFxX
Copy link
Contributor

@CAFxX CAFxX commented Nov 10, 2021

growslice has a switch on et.size that is known if the element type is known to be a byte. I wonder if we can take advantage of that somehow. Without duplicating too much code, hopefully.

Only tangentially related to the specific issue but, since make was mentioned,mallocgc would also be a prime candidate for specialization based on the type of object allocated. Since object size is often known at compile-time, it could be specialized for allocation size (potentially precomputing the size class for small allocations) and on whether the object is needzero or noscan.

Loading

@martisch
Copy link
Contributor

@martisch martisch commented Nov 11, 2021

I dont think mallogc is a good candidate since AFAIK except for make+copy the compiler does not generate calls for it and thereby the amount of space to be saved is limited mostly by direct runtime calls. The compiler generated code has newobject calls which already only takes a pointer so at most we can make it take zero arguments e.g. newobjectpointer and newobject1, newobject2 .... Im not sure when we are going to hit a net loss where we trade a bit of binary size for reduced icache misses.

func newobject(typ *_type) unsafe.Pointer {

Loading

@quasilyte
Copy link
Contributor Author

@quasilyte quasilyte commented Nov 11, 2021

Initially, I only thought about growbyteslice as that call is:

  1. Quite frequently inserted into the code (append calls for byte slices)
  2. It's only called when we need to grow the slice, which is expected to be a cold path (that's my understanding)
  3. It looked like the change was trivial, a simple wrapper that will forward a call

This part of the change should be relatively safe and there should be no performance regressions.

I did one extra step that Keith suggested and added a makebyteslice as it turned out to be somewhat interesting too, but it will lead to an extra function call in the make([]byte, ) path (makebyteslice -> makeslice instead of makeslice directly).

I tried to avoid other ideas that can be related or useful, but they require a separate investigation each, and the results can be less obvious (more tradeoff-like). I wonder if these ideas should be dug upon and reported as a separate issue (anyone is welcome to do that, I don't think that I'll go too deep in this one).

In my tests, growbyteslice (append) contributes more than 70% of the size decrease, so I can drop the makebyteslice part if it looks not good enough.

Loading

@cagedmantis cagedmantis added this to the Backlog milestone Nov 11, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
7 participants