New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/compile: needlessly early return value loads in runtime.scanobject #19195

Open
josharian opened this Issue Feb 20, 2017 · 3 comments

Comments

Projects
None yet
5 participants
@josharian
Contributor

josharian commented Feb 20, 2017

runtime.scanobject contains this bit of code:

// Mark the object.
if obj, hbits, span, objIndex := heapBitsForObject(obj, b, i); obj != 0 {
	greyobject(obj, b, i, hbits, span, gcw, objIndex)
}

This compiles to (excerpt):

	0x024b 00587 (/Users/josh/go/tip/src/runtime/mgcmark.go:1180)	MOVQ	R11, (SP)
	0x024f 00591 (/Users/josh/go/tip/src/runtime/mgcmark.go:1180)	MOVQ	BX, 8(SP)
	0x0254 00596 (/Users/josh/go/tip/src/runtime/mgcmark.go:1180)	MOVQ	R8, 16(SP)
	0x0259 00601 (/Users/josh/go/tip/src/runtime/mgcmark.go:1180)	PCDATA	$0, $3
	0x0259 00601 (/Users/josh/go/tip/src/runtime/mgcmark.go:1180)	CALL	"".heapBitsForObject(SB)
	0x025e 00606 (/Users/josh/go/tip/src/runtime/mgcmark.go:1180)	MOVL	40(SP), AX
	0x0262 00610 (/Users/josh/go/tip/src/runtime/mgcmark.go:1180)	MOVQ	24(SP), CX
	0x0267 00615 (/Users/josh/go/tip/src/runtime/mgcmark.go:1180)	MOVQ	48(SP), DX
	0x026c 00620 (/Users/josh/go/tip/src/runtime/mgcmark.go:1180)	MOVQ	56(SP), BX
	0x0271 00625 (/Users/josh/go/tip/src/runtime/mgcmark.go:1180)	MOVQ	32(SP), SI
	0x0276 00630 (/Users/josh/go/tip/src/runtime/mgcmark.go:1180)	TESTQ	CX, CX
	0x0279 00633 (/Users/josh/go/tip/src/runtime/mgcmark.go:1180)	JNE	$0, 683
	0x027b 00635 (/Users/josh/go/tip/src/runtime/mgcmark.go:1178)	MOVQ	"".arena_used+96(SP), AX
	0x0280 00640 (/Users/josh/go/tip/src/runtime/mgcmark.go:1160)	MOVL	"".h.shift+68(SP), CX
	0x0284 00644 (/Users/josh/go/tip/src/runtime/mgcmark.go:1178)	MOVQ	""..autotmp_2642+112(SP), DX
	0x0289 00649 (/Users/josh/go/tip/src/runtime/mgcmark.go:1174)	MOVQ	"".b+168(FP), BX
	0x0291 00657 (/Users/josh/go/tip/src/runtime/mgcmark.go:1160)	MOVQ	"".h.bitp+128(SP), SI
	0x0299 00665 (/Users/josh/go/tip/src/runtime/mgcmark.go:1153)	MOVQ	"".n+80(SP), DI
	0x029e 00670 (/Users/josh/go/tip/src/runtime/mgcmark.go:1153)	MOVQ	"".i+88(SP), R8
	0x02a3 00675 (/Users/josh/go/tip/src/runtime/mgcmark.go:1160)	MOVL	CX, R9
	0x02a6 00678 (/Users/josh/go/tip/src/runtime/mgcmark.go:1178)	JMP	548
	0x02ab 00683 (/Users/josh/go/tip/src/runtime/mgcmark.go:1181)	MOVQ	CX, (SP)
	0x02af 00687 (/Users/josh/go/tip/src/runtime/mgcmark.go:1181)	MOVQ	"".b+168(FP), CX
	0x02b7 00695 (/Users/josh/go/tip/src/runtime/mgcmark.go:1181)	MOVQ	CX, 8(SP)
	0x02bc 00700 (/Users/josh/go/tip/src/runtime/mgcmark.go:1181)	MOVQ	"".i+88(SP), DI
	0x02c1 00705 (/Users/josh/go/tip/src/runtime/mgcmark.go:1181)	MOVQ	DI, 16(SP)
	0x02c6 00710 (/Users/josh/go/tip/src/runtime/mgcmark.go:1181)	MOVQ	SI, 24(SP)
	0x02cb 00715 (/Users/josh/go/tip/src/runtime/mgcmark.go:1181)	MOVL	AX, 32(SP)
	0x02cf 00719 (/Users/josh/go/tip/src/runtime/mgcmark.go:1181)	MOVQ	DX, 40(SP)
	0x02d4 00724 (/Users/josh/go/tip/src/runtime/mgcmark.go:1181)	MOVQ	"".gcw+176(FP), AX
	0x02dc 00732 (/Users/josh/go/tip/src/runtime/mgcmark.go:1181)	MOVQ	AX, 48(SP)
	0x02e1 00737 (/Users/josh/go/tip/src/runtime/mgcmark.go:1181)	MOVQ	BX, 56(SP)
	0x02e6 00742 (/Users/josh/go/tip/src/runtime/mgcmark.go:1181)	PCDATA	$0, $3
	0x02e6 00742 (/Users/josh/go/tip/src/runtime/mgcmark.go:1181)	CALL	"".greyobject(SB)
	0x02eb 00747 (/Users/josh/go/tip/src/runtime/mgcmark.go:1180)	JMP	635

The loads at instructions 0x025e, 0x0267, 0x026c, and 0x0271 are too early. They're loading return values from heapBitsForObject, but if the call to greyobject isn't necessary (if the jump at 0x0279 isn't taken), their values are unnecessary and will be overwritten. This is easy enough to see in the original code as well.

The loads are scheduled where they are due to memory ordering. But maybe there's something we can do to improve the situation, perhaps using the knowledge that stack slots and function return values are always disjoint in memory?

cc @randall77 @dr2chase @cherrymui

@josharian josharian added this to the Go1.9 milestone Feb 20, 2017

@gopherbot

This comment has been minimized.

gopherbot commented Feb 21, 2017

CL https://golang.org/cl/37300 mentions this issue.

josharian added a commit to josharian/go that referenced this issue Mar 1, 2017

cmd/compile: tighten some values with memory args
The tighten pass moves values closer
to the blocks in which they are used.
Prior to this CL, values that took a memory arg
were never moved, because that could cause
multiple memory values to be live across blocks.

However, there are in fact cases in which such a
value can safely be moved across blocks.
The most common such case is loading the return
values from a function call when only one
control flow path actually uses those return values.
In that case, we were unnecessarily eagerly loading
all return values from the stack.

This CL delays the decision about whether a value
with a memory arg can be moved until we know
where we'd like to move it to.
Once we know the target block, we can check whether
moving the value will cause problems.
This CL uses a simple check: If an existing value
in the target block already uses the same memory arg,
and the value cannot fault,
then it must be ok to add another use of it.

This is enough to improve the situation in scanobject.

Before:

	0x025d 00605 (mgcmark.go:1180)	CALL	"".heapBitsForObject(SB)
	0x0262 00610 (mgcmark.go:1180)	MOVL	40(SP), AX
	0x0266 00614 (mgcmark.go:1180)	MOVQ	24(SP), CX
	0x026b 00619 (mgcmark.go:1180)	MOVQ	48(SP), DX
	0x0270 00624 (mgcmark.go:1180)	MOVQ	56(SP), BX
	0x0275 00629 (mgcmark.go:1180)	MOVQ	32(SP), SI
	0x027a 00634 (mgcmark.go:1180)	TESTQ	CX, CX
	0x027d 00637 (mgcmark.go:1180)	JNE	$0, 687
	0x027f 00639 (mgcmark.go:1178)	MOVQ	"".arena_used+96(SP), AX
	0x0284 00644 (mgcmark.go:1160)	MOVL	"".h.shift+68(SP), CX
	0x0288 00648 (mgcmark.go:1178)	MOVQ	""..autotmp_2642+112(SP), DX
	0x028d 00653 (mgcmark.go:1174)	MOVQ	"".b+168(FP), BX
	0x0295 00661 (mgcmark.go:1160)	MOVQ	"".h.bitp+128(SP), SI
	0x029d 00669 (mgcmark.go:1153)	MOVQ	"".n+80(SP), DI
	0x02a2 00674 (mgcmark.go:1153)	MOVQ	"".i+88(SP), R8
	0x02a7 00679 (mgcmark.go:1160)	MOVL	CX, R9
	0x02aa 00682 (mgcmark.go:1178)	JMP	552
	0x02af 00687 (mgcmark.go:1181)	MOVQ	CX, (SP)
	0x02b3 00691 (mgcmark.go:1181)	MOVQ	"".b+168(FP), CX
	0x02bb 00699 (mgcmark.go:1181)	MOVQ	CX, 8(SP)
	0x02c0 00704 (mgcmark.go:1181)	MOVQ	"".i+88(SP), DI
	0x02c5 00709 (mgcmark.go:1181)	MOVQ	DI, 16(SP)
	0x02ca 00714 (mgcmark.go:1181)	MOVQ	SI, 24(SP)
	0x02cf 00719 (mgcmark.go:1181)	MOVL	AX, 32(SP)
	0x02d3 00723 (mgcmark.go:1181)	MOVQ	DX, 40(SP)
	0x02d8 00728 (mgcmark.go:1181)	MOVQ	"".gcw+176(FP), AX
	0x02e0 00736 (mgcmark.go:1181)	MOVQ	AX, 48(SP)
	0x02e5 00741 (mgcmark.go:1181)	MOVQ	BX, 56(SP)
	0x02ea 00746 (mgcmark.go:1181)	PCDATA	$0, $3
	0x02ea 00746 (mgcmark.go:1181)	CALL	"".greyobject(SB)

After:

	0x025d 00605 (mgcmark.go:1180)	CALL	"".heapBitsForObject(SB)
	0x0262 00610 (mgcmark.go:1180)	MOVQ	24(SP), AX
	0x0267 00615 (mgcmark.go:1180)	TESTQ	AX, AX
	0x026a 00618 (mgcmark.go:1180)	JNE	$0, 665
	0x026c 00620 (mgcmark.go:1178)	MOVQ	"".arena_used+96(SP), AX
	0x0271 00625 (mgcmark.go:1160)	MOVL	"".h.shift+68(SP), CX
	0x0275 00629 (mgcmark.go:1178)	MOVQ	""..autotmp_2642+112(SP), DX
	0x027a 00634 (mgcmark.go:1174)	MOVQ	"".b+168(FP), BX
	0x0282 00642 (mgcmark.go:1160)	MOVQ	"".h.bitp+128(SP), SI
	0x028a 00650 (mgcmark.go:1153)	MOVQ	"".n+80(SP), DI
	0x028f 00655 (mgcmark.go:1153)	MOVQ	"".i+88(SP), R8
	0x0294 00660 (mgcmark.go:1160)	MOVL	CX, R9
	0x0297 00663 (mgcmark.go:1178)	JMP	552
	0x0299 00665 (mgcmark.go:1180)	MOVL	40(SP), CX
	0x029d 00669 (mgcmark.go:1180)	MOVQ	48(SP), DX
	0x02a2 00674 (mgcmark.go:1180)	MOVQ	56(SP), BX
	0x02a7 00679 (mgcmark.go:1180)	MOVQ	32(SP), SI
	0x02ac 00684 (mgcmark.go:1181)	MOVQ	AX, (SP)
	0x02b0 00688 (mgcmark.go:1181)	MOVQ	"".b+168(FP), AX
	0x02b8 00696 (mgcmark.go:1181)	MOVQ	AX, 8(SP)
	0x02bd 00701 (mgcmark.go:1181)	MOVQ	"".i+88(SP), DI
	0x02c2 00706 (mgcmark.go:1181)	MOVQ	DI, 16(SP)
	0x02c7 00711 (mgcmark.go:1181)	MOVQ	SI, 24(SP)
	0x02cc 00716 (mgcmark.go:1181)	MOVL	CX, 32(SP)
	0x02d0 00720 (mgcmark.go:1181)	MOVQ	DX, 40(SP)
	0x02d5 00725 (mgcmark.go:1181)	MOVQ	"".gcw+176(FP), CX
	0x02dd 00733 (mgcmark.go:1181)	MOVQ	CX, 48(SP)
	0x02e2 00738 (mgcmark.go:1181)	MOVQ	BX, 56(SP)
	0x02e7 00743 (mgcmark.go:1181)	PCDATA	$0, $3
	0x02e7 00743 (mgcmark.go:1181)	CALL	"".greyobject(SB)

Observe that after this CL, the return values from
heapBitsForObject are only loaded in a single control flow path.

Updates golang#19195

The optimization appears to pay for itself, compile-time-wise:

name       old time/op      new time/op      delta
Template        199ms ± 4%       198ms ± 3%  -0.71%  (p=0.036 n=49+45)
Unicode        90.0ms ± 5%      89.3ms ± 5%  -0.76%  (p=0.022 n=50+49)
GoTypes         558ms ± 4%       560ms ± 4%    ~     (p=0.204 n=49+48)
Compiler        2.50s ± 3%       2.51s ± 3%    ~     (p=0.087 n=50+49)
SSA             3.97s ± 3%       3.98s ± 3%    ~     (p=0.843 n=49+50)
Flate           126ms ± 3%       126ms ± 4%    ~     (p=0.134 n=48+48)
GoParser        148ms ± 3%       149ms ± 5%    ~     (p=0.128 n=46+50)
Reflect         365ms ± 4%       366ms ± 4%    ~     (p=0.237 n=50+49)
Tar             110ms ± 4%       109ms ± 5%  -1.22%  (p=0.006 n=50+47)
XML             206ms ± 5%       207ms ± 3%    ~     (p=0.145 n=49+50)

compress/lzw benchmarks:

name           old time/op    new time/op    delta
Decoder/1e4-8    85.0µs ± 3%    85.6µs ± 2%  +0.74%  (p=0.000 n=36+36)
Decoder/1e5-8     811µs ± 2%     810µs ± 2%    ~     (p=0.134 n=37+37)
Decoder/1e6-8    8.00ms ± 2%    7.99ms ± 1%    ~     (p=0.648 n=39+39)
Encoder/1e4-8     174µs ± 1%     170µs ± 1%  -2.41%  (p=0.000 n=38+37)
Encoder/1e5-8    1.57ms ± 2%    1.54ms ± 2%  -2.21%  (p=0.000 n=37+37)
Encoder/1e6-8    15.7ms ± 1%    15.4ms ± 2%  -2.10%  (p=0.000 n=38+39)

Change-Id: I486e26f096e55e6bb1cc794026664b1bc46a9391
@philhofer

This comment has been minimized.

Contributor

philhofer commented Mar 16, 2017

I've been doing some work on this front. Branch is here: https://github.com/philhofer/go/tree/store-forward

On that branch, that particular bit of code compiles to:

b54	29841 (/Users/philiphofer/go-tip/src/runtime/mgcmark.go:1183)	JCS	$1, 29828
v235	29842 (/Users/philiphofer/go-tip/src/runtime/mgcmark.go:1158)	MOVQ	R8, "".i-64(SP)
v224	29843 (/Users/philiphofer/go-tip/src/runtime/mgcmark.go:1165)	MOVL	CX, ""..autotmp_2885-84(SP)
v365	29844 (/Users/philiphofer/go-tip/src/runtime/mgcmark.go:1165)	MOVQ	SI, "".h.bitp-24(SP)
v270	29845 (/Users/philiphofer/go-tip/src/runtime/mgcmark.go:1158)	MOVQ	DI, "".n-72(SP)
v286	29846 (/Users/philiphofer/go-tip/src/runtime/mgcmark.go:1185)	MOVQ	R11, (SP)
v288	29847 (/Users/philiphofer/go-tip/src/runtime/mgcmark.go:1185)	MOVQ	BX, 8(SP)
v291	29848 (/Users/philiphofer/go-tip/src/runtime/mgcmark.go:1185)	MOVQ	R8, 16(SP)
v292	29849 (/Users/philiphofer/go-tip/src/runtime/mgcmark.go:1185)	CALL	"".heapBitsForObject(SB)
v294	29850 (/Users/philiphofer/go-tip/src/runtime/mgcmark.go:1185)	MOVQ	24(SP), AX
v146	29851 (/Users/philiphofer/go-tip/src/runtime/mgcmark.go:1185)	TESTQ	AX, AX
b52	29852 (/Users/philiphofer/go-tip/src/runtime/mgcmark.go:1185)	JNE	$0, 29862
v335	29853 (/Users/philiphofer/go-tip/src/runtime/mgcmark.go:1183)	MOVQ	"".arena_start-48(SP), AX
v326	29854 (/Users/philiphofer/go-tip/src/runtime/mgcmark.go:1165)	MOVL	""..autotmp_2885-84(SP), CX
v395	29855 (/Users/philiphofer/go-tip/src/runtime/mgcmark.go:1183)	MOVQ	"".arena_used-56(SP), DX
v390	29856 (/Users/philiphofer/go-tip/src/runtime/mgcmark.go:1179)	MOVQ	"".b(SP), BX
v328	29857 (/Users/philiphofer/go-tip/src/runtime/mgcmark.go:1165)	MOVQ	"".h.bitp-24(SP), SI
v55	29858 (/Users/philiphofer/go-tip/src/runtime/mgcmark.go:1158)	MOVQ	"".n-72(SP), DI
v208	29859 (/Users/philiphofer/go-tip/src/runtime/mgcmark.go:1158)	MOVQ	"".i-64(SP), R8
v222	29860 (/Users/philiphofer/go-tip/src/runtime/mgcmark.go:1165)	MOVL	CX, R9
b58	29861 (/Users/philiphofer/go-tip/src/runtime/mgcmark.go:1183)	JMP	29828
v304	29862 (/Users/philiphofer/go-tip/src/runtime/mgcmark.go:1186)	MOVQ	AX, (SP)
v408	29863 (/Users/philiphofer/go-tip/src/runtime/mgcmark.go:1186)	MOVQ	"".b(SP), AX
v306	29864 (/Users/philiphofer/go-tip/src/runtime/mgcmark.go:1186)	MOVQ	AX, 8(SP)
v424	29865 (/Users/philiphofer/go-tip/src/runtime/mgcmark.go:1186)	MOVQ	"".i-64(SP), CX
v308	29866 (/Users/philiphofer/go-tip/src/runtime/mgcmark.go:1186)	MOVQ	CX, 16(SP)
v321	29867 (/Users/philiphofer/go-tip/src/runtime/mgcmark.go:1185)	MOVQ	32(SP), DX
v205	29868 (/Users/philiphofer/go-tip/src/runtime/mgcmark.go:1186)	MOVQ	DX, 24(SP)
v392	29869 (/Users/philiphofer/go-tip/src/runtime/mgcmark.go:1185)	MOVL	40(SP), DX
v311	29870 (/Users/philiphofer/go-tip/src/runtime/mgcmark.go:1186)	MOVL	DX, 32(SP)
v298	29871 (/Users/philiphofer/go-tip/src/runtime/mgcmark.go:1185)	MOVQ	48(SP), DX
v314	29872 (/Users/philiphofer/go-tip/src/runtime/mgcmark.go:1186)	MOVQ	DX, 40(SP)
v237	29873 (/Users/philiphofer/go-tip/src/runtime/mgcmark.go:1186)	MOVQ	"".gcw+8(SP), DX
v317	29874 (/Users/philiphofer/go-tip/src/runtime/mgcmark.go:1186)	MOVQ	DX, 48(SP)
v300	29875 (/Users/philiphofer/go-tip/src/runtime/mgcmark.go:1185)	MOVQ	56(SP), BX
v319	29876 (/Users/philiphofer/go-tip/src/runtime/mgcmark.go:1186)	MOVQ	BX, 56(SP)
v320	29877 (/Users/philiphofer/go-tip/src/runtime/mgcmark.go:1186)	CALL	"".greyobject(SB)
b57	29878 (/Users/philiphofer/go-tip/src/runtime/mgcmark.go:1185)	JMP	29853

Alias analysis plus tighten plus a load-sinking pass is clever enough to push the loads down until they absolutely need to be loaded. In this particular example the shuffling within the destination basic block doesn't help much, but it appears to help most go programs. Geomean improvement in go1bench is about 6% right now.

philhofer added a commit to philhofer/go that referenced this issue Mar 19, 2017

cmd/compile/internal/ssa: tighten non-faulting loads
This change introduces alias analysis into the ssa
backend, and uses it to tighten some loads with their uses.

Since stack loads are non-faulting, tighten is now often able
to postpone loading function return values until their uses.
Given code like

    res, ok := fn()
    if !ok {
        return
    }
    // use res

The 'res' return value won't be loaded until after examining 'ok.'

Fixed golang#19195

Change-Id: Ibe35216e1bc3b0ff937cee9b3bac64f40f087b61

philhofer added a commit to philhofer/go that referenced this issue Mar 20, 2017

cmd/compile/internal/ssa: tighten non-faulting loads
This change introduces alias analysis into the ssa
backend, and uses it to tighten some loads with their uses.

Since stack loads are non-faulting, tighten is now often able
to postpone loading function return values until their uses.
Given code like

    res, ok := fn()
    if !ok {
        return
    }
    // use res

The 'res' return value won't be loaded until after examining 'ok.'

Fixed golang#19195

Benchmarks on linux/arm:
name                     old time/op    new time/op     delta
BinaryTree17-4              31.1s ± 0%      30.9s ± 0%   -0.39%  (p=0.001 n=9+8)
Fannkuch11-4                13.9s ± 0%      14.1s ± 0%   +0.90%  (p=0.000 n=8+9)
FmtFprintfEmpty-4           666ns ± 5%      674ns ± 4%     ~     (p=0.424 n=10+10)
FmtFprintfString-4         1.09µs ± 4%     1.12µs ± 4%   +2.62%  (p=0.027 n=10+10)
FmtFprintfInt-4            1.22µs ± 3%     1.16µs ± 3%   -4.37%  (p=0.000 n=10+10)
FmtFprintfIntInt-4         1.74µs ± 2%     1.64µs ± 3%   -6.00%  (p=0.000 n=10+10)
FmtFprintfPrefixedInt-4    1.75µs ± 1%     1.69µs ± 3%   -3.80%  (p=0.000 n=9+10)
FmtFprintfFloat-4          3.27µs ± 1%     3.41µs ± 1%   +4.33%  (p=0.000 n=10+9)
FmtManyArgs-4              6.49µs ± 1%     6.23µs ± 2%   -4.10%  (p=0.000 n=10+10)
GobDecode-4                87.1ms ± 1%     75.5ms ± 1%  -13.37%  (p=0.000 n=10+10)
GobEncode-4                69.4ms ± 1%     69.7ms ± 1%     ~     (p=0.190 n=10+10)
Gzip-4                      3.56s ± 1%      3.57s ± 1%     ~     (p=0.053 n=10+9)
Gunzip-4                    446ms ± 2%      442ms ± 2%     ~     (p=0.123 n=10+10)
HTTPClientServer-4         1.51ms ± 1%     1.55ms ± 3%   +2.26%  (p=0.001 n=8+9)
JSONEncode-4                191ms ± 1%      175ms ± 2%   -8.33%  (p=0.000 n=10+10)
JSONDecode-4                798ms ± 1%      835ms ± 1%   +4.65%  (p=0.000 n=10+10)
Mandelbrot200-4            33.6ms ± 0%     33.6ms ± 0%     ~     (p=0.068 n=8+10)
GoParse-4                  42.4ms ± 1%     42.5ms ± 1%     ~     (p=0.190 n=10+10)
RegexpMatchEasy0_32-4       829ns ± 1%      853ns ± 1%   +2.98%  (p=0.000 n=9+8)
RegexpMatchEasy0_1K-4      4.04µs ± 1%     4.03µs ± 1%     ~     (p=0.986 n=10+10)
RegexpMatchEasy1_32-4       889ns ± 2%      900ns ± 5%     ~     (p=0.566 n=10+10)
RegexpMatchEasy1_1K-4      6.01µs ± 2%     6.15µs ± 2%   +2.29%  (p=0.000 n=9+9)
RegexpMatchMedium_32-4     1.35µs ± 3%     1.39µs ± 4%   +2.26%  (p=0.018 n=9+10)
RegexpMatchMedium_1K-4      357µs ± 9%      352µs ± 2%     ~     (p=0.968 n=10+9)
RegexpMatchHard_32-4       22.2µs ± 6%     22.6µs ± 6%     ~     (p=0.161 n=9+9)
RegexpMatchHard_1K-4        652µs ± 4%      664µs ± 4%   +1.91%  (p=0.028 n=9+10)
Revcomp-4                  51.4ms ± 1%     51.3ms ± 2%     ~     (p=0.353 n=10+10)
Template-4                  1.17s ± 2%      1.06s ± 2%   -9.39%  (p=0.000 n=10+10)
TimeParse-4                4.44µs ± 1%     4.46µs ± 1%   +0.50%  (p=0.003 n=9+10)
TimeFormat-4               9.30µs ± 1%     9.33µs ± 1%     ~     (p=0.197 n=10+10)
[Geo mean]                  557µs           553µs        -0.82%

Change-Id: Ibe35216e1bc3b0ff937cee9b3bac64f40f087b61

philhofer added a commit to philhofer/go that referenced this issue Mar 21, 2017

cmd/compile/internal/ssa: tighten non-faulting loads
This change introduces alias analysis into the ssa
backend, and uses it to tighten some loads with their uses.

Since stack loads are non-faulting, tighten is now often able
to postpone loading function return values until their uses.
Given code like

    res, ok := fn()
    if !ok {
        return
    }
    // use res

The 'res' return value won't be loaded until after examining 'ok.'

Fixed golang#19195

Benchmarks on linux/arm:
name                     old time/op    new time/op     delta
BinaryTree17-4              31.1s ± 0%      30.9s ± 0%   -0.39%  (p=0.001 n=9+8)
Fannkuch11-4                13.9s ± 0%      14.1s ± 0%   +0.90%  (p=0.000 n=8+9)
FmtFprintfEmpty-4           666ns ± 5%      674ns ± 4%     ~     (p=0.424 n=10+10)
FmtFprintfString-4         1.09µs ± 4%     1.12µs ± 4%   +2.62%  (p=0.027 n=10+10)
FmtFprintfInt-4            1.22µs ± 3%     1.16µs ± 3%   -4.37%  (p=0.000 n=10+10)
FmtFprintfIntInt-4         1.74µs ± 2%     1.64µs ± 3%   -6.00%  (p=0.000 n=10+10)
FmtFprintfPrefixedInt-4    1.75µs ± 1%     1.69µs ± 3%   -3.80%  (p=0.000 n=9+10)
FmtFprintfFloat-4          3.27µs ± 1%     3.41µs ± 1%   +4.33%  (p=0.000 n=10+9)
FmtManyArgs-4              6.49µs ± 1%     6.23µs ± 2%   -4.10%  (p=0.000 n=10+10)
GobDecode-4                87.1ms ± 1%     75.5ms ± 1%  -13.37%  (p=0.000 n=10+10)
GobEncode-4                69.4ms ± 1%     69.7ms ± 1%     ~     (p=0.190 n=10+10)
Gzip-4                      3.56s ± 1%      3.57s ± 1%     ~     (p=0.053 n=10+9)
Gunzip-4                    446ms ± 2%      442ms ± 2%     ~     (p=0.123 n=10+10)
HTTPClientServer-4         1.51ms ± 1%     1.55ms ± 3%   +2.26%  (p=0.001 n=8+9)
JSONEncode-4                191ms ± 1%      175ms ± 2%   -8.33%  (p=0.000 n=10+10)
JSONDecode-4                798ms ± 1%      835ms ± 1%   +4.65%  (p=0.000 n=10+10)
Mandelbrot200-4            33.6ms ± 0%     33.6ms ± 0%     ~     (p=0.068 n=8+10)
GoParse-4                  42.4ms ± 1%     42.5ms ± 1%     ~     (p=0.190 n=10+10)
RegexpMatchEasy0_32-4       829ns ± 1%      853ns ± 1%   +2.98%  (p=0.000 n=9+8)
RegexpMatchEasy0_1K-4      4.04µs ± 1%     4.03µs ± 1%     ~     (p=0.986 n=10+10)
RegexpMatchEasy1_32-4       889ns ± 2%      900ns ± 5%     ~     (p=0.566 n=10+10)
RegexpMatchEasy1_1K-4      6.01µs ± 2%     6.15µs ± 2%   +2.29%  (p=0.000 n=9+9)
RegexpMatchMedium_32-4     1.35µs ± 3%     1.39µs ± 4%   +2.26%  (p=0.018 n=9+10)
RegexpMatchMedium_1K-4      357µs ± 9%      352µs ± 2%     ~     (p=0.968 n=10+9)
RegexpMatchHard_32-4       22.2µs ± 6%     22.6µs ± 6%     ~     (p=0.161 n=9+9)
RegexpMatchHard_1K-4        652µs ± 4%      664µs ± 4%   +1.91%  (p=0.028 n=9+10)
Revcomp-4                  51.4ms ± 1%     51.3ms ± 2%     ~     (p=0.353 n=10+10)
Template-4                  1.17s ± 2%      1.06s ± 2%   -9.39%  (p=0.000 n=10+10)
TimeParse-4                4.44µs ± 1%     4.46µs ± 1%   +0.50%  (p=0.003 n=9+10)
TimeFormat-4               9.30µs ± 1%     9.33µs ± 1%     ~     (p=0.197 n=10+10)
[Geo mean]                  557µs           553µs        -0.82%

Change-Id: Ibe35216e1bc3b0ff937cee9b3bac64f40f087b61

philhofer added a commit to philhofer/go that referenced this issue Mar 21, 2017

cmd/compile/internal/ssa: tighten non-faulting loads
This change introduces alias analysis into the ssa
backend, and uses it to tighten some loads with their uses.

Since stack loads are non-faulting, tighten is now often able
to postpone loading function return values until their uses.
Given code like

    res, ok := fn()
    if !ok {
        return
    }
    // use res

The 'res' return value won't be loaded until after examining 'ok.'

Fixed golang#19195

Benchmarks on linux/arm:
name                     old time/op    new time/op     delta
BinaryTree17-4              31.1s ± 0%      30.9s ± 0%   -0.39%  (p=0.001 n=9+8)
Fannkuch11-4                13.9s ± 0%      14.1s ± 0%   +0.90%  (p=0.000 n=8+9)
FmtFprintfEmpty-4           666ns ± 5%      674ns ± 4%     ~     (p=0.424 n=10+10)
FmtFprintfString-4         1.09µs ± 4%     1.12µs ± 4%   +2.62%  (p=0.027 n=10+10)
FmtFprintfInt-4            1.22µs ± 3%     1.16µs ± 3%   -4.37%  (p=0.000 n=10+10)
FmtFprintfIntInt-4         1.74µs ± 2%     1.64µs ± 3%   -6.00%  (p=0.000 n=10+10)
FmtFprintfPrefixedInt-4    1.75µs ± 1%     1.69µs ± 3%   -3.80%  (p=0.000 n=9+10)
FmtFprintfFloat-4          3.27µs ± 1%     3.41µs ± 1%   +4.33%  (p=0.000 n=10+9)
FmtManyArgs-4              6.49µs ± 1%     6.23µs ± 2%   -4.10%  (p=0.000 n=10+10)
GobDecode-4                87.1ms ± 1%     75.5ms ± 1%  -13.37%  (p=0.000 n=10+10)
GobEncode-4                69.4ms ± 1%     69.7ms ± 1%     ~     (p=0.190 n=10+10)
Gzip-4                      3.56s ± 1%      3.57s ± 1%     ~     (p=0.053 n=10+9)
Gunzip-4                    446ms ± 2%      442ms ± 2%     ~     (p=0.123 n=10+10)
HTTPClientServer-4         1.51ms ± 1%     1.55ms ± 3%   +2.26%  (p=0.001 n=8+9)
JSONEncode-4                191ms ± 1%      175ms ± 2%   -8.33%  (p=0.000 n=10+10)
JSONDecode-4                798ms ± 1%      835ms ± 1%   +4.65%  (p=0.000 n=10+10)
Mandelbrot200-4            33.6ms ± 0%     33.6ms ± 0%     ~     (p=0.068 n=8+10)
GoParse-4                  42.4ms ± 1%     42.5ms ± 1%     ~     (p=0.190 n=10+10)
RegexpMatchEasy0_32-4       829ns ± 1%      853ns ± 1%   +2.98%  (p=0.000 n=9+8)
RegexpMatchEasy0_1K-4      4.04µs ± 1%     4.03µs ± 1%     ~     (p=0.986 n=10+10)
RegexpMatchEasy1_32-4       889ns ± 2%      900ns ± 5%     ~     (p=0.566 n=10+10)
RegexpMatchEasy1_1K-4      6.01µs ± 2%     6.15µs ± 2%   +2.29%  (p=0.000 n=9+9)
RegexpMatchMedium_32-4     1.35µs ± 3%     1.39µs ± 4%   +2.26%  (p=0.018 n=9+10)
RegexpMatchMedium_1K-4      357µs ± 9%      352µs ± 2%     ~     (p=0.968 n=10+9)
RegexpMatchHard_32-4       22.2µs ± 6%     22.6µs ± 6%     ~     (p=0.161 n=9+9)
RegexpMatchHard_1K-4        652µs ± 4%      664µs ± 4%   +1.91%  (p=0.028 n=9+10)
Revcomp-4                  51.4ms ± 1%     51.3ms ± 2%     ~     (p=0.353 n=10+10)
Template-4                  1.17s ± 2%      1.06s ± 2%   -9.39%  (p=0.000 n=10+10)
TimeParse-4                4.44µs ± 1%     4.46µs ± 1%   +0.50%  (p=0.003 n=9+10)
TimeFormat-4               9.30µs ± 1%     9.33µs ± 1%     ~     (p=0.197 n=10+10)
[Geo mean]                  557µs           553µs        -0.82%

Change-Id: Ibe35216e1bc3b0ff937cee9b3bac64f40f087b61
@gopherbot

This comment has been minimized.

gopherbot commented Mar 22, 2017

CL https://golang.org/cl/38448 mentions this issue.

philhofer added a commit to philhofer/go that referenced this issue Mar 23, 2017

cmd/compile/internal/ssa: tighten loads
This change introduces alias analysis into the ssa
backend, and uses it to tighten some loads with their uses.

Given code like

    res, ok := fn()
    if !ok {
        return
    }
    // use res

The 'res' return value won't be loaded until after examining 'ok.'

Alias analysis also enables tightening of loads from
constant memory (strings) and certain autotmp stack slots.

A follow-up CL will introduce alias analysis to dead-store
elimination. Another will introduce load elimination.

Fixed golang#19195

Benchmarks on linux/arm:
name                     old time/op    new time/op     delta
BinaryTree17-4              31.1s ± 0%      30.9s ± 0%   -0.39%  (p=0.001 n=9+8)
Fannkuch11-4                13.9s ± 0%      14.1s ± 0%   +0.90%  (p=0.000 n=8+9)
FmtFprintfEmpty-4           666ns ± 5%      674ns ± 4%     ~     (p=0.424 n=10+10)
FmtFprintfString-4         1.09µs ± 4%     1.12µs ± 4%   +2.62%  (p=0.027 n=10+10)
FmtFprintfInt-4            1.22µs ± 3%     1.16µs ± 3%   -4.37%  (p=0.000 n=10+10)
FmtFprintfIntInt-4         1.74µs ± 2%     1.64µs ± 3%   -6.00%  (p=0.000 n=10+10)
FmtFprintfPrefixedInt-4    1.75µs ± 1%     1.69µs ± 3%   -3.80%  (p=0.000 n=9+10)
FmtFprintfFloat-4          3.27µs ± 1%     3.41µs ± 1%   +4.33%  (p=0.000 n=10+9)
FmtManyArgs-4              6.49µs ± 1%     6.23µs ± 2%   -4.10%  (p=0.000 n=10+10)
GobDecode-4                87.1ms ± 1%     75.5ms ± 1%  -13.37%  (p=0.000 n=10+10)
GobEncode-4                69.4ms ± 1%     69.7ms ± 1%     ~     (p=0.190 n=10+10)
Gzip-4                      3.56s ± 1%      3.57s ± 1%     ~     (p=0.053 n=10+9)
Gunzip-4                    446ms ± 2%      442ms ± 2%     ~     (p=0.123 n=10+10)
HTTPClientServer-4         1.51ms ± 1%     1.55ms ± 3%   +2.26%  (p=0.001 n=8+9)
JSONEncode-4                191ms ± 1%      175ms ± 2%   -8.33%  (p=0.000 n=10+10)
JSONDecode-4                798ms ± 1%      835ms ± 1%   +4.65%  (p=0.000 n=10+10)
Mandelbrot200-4            33.6ms ± 0%     33.6ms ± 0%     ~     (p=0.068 n=8+10)
GoParse-4                  42.4ms ± 1%     42.5ms ± 1%     ~     (p=0.190 n=10+10)
RegexpMatchEasy0_32-4       829ns ± 1%      853ns ± 1%   +2.98%  (p=0.000 n=9+8)
RegexpMatchEasy0_1K-4      4.04µs ± 1%     4.03µs ± 1%     ~     (p=0.986 n=10+10)
RegexpMatchEasy1_32-4       889ns ± 2%      900ns ± 5%     ~     (p=0.566 n=10+10)
RegexpMatchEasy1_1K-4      6.01µs ± 2%     6.15µs ± 2%   +2.29%  (p=0.000 n=9+9)
RegexpMatchMedium_32-4     1.35µs ± 3%     1.39µs ± 4%   +2.26%  (p=0.018 n=9+10)
RegexpMatchMedium_1K-4      357µs ± 9%      352µs ± 2%     ~     (p=0.968 n=10+9)
RegexpMatchHard_32-4       22.2µs ± 6%     22.6µs ± 6%     ~     (p=0.161 n=9+9)
RegexpMatchHard_1K-4        652µs ± 4%      664µs ± 4%   +1.91%  (p=0.028 n=9+10)
Revcomp-4                  51.4ms ± 1%     51.3ms ± 2%     ~     (p=0.353 n=10+10)
Template-4                  1.17s ± 2%      1.06s ± 2%   -9.39%  (p=0.000 n=10+10)
TimeParse-4                4.44µs ± 1%     4.46µs ± 1%   +0.50%  (p=0.003 n=9+10)
TimeFormat-4               9.30µs ± 1%     9.33µs ± 1%     ~     (p=0.197 n=10+10)
[Geo mean]                  557µs           553µs        -0.82%

Change-Id: Ibe35216e1bc3b0ff937cee9b3bac64f40f087b61

philhofer added a commit to philhofer/go that referenced this issue Mar 23, 2017

cmd/compile/internal/ssa: tighten loads
This change introduces alias analysis into the ssa
backend, and uses it to tighten some loads with their uses.

Given code like

    res, ok := fn()
    if !ok {
        return
    }
    // use res

The 'res' return value won't be loaded until after examining 'ok.'

Alias analysis also enables tightening of loads from
constant memory (strings) and certain autotmp stack slots.

A follow-up CL will introduce alias analysis to dead-store
elimination. Another will introduce load elimination.

Fixed golang#19195

Benchmarks on linux/arm:
name                     old time/op    new time/op     delta
BinaryTree17-4              31.1s ± 0%      30.9s ± 0%   -0.39%  (p=0.001 n=9+8)
Fannkuch11-4                13.9s ± 0%      14.1s ± 0%   +0.90%  (p=0.000 n=8+9)
FmtFprintfEmpty-4           666ns ± 5%      674ns ± 4%     ~     (p=0.424 n=10+10)
FmtFprintfString-4         1.09µs ± 4%     1.12µs ± 4%   +2.62%  (p=0.027 n=10+10)
FmtFprintfInt-4            1.22µs ± 3%     1.16µs ± 3%   -4.37%  (p=0.000 n=10+10)
FmtFprintfIntInt-4         1.74µs ± 2%     1.64µs ± 3%   -6.00%  (p=0.000 n=10+10)
FmtFprintfPrefixedInt-4    1.75µs ± 1%     1.69µs ± 3%   -3.80%  (p=0.000 n=9+10)
FmtFprintfFloat-4          3.27µs ± 1%     3.41µs ± 1%   +4.33%  (p=0.000 n=10+9)
FmtManyArgs-4              6.49µs ± 1%     6.23µs ± 2%   -4.10%  (p=0.000 n=10+10)
GobDecode-4                87.1ms ± 1%     75.5ms ± 1%  -13.37%  (p=0.000 n=10+10)
GobEncode-4                69.4ms ± 1%     69.7ms ± 1%     ~     (p=0.190 n=10+10)
Gzip-4                      3.56s ± 1%      3.57s ± 1%     ~     (p=0.053 n=10+9)
Gunzip-4                    446ms ± 2%      442ms ± 2%     ~     (p=0.123 n=10+10)
HTTPClientServer-4         1.51ms ± 1%     1.55ms ± 3%   +2.26%  (p=0.001 n=8+9)
JSONEncode-4                191ms ± 1%      175ms ± 2%   -8.33%  (p=0.000 n=10+10)
JSONDecode-4                798ms ± 1%      835ms ± 1%   +4.65%  (p=0.000 n=10+10)
Mandelbrot200-4            33.6ms ± 0%     33.6ms ± 0%     ~     (p=0.068 n=8+10)
GoParse-4                  42.4ms ± 1%     42.5ms ± 1%     ~     (p=0.190 n=10+10)
RegexpMatchEasy0_32-4       829ns ± 1%      853ns ± 1%   +2.98%  (p=0.000 n=9+8)
RegexpMatchEasy0_1K-4      4.04µs ± 1%     4.03µs ± 1%     ~     (p=0.986 n=10+10)
RegexpMatchEasy1_32-4       889ns ± 2%      900ns ± 5%     ~     (p=0.566 n=10+10)
RegexpMatchEasy1_1K-4      6.01µs ± 2%     6.15µs ± 2%   +2.29%  (p=0.000 n=9+9)
RegexpMatchMedium_32-4     1.35µs ± 3%     1.39µs ± 4%   +2.26%  (p=0.018 n=9+10)
RegexpMatchMedium_1K-4      357µs ± 9%      352µs ± 2%     ~     (p=0.968 n=10+9)
RegexpMatchHard_32-4       22.2µs ± 6%     22.6µs ± 6%     ~     (p=0.161 n=9+9)
RegexpMatchHard_1K-4        652µs ± 4%      664µs ± 4%   +1.91%  (p=0.028 n=9+10)
Revcomp-4                  51.4ms ± 1%     51.3ms ± 2%     ~     (p=0.353 n=10+10)
Template-4                  1.17s ± 2%      1.06s ± 2%   -9.39%  (p=0.000 n=10+10)
TimeParse-4                4.44µs ± 1%     4.46µs ± 1%   +0.50%  (p=0.003 n=9+10)
TimeFormat-4               9.30µs ± 1%     9.33µs ± 1%     ~     (p=0.197 n=10+10)
[Geo mean]                  557µs           553µs        -0.82%

Change-Id: Ibe35216e1bc3b0ff937cee9b3bac64f40f087b61

philhofer added a commit to philhofer/go that referenced this issue Mar 23, 2017

cmd/compile/internal/ssa: tighten loads
This change introduces alias analysis into the ssa
backend, and uses it to tighten some loads with their uses.

Given code like

    res, ok := fn()
    if !ok {
        return
    }
    // use res

The 'res' return value won't be loaded until after examining 'ok.'

Alias analysis also enables tightening of loads from
constant memory (strings) and certain autotmp stack slots.

A follow-up CL will introduce alias analysis to dead-store
elimination. Another will introduce load elimination.

Fixed golang#19195

Benchmarks on linux/arm:
name                     old time/op    new time/op     delta
BinaryTree17-4              31.1s ± 0%      30.9s ± 0%   -0.39%  (p=0.001 n=9+8)
Fannkuch11-4                13.9s ± 0%      14.1s ± 0%   +0.90%  (p=0.000 n=8+9)
FmtFprintfEmpty-4           666ns ± 5%      674ns ± 4%     ~     (p=0.424 n=10+10)
FmtFprintfString-4         1.09µs ± 4%     1.12µs ± 4%   +2.62%  (p=0.027 n=10+10)
FmtFprintfInt-4            1.22µs ± 3%     1.16µs ± 3%   -4.37%  (p=0.000 n=10+10)
FmtFprintfIntInt-4         1.74µs ± 2%     1.64µs ± 3%   -6.00%  (p=0.000 n=10+10)
FmtFprintfPrefixedInt-4    1.75µs ± 1%     1.69µs ± 3%   -3.80%  (p=0.000 n=9+10)
FmtFprintfFloat-4          3.27µs ± 1%     3.41µs ± 1%   +4.33%  (p=0.000 n=10+9)
FmtManyArgs-4              6.49µs ± 1%     6.23µs ± 2%   -4.10%  (p=0.000 n=10+10)
GobDecode-4                87.1ms ± 1%     75.5ms ± 1%  -13.37%  (p=0.000 n=10+10)
GobEncode-4                69.4ms ± 1%     69.7ms ± 1%     ~     (p=0.190 n=10+10)
Gzip-4                      3.56s ± 1%      3.57s ± 1%     ~     (p=0.053 n=10+9)
Gunzip-4                    446ms ± 2%      442ms ± 2%     ~     (p=0.123 n=10+10)
HTTPClientServer-4         1.51ms ± 1%     1.55ms ± 3%   +2.26%  (p=0.001 n=8+9)
JSONEncode-4                191ms ± 1%      175ms ± 2%   -8.33%  (p=0.000 n=10+10)
JSONDecode-4                798ms ± 1%      835ms ± 1%   +4.65%  (p=0.000 n=10+10)
Mandelbrot200-4            33.6ms ± 0%     33.6ms ± 0%     ~     (p=0.068 n=8+10)
GoParse-4                  42.4ms ± 1%     42.5ms ± 1%     ~     (p=0.190 n=10+10)
RegexpMatchEasy0_32-4       829ns ± 1%      853ns ± 1%   +2.98%  (p=0.000 n=9+8)
RegexpMatchEasy0_1K-4      4.04µs ± 1%     4.03µs ± 1%     ~     (p=0.986 n=10+10)
RegexpMatchEasy1_32-4       889ns ± 2%      900ns ± 5%     ~     (p=0.566 n=10+10)
RegexpMatchEasy1_1K-4      6.01µs ± 2%     6.15µs ± 2%   +2.29%  (p=0.000 n=9+9)
RegexpMatchMedium_32-4     1.35µs ± 3%     1.39µs ± 4%   +2.26%  (p=0.018 n=9+10)
RegexpMatchMedium_1K-4      357µs ± 9%      352µs ± 2%     ~     (p=0.968 n=10+9)
RegexpMatchHard_32-4       22.2µs ± 6%     22.6µs ± 6%     ~     (p=0.161 n=9+9)
RegexpMatchHard_1K-4        652µs ± 4%      664µs ± 4%   +1.91%  (p=0.028 n=9+10)
Revcomp-4                  51.4ms ± 1%     51.3ms ± 2%     ~     (p=0.353 n=10+10)
Template-4                  1.17s ± 2%      1.06s ± 2%   -9.39%  (p=0.000 n=10+10)
TimeParse-4                4.44µs ± 1%     4.46µs ± 1%   +0.50%  (p=0.003 n=9+10)
TimeFormat-4               9.30µs ± 1%     9.33µs ± 1%     ~     (p=0.197 n=10+10)
[Geo mean]                  557µs           553µs        -0.82%

Change-Id: Ibe35216e1bc3b0ff937cee9b3bac64f40f087b61

@randall77 randall77 modified the milestones: Go1.10, Go1.9 Jun 6, 2017

philhofer added a commit to philhofer/go that referenced this issue Aug 23, 2017

cmd/compile/internal/ssa: tighten loads
This change introduces alias analysis into the ssa
backend, and uses it to tighten some loads with their uses.

Given code like

    res, ok := fn()
    if !ok {
        return
    }
    // use res

The 'res' return value won't be loaded until after examining 'ok.'

Alias analysis also enables tightening of loads from
constant memory (strings) and certain autotmp stack slots.

A follow-up CL will introduce alias analysis to dead-store
elimination. Another will introduce load elimination.

Fixed golang#19195

Benchmarks on linux/arm:
name                     old time/op    new time/op     delta
BinaryTree17-4              31.1s ± 0%      30.9s ± 0%   -0.39%  (p=0.001 n=9+8)
Fannkuch11-4                13.9s ± 0%      14.1s ± 0%   +0.90%  (p=0.000 n=8+9)
FmtFprintfEmpty-4           666ns ± 5%      674ns ± 4%     ~     (p=0.424 n=10+10)
FmtFprintfString-4         1.09µs ± 4%     1.12µs ± 4%   +2.62%  (p=0.027 n=10+10)
FmtFprintfInt-4            1.22µs ± 3%     1.16µs ± 3%   -4.37%  (p=0.000 n=10+10)
FmtFprintfIntInt-4         1.74µs ± 2%     1.64µs ± 3%   -6.00%  (p=0.000 n=10+10)
FmtFprintfPrefixedInt-4    1.75µs ± 1%     1.69µs ± 3%   -3.80%  (p=0.000 n=9+10)
FmtFprintfFloat-4          3.27µs ± 1%     3.41µs ± 1%   +4.33%  (p=0.000 n=10+9)
FmtManyArgs-4              6.49µs ± 1%     6.23µs ± 2%   -4.10%  (p=0.000 n=10+10)
GobDecode-4                87.1ms ± 1%     75.5ms ± 1%  -13.37%  (p=0.000 n=10+10)
GobEncode-4                69.4ms ± 1%     69.7ms ± 1%     ~     (p=0.190 n=10+10)
Gzip-4                      3.56s ± 1%      3.57s ± 1%     ~     (p=0.053 n=10+9)
Gunzip-4                    446ms ± 2%      442ms ± 2%     ~     (p=0.123 n=10+10)
HTTPClientServer-4         1.51ms ± 1%     1.55ms ± 3%   +2.26%  (p=0.001 n=8+9)
JSONEncode-4                191ms ± 1%      175ms ± 2%   -8.33%  (p=0.000 n=10+10)
JSONDecode-4                798ms ± 1%      835ms ± 1%   +4.65%  (p=0.000 n=10+10)
Mandelbrot200-4            33.6ms ± 0%     33.6ms ± 0%     ~     (p=0.068 n=8+10)
GoParse-4                  42.4ms ± 1%     42.5ms ± 1%     ~     (p=0.190 n=10+10)
RegexpMatchEasy0_32-4       829ns ± 1%      853ns ± 1%   +2.98%  (p=0.000 n=9+8)
RegexpMatchEasy0_1K-4      4.04µs ± 1%     4.03µs ± 1%     ~     (p=0.986 n=10+10)
RegexpMatchEasy1_32-4       889ns ± 2%      900ns ± 5%     ~     (p=0.566 n=10+10)
RegexpMatchEasy1_1K-4      6.01µs ± 2%     6.15µs ± 2%   +2.29%  (p=0.000 n=9+9)
RegexpMatchMedium_32-4     1.35µs ± 3%     1.39µs ± 4%   +2.26%  (p=0.018 n=9+10)
RegexpMatchMedium_1K-4      357µs ± 9%      352µs ± 2%     ~     (p=0.968 n=10+9)
RegexpMatchHard_32-4       22.2µs ± 6%     22.6µs ± 6%     ~     (p=0.161 n=9+9)
RegexpMatchHard_1K-4        652µs ± 4%      664µs ± 4%   +1.91%  (p=0.028 n=9+10)
Revcomp-4                  51.4ms ± 1%     51.3ms ± 2%     ~     (p=0.353 n=10+10)
Template-4                  1.17s ± 2%      1.06s ± 2%   -9.39%  (p=0.000 n=10+10)
TimeParse-4                4.44µs ± 1%     4.46µs ± 1%   +0.50%  (p=0.003 n=9+10)
TimeFormat-4               9.30µs ± 1%     9.33µs ± 1%     ~     (p=0.197 n=10+10)
[Geo mean]                  557µs           553µs        -0.82%

Change-Id: Ibe35216e1bc3b0ff937cee9b3bac64f40f087b61

@bradfitz bradfitz modified the milestones: Go1.10, Go1.11 Nov 28, 2017

josharian added a commit to josharian/go that referenced this issue May 8, 2018

cmd/compile: tighten some values with memory args
The tighten pass moves values closer
to the blocks in which they are used.
Prior to this CL, values that took a memory arg
were never moved, because that could cause
multiple memory values to be live across blocks.

However, there are in fact cases in which such a
value can safely be moved across blocks.
The most common such case is loading the return
values from a function call when only one
control flow path actually uses those return values.
In that case, we were unnecessarily eagerly loading
all return values from the stack.

This CL delays the decision about whether a value
with a memory arg can be moved until we know
where we'd like to move it to.
Once we know the target block, we can check whether
moving the value will cause problems.
This CL uses a simple check: If an existing value
in the target block already uses the same memory arg,
and the value cannot fault,
then it must be ok to add another use of it.

This is enough to improve the situation in scanobject.

Before:

	0x025d 00605 (mgcmark.go:1180)	CALL	"".heapBitsForObject(SB)
	0x0262 00610 (mgcmark.go:1180)	MOVL	40(SP), AX
	0x0266 00614 (mgcmark.go:1180)	MOVQ	24(SP), CX
	0x026b 00619 (mgcmark.go:1180)	MOVQ	48(SP), DX
	0x0270 00624 (mgcmark.go:1180)	MOVQ	56(SP), BX
	0x0275 00629 (mgcmark.go:1180)	MOVQ	32(SP), SI
	0x027a 00634 (mgcmark.go:1180)	TESTQ	CX, CX
	0x027d 00637 (mgcmark.go:1180)	JNE	$0, 687
	0x027f 00639 (mgcmark.go:1178)	MOVQ	"".arena_used+96(SP), AX
	0x0284 00644 (mgcmark.go:1160)	MOVL	"".h.shift+68(SP), CX
	0x0288 00648 (mgcmark.go:1178)	MOVQ	""..autotmp_2642+112(SP), DX
	0x028d 00653 (mgcmark.go:1174)	MOVQ	"".b+168(FP), BX
	0x0295 00661 (mgcmark.go:1160)	MOVQ	"".h.bitp+128(SP), SI
	0x029d 00669 (mgcmark.go:1153)	MOVQ	"".n+80(SP), DI
	0x02a2 00674 (mgcmark.go:1153)	MOVQ	"".i+88(SP), R8
	0x02a7 00679 (mgcmark.go:1160)	MOVL	CX, R9
	0x02aa 00682 (mgcmark.go:1178)	JMP	552
	0x02af 00687 (mgcmark.go:1181)	MOVQ	CX, (SP)
	0x02b3 00691 (mgcmark.go:1181)	MOVQ	"".b+168(FP), CX
	0x02bb 00699 (mgcmark.go:1181)	MOVQ	CX, 8(SP)
	0x02c0 00704 (mgcmark.go:1181)	MOVQ	"".i+88(SP), DI
	0x02c5 00709 (mgcmark.go:1181)	MOVQ	DI, 16(SP)
	0x02ca 00714 (mgcmark.go:1181)	MOVQ	SI, 24(SP)
	0x02cf 00719 (mgcmark.go:1181)	MOVL	AX, 32(SP)
	0x02d3 00723 (mgcmark.go:1181)	MOVQ	DX, 40(SP)
	0x02d8 00728 (mgcmark.go:1181)	MOVQ	"".gcw+176(FP), AX
	0x02e0 00736 (mgcmark.go:1181)	MOVQ	AX, 48(SP)
	0x02e5 00741 (mgcmark.go:1181)	MOVQ	BX, 56(SP)
	0x02ea 00746 (mgcmark.go:1181)	PCDATA	$0, $3
	0x02ea 00746 (mgcmark.go:1181)	CALL	"".greyobject(SB)

After:

	0x025d 00605 (mgcmark.go:1180)	CALL	"".heapBitsForObject(SB)
	0x0262 00610 (mgcmark.go:1180)	MOVQ	24(SP), AX
	0x0267 00615 (mgcmark.go:1180)	TESTQ	AX, AX
	0x026a 00618 (mgcmark.go:1180)	JNE	$0, 665
	0x026c 00620 (mgcmark.go:1178)	MOVQ	"".arena_used+96(SP), AX
	0x0271 00625 (mgcmark.go:1160)	MOVL	"".h.shift+68(SP), CX
	0x0275 00629 (mgcmark.go:1178)	MOVQ	""..autotmp_2642+112(SP), DX
	0x027a 00634 (mgcmark.go:1174)	MOVQ	"".b+168(FP), BX
	0x0282 00642 (mgcmark.go:1160)	MOVQ	"".h.bitp+128(SP), SI
	0x028a 00650 (mgcmark.go:1153)	MOVQ	"".n+80(SP), DI
	0x028f 00655 (mgcmark.go:1153)	MOVQ	"".i+88(SP), R8
	0x0294 00660 (mgcmark.go:1160)	MOVL	CX, R9
	0x0297 00663 (mgcmark.go:1178)	JMP	552
	0x0299 00665 (mgcmark.go:1180)	MOVL	40(SP), CX
	0x029d 00669 (mgcmark.go:1180)	MOVQ	48(SP), DX
	0x02a2 00674 (mgcmark.go:1180)	MOVQ	56(SP), BX
	0x02a7 00679 (mgcmark.go:1180)	MOVQ	32(SP), SI
	0x02ac 00684 (mgcmark.go:1181)	MOVQ	AX, (SP)
	0x02b0 00688 (mgcmark.go:1181)	MOVQ	"".b+168(FP), AX
	0x02b8 00696 (mgcmark.go:1181)	MOVQ	AX, 8(SP)
	0x02bd 00701 (mgcmark.go:1181)	MOVQ	"".i+88(SP), DI
	0x02c2 00706 (mgcmark.go:1181)	MOVQ	DI, 16(SP)
	0x02c7 00711 (mgcmark.go:1181)	MOVQ	SI, 24(SP)
	0x02cc 00716 (mgcmark.go:1181)	MOVL	CX, 32(SP)
	0x02d0 00720 (mgcmark.go:1181)	MOVQ	DX, 40(SP)
	0x02d5 00725 (mgcmark.go:1181)	MOVQ	"".gcw+176(FP), CX
	0x02dd 00733 (mgcmark.go:1181)	MOVQ	CX, 48(SP)
	0x02e2 00738 (mgcmark.go:1181)	MOVQ	BX, 56(SP)
	0x02e7 00743 (mgcmark.go:1181)	PCDATA	$0, $3
	0x02e7 00743 (mgcmark.go:1181)	CALL	"".greyobject(SB)

Observe that after this CL, the return values from
heapBitsForObject are only loaded in a single control flow path.

Updates golang#19195

The optimization appears to pay for itself, compile-time-wise:

name       old time/op      new time/op      delta
Template        199ms ± 4%       198ms ± 3%  -0.71%  (p=0.036 n=49+45)
Unicode        90.0ms ± 5%      89.3ms ± 5%  -0.76%  (p=0.022 n=50+49)
GoTypes         558ms ± 4%       560ms ± 4%    ~     (p=0.204 n=49+48)
Compiler        2.50s ± 3%       2.51s ± 3%    ~     (p=0.087 n=50+49)
SSA             3.97s ± 3%       3.98s ± 3%    ~     (p=0.843 n=49+50)
Flate           126ms ± 3%       126ms ± 4%    ~     (p=0.134 n=48+48)
GoParser        148ms ± 3%       149ms ± 5%    ~     (p=0.128 n=46+50)
Reflect         365ms ± 4%       366ms ± 4%    ~     (p=0.237 n=50+49)
Tar             110ms ± 4%       109ms ± 5%  -1.22%  (p=0.006 n=50+47)
XML             206ms ± 5%       207ms ± 3%    ~     (p=0.145 n=49+50)

compress/lzw benchmarks:

name           old time/op    new time/op    delta
Decoder/1e4-8    85.0µs ± 3%    85.6µs ± 2%  +0.74%  (p=0.000 n=36+36)
Decoder/1e5-8     811µs ± 2%     810µs ± 2%    ~     (p=0.134 n=37+37)
Decoder/1e6-8    8.00ms ± 2%    7.99ms ± 1%    ~     (p=0.648 n=39+39)
Encoder/1e4-8     174µs ± 1%     170µs ± 1%  -2.41%  (p=0.000 n=38+37)
Encoder/1e5-8    1.57ms ± 2%    1.54ms ± 2%  -2.21%  (p=0.000 n=37+37)
Encoder/1e6-8    15.7ms ± 1%    15.4ms ± 2%  -2.10%  (p=0.000 n=38+39)

Change-Id: I486e26f096e55e6bb1cc794026664b1bc46a9391

@bradfitz bradfitz modified the milestones: Go1.11, Unplanned May 18, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment