New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/gc: stack split check does not follow static branch prediction rules of modern CPUs #10587

Closed
davecheney opened this Issue Apr 27, 2015 · 7 comments

Comments

Projects
None yet
4 participants
@davecheney
Contributor

davecheney commented Apr 27, 2015

Abstract:

On Mon, Apr 27, 2015 at 4:20 PM, Dave Cheney <da...@cheney.net> wrote: 
> Possibly, that is how the JVM does it, but the stack check prolog is also 
> used for preemption (sort of like a safe point), and in general usage the 
> stack check is predictable by the cpu, so effectively free. 

In fact current stack split check is against static branch prediction 
rules of modern CPUs (i.e. "forward branch is not taken"). 

Full discussion: https://groups.google.com/d/msg/golang-nuts/DbmqfDlAR0U/elFowQ_z_UsJ

@davecheney davecheney added this to the Go1.5 milestone Apr 27, 2015

@ianlancetaylor

This comment has been minimized.

Show comment
Hide comment
@ianlancetaylor

ianlancetaylor Apr 27, 2015

Contributor

In other words, instead of

    hello.go:3      0x400c00        64488b0c25f8ffffff      FS MOVQ FS:0xfffffff8, CX
    hello.go:3      0x400c09        483b6110                CMPQ 0x10(CX), SP
    hello.go:3      0x400c0d        7707                    JA 0x400c16
    hello.go:3      0x400c0f        e8fcc70400              CALL runtime.morestack_noctxt(SB)
    hello.go:3      0x400c14        ebea                    JMP main.main(SB)
    hello.go:3      0x400c16        4883ec78                SUBQ $0x78, SP

we should generate

MOVQ FS:0xfffffff8, CX
CMPQ 0x10(CX), SP
JBE label
SUBQ $0x78, SP
...
RET
label:
CALL runtime.morestack_noctxt(SB)
JMP main.main(SB)
Contributor

ianlancetaylor commented Apr 27, 2015

In other words, instead of

    hello.go:3      0x400c00        64488b0c25f8ffffff      FS MOVQ FS:0xfffffff8, CX
    hello.go:3      0x400c09        483b6110                CMPQ 0x10(CX), SP
    hello.go:3      0x400c0d        7707                    JA 0x400c16
    hello.go:3      0x400c0f        e8fcc70400              CALL runtime.morestack_noctxt(SB)
    hello.go:3      0x400c14        ebea                    JMP main.main(SB)
    hello.go:3      0x400c16        4883ec78                SUBQ $0x78, SP

we should generate

MOVQ FS:0xfffffff8, CX
CMPQ 0x10(CX), SP
JBE label
SUBQ $0x78, SP
...
RET
label:
CALL runtime.morestack_noctxt(SB)
JMP main.main(SB)
@josharian

This comment has been minimized.

Show comment
Hide comment
@josharian
Contributor

josharian commented May 27, 2015

@gopherbot

This comment has been minimized.

Show comment
Hide comment
@gopherbot

gopherbot commented May 27, 2015

CL https://golang.org/cl/10367 mentions this issue.

@gopherbot

This comment has been minimized.

Show comment
Hide comment
@gopherbot

gopherbot commented May 29, 2015

CL https://golang.org/cl/10496 mentions this issue.

@josharian

This comment has been minimized.

Show comment
Hide comment
@josharian

josharian May 29, 2015

Contributor

This doesn't appear to apply to arm. The arm prologue doesn't have a forward branch. Instead, it uses conditional instructions. My crude attempt at optimizing the arm prologue by focusing on the fast path was a failure. See CL 10497. Other suggestions welcomed.

Contributor

josharian commented May 29, 2015

This doesn't appear to apply to arm. The arm prologue doesn't have a forward branch. Instead, it uses conditional instructions. My crude attempt at optimizing the arm prologue by focusing on the fast path was a failure. See CL 10497. Other suggestions welcomed.

@gopherbot

This comment has been minimized.

Show comment
Hide comment
@gopherbot

gopherbot commented May 29, 2015

CL https://golang.org/cl/10500 mentions this issue.

@gopherbot

This comment has been minimized.

Show comment
Hide comment
@gopherbot

gopherbot commented May 29, 2015

CL https://golang.org/cl/10497 mentions this issue.

josharian added a commit that referenced this issue Jun 4, 2015

cmd/link: make stkcheck more flexible
stkcheck is flow-insensitive: It processes calls in PC order.
Since morestack was always the first call in a function,
it was a safe, conservative approximation to simply adjust stack
space as we went, recognizing morestack when it showed up.

Subsequent CLS will rearrange the function prologue;
morestack may no longer be the first call in a function.

Introducing flow-sensitivity to stkcheck would allow this,
and possibly allow a smaller stackguard.
It is also a high risk change and possibly expensive.

Instead, assume that all calls to morestack occur as
part of the function prologue, no matter where they
are located in the program text.

Updates #10587.

Change-Id: I4dcdd4256a980fc4bc433a68a10989ff57f7034f
Reviewed-on: https://go-review.googlesource.com/10496
Run-TryBot: Josh Bleecher Snyder <josharian@gmail.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Russ Cox <rsc@golang.org>

josharian added a commit that referenced this issue Jun 4, 2015

cmd/internal/obj/x86: make function prologue more predictable
Static branch prediction guesses that forward branches aren't taken.
Since stacks are rarely grown, make the forward branch mean grow.

Sample disassembly for

func f() {
	_ = [128]byte{}
}

Before:

TEXT main.f(SB) x.go
	x.go:3	0x2000	65488b0c25a0080000	GS MOVQ GS:0x8a0, CX
	x.go:3	0x2009	483b6110		CMPQ 0x10(CX), SP
	x.go:3	0x200d	7707			JA 0x2016
	x.go:3	0x200f	e88c410400		CALL runtime.morestack_noctxt(SB)
	x.go:3	0x2014	ebea			JMP main.f(SB)
	x.go:3	0x2016	4881ec80000000		SUBQ $0x80, SP
	x.go:4	0x201d	488d3c24		LEAQ 0(SP), DI
	x.go:4	0x2021	31c0			XORL AX, AX
	x.go:4	0x2023	e8cc640400		CALL 0x484f4
	x.go:5	0x2028	4881c480000000		ADDQ $0x80, SP
	x.go:5	0x202f	c3			RET

After:

TEXT main.f(SB) x.go
	x.go:3	0x2000	65488b0c25a0080000	GS MOVQ GS:0x8a0, CX
	x.go:3	0x2009	483b6110		CMPQ 0x10(CX), SP
	x.go:3	0x200d	761a			JBE 0x2029
	x.go:3	0x200f	4881ec80000000		SUBQ $0x80, SP
	x.go:4	0x2016	488d3c24		LEAQ 0(SP), DI
	x.go:4	0x201a	31c0			XORL AX, AX
	x.go:4	0x201c	e813740400		CALL 0x49434
	x.go:5	0x2021	4881c480000000		ADDQ $0x80, SP
	x.go:5	0x2028	c3			RET
	x.go:3	0x2029	e8224f0400		CALL runtime.morestack_noctxt(SB)
	x.go:3	0x202e	ebd0			JMP main.f(SB)

Updates #10587.

Sample benchmarks on a 2.8 GHz Intel Core i7:

package sort

name            old mean              new mean              delta
SearchWrappers   134ns × (0.99,1.01)   132ns × (0.99,1.01)  -1.73% (p=0.000 n=15+14)
SortString1K     215µs × (0.99,1.01)   213µs × (0.99,1.01)  -0.61% (p=0.020 n=14+15)
StableString1K   311µs × (0.99,1.02)   309µs × (0.99,1.02)    ~    (p=0.077 n=14+15)
SortInt1K        103µs × (0.99,1.02)   100µs × (0.98,1.01)  -3.34% (p=0.000 n=15+15)
StableInt1K      102µs × (0.99,1.01)    98µs × (0.97,1.04)  -3.53% (p=0.000 n=15+15)
SortInt64K      10.1ms × (0.98,1.02)   9.7ms × (0.99,1.01)  -3.86% (p=0.000 n=14+15)
StableInt64K    8.70ms × (0.99,1.01)  8.44ms × (0.99,1.03)  -2.93% (p=0.000 n=14+15)
Sort1e2         51.2µs × (1.00,1.01)  48.9µs × (0.99,1.02)  -4.48% (p=0.000 n=13+15)
Stable1e2        100µs × (0.99,1.02)    99µs × (0.99,1.01)  -1.15% (p=0.000 n=14+13)
Sort1e4         11.1ms × (0.99,1.02)  10.4ms × (0.99,1.01)  -6.02% (p=0.000 n=15+14)
Stable1e4       30.6ms × (0.99,1.01)  30.3ms × (0.99,1.02)  -1.02% (p=0.001 n=15+14)
Sort1e6          1.75s × (0.99,1.02)   1.66s × (0.98,1.03)  -4.95% (p=0.000 n=14+15)
Stable1e6        6.31s × (0.99,1.01)   6.26s × (0.99,1.01)  -0.79% (p=0.002 n=15+15)

package regexp

name                          old mean              new mean              delta
Literal                        131ns × (0.99,1.01)   130ns × (0.99,1.03)  -1.07% (p=0.004 n=14+15)
NotLiteral                    2.13µs × (0.99,1.01)  2.01µs × (0.99,1.03)  -5.71% (p=0.000 n=14+14)
MatchClass                    3.15µs × (0.99,1.01)  3.04µs × (0.99,1.02)  -3.40% (p=0.000 n=15+15)
MatchClass_InRange            2.92µs × (0.99,1.01)  2.77µs × (0.99,1.02)  -5.05% (p=0.000 n=13+15)
ReplaceAll                    2.17µs × (0.99,1.02)  2.06µs × (0.99,1.01)  -5.19% (p=0.000 n=15+13)
AnchoredLiteralShortNonMatch   116ns × (0.99,1.02)   113ns × (0.99,1.01)  -2.75% (p=0.000 n=15+14)
AnchoredLiteralLongNonMatch    125ns × (0.99,1.01)   127ns × (0.98,1.02)  +1.49% (p=0.000 n=15+15)
AnchoredShortMatch             178ns × (0.99,1.02)   175ns × (0.99,1.01)  -1.62% (p=0.000 n=15+13)
AnchoredLongMatch              328ns × (0.99,1.00)   341ns × (0.99,1.01)  +3.73% (p=0.000 n=12+15)
OnePassShortA                  773ns × (0.99,1.02)   752ns × (0.99,1.01)  -2.78% (p=0.000 n=15+13)
NotOnePassShortA               794ns × (0.99,1.03)   780ns × (0.99,1.02)  -1.75% (p=0.001 n=15+15)
OnePassShortB                  608ns × (0.99,1.01)   591ns × (0.99,1.02)  -2.86% (p=0.000 n=15+14)
NotOnePassShortB               576ns × (0.99,1.01)   571ns × (0.99,1.02)  -0.74% (p=0.035 n=15+15)
OnePassLongPrefix              131ns × (0.99,1.02)   130ns × (0.99,1.02)  -1.32% (p=0.003 n=15+15)
OnePassLongNotPrefix           503ns × (0.99,1.02)   481ns × (0.99,1.01)  -4.34% (p=0.000 n=15+13)
MatchEasy0_32                  102ns × (0.98,1.01)   101ns × (0.99,1.02)    ~    (p=0.907 n=15+14)
MatchEasy0_1K                  617ns × (0.99,1.02)   634ns × (0.98,1.02)  +2.77% (p=0.000 n=15+15)
MatchEasy0_32K                10.9µs × (0.99,1.01)  11.1µs × (0.99,1.01)  +1.59% (p=0.000 n=15+15)
MatchEasy0_1M                  406µs × (0.99,1.02)   410µs × (0.99,1.02)  +1.01% (p=0.000 n=14+15)
MatchEasy0_32M                13.4ms × (0.99,1.01)  13.7ms × (0.99,1.02)  +1.64% (p=0.000 n=12+15)
MatchEasy1_32                 83.7ns × (0.98,1.02)  83.0ns × (0.98,1.02)    ~    (p=0.190 n=15+15)
MatchEasy1_1K                 1.46µs × (0.99,1.02)  1.39µs × (0.99,1.02)  -4.83% (p=0.000 n=15+15)
MatchEasy1_32K                49.4µs × (0.99,1.01)  49.4µs × (0.99,1.01)    ~    (p=0.205 n=15+15)
MatchEasy1_1M                 1.72ms × (0.99,1.02)  1.75ms × (0.99,1.01)  +1.34% (p=0.000 n=15+15)
MatchEasy1_32M                55.5ms × (0.99,1.01)  56.1ms × (0.99,1.02)  +1.10% (p=0.002 n=15+15)
MatchMedium_32                1.37µs × (0.99,1.04)  1.33µs × (0.99,1.01)  -2.87% (p=0.000 n=15+15)
MatchMedium_1K                41.1µs × (0.99,1.02)  40.4µs × (0.99,1.02)  -1.59% (p=0.000 n=15+15)
MatchMedium_32K               1.71ms × (0.99,1.01)  1.75ms × (0.99,1.02)  +2.36% (p=0.000 n=14+15)
MatchMedium_1M                54.5ms × (0.99,1.01)  56.1ms × (0.99,1.01)  +2.94% (p=0.000 n=13+15)
MatchMedium_32M                1.75s × (0.99,1.01)   1.80s × (0.99,1.01)  +2.77% (p=0.000 n=15+15)
MatchHard_32                  2.12µs × (0.99,1.02)  2.06µs × (0.99,1.01)  -2.60% (p=0.000 n=15+14)
MatchHard_1K                  64.4µs × (0.98,1.02)  62.2µs × (0.99,1.01)  -3.33% (p=0.000 n=15+15)
MatchHard_32K                 2.74ms × (0.99,1.01)  2.75ms × (0.99,1.01)    ~    (p=0.310 n=15+14)
MatchHard_1M                  87.1ms × (0.99,1.02)  88.2ms × (0.99,1.01)  +1.36% (p=0.000 n=14+15)
MatchHard_32M                  2.79s × (0.99,1.02)   2.83s × (0.99,1.02)  +1.26% (p=0.004 n=15+14)

go1 benchmarks

name                   old time/op    new time/op    delta
BinaryTree17              3.34s ± 3%     3.28s ± 2%  -1.86%  (p=0.000 n=67+66)
Fannkuch11                2.50s ± 1%     2.51s ± 1%  +0.24%  (p=0.016 n=63+66)
FmtFprintfEmpty          50.3ns ± 1%    50.2ns ± 2%  -0.30%  (p=0.001 n=62+67)
FmtFprintfString          178ns ± 1%     166ns ± 1%  -7.10%  (p=0.000 n=62+59)
FmtFprintfInt             168ns ± 1%     161ns ± 2%  -4.41%  (p=0.000 n=66+64)
FmtFprintfIntInt          292ns ± 1%     282ns ± 2%  -3.55%  (p=0.000 n=62+60)
FmtFprintfPrefixedInt     245ns ± 2%     239ns ± 2%  -2.24%  (p=0.000 n=66+65)
FmtFprintfFloat           338ns ± 2%     326ns ± 1%  -3.42%  (p=0.000 n=64+59)
FmtManyArgs              1.14µs ± 1%    1.10µs ± 2%  -3.55%  (p=0.000 n=62+62)
GobDecode                8.88ms ± 2%    8.74ms ± 1%  -1.55%  (p=0.000 n=66+62)
GobEncode                6.84ms ± 2%    6.61ms ± 2%  -3.32%  (p=0.000 n=61+67)
Gzip                      356ms ± 2%     352ms ± 2%  -1.07%  (p=0.000 n=67+66)
Gunzip                   90.6ms ± 2%    89.8ms ± 1%  -0.83%  (p=0.000 n=65+64)
HTTPClientServer         82.6µs ± 2%    82.5µs ± 2%    ~     (p=0.832 n=65+63)
JSONEncode               17.5ms ± 2%    16.8ms ± 2%  -3.77%  (p=0.000 n=63+63)
JSONDecode               63.3ms ± 2%    59.0ms ± 2%  -6.85%  (p=0.000 n=64+63)
Mandelbrot200            3.85ms ± 1%    3.85ms ± 1%    ~     (p=0.127 n=65+62)
GoParse                  3.75ms ± 2%    3.66ms ± 2%  -2.39%  (p=0.000 n=66+64)
RegexpMatchEasy0_32       100ns ± 2%     100ns ± 1%  -0.65%  (p=0.000 n=62+64)
RegexpMatchEasy0_1K       342ns ± 1%     341ns ± 1%  -0.43%  (p=0.000 n=65+64)
RegexpMatchEasy1_32      82.8ns ± 2%    82.8ns ± 2%    ~     (p=0.977 n=63+64)
RegexpMatchEasy1_1K       511ns ± 2%     506ns ± 2%  -1.01%  (p=0.000 n=63+64)
RegexpMatchMedium_32      139ns ± 1%     134ns ± 3%  -3.27%  (p=0.000 n=59+60)
RegexpMatchMedium_1K     41.8µs ± 2%    40.5µs ± 2%  -3.05%  (p=0.000 n=62+64)
RegexpMatchHard_32       2.13µs ± 1%    2.09µs ± 1%  -2.22%  (p=0.000 n=60+65)
RegexpMatchHard_1K       64.4µs ± 3%    62.8µs ± 2%  -2.58%  (p=0.000 n=65+59)
Revcomp                   531ms ± 2%     529ms ± 1%  -0.28%  (p=0.022 n=61+61)
Template                 73.2ms ± 1%    73.1ms ± 1%    ~     (p=0.794 n=66+63)
TimeParse                 369ns ± 1%     352ns ± 1%  -4.68%  (p=0.000 n=65+66)
TimeFormat                374ns ± 2%     348ns ± 2%  -7.01%  (p=0.000 n=66+64)

Change-Id: Ib190b5bb48a3e9087711d9e3383621d3103dd342
Reviewed-on: https://go-review.googlesource.com/10367
Reviewed-by: Russ Cox <rsc@golang.org>

josharian added a commit that referenced this issue Jun 4, 2015

runtime, cmd/internal/obj/arm: improve arm function prologue
When stack growth is not needed, as it usually is not,
execute only a single conditional branch
rather than three conditional instructions.
This adds 4 bytes to every function,
but might speed up execution in the common case.

Sample disassembly for

func f() {
	_ = [128]byte{}
}

Before:

TEXT main.f(SB) x.go
	x.go:3	0x2000	e59a1008	MOVW 0x8(R10), R1
	x.go:3	0x2004	e59fb028	MOVW 0x28(R15), R11
	x.go:3	0x2008	e08d200b	ADD R11, R13, R2
	x.go:3	0x200c	e1520001	CMP R1, R2
	x.go:3	0x2010	91a0300e	MOVW.LS R14, R3
	x.go:3	0x2014	9b0118a9	BL.LS runtime.morestack_noctxt(SB)
	x.go:3	0x2018	9afffff8	B.LS main.f(SB)
	x.go:3	0x201c	e52de084	MOVW.W R14, -0x84(R13)
	x.go:4	0x2020	e28d1004	ADD $4, R13, R1
	x.go:4	0x2024	e3a00000	MOVW $0, R0
	x.go:4	0x2028	eb012255	BL 0x4a984
	x.go:5	0x202c	e49df084	RET #132
	x.go:5	0x2030	eafffffe	B 0x2030
	x.go:5	0x2034	ffffff7c	?

After:

TEXT main.f(SB) x.go
	x.go:3	0x2000	e59a1008	MOVW 0x8(R10), R1
	x.go:3	0x2004	e59fb02c	MOVW 0x2c(R15), R11
	x.go:3	0x2008	e08d200b	ADD R11, R13, R2
	x.go:3	0x200c	e1520001	CMP R1, R2
	x.go:3	0x2010	9a000004	B.LS 0x2028
	x.go:3	0x2014	e52de084	MOVW.W R14, -0x84(R13)
	x.go:4	0x2018	e28d1004	ADD $4, R13, R1
	x.go:4	0x201c	e3a00000	MOVW $0, R0
	x.go:4	0x2020	eb0124dc	BL 0x4b398
	x.go:5	0x2024	e49df084	RET #132
	x.go:5	0x2028	e1a0300e	MOVW R14, R3
	x.go:5	0x202c	eb011b0d	BL runtime.morestack_noctxt(SB)
	x.go:5	0x2030	eafffff2	B main.f(SB)
	x.go:5	0x2034	eafffffe	B 0x2034
	x.go:5	0x2038	ffffff7c	?

Updates #10587.

package sort benchmarks on an iPhone 6:

name            old time/op  new time/op  delta
SortString1K     569µs ± 0%   565µs ± 1%  -0.75%  (p=0.000 n=23+24)
StableString1K   872µs ± 1%   870µs ± 1%  -0.16%  (p=0.009 n=23+24)
SortInt1K        317µs ± 2%   316µs ± 2%    ~     (p=0.410 n=26+26)
StableInt1K      343µs ± 1%   339µs ± 1%  -1.07%  (p=0.000 n=22+23)
SortInt64K      30.0ms ± 1%  30.0ms ± 1%    ~     (p=0.091 n=25+24)
StableInt64K    30.2ms ± 0%  30.0ms ± 0%  -0.69%  (p=0.000 n=22+22)
Sort1e2          147µs ± 1%   146µs ± 0%  -0.48%  (p=0.000 n=25+24)
Stable1e2        290µs ± 1%   286µs ± 1%  -1.30%  (p=0.000 n=23+24)
Sort1e4         29.5ms ± 2%  29.7ms ± 1%  +0.71%  (p=0.000 n=23+23)
Stable1e4       88.7ms ± 4%  88.6ms ± 8%  -0.07%  (p=0.022 n=26+26)
Sort1e6          4.81s ± 7%   4.83s ± 7%    ~     (p=0.192 n=26+26)
Stable1e6        18.3s ± 1%   18.1s ± 1%  -0.76%  (p=0.000 n=25+23)
SearchWrappers   318ns ± 1%   344ns ± 1%  +8.14%  (p=0.000 n=23+26)

package sort benchmarks on a first generation rpi:

name            old time/op  new time/op  delta
SearchWrappers  4.13µs ± 0%  3.95µs ± 0%   -4.42%  (p=0.000 n=15+13)
SortString1K    5.81ms ± 1%  5.82ms ± 2%     ~     (p=0.400 n=14+15)
StableString1K  9.69ms ± 1%  9.73ms ± 0%     ~     (p=0.121 n=15+11)
SortInt1K       3.30ms ± 2%  3.66ms ±19%  +10.82%  (p=0.000 n=15+14)
StableInt1K     5.97ms ±15%  4.17ms ± 8%  -30.05%  (p=0.000 n=15+15)
SortInt64K       319ms ± 1%   295ms ± 1%   -7.65%  (p=0.000 n=15+15)
StableInt64K     343ms ± 0%   332ms ± 0%   -3.26%  (p=0.000 n=12+13)
Sort1e2         3.36ms ± 2%  3.22ms ± 4%   -4.10%  (p=0.000 n=15+15)
Stable1e2       6.74ms ± 1%  6.43ms ± 2%   -4.67%  (p=0.000 n=15+15)
Sort1e4          247ms ± 1%   247ms ± 1%     ~     (p=0.331 n=15+14)
Stable1e4        864ms ± 0%   820ms ± 0%   -5.15%  (p=0.000 n=14+15)
Sort1e6          41.2s ± 0%   41.2s ± 0%   +0.15%  (p=0.000 n=13+14)
Stable1e6         192s ± 0%    182s ± 0%   -5.07%  (p=0.000 n=14+14)

Change-Id: I8a9db77e1d4ea1956575895893bc9d04bd81204b
Reviewed-on: https://go-review.googlesource.com/10497
Reviewed-by: Russ Cox <rsc@golang.org>

josharian added a commit that referenced this issue Jun 4, 2015

cmd/internal/obj/arm64: make function prologue more predictable
Static branch prediction guesses that forward branches aren't taken.
Since stacks are rarely grown, make the forward branch mean grow.

While we're here, remove the debug-only instruction
saving the frame size in the temp register.

Sample disassembly for

func f() {
	_ = [128]byte{}
}

Before:

0x4008248         ldr        x1, [x28, #0x10]
0x400824c         sub        x2, sp, #0x90
0x4008250         cmp        x2, x1
0x4008254         b.hi       0x4008268
0x4008258         mov        x3, x30
0x400825c         movz       x27, #0x90
0x4008260         bl         runtime.morestack_noctxt
0x4008264         b          main.f
0x4008268         sub        sp, sp, #0x90
0x400826c         add        x16, sp, #0x10
0x4008270         str        xzr, [x16]
0x4008274         str        xzr, [x16, #0x8]
0x4008278         str        xzr, [x16, #0x10]
0x400827c         str        xzr, [x16, #0x18]
0x4008280         str        xzr, [x16, #0x20]
0x4008284         str        xzr, [x16, #0x28]
0x4008288         str        xzr, [x16, #0x30]
0x400828c         str        xzr, [x16, #0x38]
0x4008290         str        xzr, [x16, #0x40]
0x4008294         str        xzr, [x16, #0x48]
0x4008298         str        xzr, [x16, #0x50]
0x400829c         str        xzr, [x16, #0x58]
0x40082a0         str        xzr, [x16, #0x60]
0x40082a4         str        xzr, [x16, #0x68]
0x40082a8         str        xzr, [x16, #0x70]
0x40082ac         str        xzr, [x16, #0x78]
0x40082b0         add        sp, sp, #0x90
0x40082b4         ret

After:

0x4004bc8         ldr        x1, [x28, #0x10]
0x4004bcc         sub        x2, sp, #0x90
0x4004bd0         cmp        x2, x1
0x4004bd4         b.ls       0x4004c28
0x4004bd8         sub        sp, sp, #0x90
0x4004bdc         add        x16, sp, #0x10
0x4004be0         str        xzr, [x16]
0x4004be4         str        xzr, [x16, #0x8]
0x4004be8         str        xzr, [x16, #0x10]
0x4004bec         str        xzr, [x16, #0x18]
0x4004bf0         str        xzr, [x16, #0x20]
0x4004bf4         str        xzr, [x16, #0x28]
0x4004bf8         str        xzr, [x16, #0x30]
0x4004bfc         str        xzr, [x16, #0x38]
0x4004c00         str        xzr, [x16, #0x40]
0x4004c04         str        xzr, [x16, #0x48]
0x4004c08         str        xzr, [x16, #0x50]
0x4004c0c         str        xzr, [x16, #0x58]
0x4004c10         str        xzr, [x16, #0x60]
0x4004c14         str        xzr, [x16, #0x68]
0x4004c18         str        xzr, [x16, #0x70]
0x4004c1c         str        xzr, [x16, #0x78]
0x4004c20         add        sp, sp, #0x90
0x4004c24         ret
0x4004c28         mov        x3, x30
0x4004c2c         bl         runtime.morestack_noctxt
0x4004c30         b          main.f

Updates #10587.

Package sort benchmarks using an iPhone 6:

name            old time/op  new time/op  delta
SearchWrappers   355ns ± 1%   328ns ± 1%  -7.57%  (p=0.000 n=25+19)
SortString1K     580µs ± 1%   577µs ± 1%  -0.48%  (p=0.000 n=25+25)
StableString1K  1.04ms ± 0%  1.04ms ± 0%    ~     (p=0.851 n=24+25)
SortInt1K        251µs ± 1%   247µs ± 1%  -1.52%  (p=0.000 n=23+25)
StableInt1K      267µs ± 2%   261µs ± 2%  -2.02%  (p=0.000 n=25+25)
SortInt64K      23.8ms ± 1%  23.6ms ± 0%  -0.97%  (p=0.000 n=25+23)
StableInt64K    22.8ms ± 0%  22.4ms ± 1%  -1.76%  (p=0.000 n=24+25)
Sort1e2          123µs ± 1%   124µs ± 1%    ~     (p=0.256 n=23+23)
Stable1e2        248µs ± 1%   247µs ± 1%  -0.69%  (p=0.000 n=23+25)
Sort1e4         24.3ms ± 2%  24.6ms ± 5%  +1.36%  (p=0.017 n=22+25)
Stable1e4       77.2ms ± 6%  76.2ms ± 5%  -1.36%  (p=0.020 n=25+25)
Sort1e6          3.95s ± 8%   3.95s ± 8%    ~     (p=0.863 n=25+25)
Stable1e6        15.7s ± 1%   15.5s ± 1%  -1.11%  (p=0.000 n=22+23)

Change-Id: I377b3817af2ed27ddeecf24edef97fad91fc1afc
Reviewed-on: https://go-review.googlesource.com/10500
Reviewed-by: Russ Cox <rsc@golang.org>
Run-TryBot: Josh Bleecher Snyder <josharian@gmail.com>
Reviewed-by: Dave Cheney <dave@cheney.net>

@josharian josharian closed this Jun 6, 2015

@golang golang locked and limited conversation to collaborators Jun 25, 2016

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.