New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

proposal: runtime: non-cooperative goroutine preemption #24543

Open
aclements opened this Issue Mar 26, 2018 · 37 comments

Comments

Projects
None yet
@aclements
Member

aclements commented Mar 26, 2018

I propose that we solve #10958 (preemption of tight loops) using non-cooperative preemption techniques. I have a detailed design proposal, which I will post shortly. This issue will track this specific implementation approach, as opposed to the general problem.

Edit: Design doc

Currently, Go currently uses compiler-inserted cooperative preemption points in function prologues. The majority of the time, this is good enough to allow Go developers to ignore preemption and focus on writing clear parallel code, but it has sharp edges that we've seen degrade the developer experience time and time again. When it goes wrong, it goes spectacularly wrong, leading to mysterious system-wide latency issues (#17831, #19241) and sometimes complete freezes (#543, #12553, #13546, #14561, #15442, #17174, #20793, #21053). And because this is a language implementation issue that exists outside of Go's language semantics, these failures are surprising and very difficult to debug.

@dr2chase has put significant effort into prototyping cooperative preemption points in loops, which is one way to solve this problem. However, even sophisticated approaches to this led to unacceptable slow-downs in tight loops (where slow-downs are generally least acceptable).

I propose that the Go implementation switch to non-cooperative preemption using stack and register maps at (essentially) every instruction. This would allow goroutines to be preempted without explicit
preemption checks. This approach will solve the problem of delayed preemption with zero run-time overhead and have side benefits for debugger function calls (#21678).

I've already prototyped significant components of this solution, including constructing register maps and recording stack and register maps at every instruction and so far the results are quite promising.

/cc @drchase @RLH @randall77 @minux

@aclements aclements added this to the Go1.12 milestone Mar 26, 2018

@aclements aclements self-assigned this Mar 26, 2018

@gopherbot gopherbot added the Proposal label Mar 26, 2018

@gopherbot

This comment has been minimized.

gopherbot commented Mar 26, 2018

Change https://golang.org/cl/102600 mentions this issue: design: add 24543-non-cooperative-preemption

@gopherbot

This comment has been minimized.

gopherbot commented Mar 26, 2018

Change https://golang.org/cl/102603 mentions this issue: cmd/compile: detect simple inductive facts in prove

@gopherbot

This comment has been minimized.

gopherbot commented Mar 26, 2018

Change https://golang.org/cl/102604 mentions this issue: cmd/compile: don't produce a past-the-end pointer in range loops

@aclements

This comment has been minimized.

Member

aclements commented Mar 27, 2018

Forwarding some questions from @hyangah on the CL:

Are code in cgo (or outside Go) considered non-safe points?

All of cgo is currently considered a safe-point (one of the reasons it's relatively expensive to enter and exit cgo) and this won't change.

Or will runtime be careful not to send signal to the threads who may be in cgo land?

I don't think the runtime can avoid sending signals to threads that may be in cgo without expensive synchronization on common paths, but I don't think it matters. When it enters the runtime signal handler it can recognize that it was in cgo and do the appropriate thing (which will probably be to just ignore it, or maybe queue up an action like stack scanning).

Should users or cgo code avoid using the signal?

It should be okay if cgo code uses the signal, as long as it's correctly chained. I'm hoping to use POSIX real-time signals on systems where they're available, so the runtime will attempt to find one that's unused (which is usually all of them anyway), though that isn't an option on Darwin.

And a question from @randall77 (which I answered on the CL, but should have answered here):

Will we stop using the current preemption technique (the dummy large stack bound) altogether, or will the non-coop preemption just be a backstop?

There's really no cost to the current technique and we'll continue to rely on it in the runtime for the foreseeable future, so my current plan is to leave it in. However, we could be much more aggressive about removing stack bounds checks (for example if we can prove that a whole call tree will fit in the nosplit zone).

@TocarIP

This comment has been minimized.

Contributor

TocarIP commented Mar 27, 2018

So it is still possible to make goroutine nonpreemptable with something like:
sha256.Sum(make([]byte,1000000000))
where inner loop is written in asm?

@aclements

This comment has been minimized.

Member

aclements commented Mar 27, 2018

Yes, that would still make a goroutine non-preemptible. However, with some extra annotations in the assembly to indicate registers containing pointers it will become preemptible without any extra work or run-time overhead to reach an explicit safe-point. In the case of sha256.Sum these annotations would probably be trivial since it will never construct a pointer that isn't shadowed by the arguments (so it can claim there are no pointers in registers).

I'll add a paragraph to the design doc about this.

@komuw

This comment has been minimized.

komuw commented Mar 28, 2018

will the design doc be posted here?

@aclements

This comment has been minimized.

Member

aclements commented Mar 28, 2018

gopherbot pushed a commit to golang/proposal that referenced this issue Mar 28, 2018

design: add 24543-non-cooperative-preemption
For golang/go#24543.

Change-Id: Iba313a963aafcd93521bb9e006cb32d1f242301b
Reviewed-on: https://go-review.googlesource.com/102600
Reviewed-by: Rick Hudson <rlh@golang.org>
Reviewed-by: Keith Randall <khr@golang.org>
@aclements

This comment has been minimized.

Member

aclements commented Mar 28, 2018

The doc is now submitted: Proposal: Non-cooperative goroutine preemption

@mtstickney

This comment has been minimized.

mtstickney commented Mar 30, 2018

Disclaimer: I'm not a platform expert, or an expert on language implementations, or involved with go aside from having written a few toy programs in it. That said:

There's a (potentially) fatal flaw here: GetThreadContext doesn't actually work on Windows (see here for details). There are several lisp implementations that have exhibited crashes on that platform because they tried to use GetThreadContext/SetThreadContext to implement preemptive signals on Windows.

As some old notes for SBCL point out, Windows has no working version of preemptive signals without loading a kernel driver, which is generally prohibitive for applications.

@JamesBielby

This comment has been minimized.

JamesBielby commented Mar 31, 2018

I think the example code to avoid creating a past-the-end pointer has a problem if the slice has a capacity of 0. You need to declare _p after the first if statement.

@creker

This comment has been minimized.

creker commented Mar 31, 2018

@mtstickney looks like it's true but we can look for other implementations, how they go about the same problem. CoreCLR talks about the same problem - they need to preempt threads for GC and talk about the same bugs with wrong thread context. And they also talk about how they solve it without ditching SuspendThread altogether by using redirection.

I'm not an expert in this kind of stuff so I'm sorry if this has nothing to do with solving the problem here.

@mtstickney

This comment has been minimized.

mtstickney commented Mar 31, 2018

@creker Nor me, so we're in the same boat there. I hadn't seen the CoreCLR reference before, but that's the same idea as the lisp approach: SuspendThread, retrieve the current register set with GetThreadContext, change IP to point to the signal code to be run, ResumeThread, then when the handler is finished restore the original registers with SetThreadContext.

The trick is capturing the original register set: you can either do it with an OS primitive (GetThreadContext, which is buggy), or roll your own code for it. If you do the latter, you're at risk for getting a bogus set of registers because your register-collecting code is in user-mode, and could be preempted by a kernel APC.

It looks like on some Windows versions, some of the time, you can detect and avoid the race conditions with GetThreadContext (see this post, particularly the comments concerning CONTEXT_EXCEPTION_REQUEST). The CoreCLR code seems to make some attempts to work around the race condition, although I don't know if it's suitable here.

@aclements

This comment has been minimized.

Member

aclements commented Mar 31, 2018

Thanks for the pointers about GetThreadContext! That's really interesting and good to know, but I think it's actually not a problem.

For GC preemption, we can always resume the same goroutine on the same thread after preemption, so there's no need to call SetThreadContext to hijack the thread. We just need to observe its state; not run something else on that thread. Furthermore, my understanding is that GetThreadContext doesn't reliably return all registers if the thread is in a syscall, but in this case there won't be any live pointers in registers anyway (any pointer arguments to the syscall are shadowed on the Go wrapper's stack). Hence, we only need to retrieve the PC and SP in this case. Even this may not matter, since we currently treat a syscall as a giant GC safe-point, so we already save the information we need on the way in to the syscall.

For scheduler preemption, things are a bit more complicated, but I think still okay. In this case we would need to call SetThreadContext to hijack the thread, but we would only do this to threads at Go safe-points, meaning we'd never preempt something in a syscall. Today, if a goroutine has been in a syscall for too long, we don't hijack the thread, we simply flag that it should block upon returning from the syscall and schedule the next goroutine on a different thread (creating a new one or going to a pool). We would keep using that mechanism for rescheduling goroutines that are in system calls.

@gopherbot

This comment has been minimized.

gopherbot commented Apr 20, 2018

Change https://golang.org/cl/108497 mentions this issue: cmd/compile: teach Haspointer about TSSA and TTUPLE

@gopherbot

This comment has been minimized.

gopherbot commented Apr 20, 2018

Change https://golang.org/cl/108496 mentions this issue: cmd/compile: don't lower OpConvert

@gopherbot

This comment has been minimized.

gopherbot commented Apr 20, 2018

Change https://golang.org/cl/108498 mentions this issue: cmd/compile: don't compact liveness maps in place

gopherbot pushed a commit that referenced this issue Apr 20, 2018

cmd/compile: don't lower OpConvert
Currently, each architecture lowers OpConvert to an arch-specific
OpXXXconvert. This is silly because OpConvert means the same thing on
all architectures and is logically a no-op that exists only to keep
track of conversions to and from unsafe.Pointer. Furthermore, lowering
it makes it harder to recognize in other analyses, particularly
liveness analysis.

This CL eliminates the lowering of OpConvert, leaving it as the
generic op until code generation time.

The main complexity here is that we still need to register-allocate
OpConvert operations. Currently, each arch's lowered OpConvert
specifies all GP registers in its register mask. Ideally, OpConvert
wouldn't affect value homing at all, and we could just copy the home
of OpConvert's source, but this can potentially home an OpConvert in a
LocalSlot, which neither regalloc nor stackalloc expect. Rather than
try to disentangle this assumption from regalloc and stackalloc, we
continue to register-allocate OpConvert, but teach regalloc that
OpConvert can be allocated to any allocatable GP register.

For #24543.

Change-Id: I795a6aee5fd94d4444a7bafac3838a400c9f7bb6
Reviewed-on: https://go-review.googlesource.com/108496
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: David Chase <drchase@google.com>

gopherbot pushed a commit that referenced this issue Apr 20, 2018

cmd/compile: teach Haspointer about TSSA and TTUPLE
These will appear when tracking live pointers in registers, so we need
to know whether they have pointers.

For #24543.

Change-Id: I2edccee39ca989473db4b3e7875ff166808ac141
Reviewed-on: https://go-review.googlesource.com/108497
Run-TryBot: Austin Clements <austin@google.com>
Reviewed-by: David Chase <drchase@google.com>

gopherbot pushed a commit that referenced this issue Apr 23, 2018

cmd/compile: don't compact liveness maps in place
Currently Liveness.compact rewrites the Liveness.livevars slice in
place. However, we're about to add register maps, which we'll want to
track in livevars, but compact independently from the stack maps.
Hence, this CL modifies Liveness.compact to consume Liveness.livevars
and produce a new slice of deduplicated stack maps. This is somewhat
clearer anyway because it avoids potential confusion over how
Liveness.livevars is indexed.

Passes toolstash -cmp.

For #24543.

Change-Id: I7093fbc71143f8a29e677aa30c96e501f953ca2b
Reviewed-on: https://go-review.googlesource.com/108498
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: David Chase <drchase@google.com>
@gopherbot

This comment has been minimized.

gopherbot commented Apr 25, 2018

Change https://golang.org/cl/109351 mentions this issue: cmd/compile: dense numbering for GP registers

@gopherbot

This comment has been minimized.

gopherbot commented May 11, 2018

Change https://golang.org/cl/110176 mentions this issue: cmd/compile: abstract bvec sets

@gopherbot

This comment has been minimized.

gopherbot commented May 11, 2018

Change https://golang.org/cl/110175 mentions this issue: cmd/compile: single pass over Blocks in Liveness.epilogue

@gopherbot

This comment has been minimized.

gopherbot commented May 11, 2018

Change https://golang.org/cl/110179 mentions this issue: cmd/compile: reuse liveness structures

@gopherbot

This comment has been minimized.

gopherbot commented May 11, 2018

Change https://golang.org/cl/110178 mentions this issue: cmd/compile: make LivenessMap dense

gopherbot pushed a commit that referenced this issue May 22, 2018

cmd/compile: detect OFORUNTIL inductive facts in prove
Currently, we compile range loops into for loops with the obvious
initialization and update of the index variable. In this form, the
prove pass can see that the body is dominated by an i < len condition,
and findIndVar can detect that i is an induction variable and that
0 <= i < len.

GOEXPERIMENT=preemptibleloops compiles range loops to OFORUNTIL and
we're preparing to unconditionally switch to a variation of this for
 #24543. OFORUNTIL moves the increment and condition *after* the body,
which makes the bounds on the index variable much less obvious. With
OFORUNTIL, proving anything about the index variable requires
understanding the phi that joins the index values at the top of the
loop body block.

This interferes with both prove's ability to see that i < len (this is
true on both paths that enter the body, but from two different
conditional checks) and with findIndVar's ability to detect the
induction pattern.

Fix this by teaching prove to detect that the index in the pattern
constructed by OFORUNTIL is an induction variable and add both bounds
to the facts table. Currently this is done separately from findIndVar
because it depends on prove's factsTable, while findIndVar runs before
visiting blocks and building the factsTable.

Without any GOEXPERIMENT, this has no effect on std or cmd. However,
with GOEXPERIMENT=preemptibleloops, this change becomes necessary to
prove 90 conditions in std and cmd.

Change-Id: Ic025d669f81b53426309da5a6e8010e5ccaf4f49
Reviewed-on: https://go-review.googlesource.com/102603
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Keith Randall <khr@golang.org>

gopherbot pushed a commit that referenced this issue May 22, 2018

cmd/compile: don't produce a past-the-end pointer in range loops
Currently, range loops over slices and arrays are compiled roughly
like:

for i, x := range s { b }
  ⇓
for i, _n, _p := 0, len(s), &s[0]; i < _n; i, _p = i+1, _p + unsafe.Sizeof(s[0]) { b }
  ⇓
i, _n, _p := 0, len(s), &s[0]
goto cond
body:
{ b }
i, _p = i+1, _p + unsafe.Sizeof(s[0])
cond:
if i < _n { goto body } else { goto end }
end:

The problem with this lowering is that _p may temporarily point past
the end of the allocation the moment before the loop terminates. Right
now this isn't a problem because there's never a safe-point during
this brief moment.

We're about to introduce safe-points everywhere, so this bad pointer
is going to be a problem. We could mark the increment as an unsafe
block, but this inhibits reordering opportunities and could result in
infrequent safe-points if the body is short.

Instead, this CL fixes this by changing how we compile range loops to
never produce this past-the-end pointer. It changes the lowering to
roughly:

i, _n, _p := 0, len(s), &s[0]
if i < _n { goto body } else { goto end }
top:
_p += unsafe.Sizeof(s[0])
body:
{ b }
i++
if i < _n { goto top } else { goto end }
end:

Notably, the increment is split into two parts: we increment the index
before checking the condition, but increment the pointer only *after*
the condition check has succeeded.

The implementation builds on the OFORUNTIL construct that was
introduced during the loop preemption experiments, since OFORUNTIL
places the increment and condition after the loop body. To support the
extra "late increment" step, we further define OFORUNTIL's "List"
field to contain the late increment statements. This makes all of this
a relatively small change.

This depends on the improvements to the prove pass in CL 102603. With
the current lowering, bounds-check elimination knows that i < _n in
the body because the body block is dominated by the cond block. In the
new lowering, deriving this fact requires detecting that i < _n on
*both* paths into body and hence is true in body. CL 102603 made prove
able to detect this.

The code size effect of this is minimal. The cmd/go binary on
linux/amd64 increases by 0.17%. Performance-wise, this actually
appears to be a net win, though it's mostly noise:

name                      old time/op    new time/op    delta
BinaryTree17-12              2.80s ± 0%     2.61s ± 1%  -6.88%  (p=0.000 n=20+18)
Fannkuch11-12                2.41s ± 0%     2.42s ± 0%  +0.05%  (p=0.005 n=20+20)
FmtFprintfEmpty-12          41.6ns ± 5%    41.4ns ± 6%    ~     (p=0.765 n=20+19)
FmtFprintfString-12         69.4ns ± 3%    69.3ns ± 1%    ~     (p=0.084 n=19+17)
FmtFprintfInt-12            76.1ns ± 1%    77.3ns ± 1%  +1.57%  (p=0.000 n=19+19)
FmtFprintfIntInt-12          122ns ± 2%     123ns ± 3%  +0.95%  (p=0.015 n=20+20)
FmtFprintfPrefixedInt-12     153ns ± 2%     151ns ± 3%  -1.27%  (p=0.013 n=20+20)
FmtFprintfFloat-12           215ns ± 0%     216ns ± 0%  +0.47%  (p=0.000 n=20+16)
FmtManyArgs-12               486ns ± 1%     498ns ± 0%  +2.40%  (p=0.000 n=20+17)
GobDecode-12                6.43ms ± 0%    6.50ms ± 0%  +1.08%  (p=0.000 n=18+19)
GobEncode-12                5.43ms ± 1%    5.47ms ± 0%  +0.76%  (p=0.000 n=20+20)
Gzip-12                      218ms ± 1%     218ms ± 1%    ~     (p=0.883 n=20+20)
Gunzip-12                   38.8ms ± 0%    38.9ms ± 0%    ~     (p=0.644 n=19+19)
HTTPClientServer-12         76.2µs ± 1%    76.4µs ± 2%    ~     (p=0.218 n=20+20)
JSONEncode-12               12.2ms ± 0%    12.3ms ± 1%  +0.45%  (p=0.000 n=19+19)
JSONDecode-12               54.2ms ± 1%    53.3ms ± 0%  -1.67%  (p=0.000 n=20+20)
Mandelbrot200-12            3.71ms ± 0%    3.71ms ± 0%    ~     (p=0.143 n=19+20)
GoParse-12                  3.22ms ± 0%    3.19ms ± 1%  -0.72%  (p=0.000 n=20+20)
RegexpMatchEasy0_32-12      76.7ns ± 1%    75.8ns ± 1%  -1.19%  (p=0.000 n=20+17)
RegexpMatchEasy0_1K-12       245ns ± 1%     243ns ± 0%  -0.72%  (p=0.000 n=18+17)
RegexpMatchEasy1_32-12      71.9ns ± 0%    71.7ns ± 1%  -0.39%  (p=0.006 n=12+18)
RegexpMatchEasy1_1K-12       358ns ± 1%     354ns ± 1%  -1.13%  (p=0.000 n=20+19)
RegexpMatchMedium_32-12      105ns ± 2%     105ns ± 1%  -0.63%  (p=0.007 n=19+20)
RegexpMatchMedium_1K-12     31.9µs ± 1%    31.9µs ± 1%    ~     (p=1.000 n=17+17)
RegexpMatchHard_32-12       1.51µs ± 1%    1.52µs ± 2%  +0.46%  (p=0.042 n=18+18)
RegexpMatchHard_1K-12       45.3µs ± 1%    45.5µs ± 2%  +0.44%  (p=0.029 n=18+19)
Revcomp-12                   388ms ± 1%     385ms ± 0%  -0.57%  (p=0.000 n=19+18)
Template-12                 63.0ms ± 1%    63.3ms ± 0%  +0.50%  (p=0.000 n=19+20)
TimeParse-12                 309ns ± 1%     307ns ± 0%  -0.62%  (p=0.000 n=20+20)
TimeFormat-12                328ns ± 0%     333ns ± 0%  +1.35%  (p=0.000 n=19+19)
[Geo mean]                  47.0µs         46.9µs       -0.20%

(https://perf.golang.org/search?q=upload:20180326.1)

For #10958.
For #24543.

Change-Id: Icbd52e711fdbe7938a1fea3e6baca1104b53ac3a
Reviewed-on: https://go-review.googlesource.com/102604
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Keith Randall <khr@golang.org>
Reviewed-by: David Chase <drchase@google.com>

gopherbot pushed a commit that referenced this issue May 22, 2018

cmd/internal/obj: consolidate emitting entry stack map
The obj package needs to emit the PCDATA to select the entry stack map
before calling morestack. Currently this is copied for every
architecture. Since we're about to change how this works, consolidate
all of these copies into a single helper function.

For #24543.

Change-Id: Ia92d94de78f8e23fd06dba747c43e03e5989f67b
Reviewed-on: https://go-review.googlesource.com/109346
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Keith Randall <khr@golang.org>

gopherbot pushed a commit that referenced this issue May 22, 2018

cmd/compile: introduce LivenessMap and LivenessIndex
Currently liveness only produces a stack map index at each safe point,
so the information is summarized in a map[*ssa.Value]int. We're about
to have both a stack map index and a register map index, so replace
the int with a LivenessIndex type we can extend, and replace the map
with a LivenessMap that we can also change more easily in the future.

This also gives us an easy hook for defining the value that means "not
a safe point".

Passes toolstash -cmp.

For #24543.

Change-Id: Ic4c069839635efed4fd0f603899b80f8be3b56ec
Reviewed-on: https://go-review.googlesource.com/109347
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Keith Randall <khr@golang.org>

gopherbot pushed a commit that referenced this issue May 22, 2018

cmd/compile: enable stack maps everywhere except unsafe points
This modifies issafepoint in liveness analysis to report almost every
operation as a safe point. There are four things we don't mark as
safe-points:

1. Runtime code (other than at calls).

2. go:nosplit functions (other than at calls).

3. Instructions between the load of the write barrier-enabled flag and
   the write.

4. Instructions leading up to a uintptr -> unsafe.Pointer conversion.

We'll optimize this in later CLs:

name        old time/op       new time/op       delta
Template          185ms ± 2%        190ms ± 2%   +2.95%  (p=0.000 n=10+10)
Unicode          96.3ms ± 3%       96.4ms ± 1%     ~     (p=0.905 n=10+9)
GoTypes           658ms ± 0%        669ms ± 1%   +1.72%  (p=0.000 n=10+9)
Compiler          3.14s ± 1%        3.18s ± 1%   +1.56%  (p=0.000 n=9+10)
SSA               7.41s ± 2%        7.59s ± 1%   +2.48%  (p=0.000 n=9+10)
Flate             126ms ± 1%        128ms ± 1%   +2.08%  (p=0.000 n=10+10)
GoParser          153ms ± 1%        157ms ± 2%   +2.38%  (p=0.000 n=10+10)
Reflect           437ms ± 1%        442ms ± 1%   +0.98%  (p=0.001 n=10+10)
Tar               178ms ± 1%        179ms ± 1%   +0.67%  (p=0.035 n=10+9)
XML               223ms ± 1%        229ms ± 1%   +2.58%  (p=0.000 n=10+10)
[Geo mean]        394ms             401ms        +1.75%

No effect on binary size because we're not yet emitting these extra
safe points.

For #24543.

Change-Id: I16a1eebb9183cad7cef9d53c0fd21a973cad6859
Reviewed-on: https://go-review.googlesource.com/109348
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: David Chase <drchase@google.com>

gopherbot pushed a commit that referenced this issue May 22, 2018

cmd/compile: output stack map index everywhere it changes
Currently, the code generator only considers outputting stack map
indexes at CALL instructions. Raise this into the code generator loop
itself so that changes in the stack map index at any instruction emit
a PCDATA Prog before the actual instruction.

We'll optimize this in later CLs:

name        old time/op       new time/op       delta
Template          190ms ± 2%        191ms ± 2%    ~     (p=0.529 n=10+10)
Unicode          96.4ms ± 1%       98.5ms ± 3%  +2.18%  (p=0.001 n=9+10)
GoTypes           669ms ± 1%        673ms ± 1%  +0.62%  (p=0.004 n=9+9)
Compiler          3.18s ± 1%        3.22s ± 1%  +1.06%  (p=0.000 n=10+9)
SSA               7.59s ± 1%        7.64s ± 1%  +0.66%  (p=0.023 n=10+10)
Flate             128ms ± 1%        130ms ± 2%  +1.07%  (p=0.043 n=10+10)
GoParser          157ms ± 2%        158ms ± 3%    ~     (p=0.123 n=10+10)
Reflect           442ms ± 1%        445ms ± 1%  +0.73%  (p=0.017 n=10+9)
Tar               179ms ± 1%        180ms ± 1%  +0.58%  (p=0.019 n=9+9)
XML               229ms ± 1%        232ms ± 2%  +1.27%  (p=0.009 n=10+10)
[Geo mean]        401ms             405ms       +0.94%

name        old exe-bytes     new exe-bytes     delta
HelloSize         1.46M ± 0%        1.47M ± 0%  +0.84%  (p=0.000 n=10+10)
[Geo mean]        1.46M             1.47M       +0.84%

For #24543.

Change-Id: I4bfe45b767c9d9db47308a27763b303fa75bfa54
Reviewed-on: https://go-review.googlesource.com/109350
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Keith Randall <khr@golang.org>
@gopherbot

This comment has been minimized.

gopherbot commented May 22, 2018

Change https://golang.org/cl/114076 mentions this issue: cmd/compile: fix unsafe-point analysis with -N

gopherbot pushed a commit that referenced this issue May 22, 2018

cmd/compile: fix unsafe-point analysis with -N
Compiling without optimizations (-N) can result in write barrier
blocks that have been optimized away but not actually pruned from the
block set. Fix unsafe-point analysis to recognize and ignore these.

For #24543.

Change-Id: I2ca86fb1a0346214ec71d7d6c17b6a121857b01d
Reviewed-on: https://go-review.googlesource.com/114076
Run-TryBot: Austin Clements <austin@google.com>
Reviewed-by: David Chase <drchase@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>

gopherbot pushed a commit that referenced this issue May 22, 2018

cmd/compile: dense numbering for GP registers
For register maps, we need a dense numbering of registers that may
contain pointers of interest to the garbage collector. Add this to
Register and compute it from the GP register set.

For #24543.

Change-Id: If6f0521effca5eca4d17895468b1fc52d67e0f32
Reviewed-on: https://go-review.googlesource.com/109351
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Keith Randall <khr@golang.org>

gopherbot pushed a commit that referenced this issue May 22, 2018

cmd/compile: compute register liveness maps
This extends the liveness analysis to track registers containing live
pointers. We do this by tracking bitmaps for live pointer registers
in parallel with bitmaps for stack variables.

This does not yet do anything with these liveness maps, though they do
appear in the debug output for -live=2.

We'll optimize this in later CLs:

name        old time/op       new time/op       delta
Template          193ms ± 5%        195ms ± 2%    ~     (p=0.050 n=9+9)
Unicode          97.7ms ± 2%       98.4ms ± 2%    ~     (p=0.315 n=9+10)
GoTypes           674ms ± 2%        685ms ± 1%  +1.72%  (p=0.001 n=9+9)
Compiler          3.21s ± 1%        3.28s ± 1%  +2.28%  (p=0.000 n=10+9)
SSA               7.70s ± 1%        7.79s ± 1%  +1.07%  (p=0.015 n=10+10)
Flate             130ms ± 3%        133ms ± 2%  +2.19%  (p=0.003 n=10+10)
GoParser          159ms ± 3%        161ms ± 2%  +1.51%  (p=0.019 n=10+10)
Reflect           444ms ± 1%        450ms ± 1%  +1.43%  (p=0.000 n=9+10)
Tar               181ms ± 2%        183ms ± 2%  +1.45%  (p=0.010 n=10+9)
XML               230ms ± 1%        234ms ± 1%  +1.56%  (p=0.000 n=8+9)
[Geo mean]        405ms             411ms       +1.48%

No effect on binary size because we're not yet emitting the register
maps.

For #24543.

Change-Id: Ieb022f0aea89c0ea9a6f035195bce2f0e67dbae4
Reviewed-on: https://go-review.googlesource.com/109352
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Keith Randall <khr@golang.org>

gopherbot pushed a commit that referenced this issue May 22, 2018

cmd/compile, cmd/internal/obj: record register maps in binary
This adds FUNCDATA and PCDATA that records the register maps much like
the existing live arguments maps and live locals maps. The register
map is indexed independently from the argument and locals maps since
changes in register liveness tend not to correlate with changes to
argument and local liveness.

This is the final CL toward adding safe-points everywhere. The
following CLs will optimize liveness analysis to bring down the cost.
The effect of this CL is:

name        old time/op       new time/op       delta
Template          195ms ± 2%        197ms ± 1%    ~     (p=0.136 n=9+9)
Unicode          98.4ms ± 2%       99.7ms ± 1%  +1.39%  (p=0.004 n=10+10)
GoTypes           685ms ± 1%        700ms ± 1%  +2.06%  (p=0.000 n=9+9)
Compiler          3.28s ± 1%        3.34s ± 0%  +1.71%  (p=0.000 n=9+8)
SSA               7.79s ± 1%        7.91s ± 1%  +1.55%  (p=0.000 n=10+9)
Flate             133ms ± 2%        133ms ± 2%    ~     (p=0.190 n=10+10)
GoParser          161ms ± 2%        164ms ± 3%  +1.83%  (p=0.015 n=10+10)
Reflect           450ms ± 1%        457ms ± 1%  +1.62%  (p=0.000 n=10+10)
Tar               183ms ± 2%        185ms ± 1%  +0.91%  (p=0.008 n=9+10)
XML               234ms ± 1%        238ms ± 1%  +1.60%  (p=0.000 n=9+9)
[Geo mean]        411ms             417ms       +1.40%

name        old exe-bytes     new exe-bytes     delta
HelloSize         1.47M ± 0%        1.51M ± 0%  +2.79%  (p=0.000 n=10+10)

Compared to just before "cmd/internal/obj: consolidate emitting entry
stack map", the cumulative effect of adding stack maps everywhere and
register maps is:

name        old time/op       new time/op       delta
Template          185ms ± 2%        197ms ± 1%   +6.42%  (p=0.000 n=10+9)
Unicode          96.3ms ± 3%       99.7ms ± 1%   +3.60%  (p=0.000 n=10+10)
GoTypes           658ms ± 0%        700ms ± 1%   +6.37%  (p=0.000 n=10+9)
Compiler          3.14s ± 1%        3.34s ± 0%   +6.53%  (p=0.000 n=9+8)
SSA               7.41s ± 2%        7.91s ± 1%   +6.71%  (p=0.000 n=9+9)
Flate             126ms ± 1%        133ms ± 2%   +6.15%  (p=0.000 n=10+10)
GoParser          153ms ± 1%        164ms ± 3%   +6.89%  (p=0.000 n=10+10)
Reflect           437ms ± 1%        457ms ± 1%   +4.59%  (p=0.000 n=10+10)
Tar               178ms ± 1%        185ms ± 1%   +4.18%  (p=0.000 n=10+10)
XML               223ms ± 1%        238ms ± 1%   +6.39%  (p=0.000 n=10+9)
[Geo mean]        394ms             417ms        +5.78%

name        old alloc/op      new alloc/op      delta
Template         34.5MB ± 0%       38.0MB ± 0%  +10.19%  (p=0.000 n=10+10)
Unicode          29.3MB ± 0%       30.3MB ± 0%   +3.56%  (p=0.000 n=8+9)
GoTypes           113MB ± 0%        125MB ± 0%  +10.89%  (p=0.000 n=10+10)
Compiler          510MB ± 0%        575MB ± 0%  +12.79%  (p=0.000 n=10+10)
SSA              1.46GB ± 0%       1.64GB ± 0%  +12.40%  (p=0.000 n=10+10)
Flate            23.9MB ± 0%       25.9MB ± 0%   +8.56%  (p=0.000 n=10+10)
GoParser         28.0MB ± 0%       30.8MB ± 0%  +10.08%  (p=0.000 n=10+10)
Reflect          77.6MB ± 0%       84.3MB ± 0%   +8.63%  (p=0.000 n=10+10)
Tar              34.1MB ± 0%       37.0MB ± 0%   +8.44%  (p=0.000 n=10+10)
XML              42.7MB ± 0%       47.2MB ± 0%  +10.75%  (p=0.000 n=10+10)
[Geo mean]       76.0MB            83.3MB        +9.60%

name        old allocs/op     new allocs/op     delta
Template           321k ± 0%         337k ± 0%   +4.98%  (p=0.000 n=10+10)
Unicode            337k ± 0%         340k ± 0%   +1.04%  (p=0.000 n=10+9)
GoTypes           1.13M ± 0%        1.18M ± 0%   +4.85%  (p=0.000 n=10+10)
Compiler          4.67M ± 0%        4.96M ± 0%   +6.25%  (p=0.000 n=10+10)
SSA               11.7M ± 0%        12.3M ± 0%   +5.69%  (p=0.000 n=10+10)
Flate              216k ± 0%         226k ± 0%   +4.52%  (p=0.000 n=10+9)
GoParser           271k ± 0%         283k ± 0%   +4.52%  (p=0.000 n=10+10)
Reflect            927k ± 0%         972k ± 0%   +4.78%  (p=0.000 n=10+10)
Tar                318k ± 0%         333k ± 0%   +4.56%  (p=0.000 n=10+10)
XML                376k ± 0%         395k ± 0%   +5.04%  (p=0.000 n=10+10)
[Geo mean]         730k              764k        +4.61%

name        old exe-bytes     new exe-bytes     delta
HelloSize         1.46M ± 0%        1.51M ± 0%   +3.66%  (p=0.000 n=10+10)

For #24543.

Change-Id: I91e003dc64151916b384274884bf02a2d6862547
Reviewed-on: https://go-review.googlesource.com/109353
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Keith Randall <khr@golang.org>

gopherbot pushed a commit that referenced this issue May 22, 2018

cmd/compile: single pass over Blocks in Liveness.epilogue
Currently Liveness.epilogue makes three passes over the Blocks, but
there's no need to do this. Combine them into a single pass. This
eliminates the need for blockEffects.lastbitmapindex, but, more
importantly, will let us incrementally compact the liveness bitmaps
and significantly reduce allocatons in Liveness.epilogue.

Passes toolstash -cmp.

Updates #24543.

Change-Id: I27802bcd00d23aa122a7ec16cdfd739ae12dd7aa
Reviewed-on: https://go-review.googlesource.com/110175
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Keith Randall <khr@golang.org>
Reviewed-by: David Chase <drchase@google.com>

gopherbot pushed a commit that referenced this issue May 22, 2018

cmd/compile: abstract bvec sets
This moves the bvec hash table logic out of Liveness.compact and into
a bvecSet type. Furthermore, the bvecSet type has the ability to grow
dynamically, which the current implementation doesn't. In addition to
making the code cleaner, this will make it possible to incrementally
compact liveness bitmaps.

Passes toolstash -cmp

Updates #24543.

Change-Id: I46c53e504494206061a1f790ae4a02d768a65681
Reviewed-on: https://go-review.googlesource.com/110176
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Keith Randall <khr@golang.org>

gopherbot pushed a commit that referenced this issue May 22, 2018

cmd/compile: incrementally compact liveness maps
The per-Value slice of liveness maps is currently one of the largest
sources of allocation in the compiler. On cmd/compile/internal/ssa,
it's 5% of overall allocation, or 75MB in total. Enabling liveness
maps everywhere significantly increased this allocation footprint,
which in turn slowed down the compiler.

Improve this by compacting the liveness maps after every block is
processed. There are typically very few distinct liveness maps, so
compacting the maps after every block, rather than at the end of the
function, can significantly reduce these allocations.

Passes toolstash -cmp.

name        old time/op       new time/op       delta
Template          198ms ± 2%        196ms ± 1%  -1.11%  (p=0.008 n=9+10)
Unicode           100ms ± 1%         99ms ± 1%  -0.94%  (p=0.015 n=8+9)
GoTypes           703ms ± 2%        695ms ± 1%  -1.15%  (p=0.000 n=10+10)
Compiler          3.38s ± 3%        3.33s ± 0%  -1.66%  (p=0.000 n=10+9)
SSA               7.96s ± 1%        7.93s ± 1%    ~ 	(p=0.113 n=9+10)
Flate             134ms ± 1%        132ms ± 1%  -1.30%  (p=0.000 n=8+10)
GoParser          165ms ± 2%        163ms ± 1%  -1.32%  (p=0.013 n=9+10)
Reflect           462ms ± 2%        459ms ± 0%  -0.65%  (p=0.036 n=9+8)
Tar               188ms ± 2%        186ms ± 1%    ~     (p=0.173 n=8+10)
XML               243ms ± 7%        239ms ± 1%    ~     (p=0.684 n=10+10)
[Geo mean]        421ms             416ms       -1.10%

name        old alloc/op      new alloc/op      delta
Template         38.0MB ± 0%       36.5MB ± 0%  -3.98%  (p=0.000 n=10+10)
Unicode          30.3MB ± 0%       29.6MB ± 0%  -2.21% 	(p=0.000 n=10+10)
GoTypes           125MB ± 0%        120MB ± 0%  -4.51% 	(p=0.000 n=10+9)
Compiler          575MB ± 0%        546MB ± 0%  -5.06% 	(p=0.000 n=10+10)
SSA              1.64GB ± 0%       1.55GB ± 0%  -4.97% 	(p=0.000 n=10+10)
Flate            25.9MB ± 0%       25.0MB ± 0%  -3.41% 	(p=0.000 n=10+10)
GoParser         30.7MB ± 0%       29.5MB ± 0%  -3.97% 	(p=0.000 n=10+10)
Reflect          84.1MB ± 0%       81.9MB ± 0%  -2.64% 	(p=0.000 n=10+10)
Tar              37.0MB ± 0%       35.8MB ± 0%  -3.27% 	(p=0.000 n=10+9)
XML              47.2MB ± 0%       45.0MB ± 0%  -4.57% 	(p=0.000 n=10+10)
[Geo mean]       83.2MB            79.9MB       -3.86%

name        old allocs/op     new allocs/op     delta
Template           337k ± 0%         337k ± 0%  -0.06%  (p=0.000 n=10+10)
Unicode            340k ± 0%         340k ± 0%  -0.01% 	(p=0.014 n=10+10)
GoTypes           1.18M ± 0%        1.18M ± 0%  -0.04% 	(p=0.000 n=10+10)
Compiler          4.97M ± 0%        4.97M ± 0%  -0.03% 	(p=0.000 n=10+10)
SSA               12.3M ± 0%        12.3M ± 0%  -0.01% 	(p=0.000 n=10+10)
Flate              226k ± 0%         225k ± 0%  -0.09% 	(p=0.000 n=10+10)
GoParser           283k ± 0%         283k ± 0%  -0.06% 	(p=0.000 n=10+9)
Reflect            972k ± 0%         971k ± 0%  -0.04% 	(p=0.000 n=10+8)
Tar                333k ± 0%         332k ± 0%  -0.05% 	(p=0.000 n=10+9)
XML                395k ± 0%         395k ± 0%  -0.04% 	(p=0.000 n=10+10)
[Geo mean]         764k              764k       -0.04%

Updates #24543.

Change-Id: I6fdc46e4ddb6a8eea95d38242345205eb8397f0b
Reviewed-on: https://go-review.googlesource.com/110177
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Keith Randall <khr@golang.org>

gopherbot pushed a commit that referenced this issue May 22, 2018

cmd/compile: make LivenessMap dense
Currently liveness information is kept in a map keyed by *ssa.Value.
This made sense when liveness information was sparse, but now we have
liveness for nearly every ssa.Value. There's a fair amount of memory
and CPU overhead to this map now.

This CL replaces this map with a slice indexed by value ID.

Passes toolstash -cmp.

name        old time/op       new time/op       delta
Template          197ms ± 1%        194ms ± 1%  -1.60%  (p=0.000 n=9+10)
Unicode           100ms ± 2%         99ms ± 1%  -1.31%  (p=0.012 n=8+10)
GoTypes           695ms ± 1%        689ms ± 0%  -0.94%  (p=0.000 n=10+10)
Compiler          3.34s ± 2%        3.29s ± 1%  -1.26%  (p=0.000 n=10+9)
SSA               8.08s ± 0%        8.02s ± 2%  -0.70%  (p=0.034 n=8+10)
Flate             133ms ± 1%        131ms ± 1%  -1.04%  (p=0.006 n=10+9)
GoParser          163ms ± 1%        162ms ± 1%  -0.79%  (p=0.034 n=8+10)
Reflect           459ms ± 1%        454ms ± 0%  -1.06%  (p=0.000 n=10+8)
Tar               186ms ± 1%        185ms ± 1%  -0.87%  (p=0.003 n=9+9)
XML               238ms ± 1%        235ms ± 1%  -1.01%  (p=0.004 n=8+9)
[Geo mean]        418ms             414ms       -1.06%

name        old alloc/op      new alloc/op      delta
Template         36.4MB ± 0%       35.6MB ± 0%  -2.29%  (p=0.000 n=9+10)
Unicode          29.7MB ± 0%       29.5MB ± 0%  -0.68%  (p=0.000 n=10+10)
GoTypes           119MB ± 0%        117MB ± 0%  -2.30%  (p=0.000 n=9+9)
Compiler          546MB ± 0%        532MB ± 0%  -2.47%  (p=0.000 n=10+10)
SSA              1.59GB ± 0%       1.55GB ± 0%  -2.41%  (p=0.000 n=10+10)
Flate            24.9MB ± 0%       24.5MB ± 0%  -1.77%  (p=0.000 n=8+10)
GoParser         29.5MB ± 0%       28.7MB ± 0%  -2.60%  (p=0.000 n=9+10)
Reflect          81.7MB ± 0%       80.5MB ± 0%  -1.49%  (p=0.000 n=10+10)
Tar              35.7MB ± 0%       35.1MB ± 0%  -1.64%  (p=0.000 n=10+10)
XML              45.0MB ± 0%       43.7MB ± 0%  -2.76%  (p=0.000 n=9+10)
[Geo mean]       80.1MB            78.4MB       -2.04%

name        old allocs/op     new allocs/op     delta
Template           336k ± 0%         335k ± 0%  -0.31%  (p=0.000 n=9+10)
Unicode            339k ± 0%         339k ± 0%  -0.05%  (p=0.000 n=10+10)
GoTypes           1.18M ± 0%        1.18M ± 0%  -0.26%  (p=0.000 n=10+10)
Compiler          4.96M ± 0%        4.94M ± 0%  -0.24%  (p=0.000 n=10+10)
SSA               12.6M ± 0%        12.5M ± 0%  -0.30%  (p=0.000 n=10+10)
Flate              224k ± 0%         223k ± 0%  -0.30%  (p=0.000 n=10+10)
GoParser           282k ± 0%         281k ± 0%  -0.32%  (p=0.000 n=10+10)
Reflect            965k ± 0%         963k ± 0%  -0.27%  (p=0.000 n=9+10)
Tar                331k ± 0%         330k ± 0%  -0.27%  (p=0.000 n=10+10)
XML                393k ± 0%         392k ± 0%  -0.26%  (p=0.000 n=10+10)
[Geo mean]         763k              761k       -0.26%

Updates #24543.

Change-Id: I4cfd2461510d3c026a262760bca225dc37482341
Reviewed-on: https://go-review.googlesource.com/110178
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Keith Randall <khr@golang.org>

gopherbot pushed a commit that referenced this issue May 22, 2018

cmd/compile: reuse liveness structures
Currently liveness analysis is a significant source of allocations in
the compiler. This CL mitigates this by moving the main sources of
allocation to the ssa.Cache, allowing them to be reused between
different liveness runs.

Passes toolstash -cmp.

name        old time/op       new time/op       delta
Template          194ms ± 1%        193ms ± 1%    ~     (p=0.156 n=10+9)
Unicode          99.1ms ± 1%       99.3ms ± 2%    ~     (p=0.853 n=10+10)
GoTypes           689ms ± 0%        687ms ± 0%  -0.27% 	(p=0.022 n=10+9)
Compiler          3.29s ± 1%        3.30s ± 1%    ~ 	(p=0.489 n=9+9)
SSA               8.02s ± 2%        7.97s ± 1%  -0.71%  (p=0.011 n=10+10)
Flate             131ms ± 1%        130ms ± 1%  -0.59%  (p=0.043 n=9+10)
GoParser          162ms ± 1%        160ms ± 1%  -1.53%  (p=0.000 n=10+10)
Reflect           454ms ± 0%        454ms ± 0%    ~    	(p=0.959 n=8+8)
Tar               185ms ± 1%        185ms ± 2%    ~ 	(p=0.905 n=9+10)
XML               235ms ± 1%        232ms ± 1%  -1.15% 	(p=0.001 n=9+10)
[Geo mean]        414ms             412ms       -0.39%

name        old alloc/op      new alloc/op      delta
Template         35.6MB ± 0%       34.2MB ± 0%  -3.75%  (p=0.000 n=10+10)
Unicode          29.5MB ± 0%       29.4MB ± 0%  -0.26%  (p=0.000 n=10+9)
GoTypes           117MB ± 0%        112MB ± 0%  -3.78%  (p=0.000 n=9+10)
Compiler          532MB ± 0%        512MB ± 0%  -3.80%  (p=0.000 n=10+10)
SSA              1.55GB ± 0%       1.48GB ± 0%  -4.82%  (p=0.000 n=10+10)
Flate            24.5MB ± 0%       23.6MB ± 0%  -3.61%  (p=0.000 n=10+9)
GoParser         28.7MB ± 0%       27.7MB ± 0%  -3.43%  (p=0.000 n=10+10)
Reflect          80.5MB ± 0%       78.1MB ± 0%  -2.96%  (p=0.000 n=10+10)
Tar              35.1MB ± 0%       33.9MB ± 0%  -3.49%  (p=0.000 n=10+10)
XML              43.7MB ± 0%       42.4MB ± 0%  -3.05%  (p=0.000 n=10+10)
[Geo mean]       78.4MB            75.8MB       -3.30%

name        old allocs/op     new allocs/op     delta
Template           335k ± 0%         335k ± 0%  -0.12%  (p=0.000 n=10+10)
Unicode            339k ± 0%         339k ± 0%  -0.01%  (p=0.001 n=10+10)
GoTypes           1.18M ± 0%        1.17M ± 0%  -0.12%  (p=0.000 n=10+10)
Compiler          4.94M ± 0%        4.94M ± 0%  -0.06%  (p=0.000 n=10+10)
SSA               12.5M ± 0%        12.5M ± 0%  -0.07%  (p=0.000 n=10+10)
Flate              223k ± 0%         223k ± 0%  -0.11%  (p=0.000 n=10+10)
GoParser           281k ± 0%         281k ± 0%  -0.08%  (p=0.000 n=10+10)
Reflect            963k ± 0%         960k ± 0%  -0.23%  (p=0.000 n=10+9)
Tar                330k ± 0%         330k ± 0%  -0.12%  (p=0.000 n=10+10)
XML                392k ± 0%         392k ± 0%  -0.08%  (p=0.000 n=10+10)
[Geo mean]         761k              760k       -0.10%

Compared to just before "cmd/internal/obj: consolidate emitting entry
stack map", the cumulative effect of adding stack maps everywhere and
register maps, plus these optimizations, is:

name        old time/op       new time/op       delta
Template          186ms ± 1%        194ms ± 1%  +4.41%  (p=0.000 n=9+10)
Unicode          96.5ms ± 1%       99.1ms ± 1%  +2.76%  (p=0.000 n=9+10)
GoTypes           659ms ± 1%        689ms ± 0%  +4.56%  (p=0.000 n=9+10)
Compiler          3.14s ± 2%        3.29s ± 1%  +4.95%  (p=0.000 n=9+9)
SSA               7.68s ± 3%        8.02s ± 2%  +4.41%  (p=0.000 n=10+10)
Flate             126ms ± 0%        131ms ± 1%  +4.14%  (p=0.000 n=10+9)
GoParser          153ms ± 1%        162ms ± 1%  +5.90%  (p=0.000 n=10+10)
Reflect           436ms ± 1%        454ms ± 0%  +4.14%  (p=0.000 n=10+8)
Tar               177ms ± 1%        185ms ± 1%  +4.28%  (p=0.000 n=8+9)
XML               224ms ± 1%        235ms ± 1%  +5.23%  (p=0.000 n=10+9)
[Geo mean]        396ms             414ms       +4.47%

name        old alloc/op      new alloc/op      delta
Template         34.5MB ± 0%       35.6MB ± 0%  +3.24%  (p=0.000 n=10+10)
Unicode          29.3MB ± 0%       29.5MB ± 0%  +0.51%  (p=0.000 n=9+10)
GoTypes           113MB ± 0%        117MB ± 0%  +3.31%  (p=0.000 n=8+9)
Compiler          509MB ± 0%        532MB ± 0%  +4.46%  (p=0.000 n=10+10)
SSA              1.49GB ± 0%       1.55GB ± 0%  +4.10%  (p=0.000 n=10+10)
Flate            23.8MB ± 0%       24.5MB ± 0%  +2.92%  (p=0.000 n=10+10)
GoParser         27.9MB ± 0%       28.7MB ± 0%  +2.88%  (p=0.000 n=10+10)
Reflect          77.4MB ± 0%       80.5MB ± 0%  +4.01%  (p=0.000 n=10+10)
Tar              34.1MB ± 0%       35.1MB ± 0%  +3.12%  (p=0.000 n=10+10)
XML              42.6MB ± 0%       43.7MB ± 0%  +2.65%  (p=0.000 n=10+10)
[Geo mean]       76.1MB            78.4MB       +3.11%

name        old allocs/op     new allocs/op     delta
Template           320k ± 0%         335k ± 0%  +4.60%  (p=0.000 n=10+10)
Unicode            336k ± 0%         339k ± 0%  +0.96%  (p=0.000 n=9+10)
GoTypes           1.12M ± 0%        1.18M ± 0%  +4.55%  (p=0.000 n=10+10)
Compiler          4.66M ± 0%        4.94M ± 0%  +6.18%  (p=0.000 n=10+10)
SSA               11.9M ± 0%        12.5M ± 0%  +5.37%  (p=0.000 n=10+10)
Flate              214k ± 0%         223k ± 0%  +4.15%  (p=0.000 n=9+10)
GoParser           270k ± 0%         281k ± 0%  +4.15%  (p=0.000 n=10+10)
Reflect            921k ± 0%         963k ± 0%  +4.49%  (p=0.000 n=10+10)
Tar                317k ± 0%         330k ± 0%  +4.25%  (p=0.000 n=10+10)
XML                375k ± 0%         392k ± 0%  +4.75%  (p=0.000 n=10+10)
[Geo mean]         729k              761k       +4.34%

Updates #24543.

Change-Id: Ia951fdb3c17ae1c156e1d05fc42e69caba33c91a
Reviewed-on: https://go-review.googlesource.com/110179
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: David Chase <drchase@google.com>
@wsc1

This comment has been minimized.

wsc1 commented Sep 30, 2018

Hi all, a question about this design. First, I've only read it from a high level, and haven't followed the use cases where the cooperative preemption is problematic.

But the big picture question I have is: what about code which depends on cooperative preemption?
I don't think we can know the extent of it.

But I can give an example where non-cooperative preemption might be problematic, on a project I am working on. The context is communication with an OS/C thread via atomics and not cgo, where timing is very sensitive: real time deadline misses will render things useless (audio i/o).

If currently, some code controls the pre-emption via spinning and runtime.Gosched(), then it seems to me this proposal will break that code because it will introduce preemption and hence delay the thing which is programmed to not be pre-empted.

Again, there is no way for us to know how much such code there is, and without an assessment of that, it seems to me this proposal risks entering a game of whack-a-mole w.r.t. go scheduling, where you solve one problem and as a result another pops up.

Please don't take away programmer control of pre-emption.

Last question: what good could runtime.Gosched() possibly serve with per-instruction pre-emption?

Again, sorry I don't know the details of this proposal, but that might be the case with a lot of other code that uses runtime.Gosched() under the assumption of cooperative pre-emption.

@networkimprov

This comment has been minimized.

networkimprov commented Oct 25, 2018

@wsc1 can you provide a code example which breaks without cooperative scheduling?

@wsc1

This comment has been minimized.

wsc1 commented Oct 25, 2018

@networkimprov please see
#10958
for status and rationale of the problem and problems related to testing for it via the cloud.

@crvv

This comment has been minimized.

Contributor

crvv commented Oct 26, 2018

@wsc1
No one said the test program should run via cloud.

@wsc1

This comment has been minimized.

wsc1 commented Oct 26, 2018

@wsc1
No one said the test program should run via cloud.

@crvv testing audio latency violations is hard. it isn't expected to agree except on the same hard ware under the same operating conditions or under OS simulation, and the scheduler random seed can come into play. re-creating those things to placate the interests of pre-emptive scheduling is not my job, although I'm happy to help along the way. No one said what hardware or OS operating conditions either.

There are also a myriad of reasons why pre-emption in a real-time audio processing loop would cause problems documented there by audio processing professionals. The Go runtime uses locks and the memory management can lock the system. These things are widely accepted as things which cause glitches in real time audio because they can take longer than the real-wall-clock-time allocated to a low latency application. This is widely accepted. It is also widely accepted that the glitches "will eventually" happen, meaning that it is very hard to create a test in which they do in a snap because it depends on the whole system (OS, hardware, go runtime) state.

I do not find it so credible to quote out of context against the grain of best practice in the field. You could also provide a reason why you want a test on the github issue tracker. Do you believe that the worst case real-wall-clock-time of pre-emption doings inserted into a critical segment of real-time audio processing code shouldn't cause problems? Why?

To me, the burden of proof lies there. On my end, I'll continue to provide what info and alternatives I can to help inform the discussion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment