runtime: tight loops should be preemptible #10958

Open
aclements opened this Issue May 26, 2015 · 40 comments

Projects

None yet
@aclements
Member

Currently goroutines are only preemptible at function call points. Hence, it's possible to write a tight loop (e.g., a numerical kernel or a spin on an atomic) with no calls or allocation that arbitrarily delays preemption. This can result in arbitrarily long pause times as the GC waits for all goroutines to stop.

In unusual situations, this can even lead to deadlock when trying to stop the world. For example, the runtime's TestGoroutineParallelism tries to prevent GC during the test to avoid deadlock. It runs several goroutines in tight loops that communicate through a shared atomic variable. If the coordinator that starts these is paused part way through, it will deadlock.

One possible fix is to insert preemption points on control flow loops that otherwise have no preemption points. Depending on the cost of this check, it may also require unrolling such loops to amortize the overhead of the check.

This has been a longstanding issue, so I don't think it's necessary to fix it for 1.5. It can cause the 1.5 GC to miss its 10ms STW goal, but code like numerical kernels is probably not latency-sensitive. And as far as I can tell, causing a deadlock like the runtime test can do requires looking for trouble.

@RLH

@aclements aclements added this to the Go1.6 milestone May 26, 2015
@minux
Member
minux commented May 26, 2015
@randall77
Contributor

The problem with interrupts is that the goroutine can be at an arbitrary address. The interrupted address is probably at a place where we have no data about where the pointers are. We'd either have to record a lot more frame layout/type/liveness information (including registers), or we'd have to simulate the goroutine forward to a safepoint. Both are challenging.

@minux
Member
minux commented May 26, 2015
@ianlancetaylor
Contributor

It seems to me that we can preempt at a write barrier. So the only loops we are talking about are those that make no writes to the heap and make no function calls. If we think in terms of adding

    uint8 b
  loop:
    ++b
    if b == 0 {
        preemptCheck()
    }

then the normal path through the loop will have two extra instructions (add/beq) where the add may be to either a register or a memory location, depending on overall register pressure. This will be measurable in tight loops but for most cases shouldn't be too bad.

@randall77
Contributor

No pointer writes to the heap.

But maybe this is enough? If we can preempt at the first write barrier, then even if we let the mutator run it can't modify anything that the GC cares about. So maybe we just set the write barrier enabled bit and let goroutines run. Once a goroutine sees the write barrier bit (how do we force that?) it can't modify any memory the GC cares about. So it is safe to start the GC without waiting for every goroutine to stop.

@aclements
Member

The problem of inserting artificial preemption points is reduced
numerical performance.

True. That's why I suggested unrolling loops, which AFAIK is a standard solution to this problem.

However, I think the check can be done in just two instructions, even without adding Ian's loop counter. Just CMP the preempt flag and branch if it's set. That branch will almost never be hit, so it will be highly predictable, and the preempt flag should be in the L1, so the check may in fact be extremely cheap.

No pointer writes to the heap. But maybe this is enough?

This certainly reduces the set of programs that have this problem, but I don't think it actually helps with either of the examples I gave, since numerical kernels probably won't have heap pointer writes, and the runtime test that can deadlock the GC certainly has no pointer writes.

@RLH
Contributor
RLH commented May 27, 2015

This is the reference for doing a GC safepoint at every instruction. It is
not something we want to do for x86 much less the various other
architectures we need to support.
http://dl.acm.org/citation.cfm?id=301652

Most systems I know of do what Austin suggests, unroll the loop and insert
a check. No numbers but 8 unrolls seems to be what I recall. Not only is
the branch highly predictable but the check does not introduce any
dependencies making it close to free on an out of order machine. There have
been other approaches such as code plugging and predicated branches but HW
has moved on. I had not seen Ian's suggestion.

On Tue, May 26, 2015 at 9:24 PM, Austin Clements notifications@github.com
wrote:

The problem of inserting artificial preemption points is reduced
numerical performance.

True. That's why I suggested unrolling loops, which AFAIK is a standard
solution to this problem.

However, I think the check can be done in just two instructions, even
without adding Ian's loop counter. Just CMP the preempt flag and branch if
it's set. That branch will almost never be hit, so it will be highly
predictable, and the preempt flag should be in the L1, so the check may in
fact be extremely cheap.

No pointer writes to the heap. But maybe this is enough?

This certainly reduces the set of programs that have this problem, but I
don't think it actually helps with either of the examples I gave, since
numerical kernels probably won't have heap pointer writes, and the runtime
test that can deadlock the GC certainly has no pointer writes.


Reply to this email directly or view it on GitHub
#10958 (comment).

@aclements aclements modified the milestone: Go1.6Early, Go1.6 Jun 30, 2015
@kingluo
kingluo commented Sep 9, 2015

@aclements,

Currently goroutines are only preemptible at function call points.

To be precise, the preemption check seems only at newstack()?

For example,

package main

import (
    "fmt"
    "runtime"
    "time"
)

func test() {
    a := 100
    for i := 1; i < 1000; i++ {
        a = i*100/i + a
    }
}

func main() {
    runtime.GOMAXPROCS(1)
    go func() {
        for {
            test()
        }
    }()
    time.Sleep(100 * time.Millisecond)
    fmt.Println("hello world")
}

The test() is not inline, and the infinite loop calls test(), but it would not preempt at the calls, because the morestack() --> newstack() not involved, so the "hello world" would never be printed.

@randall77
Contributor

test() is not inlined, but since it is a leaf it is promoted to nosplit, so it doesn't have the preemption check any more. Fixing that sounds much easier than the rest of this bug. Maybe we could forbid nosplit promotion for functions called from loops?

@griesemer
Contributor

For the reference, the same problem appeared in the HotSpot JVM. I remember two approaches:

  1. Around 1997/1998, Rene Schmidt (http://jaoo.dk/aarhus2007/speaker/Rene+W.+Schmidt) implemented the following mechanism: Threads running a tight loop w/o function calls would receive a signal to temporarily suspend them. The runtime then dynamically generated a partial "copy" of the loop instructions the thread was running, from the current PC to the (unconditional) backward branch, except that the backward branch was replaced with a call into the runtime (leading to proper suspension at a safe point). The thread was then restarted with the pc modified such that it would run the newly generated piece of code. That code would run to the end of the loop body and then suspend itself at which point the code copy was discarded and the pc (return address) adjusted to continue with the original code (after the safe point).

This mechanism was ingenious but also insanely complicated. It was abandoned eventually for:

  1. A simple test and branch (2 additional instructions) at the end of a loop body which didn't contain any calls. As far as I recall, we didn't do any form of loop unrolling and the performance implications were not significant in the overall picture of a larger application.

My vote would be for 2) to start in loops where @ianlancetaylor's suggestion doesn't work. I suspect that for all but the smallest long-running inner loops (where unrolling might make sense independently), the performance penalty is acceptable.

If this is not good enough, and we don't want to pay the code size cost of unrolling a loop multiple times, here's another idea based on 1): Instead of the backward branch check as in 2) plus unrolling, keep the original loop, and generate (at compile time) a 2nd version of the loop body that ends in a runtime call to suspend itself instead of the backward branch. The code size cost is about the cost of having the loop unrolled twice. When the goroutine needs to run to a safe point, use a signal to temporarily suspend the goroutine, switch the pc to the corresponding pc in the copy of the loop body, continue with execution there and have the code suspend itself at a safe point. There's no dynamic code generation involved, generating the extra loop body is trivial at compile-time, the extra amount of code is less than with loop unrolling, and there's only a little bit of runtime work to modify the pc. The regular code would run at full speed if no garbage collection is needed.

@nightlyone
Contributor

Another idea, similiar to @ianlancetaylor's idea, is to estimate the cost (in ns) of the loop and only check for required suspension every N loops through that loop body.

Once the compiler can unroll/unpeel, that check-every-N logic can be either inlined after the unrolled body, if unrolling is beneficial, or kept when unrolling makes no sense for that loop body.

Such logic also reads better when using debuggers.

@griesemer
Contributor

@nightlyone The check-every-N seems more complex than just having two extra instructions (compare and branch). And if unrolling is done, that compare-and-branch is already needed only every N loop iterations only (where N is the number of times the loop was unrolled).

@nightlyone
Contributor

@griesemer not sure how the test will avoid atomic loads, that's why I suggested the check-every-N.

So the pseudo-go-code before unrolling would look like this:

loop:   
        // loop-body        

        if counter > N {          
                counter = 0                       

                if need_stop = atomic.LoadBool(&runtime.getg().need_stop); need_stop {
                        runtime.Gosched()                                    
                }                                                               
        }                                                                                                                       
        counter++                                                                                                           
        goto loop

after unrolling it would look like

loop:   
        // loop-body0
        // loop-body2
        // ...
        // loop-bodyM        

        if counter > N/M {          
                counter = 0                       

                if need_stop = atomic.LoadBool(&runtime.getg().need_stop); need_stop {
                        runtime.Gosched()                                    
                }                                                               
        }                                                                                                                       
        counter++                                                                                                           
        goto loop

So the code inserted would be a constant overhead, but still run every N loops.

@aclements
Member

The load doesn't have to be atomic. The current preemption check in the function prologue isn't atomic.

One tricky bit with the compare and branch is that the actual preemption point needs to have no live registers. Presumably, for performance reasons, we want the loop to be able to keep things in registers, so the code we branch to on a preempt has to flush any live registers before the preempt and reload them after the preempt. I don't think this will be particularly hard, but it is something to consider, since it might affect which stage of the compiler is responsible for generating this code.

@griesemer
Contributor

@aclements Indeed. Which is why perhaps switching the pc to a 2nd version of the loop body that ends in a safe point might not be much more complex and permit the loop to run at full speed in the normal case.

@aclements
Member

@aclements Indeed. Which is why perhaps switching the pc to a 2nd version of the loop body that ends in a safe point might not be much more complex and permit the loop to run at full speed in the normal case.

From the runtime's perspective, I think this would be more complicated because stealing a signal is a logistic pain and we'd have to deal with tables mapping from fast loop PCs to slow loop PCs. The compiler would have to generate these tables. This seems like a very clever plan B, but I think first we should trying adding a no-op compare and branch and see if it's actually a problem for dense numerical kernels.

@randall77
Contributor

The new SSA register allocator handles situations like this well, keeping everything in registers on the common edges and spill/call/restore on the unlikely case.

@RLH
Contributor
RLH commented Sep 11, 2015

The load is not dependent on anything in the loop. Assuming the loop is
tight, the value is almost certainly to a location already in the L1 cache,
the branch will be highly predictable. Just to make sure the branch
predictor doesn't even need to be warmed up, we can make it a backward
branch. I would be sort of surprised that if on an out of order machine the
cost would even be noticed. That said build and measure is the only way to
be sure.

On Fri, Sep 11, 2015 at 3:38 PM, Keith Randall notifications@github.com
wrote:

The new SSA register allocator handles situations like this well, keeping
everything in registers on the common edges and spill/call/restore on the
unlikely case.


Reply to this email directly or view it on GitHub
#10958 (comment).

@rsc rsc modified the milestone: Unplanned, Go1.6Early Nov 4, 2015
@gopherbot

CL https://golang.org/cl/19516 mentions this issue.

@gopherbot gopherbot pushed a commit that referenced this issue Feb 16, 2016
@aclements aclements runtime: fix deadlock in TestCrashDumpsAllThreads
TestCrashDumpsAllThreads carefully sets the number of Ps to one
greater than the number of non-preemptible loops it starts so that the
main goroutine can continue to run (necessary because of #10958).
However, if GC starts, it can take over that one spare P and lock up
the system while waiting for the non-preemptible loops, causing the
test to eventually time out. This deadlock is easily reproducible if
you run the runtime test with GOGC=1.

Fix this by forcing GOGC=off when running this test.

Change-Id: Ifb22da5ce33f9a61700a326ea92fcf4b049721d1
Reviewed-on: https://go-review.googlesource.com/19516
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Russ Cox <rsc@golang.org>
7c22af8
@yaxinlx
yaxinlx commented Jun 12, 2016

It would be great if there is a complier flag to detect greedy goroutines.

@celer
celer commented Jun 13, 2016

We're re-writing our product from scratch in go, and one of more memorable failure modes of the prior implementation was an infinite loop in an open source library, causing lots of customer pain. So we'd love to see this addressed. (Sadly this fixed alone isn't enough as you still ideally need a way to find go routines which are spinning and stop them).

So our use case is maintaining high availability in the face of a number of errantly looping go routines.

@aclements aclements modified the milestone: Go1.8Early, Unplanned Jun 13, 2016
@aclements
Member

Moving to Go1.8Early since we've been getting more problem reports from this lately (though I think they're still all from either testing code or buggy code).

@yaxinlx, I'm not sure what you would want the compiler to report. Most code is probably has lots of non-preemptible loops, but they're all trivially short and bounded, so I suspect any compiler output would be too noisy to be useful. I think it would be much better to just fix this issue than to put effort into complex tools for dropping it in users' laps.

@aclements
Member

I wanted to note an interesting case that hasn't occurred to me until recently: currently we omit the stack growth check from functions with very small frames (the exact value differs between arches), which means we also don't check for preemption in these functions. As a result, even a loop that contains a function call may be non-preemptible if the function it calls has a very small frame. That function call could even be a method call or a function pointer, in which case the compiler has no way to soundly detect that it's calling a function without a preemption check when it's compiling the loop.

Any fix for this issue has to deal with this case. I can think of a few possible solutions, and there may be others: 1) put preemption points in all function prologues or 2) insert preemption points on back-edges conservatively if the compiler can't prove that at least one call in the loop is preemptible. As an extension of 2, we could also 3) insert preemption points in all method prologues, which would increase the number of loops the compiler can prove already contain a preemption point.

@aclements aclements added the NeedsFix label Jun 13, 2016
@randall77 randall77 was assigned by aclements Jun 13, 2016
@quentinmit quentinmit modified the milestone: Go1.9Early, Go1.8Early Sep 29, 2016
@gopherbot

CL https://golang.org/cl/31766 mentions this issue.

@gopherbot gopherbot pushed a commit that referenced this issue Oct 28, 2016
@aclements aclements runtime: disable stack rescanning by default
With the hybrid barrier in place, we can now disable stack rescanning
by default. This commit adds a "gcrescanstacks" GODEBUG variable that
is off by default but can be set to re-enable STW stack rescanning.
The plan is to leave this off but available in Go 1.8 for debugging
and as a fallback.

With this change, worst-case mark termination time at GOMAXPROCS=12
*not* including time spent stopping the world (which is still
unbounded) is reliably under 100 µs, with a 95%ile around 50 µs in
every benchmark I tried (the go1 benchmarks, the x/benchmarks garbage
benchmark, and the gcbench activegs and rpc benchmarks). Including
time spent stopping the world usually adds about 20 µs to total STW
time at GOMAXPROCS=12, but I've seen it add around 150 µs in these
benchmarks when a goroutine takes time to reach a safe point (see
issue #10958) or when stopping the world races with goroutine
switches. At GOMAXPROCS=1, where this isn't an issue, worst case STW
is typically 30 µs.

The go-gcbench activegs benchmark is designed to stress large numbers
of dirty stacks. This commit reduces 95%ile STW time for 500k dirty
stacks by nearly three orders of magnitude, from 150ms to 195µs.

This has little effect on the throughput of the go1 benchmarks or the
x/benchmarks benchmarks.

name         old time/op  new time/op  delta
XGarbage-12  2.31ms ± 0%  2.32ms ± 1%  +0.28%  (p=0.001 n=17+16)
XJSON-12     12.4ms ± 0%  12.4ms ± 0%  +0.41%  (p=0.000 n=18+18)
XHTTP-12     11.8µs ± 0%  11.8µs ± 1%    ~     (p=0.492 n=20+18)

It reduces the tail latency of the x/benchmarks HTTP benchmark:

name      old p50-time  new p50-time  delta
XHTTP-12    489µs ± 0%    491µs ± 1%  +0.54%  (p=0.000 n=20+18)

name      old p95-time  new p95-time  delta
XHTTP-12    957µs ± 1%    960µs ± 1%  +0.28%  (p=0.002 n=20+17)

name      old p99-time  new p99-time  delta
XHTTP-12   1.76ms ± 1%   1.64ms ± 1%  -7.20%  (p=0.000 n=20+18)

Comparing to the beginning of the hybrid barrier implementation
("runtime: parallelize STW mcache flushing") shows that the hybrid
barrier trades a small performance impact for much better STW latency,
as expected. The magnitude of the performance impact is generally
small:

name                      old time/op    new time/op    delta
BinaryTree17-12              2.37s ± 1%     2.42s ± 1%  +2.04%  (p=0.000 n=19+18)
Fannkuch11-12                2.84s ± 0%     2.72s ± 0%  -4.00%  (p=0.000 n=19+19)
FmtFprintfEmpty-12          44.2ns ± 1%    45.2ns ± 1%  +2.20%  (p=0.000 n=17+19)
FmtFprintfString-12          130ns ± 1%     134ns ± 0%  +2.94%  (p=0.000 n=18+16)
FmtFprintfInt-12             114ns ± 1%     117ns ± 0%  +3.01%  (p=0.000 n=19+15)
FmtFprintfIntInt-12          176ns ± 1%     182ns ± 0%  +3.17%  (p=0.000 n=20+15)
FmtFprintfPrefixedInt-12     186ns ± 1%     187ns ± 1%  +1.04%  (p=0.000 n=20+19)
FmtFprintfFloat-12           251ns ± 1%     250ns ± 1%  -0.74%  (p=0.000 n=17+18)
FmtManyArgs-12               746ns ± 1%     761ns ± 0%  +2.08%  (p=0.000 n=19+20)
GobDecode-12                6.57ms ± 1%    6.65ms ± 1%  +1.11%  (p=0.000 n=19+20)
GobEncode-12                5.59ms ± 1%    5.65ms ± 0%  +1.08%  (p=0.000 n=17+17)
Gzip-12                      223ms ± 1%     223ms ± 1%  -0.31%  (p=0.006 n=20+20)
Gunzip-12                   38.0ms ± 0%    37.9ms ± 1%  -0.25%  (p=0.009 n=19+20)
HTTPClientServer-12         77.5µs ± 1%    78.9µs ± 2%  +1.89%  (p=0.000 n=20+20)
JSONEncode-12               14.7ms ± 1%    14.9ms ± 0%  +0.75%  (p=0.000 n=20+20)
JSONDecode-12               53.0ms ± 1%    55.9ms ± 1%  +5.54%  (p=0.000 n=19+19)
Mandelbrot200-12            3.81ms ± 0%    3.81ms ± 1%  +0.20%  (p=0.023 n=17+19)
GoParse-12                  3.17ms ± 1%    3.18ms ± 1%    ~     (p=0.057 n=20+19)
RegexpMatchEasy0_32-12      71.7ns ± 1%    70.4ns ± 1%  -1.77%  (p=0.000 n=19+20)
RegexpMatchEasy0_1K-12       946ns ± 0%     946ns ± 0%    ~     (p=0.405 n=18+18)
RegexpMatchEasy1_32-12      67.2ns ± 2%    67.3ns ± 2%    ~     (p=0.732 n=20+20)
RegexpMatchEasy1_1K-12       374ns ± 1%     378ns ± 1%  +1.14%  (p=0.000 n=18+19)
RegexpMatchMedium_32-12      107ns ± 1%     107ns ± 1%    ~     (p=0.259 n=18+20)
RegexpMatchMedium_1K-12     34.2µs ± 1%    34.5µs ± 1%  +1.03%  (p=0.000 n=18+18)
RegexpMatchHard_32-12       1.77µs ± 1%    1.79µs ± 1%  +0.73%  (p=0.000 n=19+18)
RegexpMatchHard_1K-12       53.6µs ± 1%    54.2µs ± 1%  +1.10%  (p=0.000 n=19+19)
Template-12                 61.5ms ± 1%    63.9ms ± 0%  +3.96%  (p=0.000 n=18+18)
TimeParse-12                 303ns ± 1%     300ns ± 1%  -1.08%  (p=0.000 n=19+20)
TimeFormat-12                318ns ± 1%     320ns ± 0%  +0.79%  (p=0.000 n=19+19)
Revcomp-12 (*)               509ms ± 3%     504ms ± 0%    ~     (p=0.967 n=7+12)
[Geo mean]                  54.3µs         54.8µs       +0.88%

(*) Revcomp is highly non-linear, so I only took samples with 2
iterations.

name         old time/op  new time/op  delta
XGarbage-12  2.25ms ± 0%  2.32ms ± 1%  +2.74%  (p=0.000 n=16+16)
XJSON-12     11.6ms ± 0%  12.4ms ± 0%  +6.81%  (p=0.000 n=18+18)
XHTTP-12     11.6µs ± 1%  11.8µs ± 1%  +1.62%  (p=0.000 n=17+18)

Updates #17503.

Updates #17099, since you can't have a rescan list bug if there's no
rescan list. I'm not marking it as fixed, since gcrescanstacks can
still be set to re-enable the rescan lists.

Change-Id: I6e926b4c2dbd4cd56721869d4f817bdbb330b851
Reviewed-on: https://go-review.googlesource.com/31766
Reviewed-by: Rick Hudson <rlh@golang.org>
bd640c8
@aclements
Member

@rhysh found an interesting example of this problem in the standard library in #17831 : the base64 decoding loop in encoding/base64.Encoding.decode has no preemption points, so if you're base64 decoding something large it can significantly delay GC.

@aclements
Member

From what I understand, @rhysh's case is particularly interesting since the application is more throughput-sensitive than latency-sensitive, yet the STW delays significantly reduce parallelism and in turn significantly reduce throughput. So, in this case, the overhead of the backwards-branch check would certainly pay for itself in overall application throughput.

@dr2chase
Contributor
dr2chase commented Nov 7, 2016

Any reason not to handle this as a special case, split it into two loops, outer explicitly checks for preemption, inner runs for few enough iterations we don't care?

@mdempsky
Member
mdempsky commented Nov 7, 2016

@dr2chase Isn't that effectively the same thing as adding a loop counter?

@aclements
Member

@dr2chase, by "special case" do you mean specifically encoding/base64? This certainly isn't the only loop like this in std (this same pattern shows up all over encoding/* at least: ascii85.Encode, base32.Encode/decode, base64.Encode, hex.Encode/Decode, etc).

@dr2chase
Contributor
dr2chase commented Nov 7, 2016 edited

I was imagining modifying the library by-hand, but if there's a dozen instances already, maybe we should reconsider that. I imagine something along the lines of replacing

for i := 0; i < N; i++ { ... }

with

j := 0
for ; j <= N-1000;  j+= 1000 {
   for i := j; i < j+1000; i++ { ... }
   preemption_check()
}
for i := j; i < N; i++ { ... }

It's slightly more efficient than running a separate counter, not sure we care. There might be some advantage if we have ambitions of SIMD-fying inner loops, since we'd probably do a transformation like that anyhow.

@aclements
Member

@dr2chase, I feel like we may be getting ahead of ourselves here. If we have the cycles, I would like to try an experiment with the simplest possible approach to this: just have the compiler insert stack bound checks in control flow cycles with no preemptible calls and see how much overhead that really has. @cherrymui thinks this should be an easy thing to try. If we start with that experiment, we can get a better sense of whether more sophisticated techniques are even necessary.

@aclements
Member

https://go-review.googlesource.com/c/33093/ adds GOEXPERIMENT=preemptibleloops, which enables the compiler to add preemption points to all loops. As of CL 4, this adds preemption points to essentially all loops, even if they already have preemption points. The overhead is quite high on some benchmarks, but it's clear that there's also a lot of headroom available to improve this. The current plan is to commit this to Go 1.8 under the GOEXPERIMENT for diagnostic purposes, and to try to reduce the overhead and enable this by default for Go 1.9.

Some things we can do to improve the overhead (many of which have been discussed above):

  • Improve the efficiency of the check. I haven't looked at the generated code, but we're probably not getting the tightest check possible. In particular, we may be introducing spills into the hot path, rather than sinking them to the preemption path.
  • Don't add preemption points to loops that already have preemption points. This requires knowing which functions are or could be nosplit, which is complicated by the auto-nosplit logic. There's some evidence that auto-nosplit isn't doing us any good anyway, so maybe we just disable that? Unfortunately, this doesn't help with the tightest of loops, where the overhead is the highest.
  • Unroll loops enough to amortize the cost of the check. Unrolling is generally discouraged these days because CPUs are really, really good at running tight loops.
  • Use a register check instead of a memory check. Currently we're checking the stack bound, which requires a memory load. A different approach would be to reserve some register during these loops, have the compiler record the presence of this register in the PCDATA, and set the register by sending a signal to the thread. This would trade a cheaper check for a more costly preemption, but requires mucking with signals, which is always a pain. Relative to @griesemer's suggestion of creating a second copy of the loop, this requires stealing a register, but seems simpler overall to me.
  • Use asynchronous STW (suggested by @cherrymui). At this point, sweep termination and mark termination basically don't need goroutines to be a safe points. We could stop all threads asynchronously using a signal, perform sweep/mark termination, and then resume threads. This doesn't help with stack scanning, which must be done at a safe point, so an unpreemptible loop could still delay mark termination if it ran for long enough. There are some sequences in the runtime that must be "atomic" with respect to the garbage collector, but in these it would be easy for the signal to set a flag, let the thread continue, and then check the flag at the end of the sequence.
@ianlancetaylor
Contributor

We can record whether a function is nosplit or not, including auto-nosplit, in the export data, so I don't see any reason here to disable auto-nosplit. Certainly we should only instrument non-preemptible loops, which is to say, any case where there is a path from L to a branch to L that does not call any non-nosplit functions.

I think signals are problematic both because they are slow and because relying on them will make the Go runtime less portable to different environments.

I continue to believe that the simplest approach with minimal overhead is to introduce a new temporary variable into every non-preemptible loop. Rewrite each backward branch to L to branch instead to L':

L':
    tmp--
    if tmp == 0 && runtime.preempt {
        preempt()
        tmp = 16384
    }
    goto L

That should be a straightforward AST transformation early in the compiler. Let the rest of the compiler decide whether to put tmp into a register or keep it on the stack. Either way it will be faster than a memory load from shared memory.

@gopherbot

CL https://golang.org/cl/33093 mentions this issue.

@dr2chase
Contributor
dr2chase commented Nov 17, 2016 edited

Here's a quick comparison of all-backedges vs call(including nosplit)-sensitive timings, where
call-sensitive here is best-case improvement (w/o unrolling or other clever games) since it is treating nosplit functions as doing scheduling checks.

all-backedges (current CL):
AMD64, geomean+5%; worst: RevComp+45%, MandelBrot+20%, Gzip+11%.
PPC64, geomean+4%; worst: MandelBrot+67%, Fannkuch+16%, Revcomp+9%.

call-sensitive:
AMD64, geomean+4.7%; worst: RevComp+32%, MandelBrot+21%, Gzip+11%.
PPC64, geomean+2%; worst: MandelBrot+74%, FmtFprintfPrefixedInt+8.2%, FannKuch+6.8%

Not clear if something funky's going on with the PPC box; there was another person logged on, and it has a huge number of processors, but three of the RegexpMatchEasy family sped up, by 1, 9.8, and 11.8%. I checked this again with an overnight 100-run test, and got the same results.

I think we have some unrolling in our future, but this suggests that the existing GOEXPERIMENT CL is a reasonable place to stop, especially for AMD64.

@aclements
Member

We can record whether a function is nosplit or not, including auto-nosplit, in the export data, so I don't see any reason here to disable auto-nosplit.

That's fine when calling functions in other packages, but if it's a call to the same package, auto-nosplit isn't determined until long after the pass that inserts preemption points.

@dr2chase, is it convenient to take a look at how tight the generated code is? Not necessarily try to improve it right now, just to get a sense of how much codegen improvement there may be. And would it be easy to try @ianlancetaylor's suggestion?

@dr2chase
Contributor

To try @ianlancetaylor's suggestion I think is not terribly hard, especially if we aren't picky about resetting tmp before each loop, especially if we aren't worried about irreducible loops, especially if we so all loops. Inserting phi functions for tmp coming from different loops is a minor problem, but it's been solved before.

I think the code quality issue is more a matter of what behavior we induce in the register allocator.

@aclements
Member

Perhaps we shouldn't reset tmp between loops. If tmp is function-wide and only set at entry to the function, we would hit reasonable preemption points regardless of the loop structure of the function. Otherwise, if there are several loops, we may not hit the threshold in any individual loop, but the function could still run for a long time.

@gopherbot

CL https://golang.org/cl/33910 mentions this issue.

@dr2chase
Contributor
dr2chase commented Dec 19, 2016 edited

Update: Dec14 version of the CL above is a sketch of what we want; it is correct and adequate for GOEXPERIMENT, but I think needs the following changes/enhancements:

  1. the register allocator needs to be improved to "sink" spills down into rarely taken branches when this is possible. The current code to sink spills into loop exit edges is a special case of this.

  2. loop unrolling needs to happen sometimes.

  3. we might want to disable the nosplit optimization in the assembler so that it was easier to exclude cases where there was already a call in the loop. Cherry measured the gains from that optimization and it was not large.

  4. to be more predictable in pause times, we need to estimate the amount of time per loop iteration and decrement the counter in proportion. The compiler can count shortest/long paths in terms of SSA instructions; this may require smarter placement of decrements (which will in turn require more phi functions) but let's not worry about that yet.

  5. the constant used to reinitialize the counter probably needs to be precomputed by the runtime based on the processor where a program is run.

As an example of picking counter size, suppose our goal is to check for interrupt every 50 microseconds. Suppose a 1 GHz processor; 50 usec = 50,000 cycles. A 10 "instruction" loop executed at one IPC should check every 5000 iterations, and let's say that no loop is shorter than 10 instructions through unrolling; if the counter is initialized to 5000, then decrements by 1 will obtain the desired check latency. If the processor speed is 2GHz, then initialize the counter to 10000 (10000 iterations, each taking 5ns), if it s 500Mhz, then initialize the counter to 2500. Loops up to 15 instructions decrement by 1, 15-25 by 2, 25-35 by 3, etc, that is a compiled-in constant, but the runtime picks the starting value to ensure that checks happen often enough.

This is only to get checks to within an order of magnitude, but not adjusting for loop size would fail by several orders of magnitude. Depending on measured overheads, we might also decide to check more aggressively.

@gopherbot gopherbot pushed a commit that referenced this issue Jan 9, 2017
@dr2chase dr2chase cmd/compile: insert scheduling checks on loop backedges
Loop breaking with a counter.  Benchmarked (see comments),
eyeball checked for sanity on popular loops.  This code
ought to handle loops in general, and properly inserts phi
functions in cases where the earlier version might not have.

Includes test, plus modifications to test/run.go to deal with
timeout and killing looping test.  Tests broken by the addition
of extra code (branch frequency and live vars) for added
checks turn the check insertion off.

If GOEXPERIMENT=preemptibleloops, the compiler inserts reschedule
checks on every backedge of every reducible loop.  Alternately,
specifying GO_GCFLAGS=-d=ssa/insert_resched_checks/on will
enable it for a single compilation, but because the core Go
libraries contain some loops that may run long, this is less
likely to have the desired effect.

This is intended as a tool to help in the study and diagnosis
of GC and other latency problems, now that goal STW GC latency
is on the order of 100 microseconds or less.

Updates #17831.
Updates #10958.

Change-Id: I6206c163a5b0248e3f21eb4fc65f73a179e1f639
Reviewed-on: https://go-review.googlesource.com/33910
Run-TryBot: David Chase <drchase@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Keith Randall <khr@golang.org>
7f1ff65
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment