New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: tight loops should be preemptible #10958

Open
aclements opened this Issue May 26, 2015 · 72 comments

Comments

Projects
None yet
@aclements
Member

aclements commented May 26, 2015

Currently goroutines are only preemptible at function call points. Hence, it's possible to write a tight loop (e.g., a numerical kernel or a spin on an atomic) with no calls or allocation that arbitrarily delays preemption. This can result in arbitrarily long pause times as the GC waits for all goroutines to stop.

In unusual situations, this can even lead to deadlock when trying to stop the world. For example, the runtime's TestGoroutineParallelism tries to prevent GC during the test to avoid deadlock. It runs several goroutines in tight loops that communicate through a shared atomic variable. If the coordinator that starts these is paused part way through, it will deadlock.

One possible fix is to insert preemption points on control flow loops that otherwise have no preemption points. Depending on the cost of this check, it may also require unrolling such loops to amortize the overhead of the check.

This has been a longstanding issue, so I don't think it's necessary to fix it for 1.5. It can cause the 1.5 GC to miss its 10ms STW goal, but code like numerical kernels is probably not latency-sensitive. And as far as I can tell, causing a deadlock like the runtime test can do requires looking for trouble.

@RLH

@aclements aclements added this to the Go1.6 milestone May 26, 2015

@minux

This comment has been minimized.

Show comment
Hide comment
@minux

minux May 26, 2015

Member
Member

minux commented May 26, 2015

@randall77

This comment has been minimized.

Show comment
Hide comment
@randall77

randall77 May 26, 2015

Contributor

The problem with interrupts is that the goroutine can be at an arbitrary address. The interrupted address is probably at a place where we have no data about where the pointers are. We'd either have to record a lot more frame layout/type/liveness information (including registers), or we'd have to simulate the goroutine forward to a safepoint. Both are challenging.

Contributor

randall77 commented May 26, 2015

The problem with interrupts is that the goroutine can be at an arbitrary address. The interrupted address is probably at a place where we have no data about where the pointers are. We'd either have to record a lot more frame layout/type/liveness information (including registers), or we'd have to simulate the goroutine forward to a safepoint. Both are challenging.

@minux

This comment has been minimized.

Show comment
Hide comment
@minux

minux May 26, 2015

Member
Member

minux commented May 26, 2015

@ianlancetaylor

This comment has been minimized.

Show comment
Hide comment
@ianlancetaylor

ianlancetaylor May 26, 2015

Contributor

It seems to me that we can preempt at a write barrier. So the only loops we are talking about are those that make no writes to the heap and make no function calls. If we think in terms of adding

    uint8 b
  loop:
    ++b
    if b == 0 {
        preemptCheck()
    }

then the normal path through the loop will have two extra instructions (add/beq) where the add may be to either a register or a memory location, depending on overall register pressure. This will be measurable in tight loops but for most cases shouldn't be too bad.

Contributor

ianlancetaylor commented May 26, 2015

It seems to me that we can preempt at a write barrier. So the only loops we are talking about are those that make no writes to the heap and make no function calls. If we think in terms of adding

    uint8 b
  loop:
    ++b
    if b == 0 {
        preemptCheck()
    }

then the normal path through the loop will have two extra instructions (add/beq) where the add may be to either a register or a memory location, depending on overall register pressure. This will be measurable in tight loops but for most cases shouldn't be too bad.

@randall77

This comment has been minimized.

Show comment
Hide comment
@randall77

randall77 May 26, 2015

Contributor

No pointer writes to the heap.

But maybe this is enough? If we can preempt at the first write barrier, then even if we let the mutator run it can't modify anything that the GC cares about. So maybe we just set the write barrier enabled bit and let goroutines run. Once a goroutine sees the write barrier bit (how do we force that?) it can't modify any memory the GC cares about. So it is safe to start the GC without waiting for every goroutine to stop.

Contributor

randall77 commented May 26, 2015

No pointer writes to the heap.

But maybe this is enough? If we can preempt at the first write barrier, then even if we let the mutator run it can't modify anything that the GC cares about. So maybe we just set the write barrier enabled bit and let goroutines run. Once a goroutine sees the write barrier bit (how do we force that?) it can't modify any memory the GC cares about. So it is safe to start the GC without waiting for every goroutine to stop.

@aclements

This comment has been minimized.

Show comment
Hide comment
@aclements

aclements May 27, 2015

Member

The problem of inserting artificial preemption points is reduced
numerical performance.

True. That's why I suggested unrolling loops, which AFAIK is a standard solution to this problem.

However, I think the check can be done in just two instructions, even without adding Ian's loop counter. Just CMP the preempt flag and branch if it's set. That branch will almost never be hit, so it will be highly predictable, and the preempt flag should be in the L1, so the check may in fact be extremely cheap.

No pointer writes to the heap. But maybe this is enough?

This certainly reduces the set of programs that have this problem, but I don't think it actually helps with either of the examples I gave, since numerical kernels probably won't have heap pointer writes, and the runtime test that can deadlock the GC certainly has no pointer writes.

Member

aclements commented May 27, 2015

The problem of inserting artificial preemption points is reduced
numerical performance.

True. That's why I suggested unrolling loops, which AFAIK is a standard solution to this problem.

However, I think the check can be done in just two instructions, even without adding Ian's loop counter. Just CMP the preempt flag and branch if it's set. That branch will almost never be hit, so it will be highly predictable, and the preempt flag should be in the L1, so the check may in fact be extremely cheap.

No pointer writes to the heap. But maybe this is enough?

This certainly reduces the set of programs that have this problem, but I don't think it actually helps with either of the examples I gave, since numerical kernels probably won't have heap pointer writes, and the runtime test that can deadlock the GC certainly has no pointer writes.

@RLH

This comment has been minimized.

Show comment
Hide comment
@RLH

RLH May 27, 2015

Contributor

This is the reference for doing a GC safepoint at every instruction. It is
not something we want to do for x86 much less the various other
architectures we need to support.
http://dl.acm.org/citation.cfm?id=301652

Most systems I know of do what Austin suggests, unroll the loop and insert
a check. No numbers but 8 unrolls seems to be what I recall. Not only is
the branch highly predictable but the check does not introduce any
dependencies making it close to free on an out of order machine. There have
been other approaches such as code plugging and predicated branches but HW
has moved on. I had not seen Ian's suggestion.

On Tue, May 26, 2015 at 9:24 PM, Austin Clements notifications@github.com
wrote:

The problem of inserting artificial preemption points is reduced
numerical performance.

True. That's why I suggested unrolling loops, which AFAIK is a standard
solution to this problem.

However, I think the check can be done in just two instructions, even
without adding Ian's loop counter. Just CMP the preempt flag and branch if
it's set. That branch will almost never be hit, so it will be highly
predictable, and the preempt flag should be in the L1, so the check may in
fact be extremely cheap.

No pointer writes to the heap. But maybe this is enough?

This certainly reduces the set of programs that have this problem, but I
don't think it actually helps with either of the examples I gave, since
numerical kernels probably won't have heap pointer writes, and the runtime
test that can deadlock the GC certainly has no pointer writes.


Reply to this email directly or view it on GitHub
#10958 (comment).

Contributor

RLH commented May 27, 2015

This is the reference for doing a GC safepoint at every instruction. It is
not something we want to do for x86 much less the various other
architectures we need to support.
http://dl.acm.org/citation.cfm?id=301652

Most systems I know of do what Austin suggests, unroll the loop and insert
a check. No numbers but 8 unrolls seems to be what I recall. Not only is
the branch highly predictable but the check does not introduce any
dependencies making it close to free on an out of order machine. There have
been other approaches such as code plugging and predicated branches but HW
has moved on. I had not seen Ian's suggestion.

On Tue, May 26, 2015 at 9:24 PM, Austin Clements notifications@github.com
wrote:

The problem of inserting artificial preemption points is reduced
numerical performance.

True. That's why I suggested unrolling loops, which AFAIK is a standard
solution to this problem.

However, I think the check can be done in just two instructions, even
without adding Ian's loop counter. Just CMP the preempt flag and branch if
it's set. That branch will almost never be hit, so it will be highly
predictable, and the preempt flag should be in the L1, so the check may in
fact be extremely cheap.

No pointer writes to the heap. But maybe this is enough?

This certainly reduces the set of programs that have this problem, but I
don't think it actually helps with either of the examples I gave, since
numerical kernels probably won't have heap pointer writes, and the runtime
test that can deadlock the GC certainly has no pointer writes.


Reply to this email directly or view it on GitHub
#10958 (comment).

@kingluo

This comment has been minimized.

Show comment
Hide comment
@kingluo

kingluo Sep 9, 2015

@aclements,

Currently goroutines are only preemptible at function call points.

To be precise, the preemption check seems only at newstack()?

For example,

package main

import (
    "fmt"
    "runtime"
    "time"
)

func test() {
    a := 100
    for i := 1; i < 1000; i++ {
        a = i*100/i + a
    }
}

func main() {
    runtime.GOMAXPROCS(1)
    go func() {
        for {
            test()
        }
    }()
    time.Sleep(100 * time.Millisecond)
    fmt.Println("hello world")
}

The test() is not inline, and the infinite loop calls test(), but it would not preempt at the calls, because the morestack() --> newstack() not involved, so the "hello world" would never be printed.

kingluo commented Sep 9, 2015

@aclements,

Currently goroutines are only preemptible at function call points.

To be precise, the preemption check seems only at newstack()?

For example,

package main

import (
    "fmt"
    "runtime"
    "time"
)

func test() {
    a := 100
    for i := 1; i < 1000; i++ {
        a = i*100/i + a
    }
}

func main() {
    runtime.GOMAXPROCS(1)
    go func() {
        for {
            test()
        }
    }()
    time.Sleep(100 * time.Millisecond)
    fmt.Println("hello world")
}

The test() is not inline, and the infinite loop calls test(), but it would not preempt at the calls, because the morestack() --> newstack() not involved, so the "hello world" would never be printed.

@randall77

This comment has been minimized.

Show comment
Hide comment
@randall77

randall77 Sep 9, 2015

Contributor

test() is not inlined, but since it is a leaf it is promoted to nosplit, so it doesn't have the preemption check any more. Fixing that sounds much easier than the rest of this bug. Maybe we could forbid nosplit promotion for functions called from loops?

Contributor

randall77 commented Sep 9, 2015

test() is not inlined, but since it is a leaf it is promoted to nosplit, so it doesn't have the preemption check any more. Fixing that sounds much easier than the rest of this bug. Maybe we could forbid nosplit promotion for functions called from loops?

@griesemer

This comment has been minimized.

Show comment
Hide comment
@griesemer

griesemer Sep 10, 2015

Contributor

For the reference, the same problem appeared in the HotSpot JVM. I remember two approaches:

  1. Around 1997/1998, Rene Schmidt (http://jaoo.dk/aarhus2007/speaker/Rene+W.+Schmidt) implemented the following mechanism: Threads running a tight loop w/o function calls would receive a signal to temporarily suspend them. The runtime then dynamically generated a partial "copy" of the loop instructions the thread was running, from the current PC to the (unconditional) backward branch, except that the backward branch was replaced with a call into the runtime (leading to proper suspension at a safe point). The thread was then restarted with the pc modified such that it would run the newly generated piece of code. That code would run to the end of the loop body and then suspend itself at which point the code copy was discarded and the pc (return address) adjusted to continue with the original code (after the safe point).

This mechanism was ingenious but also insanely complicated. It was abandoned eventually for:

  1. A simple test and branch (2 additional instructions) at the end of a loop body which didn't contain any calls. As far as I recall, we didn't do any form of loop unrolling and the performance implications were not significant in the overall picture of a larger application.

My vote would be for 2) to start in loops where @ianlancetaylor's suggestion doesn't work. I suspect that for all but the smallest long-running inner loops (where unrolling might make sense independently), the performance penalty is acceptable.

If this is not good enough, and we don't want to pay the code size cost of unrolling a loop multiple times, here's another idea based on 1): Instead of the backward branch check as in 2) plus unrolling, keep the original loop, and generate (at compile time) a 2nd version of the loop body that ends in a runtime call to suspend itself instead of the backward branch. The code size cost is about the cost of having the loop unrolled twice. When the goroutine needs to run to a safe point, use a signal to temporarily suspend the goroutine, switch the pc to the corresponding pc in the copy of the loop body, continue with execution there and have the code suspend itself at a safe point. There's no dynamic code generation involved, generating the extra loop body is trivial at compile-time, the extra amount of code is less than with loop unrolling, and there's only a little bit of runtime work to modify the pc. The regular code would run at full speed if no garbage collection is needed.

Contributor

griesemer commented Sep 10, 2015

For the reference, the same problem appeared in the HotSpot JVM. I remember two approaches:

  1. Around 1997/1998, Rene Schmidt (http://jaoo.dk/aarhus2007/speaker/Rene+W.+Schmidt) implemented the following mechanism: Threads running a tight loop w/o function calls would receive a signal to temporarily suspend them. The runtime then dynamically generated a partial "copy" of the loop instructions the thread was running, from the current PC to the (unconditional) backward branch, except that the backward branch was replaced with a call into the runtime (leading to proper suspension at a safe point). The thread was then restarted with the pc modified such that it would run the newly generated piece of code. That code would run to the end of the loop body and then suspend itself at which point the code copy was discarded and the pc (return address) adjusted to continue with the original code (after the safe point).

This mechanism was ingenious but also insanely complicated. It was abandoned eventually for:

  1. A simple test and branch (2 additional instructions) at the end of a loop body which didn't contain any calls. As far as I recall, we didn't do any form of loop unrolling and the performance implications were not significant in the overall picture of a larger application.

My vote would be for 2) to start in loops where @ianlancetaylor's suggestion doesn't work. I suspect that for all but the smallest long-running inner loops (where unrolling might make sense independently), the performance penalty is acceptable.

If this is not good enough, and we don't want to pay the code size cost of unrolling a loop multiple times, here's another idea based on 1): Instead of the backward branch check as in 2) plus unrolling, keep the original loop, and generate (at compile time) a 2nd version of the loop body that ends in a runtime call to suspend itself instead of the backward branch. The code size cost is about the cost of having the loop unrolled twice. When the goroutine needs to run to a safe point, use a signal to temporarily suspend the goroutine, switch the pc to the corresponding pc in the copy of the loop body, continue with execution there and have the code suspend itself at a safe point. There's no dynamic code generation involved, generating the extra loop body is trivial at compile-time, the extra amount of code is less than with loop unrolling, and there's only a little bit of runtime work to modify the pc. The regular code would run at full speed if no garbage collection is needed.

@nightlyone

This comment has been minimized.

Show comment
Hide comment
@nightlyone

nightlyone Sep 11, 2015

Contributor

Another idea, similiar to @ianlancetaylor's idea, is to estimate the cost (in ns) of the loop and only check for required suspension every N loops through that loop body.

Once the compiler can unroll/unpeel, that check-every-N logic can be either inlined after the unrolled body, if unrolling is beneficial, or kept when unrolling makes no sense for that loop body.

Such logic also reads better when using debuggers.

Contributor

nightlyone commented Sep 11, 2015

Another idea, similiar to @ianlancetaylor's idea, is to estimate the cost (in ns) of the loop and only check for required suspension every N loops through that loop body.

Once the compiler can unroll/unpeel, that check-every-N logic can be either inlined after the unrolled body, if unrolling is beneficial, or kept when unrolling makes no sense for that loop body.

Such logic also reads better when using debuggers.

@griesemer

This comment has been minimized.

Show comment
Hide comment
@griesemer

griesemer Sep 11, 2015

Contributor

@nightlyone The check-every-N seems more complex than just having two extra instructions (compare and branch). And if unrolling is done, that compare-and-branch is already needed only every N loop iterations only (where N is the number of times the loop was unrolled).

Contributor

griesemer commented Sep 11, 2015

@nightlyone The check-every-N seems more complex than just having two extra instructions (compare and branch). And if unrolling is done, that compare-and-branch is already needed only every N loop iterations only (where N is the number of times the loop was unrolled).

@nightlyone

This comment has been minimized.

Show comment
Hide comment
@nightlyone

nightlyone Sep 11, 2015

Contributor

@griesemer not sure how the test will avoid atomic loads, that's why I suggested the check-every-N.

So the pseudo-go-code before unrolling would look like this:

loop:   
        // loop-body        

        if counter > N {          
                counter = 0                       

                if need_stop = atomic.LoadBool(&runtime.getg().need_stop); need_stop {
                        runtime.Gosched()                                    
                }                                                               
        }                                                                                                                       
        counter++                                                                                                           
        goto loop

after unrolling it would look like

loop:   
        // loop-body0
        // loop-body2
        // ...
        // loop-bodyM        

        if counter > N/M {          
                counter = 0                       

                if need_stop = atomic.LoadBool(&runtime.getg().need_stop); need_stop {
                        runtime.Gosched()                                    
                }                                                               
        }                                                                                                                       
        counter++                                                                                                           
        goto loop

So the code inserted would be a constant overhead, but still run every N loops.

Contributor

nightlyone commented Sep 11, 2015

@griesemer not sure how the test will avoid atomic loads, that's why I suggested the check-every-N.

So the pseudo-go-code before unrolling would look like this:

loop:   
        // loop-body        

        if counter > N {          
                counter = 0                       

                if need_stop = atomic.LoadBool(&runtime.getg().need_stop); need_stop {
                        runtime.Gosched()                                    
                }                                                               
        }                                                                                                                       
        counter++                                                                                                           
        goto loop

after unrolling it would look like

loop:   
        // loop-body0
        // loop-body2
        // ...
        // loop-bodyM        

        if counter > N/M {          
                counter = 0                       

                if need_stop = atomic.LoadBool(&runtime.getg().need_stop); need_stop {
                        runtime.Gosched()                                    
                }                                                               
        }                                                                                                                       
        counter++                                                                                                           
        goto loop

So the code inserted would be a constant overhead, but still run every N loops.

@aclements

This comment has been minimized.

Show comment
Hide comment
@aclements

aclements Sep 11, 2015

Member

The load doesn't have to be atomic. The current preemption check in the function prologue isn't atomic.

One tricky bit with the compare and branch is that the actual preemption point needs to have no live registers. Presumably, for performance reasons, we want the loop to be able to keep things in registers, so the code we branch to on a preempt has to flush any live registers before the preempt and reload them after the preempt. I don't think this will be particularly hard, but it is something to consider, since it might affect which stage of the compiler is responsible for generating this code.

Member

aclements commented Sep 11, 2015

The load doesn't have to be atomic. The current preemption check in the function prologue isn't atomic.

One tricky bit with the compare and branch is that the actual preemption point needs to have no live registers. Presumably, for performance reasons, we want the loop to be able to keep things in registers, so the code we branch to on a preempt has to flush any live registers before the preempt and reload them after the preempt. I don't think this will be particularly hard, but it is something to consider, since it might affect which stage of the compiler is responsible for generating this code.

@griesemer

This comment has been minimized.

Show comment
Hide comment
@griesemer

griesemer Sep 11, 2015

Contributor

@aclements Indeed. Which is why perhaps switching the pc to a 2nd version of the loop body that ends in a safe point might not be much more complex and permit the loop to run at full speed in the normal case.

Contributor

griesemer commented Sep 11, 2015

@aclements Indeed. Which is why perhaps switching the pc to a 2nd version of the loop body that ends in a safe point might not be much more complex and permit the loop to run at full speed in the normal case.

@aclements

This comment has been minimized.

Show comment
Hide comment
@aclements

aclements Sep 11, 2015

Member

@aclements Indeed. Which is why perhaps switching the pc to a 2nd version of the loop body that ends in a safe point might not be much more complex and permit the loop to run at full speed in the normal case.

From the runtime's perspective, I think this would be more complicated because stealing a signal is a logistic pain and we'd have to deal with tables mapping from fast loop PCs to slow loop PCs. The compiler would have to generate these tables. This seems like a very clever plan B, but I think first we should trying adding a no-op compare and branch and see if it's actually a problem for dense numerical kernels.

Member

aclements commented Sep 11, 2015

@aclements Indeed. Which is why perhaps switching the pc to a 2nd version of the loop body that ends in a safe point might not be much more complex and permit the loop to run at full speed in the normal case.

From the runtime's perspective, I think this would be more complicated because stealing a signal is a logistic pain and we'd have to deal with tables mapping from fast loop PCs to slow loop PCs. The compiler would have to generate these tables. This seems like a very clever plan B, but I think first we should trying adding a no-op compare and branch and see if it's actually a problem for dense numerical kernels.

@randall77

This comment has been minimized.

Show comment
Hide comment
@randall77

randall77 Sep 11, 2015

Contributor

The new SSA register allocator handles situations like this well, keeping everything in registers on the common edges and spill/call/restore on the unlikely case.

Contributor

randall77 commented Sep 11, 2015

The new SSA register allocator handles situations like this well, keeping everything in registers on the common edges and spill/call/restore on the unlikely case.

@RLH

This comment has been minimized.

Show comment
Hide comment
@RLH

RLH Sep 11, 2015

Contributor

The load is not dependent on anything in the loop. Assuming the loop is
tight, the value is almost certainly to a location already in the L1 cache,
the branch will be highly predictable. Just to make sure the branch
predictor doesn't even need to be warmed up, we can make it a backward
branch. I would be sort of surprised that if on an out of order machine the
cost would even be noticed. That said build and measure is the only way to
be sure.

On Fri, Sep 11, 2015 at 3:38 PM, Keith Randall notifications@github.com
wrote:

The new SSA register allocator handles situations like this well, keeping
everything in registers on the common edges and spill/call/restore on the
unlikely case.


Reply to this email directly or view it on GitHub
#10958 (comment).

Contributor

RLH commented Sep 11, 2015

The load is not dependent on anything in the loop. Assuming the loop is
tight, the value is almost certainly to a location already in the L1 cache,
the branch will be highly predictable. Just to make sure the branch
predictor doesn't even need to be warmed up, we can make it a backward
branch. I would be sort of surprised that if on an out of order machine the
cost would even be noticed. That said build and measure is the only way to
be sure.

On Fri, Sep 11, 2015 at 3:38 PM, Keith Randall notifications@github.com
wrote:

The new SSA register allocator handles situations like this well, keeping
everything in registers on the common edges and spill/call/restore on the
unlikely case.


Reply to this email directly or view it on GitHub
#10958 (comment).

@gopherbot

This comment has been minimized.

Show comment
Hide comment
@gopherbot

gopherbot commented Feb 16, 2016

CL https://golang.org/cl/19516 mentions this issue.

gopherbot pushed a commit that referenced this issue Feb 16, 2016

runtime: fix deadlock in TestCrashDumpsAllThreads
TestCrashDumpsAllThreads carefully sets the number of Ps to one
greater than the number of non-preemptible loops it starts so that the
main goroutine can continue to run (necessary because of #10958).
However, if GC starts, it can take over that one spare P and lock up
the system while waiting for the non-preemptible loops, causing the
test to eventually time out. This deadlock is easily reproducible if
you run the runtime test with GOGC=1.

Fix this by forcing GOGC=off when running this test.

Change-Id: Ifb22da5ce33f9a61700a326ea92fcf4b049721d1
Reviewed-on: https://go-review.googlesource.com/19516
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Russ Cox <rsc@golang.org>
@yaxinlx

This comment has been minimized.

Show comment
Hide comment
@yaxinlx

yaxinlx Jun 12, 2016

It would be great if there is a complier flag to detect greedy goroutines.

yaxinlx commented Jun 12, 2016

It would be great if there is a complier flag to detect greedy goroutines.

@celer

This comment has been minimized.

Show comment
Hide comment
@celer

celer Jun 13, 2016

We're re-writing our product from scratch in go, and one of more memorable failure modes of the prior implementation was an infinite loop in an open source library, causing lots of customer pain. So we'd love to see this addressed. (Sadly this fixed alone isn't enough as you still ideally need a way to find go routines which are spinning and stop them).

So our use case is maintaining high availability in the face of a number of errantly looping go routines.

celer commented Jun 13, 2016

We're re-writing our product from scratch in go, and one of more memorable failure modes of the prior implementation was an infinite loop in an open source library, causing lots of customer pain. So we'd love to see this addressed. (Sadly this fixed alone isn't enough as you still ideally need a way to find go routines which are spinning and stop them).

So our use case is maintaining high availability in the face of a number of errantly looping go routines.

@aclements aclements modified the milestones: Go1.8Early, Unplanned Jun 13, 2016

@aclements

This comment has been minimized.

Show comment
Hide comment
@aclements

aclements Jun 13, 2016

Member

Moving to Go1.8Early since we've been getting more problem reports from this lately (though I think they're still all from either testing code or buggy code).

@yaxinlx, I'm not sure what you would want the compiler to report. Most code is probably has lots of non-preemptible loops, but they're all trivially short and bounded, so I suspect any compiler output would be too noisy to be useful. I think it would be much better to just fix this issue than to put effort into complex tools for dropping it in users' laps.

Member

aclements commented Jun 13, 2016

Moving to Go1.8Early since we've been getting more problem reports from this lately (though I think they're still all from either testing code or buggy code).

@yaxinlx, I'm not sure what you would want the compiler to report. Most code is probably has lots of non-preemptible loops, but they're all trivially short and bounded, so I suspect any compiler output would be too noisy to be useful. I think it would be much better to just fix this issue than to put effort into complex tools for dropping it in users' laps.

@aclements

This comment has been minimized.

Show comment
Hide comment
@aclements

aclements Jun 13, 2016

Member

I wanted to note an interesting case that hasn't occurred to me until recently: currently we omit the stack growth check from functions with very small frames (the exact value differs between arches), which means we also don't check for preemption in these functions. As a result, even a loop that contains a function call may be non-preemptible if the function it calls has a very small frame. That function call could even be a method call or a function pointer, in which case the compiler has no way to soundly detect that it's calling a function without a preemption check when it's compiling the loop.

Any fix for this issue has to deal with this case. I can think of a few possible solutions, and there may be others: 1) put preemption points in all function prologues or 2) insert preemption points on back-edges conservatively if the compiler can't prove that at least one call in the loop is preemptible. As an extension of 2, we could also 3) insert preemption points in all method prologues, which would increase the number of loops the compiler can prove already contain a preemption point.

Member

aclements commented Jun 13, 2016

I wanted to note an interesting case that hasn't occurred to me until recently: currently we omit the stack growth check from functions with very small frames (the exact value differs between arches), which means we also don't check for preemption in these functions. As a result, even a loop that contains a function call may be non-preemptible if the function it calls has a very small frame. That function call could even be a method call or a function pointer, in which case the compiler has no way to soundly detect that it's calling a function without a preemption check when it's compiling the loop.

Any fix for this issue has to deal with this case. I can think of a few possible solutions, and there may be others: 1) put preemption points in all function prologues or 2) insert preemption points on back-edges conservatively if the compiler can't prove that at least one call in the loop is preemptible. As an extension of 2, we could also 3) insert preemption points in all method prologues, which would increase the number of loops the compiler can prove already contain a preemption point.

@randall77 randall77 removed their assignment Aug 24, 2017

@dr2chase

This comment has been minimized.

Show comment
Hide comment
@dr2chase

dr2chase Sep 26, 2017

Contributor

Some progress reports, pro and con. First, I tried blind-stupid unrolling by 2 in another experiment (where the loop is simply stamped out twice, including the test and any increment) and that was generally not a help. That was CL40996.

I did some benchmarking (from the benchmarks I collected for https://github.com/dr2chase/bent) to see how the two preemption tests compare applied to Tip, versus 1.9. If we use fault-based preemption, and consider the geomean, we're roughly there, and we still have at least one big optimization (improved inlining) still in the pipeline. Here's the results, further discussion below:

name \ time/op                             Go1.9.stdout  ReschedFault.stdout  ReschedTest.stdout  Tip.stdout
FastTest2KB-12                               131ns ± 0%           163ns ± 0%          232ns ± 0%    128ns ± 0%
BaseTest2KB-12                               606ns ± 0%          1193ns ± 0%         1619ns ± 0%    602ns ± 1%
Encoding4KBVerySparse-12                    17.2µs ± 0%          18.8µs ± 0%         19.6µs ± 1%   17.3µs ± 0%
Join_8-12                                    2.13s ± 1%           2.25s ± 1%          2.30s ± 1%    2.11s ± 2%
HashimotoLight-12                           1.62ms ± 1%          1.92ms ± 1%         2.05ms ± 4%   1.62ms ± 1%
ChainRead_full_10k-12                        343ms ± 1%           349ms ± 2%          345ms ± 2%    328ms ± 4%
Sha3_224_MTU-12                             4.83µs ± 0%          4.85µs ± 0%         4.87µs ± 0%   4.83µs ± 0%
GenSharedKeyP256-12                         74.2µs ± 0%          74.9µs ± 0%         71.6µs ± 0%   71.6µs ± 0%
OpDiv128-12                                  280ns ± 1%           287ns ± 1%          294ns ± 1%    290ns ± 0%
SinTableLookup-12                           1.72ns ± 0%          0.76ns ± 0%         0.87ns ± 0%   0.50ns ± 0%
SinStdLib-12                                10.1ns ± 0%           9.8ns ± 0%         10.2ns ± 0%   14.0ns ± 0%
Run/10k/1-12                                 25.6s ± 0%           26.8s ± 0%          27.3s ± 0%    25.0s ± 0%
Run/10k/16-12                                4.77s ± 2%           4.99s ± 2%          5.01s ± 2%    4.67s ± 1%
GetObject5MbFS-12                           4.33ms ± 1%          4.36ms ± 1%         4.39ms ± 1%   4.34ms ± 1%
Dnrm2MediumPosInc-12                        4.02µs ± 0%          4.02µs ± 0%         4.02µs ± 0%   4.04µs ± 0%
DasumMediumUnitaryInc-12                     870ns ± 0%           927ns ± 0%         1032ns ± 0%    859ns ± 0%
Dgeev/Circulant10-12                        43.8µs ± 2%          44.9µs ± 1%         45.2µs ± 1%   42.5µs ± 1%
Dgeev/Circulant100-12                       10.9ms ± 1%          10.5ms ± 1%         10.8ms ± 1%   10.6ms ± 1%
MulWorkspaceDense1000Hundredth-12           65.3ms ± 0%          65.5ms ± 0%         65.8ms ± 0%   65.3ms ± 1%
ScaleVec10000Inc20-12                       22.2µs ± 1%          22.2µs ± 1%         22.2µs ± 1%   22.3µs ± 1%
ValidateVersionTildeFail-12                  749ns ± 1%           753ns ± 1%          783ns ± 0%    747ns ± 1%
StripHTML-12                                3.01µs ± 0%          3.02µs ± 0%         3.09µs ± 1%   2.99µs ± 0%
ReaderContains-12                           9.61µs ± 0%          9.73µs ± 0%        10.10µs ± 0%   9.62µs ± 0%
ParsePage-12                                57.8µs ± 1%          58.5µs ± 1%         59.2µs ± 1%   57.5µs ± 1%
EncodeCodecFromInternalProtobuf-12          4.50µs ± 1%          4.75µs ± 1%         4.82µs ± 1%   4.21µs ± 1%
List1kNodes30kPods-12                       1.34ms ± 2%          1.35ms ± 4%         1.38ms ± 3%   1.34ms ± 2%
NewV5-12                                     340ns ± 0%           308ns ± 0%          309ns ± 0%    315ns ± 0%
MarshalToString-12                          69.8ns ± 0%          68.8ns ± 0%         70.3ns ± 0%   67.5ns ± 0%
TarjanSCCGnp_10_tenth-12                    7.28µs ± 4%          6.91µs ± 6%         6.98µs ± 3%   6.82µs ± 4%
TarjanSCCGnp_1000_half-12                   82.2ms ± 1%          82.1ms ± 2%         82.2ms ± 0%   80.6ms ± 3%
AStarUndirectedmallWorld_10_2_2_2_Heur-12   13.9µs ± 5%          13.2µs ± 3%         13.5µs ± 5%   13.5µs ± 8%
LouvainDirectedMultiplex-12                 19.5ms ± 0%          18.4ms ± 0%         18.7ms ± 0%   18.2ms ± 0%
WalkAllBreadthFirstGnp_10_tenth-12          3.00µs ± 3%          2.70µs ± 3%         2.77µs ± 3%   2.66µs ± 4%
WalkAllBreadthFirstGnp_1000_tenth-12        9.85ms ± 3%          9.60ms ± 4%         9.76ms ± 4%   9.71ms ± 3%
[Geo mean]                                  73.7µs               74.3µs              77.3µs        70.3µs     

The worst-performing test of the bunch is BenchmarkBaseTest2KB, which measures the performance of code not run on x86, amd64, ppc64[le], or s390x.

func BenchmarkBaseTest2KB(b *testing.B) { benchmarkBaseTest(b, 2048) }
...
func benchmarkBaseTest(b *testing.B, size int) {
	p := make([]byte, size)
	for i := 0; i < b.N; i++ {
		safeTestBytes(p)
	}
}
...
func safeTestBytes(p []byte) bool {
	for i := 0; i < len(p); i++ {
		if p[i] != 0 {
			return true
		}
	}
	return false
}

Naive x2 unrolling reduces the pain of the preemption-checking case but slows the no-preemption case:

old: BaseTest2KB-12              606ns ± 0%          1193ns ± 0%         1619ns ± 0%    602ns ± 1%
new: BaseTest2KB-12              797ns ± 0%          1047ns ± 1%         1332ns ± 0%   798ns ± 0%

func safeTestBytes(p []byte) bool {
	i := 0
	for ; i < len(p)-1; i += 2 {
		if p[i] != 0 {
			return true
		}
		if p[i+1] != 0 {
			return true
		}
	}
	for ; i < len(p); i++ {
		if p[i] != 0 {
			return true
		}
	}
	return false
}

A by-4 unrolling:

+0+1+2+3 BaseTest2KB-12              745ns ± 0%           852ns ± 0%          921ns ± 0%   745ns ± 0%
+3+2+1+0 BaseTest2KB-12              743ns ± 0%           837ns ± 0%          968ns ± 0%   743ns ± 0%

func safeTestBytes(p []byte) bool {
	i := 0
	for ; i < len(p)-3; i += 4 {
		if p[i+0] != 0 {
			return true
		}
		if p[i+1] != 0 {
			return true
		}
		if p[i+2] != 0 {
			return true
		}
		if p[i+3] != 0 {
			return true
		}
	}
	for ; i < len(p); i++ {
		if p[i] != 0 {
			return true
		}
	}
	return false
}

There's some possibility of doing this automatically, but it's also possible to do this by hand for problematic loops. One reason not to pursue by-hand tweaks is that we might implement preemption differently in the future, for instance by trapping the thread and single-stepping forward to a designated safe point (perhaps enhancing pointer maps to include registers) and that would not only remove the need for the change, it would also work best with the original loop (which executes faster, and would have a smaller loop body to step through).

Contributor

dr2chase commented Sep 26, 2017

Some progress reports, pro and con. First, I tried blind-stupid unrolling by 2 in another experiment (where the loop is simply stamped out twice, including the test and any increment) and that was generally not a help. That was CL40996.

I did some benchmarking (from the benchmarks I collected for https://github.com/dr2chase/bent) to see how the two preemption tests compare applied to Tip, versus 1.9. If we use fault-based preemption, and consider the geomean, we're roughly there, and we still have at least one big optimization (improved inlining) still in the pipeline. Here's the results, further discussion below:

name \ time/op                             Go1.9.stdout  ReschedFault.stdout  ReschedTest.stdout  Tip.stdout
FastTest2KB-12                               131ns ± 0%           163ns ± 0%          232ns ± 0%    128ns ± 0%
BaseTest2KB-12                               606ns ± 0%          1193ns ± 0%         1619ns ± 0%    602ns ± 1%
Encoding4KBVerySparse-12                    17.2µs ± 0%          18.8µs ± 0%         19.6µs ± 1%   17.3µs ± 0%
Join_8-12                                    2.13s ± 1%           2.25s ± 1%          2.30s ± 1%    2.11s ± 2%
HashimotoLight-12                           1.62ms ± 1%          1.92ms ± 1%         2.05ms ± 4%   1.62ms ± 1%
ChainRead_full_10k-12                        343ms ± 1%           349ms ± 2%          345ms ± 2%    328ms ± 4%
Sha3_224_MTU-12                             4.83µs ± 0%          4.85µs ± 0%         4.87µs ± 0%   4.83µs ± 0%
GenSharedKeyP256-12                         74.2µs ± 0%          74.9µs ± 0%         71.6µs ± 0%   71.6µs ± 0%
OpDiv128-12                                  280ns ± 1%           287ns ± 1%          294ns ± 1%    290ns ± 0%
SinTableLookup-12                           1.72ns ± 0%          0.76ns ± 0%         0.87ns ± 0%   0.50ns ± 0%
SinStdLib-12                                10.1ns ± 0%           9.8ns ± 0%         10.2ns ± 0%   14.0ns ± 0%
Run/10k/1-12                                 25.6s ± 0%           26.8s ± 0%          27.3s ± 0%    25.0s ± 0%
Run/10k/16-12                                4.77s ± 2%           4.99s ± 2%          5.01s ± 2%    4.67s ± 1%
GetObject5MbFS-12                           4.33ms ± 1%          4.36ms ± 1%         4.39ms ± 1%   4.34ms ± 1%
Dnrm2MediumPosInc-12                        4.02µs ± 0%          4.02µs ± 0%         4.02µs ± 0%   4.04µs ± 0%
DasumMediumUnitaryInc-12                     870ns ± 0%           927ns ± 0%         1032ns ± 0%    859ns ± 0%
Dgeev/Circulant10-12                        43.8µs ± 2%          44.9µs ± 1%         45.2µs ± 1%   42.5µs ± 1%
Dgeev/Circulant100-12                       10.9ms ± 1%          10.5ms ± 1%         10.8ms ± 1%   10.6ms ± 1%
MulWorkspaceDense1000Hundredth-12           65.3ms ± 0%          65.5ms ± 0%         65.8ms ± 0%   65.3ms ± 1%
ScaleVec10000Inc20-12                       22.2µs ± 1%          22.2µs ± 1%         22.2µs ± 1%   22.3µs ± 1%
ValidateVersionTildeFail-12                  749ns ± 1%           753ns ± 1%          783ns ± 0%    747ns ± 1%
StripHTML-12                                3.01µs ± 0%          3.02µs ± 0%         3.09µs ± 1%   2.99µs ± 0%
ReaderContains-12                           9.61µs ± 0%          9.73µs ± 0%        10.10µs ± 0%   9.62µs ± 0%
ParsePage-12                                57.8µs ± 1%          58.5µs ± 1%         59.2µs ± 1%   57.5µs ± 1%
EncodeCodecFromInternalProtobuf-12          4.50µs ± 1%          4.75µs ± 1%         4.82µs ± 1%   4.21µs ± 1%
List1kNodes30kPods-12                       1.34ms ± 2%          1.35ms ± 4%         1.38ms ± 3%   1.34ms ± 2%
NewV5-12                                     340ns ± 0%           308ns ± 0%          309ns ± 0%    315ns ± 0%
MarshalToString-12                          69.8ns ± 0%          68.8ns ± 0%         70.3ns ± 0%   67.5ns ± 0%
TarjanSCCGnp_10_tenth-12                    7.28µs ± 4%          6.91µs ± 6%         6.98µs ± 3%   6.82µs ± 4%
TarjanSCCGnp_1000_half-12                   82.2ms ± 1%          82.1ms ± 2%         82.2ms ± 0%   80.6ms ± 3%
AStarUndirectedmallWorld_10_2_2_2_Heur-12   13.9µs ± 5%          13.2µs ± 3%         13.5µs ± 5%   13.5µs ± 8%
LouvainDirectedMultiplex-12                 19.5ms ± 0%          18.4ms ± 0%         18.7ms ± 0%   18.2ms ± 0%
WalkAllBreadthFirstGnp_10_tenth-12          3.00µs ± 3%          2.70µs ± 3%         2.77µs ± 3%   2.66µs ± 4%
WalkAllBreadthFirstGnp_1000_tenth-12        9.85ms ± 3%          9.60ms ± 4%         9.76ms ± 4%   9.71ms ± 3%
[Geo mean]                                  73.7µs               74.3µs              77.3µs        70.3µs     

The worst-performing test of the bunch is BenchmarkBaseTest2KB, which measures the performance of code not run on x86, amd64, ppc64[le], or s390x.

func BenchmarkBaseTest2KB(b *testing.B) { benchmarkBaseTest(b, 2048) }
...
func benchmarkBaseTest(b *testing.B, size int) {
	p := make([]byte, size)
	for i := 0; i < b.N; i++ {
		safeTestBytes(p)
	}
}
...
func safeTestBytes(p []byte) bool {
	for i := 0; i < len(p); i++ {
		if p[i] != 0 {
			return true
		}
	}
	return false
}

Naive x2 unrolling reduces the pain of the preemption-checking case but slows the no-preemption case:

old: BaseTest2KB-12              606ns ± 0%          1193ns ± 0%         1619ns ± 0%    602ns ± 1%
new: BaseTest2KB-12              797ns ± 0%          1047ns ± 1%         1332ns ± 0%   798ns ± 0%

func safeTestBytes(p []byte) bool {
	i := 0
	for ; i < len(p)-1; i += 2 {
		if p[i] != 0 {
			return true
		}
		if p[i+1] != 0 {
			return true
		}
	}
	for ; i < len(p); i++ {
		if p[i] != 0 {
			return true
		}
	}
	return false
}

A by-4 unrolling:

+0+1+2+3 BaseTest2KB-12              745ns ± 0%           852ns ± 0%          921ns ± 0%   745ns ± 0%
+3+2+1+0 BaseTest2KB-12              743ns ± 0%           837ns ± 0%          968ns ± 0%   743ns ± 0%

func safeTestBytes(p []byte) bool {
	i := 0
	for ; i < len(p)-3; i += 4 {
		if p[i+0] != 0 {
			return true
		}
		if p[i+1] != 0 {
			return true
		}
		if p[i+2] != 0 {
			return true
		}
		if p[i+3] != 0 {
			return true
		}
	}
	for ; i < len(p); i++ {
		if p[i] != 0 {
			return true
		}
	}
	return false
}

There's some possibility of doing this automatically, but it's also possible to do this by hand for problematic loops. One reason not to pursue by-hand tweaks is that we might implement preemption differently in the future, for instance by trapping the thread and single-stepping forward to a designated safe point (perhaps enhancing pointer maps to include registers) and that would not only remove the need for the change, it would also work best with the original loop (which executes faster, and would have a smaller loop body to step through).

@gopherbot

This comment has been minimized.

Show comment
Hide comment
@gopherbot

gopherbot Oct 4, 2017

Change https://golang.org/cl/68012 mentions this issue: runtime: keep rescheduling fault page together

gopherbot commented Oct 4, 2017

Change https://golang.org/cl/68012 mentions this issue: runtime: keep rescheduling fault page together

@gopherbot

This comment has been minimized.

Show comment
Hide comment
@gopherbot

gopherbot Oct 4, 2017

Change https://golang.org/cl/68014 mentions this issue: cmd/compile, runtime: make bad safe-points throw

gopherbot commented Oct 4, 2017

Change https://golang.org/cl/68014 mentions this issue: cmd/compile, runtime: make bad safe-points throw

@gopherbot

This comment has been minimized.

Show comment
Hide comment
@gopherbot

gopherbot Oct 4, 2017

Change https://golang.org/cl/68015 mentions this issue: runtime: normalize (*sigctxt).fault() type

gopherbot commented Oct 4, 2017

Change https://golang.org/cl/68015 mentions this issue: runtime: normalize (*sigctxt).fault() type

@gopherbot

This comment has been minimized.

Show comment
Hide comment
@gopherbot

gopherbot Oct 4, 2017

Change https://golang.org/cl/68020 mentions this issue: runtime: ignore loop preemption signals in GDB

gopherbot commented Oct 4, 2017

Change https://golang.org/cl/68020 mentions this issue: runtime: ignore loop preemption signals in GDB

@gopherbot

This comment has been minimized.

Show comment
Hide comment
@gopherbot

gopherbot Oct 4, 2017

Change https://golang.org/cl/68013 mentions this issue: cmd/compile, runtime: indirect fault-based loop preemption

gopherbot commented Oct 4, 2017

Change https://golang.org/cl/68013 mentions this issue: cmd/compile, runtime: indirect fault-based loop preemption

@gopherbot

This comment has been minimized.

Show comment
Hide comment
@gopherbot

gopherbot Oct 4, 2017

Change https://golang.org/cl/68010 mentions this issue: cmd/compile: disallow loop preemption in the runtime

gopherbot commented Oct 4, 2017

Change https://golang.org/cl/68010 mentions this issue: cmd/compile: disallow loop preemption in the runtime

@gopherbot

This comment has been minimized.

Show comment
Hide comment
@gopherbot

gopherbot Oct 4, 2017

Change https://golang.org/cl/68016 mentions this issue: runtime: use page unmapping to preempt loops in STW

gopherbot commented Oct 4, 2017

Change https://golang.org/cl/68016 mentions this issue: runtime: use page unmapping to preempt loops in STW

@gopherbot

This comment has been minimized.

Show comment
Hide comment
@gopherbot

gopherbot Oct 4, 2017

Change https://golang.org/cl/68011 mentions this issue: cmd/compile: record rescheduling target PC

gopherbot commented Oct 4, 2017

Change https://golang.org/cl/68011 mentions this issue: cmd/compile: record rescheduling target PC

@gopherbot

This comment has been minimized.

Show comment
Hide comment
@gopherbot

gopherbot Oct 4, 2017

Change https://golang.org/cl/68017 mentions this issue: runtime: don't acquire m.locks in forEachP

gopherbot commented Oct 4, 2017

Change https://golang.org/cl/68017 mentions this issue: runtime: don't acquire m.locks in forEachP

@gopherbot

This comment has been minimized.

Show comment
Hide comment
@gopherbot

gopherbot Oct 4, 2017

Change https://golang.org/cl/68018 mentions this issue: runtime: use global preemption in forEachP and stack scanning

gopherbot commented Oct 4, 2017

Change https://golang.org/cl/68018 mentions this issue: runtime: use global preemption in forEachP and stack scanning

gopherbot pushed a commit that referenced this issue Oct 5, 2017

runtime: normalize (*sigctxt).fault() type
(*sigctxt).fault() currently returns either uintptr, uint32, or uint64
depending on the platform. Make them all return uintptr.

For #10958 (but a nice change on its own).

Change-Id: I7813e779d0edcba112dd47fda776f4ce6e50e227
Reviewed-on: https://go-review.googlesource.com/68015
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: David Chase <drchase@google.com>
@DanielMorsing

This comment has been minimized.

Show comment
Hide comment
@DanielMorsing

DanielMorsing Oct 5, 2017

Contributor

Do the tight loops that cause issues end up having that many safepoints that can be preempted at? Safepoints are most commonly calls and there's already a mechanism for handling those.

Contributor

DanielMorsing commented Oct 5, 2017

Do the tight loops that cause issues end up having that many safepoints that can be preempted at? Safepoints are most commonly calls and there's already a mechanism for handling those.

@dr2chase

This comment has been minimized.

Show comment
Hide comment
@dr2chase

dr2chase Oct 5, 2017

Contributor

Loops with calls to not-leaf functions don't need preemption checks, but the leaf/not-leaf choice is made late in the pipeline and is not recorded in the export data, and in a test where all loops cut by calls were exempted from safepoint insertion, it didn't help that much -- loops with calls in them are already doing enough "real work" that the safepoint check doesn't make a big difference. The "bad" loops are tiny and contain no calls.

Contributor

dr2chase commented Oct 5, 2017

Loops with calls to not-leaf functions don't need preemption checks, but the leaf/not-leaf choice is made late in the pipeline and is not recorded in the export data, and in a test where all loops cut by calls were exempted from safepoint insertion, it didn't help that much -- loops with calls in them are already doing enough "real work" that the safepoint check doesn't make a big difference. The "bad" loops are tiny and contain no calls.

@DanielMorsing

This comment has been minimized.

Show comment
Hide comment
@DanielMorsing

DanielMorsing Oct 6, 2017

Contributor

I think we're either talking cross-purposes or I'm just not getting it. Reading the current CLs, it seems that the instruction that causes the rescheduling will only be valid at GC safepoints. It's possible to construct tight loops that have none, for example this package:

package foobar

type foo struct {
	b []byte
}

func bar(f []*foo) int {
	var k int
	for i := 0; i < len(f); i++ {
		x := f[i].b
		f[i] = nil
		for j := 0; j < len(x); j++ {
			k += int(x[j])
		}
	}
	return k
}

Here, the inner loop will have no safepoints to put the rescheduling instruction at, since the x pointer is kept in a register and there any no references to it elsewhere.

Is there a plan for handling this kind of loop?

Contributor

DanielMorsing commented Oct 6, 2017

I think we're either talking cross-purposes or I'm just not getting it. Reading the current CLs, it seems that the instruction that causes the rescheduling will only be valid at GC safepoints. It's possible to construct tight loops that have none, for example this package:

package foobar

type foo struct {
	b []byte
}

func bar(f []*foo) int {
	var k int
	for i := 0; i < len(f); i++ {
		x := f[i].b
		f[i] = nil
		for j := 0; j < len(x); j++ {
			k += int(x[j])
		}
	}
	return k
}

Here, the inner loop will have no safepoints to put the rescheduling instruction at, since the x pointer is kept in a register and there any no references to it elsewhere.

Is there a plan for handling this kind of loop?

@dr2chase

This comment has been minimized.

Show comment
Hide comment
@dr2chase

dr2chase Oct 6, 2017

Contributor

This is that plan. The compiler will insert a safepoint check (of some flavor, TBD) in that every loop.

Contributor

dr2chase commented Oct 6, 2017

This is that plan. The compiler will insert a safepoint check (of some flavor, TBD) in that every loop.

@gopherbot

This comment has been minimized.

Show comment
Hide comment
@gopherbot

gopherbot Oct 26, 2017

Change https://golang.org/cl/73710 mentions this issue: runtime: "fix" non-preemptible loop in TestParallelRWMutexReaders

gopherbot commented Oct 26, 2017

Change https://golang.org/cl/73710 mentions this issue: runtime: "fix" non-preemptible loop in TestParallelRWMutexReaders

gopherbot pushed a commit that referenced this issue Oct 26, 2017

runtime: "fix" non-preemptible loop in TestParallelRWMutexReaders
TestParallelRWMutexReaders has a non-preemptible loop in it that can
deadlock if GC triggers. "Fix" it like we've fixed similar tests.

Updates #10958.

Change-Id: I13618f522f5ef0c864e7171ad2f655edececacd7
Reviewed-on: https://go-review.googlesource.com/73710
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Ian Lance Taylor <iant@golang.org>

@aclements aclements modified the milestones: Go1.10, Go1.11 Nov 8, 2017

@gopherbot

This comment has been minimized.

Show comment
Hide comment
@gopherbot

gopherbot Mar 26, 2018

Change https://golang.org/cl/102604 mentions this issue: cmd/compile: don't produce a past-the-end pointer in range loops

gopherbot commented Mar 26, 2018

Change https://golang.org/cl/102604 mentions this issue: cmd/compile: don't produce a past-the-end pointer in range loops

@ghost

This comment has been minimized.

Show comment
Hide comment
@ghost

ghost Mar 30, 2018

Goroutines are already pre-empted by the OS. It will even switch between multiple busy goroutines on a single core box, it's amazing! If your busy goroutines are starving some other thread, then create N+1 threads using GOMAXPROCS where N is the number of busy goroutines. Problem solved. A lot of people just don't understand this. Controlling how many busy goroutines you are scheduling is a programmer responsibility in any kind of co-operative light-weight threading system. If you can't handle it, write Java.

If the problem is GC....then maybe you be writing a better garbage collector instead of this.

ghost commented Mar 30, 2018

Goroutines are already pre-empted by the OS. It will even switch between multiple busy goroutines on a single core box, it's amazing! If your busy goroutines are starving some other thread, then create N+1 threads using GOMAXPROCS where N is the number of busy goroutines. Problem solved. A lot of people just don't understand this. Controlling how many busy goroutines you are scheduling is a programmer responsibility in any kind of co-operative light-weight threading system. If you can't handle it, write Java.

If the problem is GC....then maybe you be writing a better garbage collector instead of this.

@docmerlin

This comment has been minimized.

Show comment
Hide comment
@docmerlin

docmerlin Mar 30, 2018

@fcntl switching between goroutines on the same thread is cooperative (mostly), not preemptive. The OS can do preemption when it switches between goroutines on different threads, but within a thread its cooperative (mostly). The exception to this is at function calls, where the runtime can preimpt. This can create a problem if you have a very tight loop that is long running. Because there are no preemption points, the GC can't stop the world. Its a known issue, which can cause all sorts of nasty leaks.

docmerlin commented Mar 30, 2018

@fcntl switching between goroutines on the same thread is cooperative (mostly), not preemptive. The OS can do preemption when it switches between goroutines on different threads, but within a thread its cooperative (mostly). The exception to this is at function calls, where the runtime can preimpt. This can create a problem if you have a very tight loop that is long running. Because there are no preemption points, the GC can't stop the world. Its a known issue, which can cause all sorts of nasty leaks.

@bradfitz bradfitz modified the milestones: Go1.11, Go1.12 May 18, 2018

gopherbot pushed a commit that referenced this issue May 22, 2018

cmd/compile: don't produce a past-the-end pointer in range loops
Currently, range loops over slices and arrays are compiled roughly
like:

for i, x := range s { b }
  ⇓
for i, _n, _p := 0, len(s), &s[0]; i < _n; i, _p = i+1, _p + unsafe.Sizeof(s[0]) { b }
  ⇓
i, _n, _p := 0, len(s), &s[0]
goto cond
body:
{ b }
i, _p = i+1, _p + unsafe.Sizeof(s[0])
cond:
if i < _n { goto body } else { goto end }
end:

The problem with this lowering is that _p may temporarily point past
the end of the allocation the moment before the loop terminates. Right
now this isn't a problem because there's never a safe-point during
this brief moment.

We're about to introduce safe-points everywhere, so this bad pointer
is going to be a problem. We could mark the increment as an unsafe
block, but this inhibits reordering opportunities and could result in
infrequent safe-points if the body is short.

Instead, this CL fixes this by changing how we compile range loops to
never produce this past-the-end pointer. It changes the lowering to
roughly:

i, _n, _p := 0, len(s), &s[0]
if i < _n { goto body } else { goto end }
top:
_p += unsafe.Sizeof(s[0])
body:
{ b }
i++
if i < _n { goto top } else { goto end }
end:

Notably, the increment is split into two parts: we increment the index
before checking the condition, but increment the pointer only *after*
the condition check has succeeded.

The implementation builds on the OFORUNTIL construct that was
introduced during the loop preemption experiments, since OFORUNTIL
places the increment and condition after the loop body. To support the
extra "late increment" step, we further define OFORUNTIL's "List"
field to contain the late increment statements. This makes all of this
a relatively small change.

This depends on the improvements to the prove pass in CL 102603. With
the current lowering, bounds-check elimination knows that i < _n in
the body because the body block is dominated by the cond block. In the
new lowering, deriving this fact requires detecting that i < _n on
*both* paths into body and hence is true in body. CL 102603 made prove
able to detect this.

The code size effect of this is minimal. The cmd/go binary on
linux/amd64 increases by 0.17%. Performance-wise, this actually
appears to be a net win, though it's mostly noise:

name                      old time/op    new time/op    delta
BinaryTree17-12              2.80s ± 0%     2.61s ± 1%  -6.88%  (p=0.000 n=20+18)
Fannkuch11-12                2.41s ± 0%     2.42s ± 0%  +0.05%  (p=0.005 n=20+20)
FmtFprintfEmpty-12          41.6ns ± 5%    41.4ns ± 6%    ~     (p=0.765 n=20+19)
FmtFprintfString-12         69.4ns ± 3%    69.3ns ± 1%    ~     (p=0.084 n=19+17)
FmtFprintfInt-12            76.1ns ± 1%    77.3ns ± 1%  +1.57%  (p=0.000 n=19+19)
FmtFprintfIntInt-12          122ns ± 2%     123ns ± 3%  +0.95%  (p=0.015 n=20+20)
FmtFprintfPrefixedInt-12     153ns ± 2%     151ns ± 3%  -1.27%  (p=0.013 n=20+20)
FmtFprintfFloat-12           215ns ± 0%     216ns ± 0%  +0.47%  (p=0.000 n=20+16)
FmtManyArgs-12               486ns ± 1%     498ns ± 0%  +2.40%  (p=0.000 n=20+17)
GobDecode-12                6.43ms ± 0%    6.50ms ± 0%  +1.08%  (p=0.000 n=18+19)
GobEncode-12                5.43ms ± 1%    5.47ms ± 0%  +0.76%  (p=0.000 n=20+20)
Gzip-12                      218ms ± 1%     218ms ± 1%    ~     (p=0.883 n=20+20)
Gunzip-12                   38.8ms ± 0%    38.9ms ± 0%    ~     (p=0.644 n=19+19)
HTTPClientServer-12         76.2µs ± 1%    76.4µs ± 2%    ~     (p=0.218 n=20+20)
JSONEncode-12               12.2ms ± 0%    12.3ms ± 1%  +0.45%  (p=0.000 n=19+19)
JSONDecode-12               54.2ms ± 1%    53.3ms ± 0%  -1.67%  (p=0.000 n=20+20)
Mandelbrot200-12            3.71ms ± 0%    3.71ms ± 0%    ~     (p=0.143 n=19+20)
GoParse-12                  3.22ms ± 0%    3.19ms ± 1%  -0.72%  (p=0.000 n=20+20)
RegexpMatchEasy0_32-12      76.7ns ± 1%    75.8ns ± 1%  -1.19%  (p=0.000 n=20+17)
RegexpMatchEasy0_1K-12       245ns ± 1%     243ns ± 0%  -0.72%  (p=0.000 n=18+17)
RegexpMatchEasy1_32-12      71.9ns ± 0%    71.7ns ± 1%  -0.39%  (p=0.006 n=12+18)
RegexpMatchEasy1_1K-12       358ns ± 1%     354ns ± 1%  -1.13%  (p=0.000 n=20+19)
RegexpMatchMedium_32-12      105ns ± 2%     105ns ± 1%  -0.63%  (p=0.007 n=19+20)
RegexpMatchMedium_1K-12     31.9µs ± 1%    31.9µs ± 1%    ~     (p=1.000 n=17+17)
RegexpMatchHard_32-12       1.51µs ± 1%    1.52µs ± 2%  +0.46%  (p=0.042 n=18+18)
RegexpMatchHard_1K-12       45.3µs ± 1%    45.5µs ± 2%  +0.44%  (p=0.029 n=18+19)
Revcomp-12                   388ms ± 1%     385ms ± 0%  -0.57%  (p=0.000 n=19+18)
Template-12                 63.0ms ± 1%    63.3ms ± 0%  +0.50%  (p=0.000 n=19+20)
TimeParse-12                 309ns ± 1%     307ns ± 0%  -0.62%  (p=0.000 n=20+20)
TimeFormat-12                328ns ± 0%     333ns ± 0%  +1.35%  (p=0.000 n=19+19)
[Geo mean]                  47.0µs         46.9µs       -0.20%

(https://perf.golang.org/search?q=upload:20180326.1)

For #10958.
For #24543.

Change-Id: Icbd52e711fdbe7938a1fea3e6baca1104b53ac3a
Reviewed-on: https://go-review.googlesource.com/102604
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Keith Randall <khr@golang.org>
Reviewed-by: David Chase <drchase@google.com>
@gopherbot

This comment has been minimized.

Show comment
Hide comment
@gopherbot

gopherbot May 23, 2018

Change https://golang.org/cl/114086 mentions this issue: runtime: fix preemption deadlocks in TestDebugCall*

gopherbot commented May 23, 2018

Change https://golang.org/cl/114086 mentions this issue: runtime: fix preemption deadlocks in TestDebugCall*

gopherbot pushed a commit that referenced this issue May 24, 2018

runtime: fix preemption deadlocks in TestDebugCall*
TestDebugCall* uses atomic spin loops and hence can deadlock if the
garbage collector is enabled (because of #10958; ironically,
implementing debugger call injection is closely related to fixing this
exact issue, but we're not there yet).

Fix this by disabling the garbage collector during these tests.

Updates #25519 (might fix it, though I suspect not)

Change-Id: If1e454b9cdea8e4b1cd82509b762c75b6acd8476
Reviewed-on: https://go-review.googlesource.com/114086
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Keith Randall <khr@golang.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment