Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.
Sign upruntime: tight loops should be preemptible #10958
Comments
aclements
added this to the Go1.6 milestone
May 26, 2015
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
minux
May 26, 2015
Member
|
Another solution is to send a signal to the thread to force
it to be preempted.
i.e. once a thread is stopped, it masks a certain signal, and
if there are some threads running user code that refuse to
stop, we send that signal to the current process. In
runtime.sigtramp, it can force the switch of the goroutine
and then mask the signal.
This solution introduces no overhead to the code, at the
expense of complicating the runtime design.
|
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
randall77
May 26, 2015
Contributor
The problem with interrupts is that the goroutine can be at an arbitrary address. The interrupted address is probably at a place where we have no data about where the pointers are. We'd either have to record a lot more frame layout/type/liveness information (including registers), or we'd have to simulate the goroutine forward to a safepoint. Both are challenging.
|
The problem with interrupts is that the goroutine can be at an arbitrary address. The interrupted address is probably at a place where we have no data about where the pointers are. We'd either have to record a lot more frame layout/type/liveness information (including registers), or we'd have to simulate the goroutine forward to a safepoint. Both are challenging. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
minux
May 26, 2015
Member
|
The problem of inserting artificial preemption points is reduced
numerical performance.
Yet another solution is to just ignore the problem. User is supposed
to insert runtime.Gosched() calls at appropriate points as needed.
btw, isn't the long term GC plan is to never stop the world?
|
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
ianlancetaylor
May 26, 2015
Contributor
It seems to me that we can preempt at a write barrier. So the only loops we are talking about are those that make no writes to the heap and make no function calls. If we think in terms of adding
uint8 b
loop:
++b
if b == 0 {
preemptCheck()
}
then the normal path through the loop will have two extra instructions (add/beq) where the add may be to either a register or a memory location, depending on overall register pressure. This will be measurable in tight loops but for most cases shouldn't be too bad.
|
It seems to me that we can preempt at a write barrier. So the only loops we are talking about are those that make no writes to the heap and make no function calls. If we think in terms of adding
then the normal path through the loop will have two extra instructions (add/beq) where the add may be to either a register or a memory location, depending on overall register pressure. This will be measurable in tight loops but for most cases shouldn't be too bad. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
randall77
May 26, 2015
Contributor
No pointer writes to the heap.
But maybe this is enough? If we can preempt at the first write barrier, then even if we let the mutator run it can't modify anything that the GC cares about. So maybe we just set the write barrier enabled bit and let goroutines run. Once a goroutine sees the write barrier bit (how do we force that?) it can't modify any memory the GC cares about. So it is safe to start the GC without waiting for every goroutine to stop.
|
No pointer writes to the heap. But maybe this is enough? If we can preempt at the first write barrier, then even if we let the mutator run it can't modify anything that the GC cares about. So maybe we just set the write barrier enabled bit and let goroutines run. Once a goroutine sees the write barrier bit (how do we force that?) it can't modify any memory the GC cares about. So it is safe to start the GC without waiting for every goroutine to stop. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
aclements
May 27, 2015
Member
The problem of inserting artificial preemption points is reduced
numerical performance.
True. That's why I suggested unrolling loops, which AFAIK is a standard solution to this problem.
However, I think the check can be done in just two instructions, even without adding Ian's loop counter. Just CMP the preempt flag and branch if it's set. That branch will almost never be hit, so it will be highly predictable, and the preempt flag should be in the L1, so the check may in fact be extremely cheap.
No pointer writes to the heap. But maybe this is enough?
This certainly reduces the set of programs that have this problem, but I don't think it actually helps with either of the examples I gave, since numerical kernels probably won't have heap pointer writes, and the runtime test that can deadlock the GC certainly has no pointer writes.
True. That's why I suggested unrolling loops, which AFAIK is a standard solution to this problem. However, I think the check can be done in just two instructions, even without adding Ian's loop counter. Just CMP the preempt flag and branch if it's set. That branch will almost never be hit, so it will be highly predictable, and the preempt flag should be in the L1, so the check may in fact be extremely cheap.
This certainly reduces the set of programs that have this problem, but I don't think it actually helps with either of the examples I gave, since numerical kernels probably won't have heap pointer writes, and the runtime test that can deadlock the GC certainly has no pointer writes. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
RLH
May 27, 2015
Contributor
This is the reference for doing a GC safepoint at every instruction. It is
not something we want to do for x86 much less the various other
architectures we need to support.
http://dl.acm.org/citation.cfm?id=301652
Most systems I know of do what Austin suggests, unroll the loop and insert
a check. No numbers but 8 unrolls seems to be what I recall. Not only is
the branch highly predictable but the check does not introduce any
dependencies making it close to free on an out of order machine. There have
been other approaches such as code plugging and predicated branches but HW
has moved on. I had not seen Ian's suggestion.
On Tue, May 26, 2015 at 9:24 PM, Austin Clements notifications@github.com
wrote:
The problem of inserting artificial preemption points is reduced
numerical performance.True. That's why I suggested unrolling loops, which AFAIK is a standard
solution to this problem.However, I think the check can be done in just two instructions, even
without adding Ian's loop counter. Just CMP the preempt flag and branch if
it's set. That branch will almost never be hit, so it will be highly
predictable, and the preempt flag should be in the L1, so the check may in
fact be extremely cheap.No pointer writes to the heap. But maybe this is enough?
This certainly reduces the set of programs that have this problem, but I
don't think it actually helps with either of the examples I gave, since
numerical kernels probably won't have heap pointer writes, and the runtime
test that can deadlock the GC certainly has no pointer writes.—
Reply to this email directly or view it on GitHub
#10958 (comment).
|
This is the reference for doing a GC safepoint at every instruction. It is Most systems I know of do what Austin suggests, unroll the loop and insert On Tue, May 26, 2015 at 9:24 PM, Austin Clements notifications@github.com
|
aclements
referenced this issue
Jun 29, 2015
Closed
runtime: golang scheduler is not preemptive - it's cooperative? #11462
aclements
modified the milestones:
Go1.6Early,
Go1.6
Jun 30, 2015
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
kingluo
Sep 9, 2015
Currently goroutines are only preemptible at function call points.
To be precise, the preemption check seems only at newstack()?
For example,
package main
import (
"fmt"
"runtime"
"time"
)
func test() {
a := 100
for i := 1; i < 1000; i++ {
a = i*100/i + a
}
}
func main() {
runtime.GOMAXPROCS(1)
go func() {
for {
test()
}
}()
time.Sleep(100 * time.Millisecond)
fmt.Println("hello world")
}The test() is not inline, and the infinite loop calls test(), but it would not preempt at the calls, because the morestack() --> newstack() not involved, so the "hello world" would never be printed.
kingluo
commented
Sep 9, 2015
To be precise, the preemption check seems only at newstack()? For example, package main
import (
"fmt"
"runtime"
"time"
)
func test() {
a := 100
for i := 1; i < 1000; i++ {
a = i*100/i + a
}
}
func main() {
runtime.GOMAXPROCS(1)
go func() {
for {
test()
}
}()
time.Sleep(100 * time.Millisecond)
fmt.Println("hello world")
}The test() is not inline, and the infinite loop calls test(), but it would not preempt at the calls, because the morestack() --> newstack() not involved, so the "hello world" would never be printed. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
randall77
Sep 9, 2015
Contributor
test() is not inlined, but since it is a leaf it is promoted to nosplit, so it doesn't have the preemption check any more. Fixing that sounds much easier than the rest of this bug. Maybe we could forbid nosplit promotion for functions called from loops?
|
test() is not inlined, but since it is a leaf it is promoted to nosplit, so it doesn't have the preemption check any more. Fixing that sounds much easier than the rest of this bug. Maybe we could forbid nosplit promotion for functions called from loops? |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
griesemer
Sep 10, 2015
Contributor
For the reference, the same problem appeared in the HotSpot JVM. I remember two approaches:
- Around 1997/1998, Rene Schmidt (http://jaoo.dk/aarhus2007/speaker/Rene+W.+Schmidt) implemented the following mechanism: Threads running a tight loop w/o function calls would receive a signal to temporarily suspend them. The runtime then dynamically generated a partial "copy" of the loop instructions the thread was running, from the current PC to the (unconditional) backward branch, except that the backward branch was replaced with a call into the runtime (leading to proper suspension at a safe point). The thread was then restarted with the pc modified such that it would run the newly generated piece of code. That code would run to the end of the loop body and then suspend itself at which point the code copy was discarded and the pc (return address) adjusted to continue with the original code (after the safe point).
This mechanism was ingenious but also insanely complicated. It was abandoned eventually for:
- A simple test and branch (2 additional instructions) at the end of a loop body which didn't contain any calls. As far as I recall, we didn't do any form of loop unrolling and the performance implications were not significant in the overall picture of a larger application.
My vote would be for 2) to start in loops where @ianlancetaylor's suggestion doesn't work. I suspect that for all but the smallest long-running inner loops (where unrolling might make sense independently), the performance penalty is acceptable.
If this is not good enough, and we don't want to pay the code size cost of unrolling a loop multiple times, here's another idea based on 1): Instead of the backward branch check as in 2) plus unrolling, keep the original loop, and generate (at compile time) a 2nd version of the loop body that ends in a runtime call to suspend itself instead of the backward branch. The code size cost is about the cost of having the loop unrolled twice. When the goroutine needs to run to a safe point, use a signal to temporarily suspend the goroutine, switch the pc to the corresponding pc in the copy of the loop body, continue with execution there and have the code suspend itself at a safe point. There's no dynamic code generation involved, generating the extra loop body is trivial at compile-time, the extra amount of code is less than with loop unrolling, and there's only a little bit of runtime work to modify the pc. The regular code would run at full speed if no garbage collection is needed.
|
For the reference, the same problem appeared in the HotSpot JVM. I remember two approaches:
This mechanism was ingenious but also insanely complicated. It was abandoned eventually for:
My vote would be for 2) to start in loops where @ianlancetaylor's suggestion doesn't work. I suspect that for all but the smallest long-running inner loops (where unrolling might make sense independently), the performance penalty is acceptable. If this is not good enough, and we don't want to pay the code size cost of unrolling a loop multiple times, here's another idea based on 1): Instead of the backward branch check as in 2) plus unrolling, keep the original loop, and generate (at compile time) a 2nd version of the loop body that ends in a runtime call to suspend itself instead of the backward branch. The code size cost is about the cost of having the loop unrolled twice. When the goroutine needs to run to a safe point, use a signal to temporarily suspend the goroutine, switch the pc to the corresponding pc in the copy of the loop body, continue with execution there and have the code suspend itself at a safe point. There's no dynamic code generation involved, generating the extra loop body is trivial at compile-time, the extra amount of code is less than with loop unrolling, and there's only a little bit of runtime work to modify the pc. The regular code would run at full speed if no garbage collection is needed. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
nightlyone
Sep 11, 2015
Contributor
Another idea, similiar to @ianlancetaylor's idea, is to estimate the cost (in ns) of the loop and only check for required suspension every N loops through that loop body.
Once the compiler can unroll/unpeel, that check-every-N logic can be either inlined after the unrolled body, if unrolling is beneficial, or kept when unrolling makes no sense for that loop body.
Such logic also reads better when using debuggers.
|
Another idea, similiar to @ianlancetaylor's idea, is to estimate the cost (in ns) of the loop and only check for required suspension every N loops through that loop body. Once the compiler can unroll/unpeel, that check-every-N logic can be either inlined after the unrolled body, if unrolling is beneficial, or kept when unrolling makes no sense for that loop body. Such logic also reads better when using debuggers. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
griesemer
Sep 11, 2015
Contributor
@nightlyone The check-every-N seems more complex than just having two extra instructions (compare and branch). And if unrolling is done, that compare-and-branch is already needed only every N loop iterations only (where N is the number of times the loop was unrolled).
|
@nightlyone The check-every-N seems more complex than just having two extra instructions (compare and branch). And if unrolling is done, that compare-and-branch is already needed only every N loop iterations only (where N is the number of times the loop was unrolled). |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
nightlyone
Sep 11, 2015
Contributor
@griesemer not sure how the test will avoid atomic loads, that's why I suggested the check-every-N.
So the pseudo-go-code before unrolling would look like this:
loop:
// loop-body
if counter > N {
counter = 0
if need_stop = atomic.LoadBool(&runtime.getg().need_stop); need_stop {
runtime.Gosched()
}
}
counter++
goto loopafter unrolling it would look like
loop:
// loop-body0
// loop-body2
// ...
// loop-bodyM
if counter > N/M {
counter = 0
if need_stop = atomic.LoadBool(&runtime.getg().need_stop); need_stop {
runtime.Gosched()
}
}
counter++
goto loopSo the code inserted would be a constant overhead, but still run every N loops.
|
@griesemer not sure how the test will avoid atomic loads, that's why I suggested the check-every-N. So the pseudo-go-code before unrolling would look like this: loop:
// loop-body
if counter > N {
counter = 0
if need_stop = atomic.LoadBool(&runtime.getg().need_stop); need_stop {
runtime.Gosched()
}
}
counter++
goto loopafter unrolling it would look like loop:
// loop-body0
// loop-body2
// ...
// loop-bodyM
if counter > N/M {
counter = 0
if need_stop = atomic.LoadBool(&runtime.getg().need_stop); need_stop {
runtime.Gosched()
}
}
counter++
goto loopSo the code inserted would be a constant overhead, but still run every N loops. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
aclements
Sep 11, 2015
Member
The load doesn't have to be atomic. The current preemption check in the function prologue isn't atomic.
One tricky bit with the compare and branch is that the actual preemption point needs to have no live registers. Presumably, for performance reasons, we want the loop to be able to keep things in registers, so the code we branch to on a preempt has to flush any live registers before the preempt and reload them after the preempt. I don't think this will be particularly hard, but it is something to consider, since it might affect which stage of the compiler is responsible for generating this code.
|
The load doesn't have to be atomic. The current preemption check in the function prologue isn't atomic. One tricky bit with the compare and branch is that the actual preemption point needs to have no live registers. Presumably, for performance reasons, we want the loop to be able to keep things in registers, so the code we branch to on a preempt has to flush any live registers before the preempt and reload them after the preempt. I don't think this will be particularly hard, but it is something to consider, since it might affect which stage of the compiler is responsible for generating this code. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
griesemer
Sep 11, 2015
Contributor
@aclements Indeed. Which is why perhaps switching the pc to a 2nd version of the loop body that ends in a safe point might not be much more complex and permit the loop to run at full speed in the normal case.
|
@aclements Indeed. Which is why perhaps switching the pc to a 2nd version of the loop body that ends in a safe point might not be much more complex and permit the loop to run at full speed in the normal case. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
aclements
Sep 11, 2015
Member
@aclements Indeed. Which is why perhaps switching the pc to a 2nd version of the loop body that ends in a safe point might not be much more complex and permit the loop to run at full speed in the normal case.
From the runtime's perspective, I think this would be more complicated because stealing a signal is a logistic pain and we'd have to deal with tables mapping from fast loop PCs to slow loop PCs. The compiler would have to generate these tables. This seems like a very clever plan B, but I think first we should trying adding a no-op compare and branch and see if it's actually a problem for dense numerical kernels.
From the runtime's perspective, I think this would be more complicated because stealing a signal is a logistic pain and we'd have to deal with tables mapping from fast loop PCs to slow loop PCs. The compiler would have to generate these tables. This seems like a very clever plan B, but I think first we should trying adding a no-op compare and branch and see if it's actually a problem for dense numerical kernels. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
randall77
Sep 11, 2015
Contributor
The new SSA register allocator handles situations like this well, keeping everything in registers on the common edges and spill/call/restore on the unlikely case.
|
The new SSA register allocator handles situations like this well, keeping everything in registers on the common edges and spill/call/restore on the unlikely case. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
RLH
Sep 11, 2015
Contributor
The load is not dependent on anything in the loop. Assuming the loop is
tight, the value is almost certainly to a location already in the L1 cache,
the branch will be highly predictable. Just to make sure the branch
predictor doesn't even need to be warmed up, we can make it a backward
branch. I would be sort of surprised that if on an out of order machine the
cost would even be noticed. That said build and measure is the only way to
be sure.
On Fri, Sep 11, 2015 at 3:38 PM, Keith Randall notifications@github.com
wrote:
The new SSA register allocator handles situations like this well, keeping
everything in registers on the common edges and spill/call/restore on the
unlikely case.—
Reply to this email directly or view it on GitHub
#10958 (comment).
|
The load is not dependent on anything in the loop. Assuming the loop is On Fri, Sep 11, 2015 at 3:38 PM, Keith Randall notifications@github.com
|
rsc
modified the milestones:
Unplanned,
Go1.6Early
Nov 4, 2015
aclements
referenced this issue
Dec 9, 2015
Closed
runtime: goroutine starvation due to Gosched #13546
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
gopherbot
commented
Feb 16, 2016
|
CL https://golang.org/cl/19516 mentions this issue. |
pushed a commit
that referenced
this issue
Feb 16, 2016
creker
referenced this issue
Apr 26, 2016
Closed
runtime: tight loop hangs process completely after some time #15442
mdempsky
referenced this issue
May 19, 2016
Closed
runtime: under edge conditions, select fails to read from time.After() channel #14561
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
wsc1
Sep 30, 2018
Hi all,
I do not have a deep understanding of this issue, but I do have a concern about compatibility and controllability on the part of a programmer.
Most importantly, there are also deliberate tight loops without pre-emption which are useful and may be relied upon.
I am relying on this currently in audio i/o, where I do not want the tight loop to be pre-empted unless I tell it to be in order to reliably be able to exchange real-time data as it is played or recorded. Who knows what other code relies on it?
To me, cooperative pre-emption is part of the language. Without it, the existence of runtime.Gosched() seems useless.
I unfortunately do not have a solution for the cases where the programs fail to cooperatively pre-empt but should.
But I also remember some very old discussions where it was suggested to simply place a function in the hot spot causing problems. This would place control of the scheduling in the hands of the programmer, where it has been squarely (modulo OS threads and at least with GOMAXPROCS=1) for quite some time.
wsc1
commented
Sep 30, 2018
|
Hi all, I do not have a deep understanding of this issue, but I do have a concern about compatibility and controllability on the part of a programmer. Most importantly, there are also deliberate tight loops without pre-emption which are useful and may be relied upon. I am relying on this currently in audio i/o, where I do not want the tight loop to be pre-empted unless I tell it to be in order to reliably be able to exchange real-time data as it is played or recorded. Who knows what other code relies on it? To me, cooperative pre-emption is part of the language. Without it, the existence of runtime.Gosched() seems useless. I unfortunately do not have a solution for the cases where the programs fail to cooperatively pre-empt but should. But I also remember some very old discussions where it was suggested to simply place a function in the hot spot causing problems. This would place control of the scheduling in the hands of the programmer, where it has been squarely (modulo OS threads and at least with GOMAXPROCS=1) for quite some time. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
CAFxX
Sep 30, 2018
Contributor
@wsc1 in general all kernels in common use today also do non-cooperative pre-emption[1], and in practice it works, so I don't really see how adding a similar model to go can have any impact on the workload you're describing (especially considered that you describe it as an I/O workload - and all I/O operations in go have always been potential pre-emption points).
[1] unless you use a specialized real-time kernel or a real-time scheduling class on a kernel that supports them
|
@wsc1 in general all kernels in common use today also do non-cooperative pre-emption[1], and in practice it works, so I don't really see how adding a similar model to go can have any impact on the workload you're describing (especially considered that you describe it as an I/O workload - and all I/O operations in go have always been potential pre-emption points). [1] unless you use a specialized real-time kernel or a real-time scheduling class on a kernel that supports them |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
wsc1
Oct 1, 2018
@CAFxX yes, kernels pre-empt and must do so. A runtime isn't a kernel and must work in the context of what the host OS supplies.
For audio. 1) yes realtime OS scheduling is the norm, at least in the sense of real time scheduling priorities. It happens every time you listen to music or enter in a call or watch a movie on many many devices. 2) it's not really I/O in the sense of a sys call on a file descriptor. It depends on what is being interfaced, but often enough it's like this: IRQ goes to the kernel from an audio specific hw clock, HW driver is triggered by kernel, and invokes a user supplied callback function, user supplied callback function accesses HW memory, which is likely something like DMA. If inside that callback, you are pre-empted, then the real-time quality goes out the window and the audio will glitch. I don't want Go to be doomed to this forever and hence banned from doing audio effectively. So it's not like the Go code makes a sys call and has corresponding OS I/O overhead associated with files, network, etc. the data is already in memory by construction.
For more specifics, consider below from zc/sio
func (r *Cb) toC(addr *uint32) error {
var sz uint32
i := 0
for {
sz = atomic.LoadUint32(addr)
if atomic.CompareAndSwapUint32(addr, sz, sz-1) {
return nil
}
i++
if i%atomicTryLen == 0 {
if i >= atomicTryLim {
return ErrCApiLost
}
// runtime.Gosched may or may not invoke a syscall if many g's on m
// use sparingly
runtime.Gosched()
}
}
}
In code like the above, I would expect the introduction of pre-emptive scheduling would cause it to less reliably do its job on time. I'm not a runtime expert, so maybe I'm missing something and scheduling happens anyways on atomic calls or something (I would guess and hope not).
I also don't want to downplay the known problems or the importance of fixing them. I also want to say I have not had time to read all the issues and related material to the proposal and I am not a runtime scheduling expert, but from what I do know, it scares me.
Finally, your comment/question does not address the issue of whether or not programmers have depended on cooperative pre-emption in the Go run runtime, nor why runtime.Gosched would be any use in a pre-emptive scheduler. I have not seen that in the proposal or related material, but there is a lot of it. If anyone has a pointer, it would help.
The presence of runtime.Gosched implies (to me) that scheduling is not run except at predefined points that the programmer can work with, otherwise I don't see a use for it all. Changing that feels to me like language semantics change, a sort of deprecation of runtime.Gosched without providing an alternative. The example above is one that seems to me might well be negatively effected. There is no way that I know of to establish what else might be negatively effected, or its scope. They aren't filing issues because their code works, and they might have no idea about this proposal or its potential effects.
wsc1
commented
Oct 1, 2018
|
@CAFxX yes, kernels pre-empt and must do so. A runtime isn't a kernel and must work in the context of what the host OS supplies. For audio. 1) yes realtime OS scheduling is the norm, at least in the sense of real time scheduling priorities. It happens every time you listen to music or enter in a call or watch a movie on many many devices. 2) it's not really I/O in the sense of a sys call on a file descriptor. It depends on what is being interfaced, but often enough it's like this: IRQ goes to the kernel from an audio specific hw clock, HW driver is triggered by kernel, and invokes a user supplied callback function, user supplied callback function accesses HW memory, which is likely something like DMA. If inside that callback, you are pre-empted, then the real-time quality goes out the window and the audio will glitch. I don't want Go to be doomed to this forever and hence banned from doing audio effectively. So it's not like the Go code makes a sys call and has corresponding OS I/O overhead associated with files, network, etc. the data is already in memory by construction. For more specifics, consider below from zc/sio
In code like the above, I would expect the introduction of pre-emptive scheduling would cause it to less reliably do its job on time. I'm not a runtime expert, so maybe I'm missing something and scheduling happens anyways on atomic calls or something (I would guess and hope not). I also don't want to downplay the known problems or the importance of fixing them. I also want to say I have not had time to read all the issues and related material to the proposal and I am not a runtime scheduling expert, but from what I do know, it scares me. Finally, your comment/question does not address the issue of whether or not programmers have depended on cooperative pre-emption in the Go run runtime, nor why runtime.Gosched would be any use in a pre-emptive scheduler. I have not seen that in the proposal or related material, but there is a lot of it. If anyone has a pointer, it would help. The presence of runtime.Gosched implies (to me) that scheduling is not run except at predefined points that the programmer can work with, otherwise I don't see a use for it all. Changing that feels to me like language semantics change, a sort of deprecation of runtime.Gosched without providing an alternative. The example above is one that seems to me might well be negatively effected. There is no way that I know of to establish what else might be negatively effected, or its scope. They aren't filing issues because their code works, and they might have no idea about this proposal or its potential effects. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
rasky
Oct 1, 2018
Member
I think it would help if you could provide a fully working example of a program that works thanks to your library and you are worried that could stop working after preemption. If it's a mock of a real-word software used in production, that would be the best.
My two cents is that I assume that an API could be added to mark sections that can't be pre-empted (I assume something like that is also needed within the runtime itself, in critical sections), but I don't have any specific insights on the matter.
|
I think it would help if you could provide a fully working example of a program that works thanks to your library and you are worried that could stop working after preemption. If it's a mock of a real-word software used in production, that would be the best. My two cents is that I assume that an API could be added to mark sections that can't be pre-empted (I assume something like that is also needed within the runtime itself, in critical sections), but I don't have any specific insights on the matter. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
wsc1
Oct 1, 2018
@rasky yes an API to prevent pre-emption would work (provided it doesn't sys call).
for an example, take any callback based audio API which runs on a special thread (eg Jack, Apple CoreAudio, Android AAudio), imagine you want Go to do the C callback work without pre-emption (to the extent possible, as yes the OS can pre-empt, but it is unlikely on a real time thread) and without sys calls.
All the above audio APIs clearly document to not invoke things which can take unbounded time under OS load, such as locks/mutexes and moreover if you do sys-calls then you introduce OS context switching, which makes the real-timing of the callback more variable which in turn makes the code in the callback inappropriate for handling real-time audio data.
The library above also comes with a test to emulate that. Because of its relation to OS scheduling, Note that to test reliability, one must load the OS outside the program in question, so you cannot draw conclusions from testing one program without also placing the entire host OS under test. Sorry I can't provide such a test which controls the whole OS at this time. But it should be easy to see: if OS real time threads are involved in running user code, then forcing the kernel into a sys call or forcing the go code into a reschedule (which can sys call) inside that user code will take longer, potentially much longer in terms of wall clock time, than not. then the audio can glitch, which in turn means no one would choose Go to do the audio processing with expectations of reliability.
runtime.Gosched() can invoke sys calls. the programmer should (continue to) be able to avoid that.
wsc1
commented
Oct 1, 2018
•
|
@rasky yes an API to prevent pre-emption would work (provided it doesn't sys call). for an example, take any callback based audio API which runs on a special thread (eg Jack, Apple CoreAudio, Android AAudio), imagine you want Go to do the C callback work without pre-emption (to the extent possible, as yes the OS can pre-empt, but it is unlikely on a real time thread) and without sys calls. All the above audio APIs clearly document to not invoke things which can take unbounded time under OS load, such as locks/mutexes and moreover if you do sys-calls then you introduce OS context switching, which makes the real-timing of the callback more variable which in turn makes the code in the callback inappropriate for handling real-time audio data. The library above also comes with a test to emulate that. Because of its relation to OS scheduling, Note that to test reliability, one must load the OS outside the program in question, so you cannot draw conclusions from testing one program without also placing the entire host OS under test. Sorry I can't provide such a test which controls the whole OS at this time. But it should be easy to see: if OS real time threads are involved in running user code, then forcing the kernel into a sys call or forcing the go code into a reschedule (which can sys call) inside that user code will take longer, potentially much longer in terms of wall clock time, than not. then the audio can glitch, which in turn means no one would choose Go to do the audio processing with expectations of reliability. runtime.Gosched() can invoke sys calls. the programmer should (continue to) be able to avoid that. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
bcmills
Oct 1, 2018
Member
if OS real time threads are involved in running user code, then forcing the kernel into a sys call or forcing the go code into a reschedule (which can sys call) inside that user code will take longer, potentially much longer in terms of wall clock time, than not.
If a goroutine is locked to an OS thread and not allocating memory in the steady state, then it seems like the only reason to preempt it would be to mark its stack to complete a GC cycle, which should not take very long anyway.
If a goroutine is locked to an OS thread and not allocating memory in the steady state, then it seems like the only reason to preempt it would be to mark its stack to complete a GC cycle, which should not take very long anyway. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
wsc1
Oct 1, 2018
if OS real time threads are involved in running user code, then forcing the kernel into a sys call or forcing the go code into a reschedule (which can sys call) inside that user code will take longer, potentially much longer in terms of wall clock time, than not.
If a goroutine is locked to an OS thread and not allocating memory in the steady state, then it seems like the only reason to preempt it would be to mark its stack to complete a GC cycle, which should not take very long anyway.
Yes, GC pause time is not the primary worry here. Taking away established mechanisms to handle pre-emption and the ability for a programmer to write Go code which doesn't sys call on the current m over the real time span associated with a critical section of code is the worry for my audio use case. There are an unknown number of other use cases of runtime.Gosched() and no way I know of to guarantee their feedback.
Although LockOSThread would very much alleviate many problems with uncontrolled pre-emption, it also introduces a lot of new considerations and problems. It does not seem to me like a suitable replacement for all programs which currently rely on cooperative scheduling. Again, I would expect there are no issues for those programs now because they work according to the current language semantic of cooperative pre-emption.
One thing I don't understand about the issues related to pre-emptive scheduling proposal is this (please note I haven't had time to really study them, this is just a big picture impression): If 10 years ago Go said "scheduling is cooperative, and so if you loop without function calls or communication on a thread, it will spin lock and cause associated problems, so use runtime.Gosched() appropriately" (which is basically my memory of it), then why today are we proposing to insert runtime.Gosched() automatically instead of telling the programers to do it to solve those issues? I think at least many programmers would prefer to have the control over pre-emption as it is today and take the associated responsibility.
wsc1
commented
Oct 1, 2018
Yes, GC pause time is not the primary worry here. Taking away established mechanisms to handle pre-emption and the ability for a programmer to write Go code which doesn't sys call on the current m over the real time span associated with a critical section of code is the worry for my audio use case. There are an unknown number of other use cases of runtime.Gosched() and no way I know of to guarantee their feedback. Although LockOSThread would very much alleviate many problems with uncontrolled pre-emption, it also introduces a lot of new considerations and problems. It does not seem to me like a suitable replacement for all programs which currently rely on cooperative scheduling. Again, I would expect there are no issues for those programs now because they work according to the current language semantic of cooperative pre-emption. One thing I don't understand about the issues related to pre-emptive scheduling proposal is this (please note I haven't had time to really study them, this is just a big picture impression): If 10 years ago Go said "scheduling is cooperative, and so if you loop without function calls or communication on a thread, it will spin lock and cause associated problems, so use runtime.Gosched() appropriately" (which is basically my memory of it), then why today are we proposing to insert runtime.Gosched() automatically instead of telling the programers to do it to solve those issues? I think at least many programmers would prefer to have the control over pre-emption as it is today and take the associated responsibility. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
dr2chase
Oct 1, 2018
Contributor
You can get a pretty-near worst-case test of this right now by either
- building an entire system with
GOEXPERIMENT=preemptibleloops ./make.bashwhich will insert explicit (i.e., slowing your program down) checks that allow preemption in all loops - or compiling your application (but not the runtime or other packages that it uses) with
-gcflags=-d=ssa/insert_resched_checks/on
Lack of preemption causes problems for other people, and these have been costly to debug, so assume that this change will occur. We've been working on it for some time and tried (to some extent) four different techniques for addressing it, and discarded three of them as either too costly or too risky. What you get with the two incantations above is one of the too-slow solutions.
If preemptible loops turns out to be a problem for you, there are ways to work around this, but it would be really nice to measure first. One reason is that it might only be a hypothetical problem, another is that if you do experience latency glitches, they turn out to be instances of actual bugs that if fixed would reduce this to a non-problem. (I know of several such potential bugs, their priority depends on their harm, preferably measured. See #27732) I just learned to use the execution tracer to study GC/scheduler bugs, and it would probably be helpful for you. Failing that, a small example taken from your application that demonstrates (measures?) the problem would be helpful to us.
And yes, there are ways to work around this, but again, we would prefer to fix bugs rather than have users sprinkle incantations in their code to work around bugs. (For example, once upon a time some people "knew" that adding an empty switch statement to their code would prevent it from being inlined.) Ideally each vile hack that is inserted would include a // TODO: remove when #xyzzy is fixed annotation.
|
You can get a pretty-near worst-case test of this right now by either
Lack of preemption causes problems for other people, and these have been costly to debug, so assume that this change will occur. We've been working on it for some time and tried (to some extent) four different techniques for addressing it, and discarded three of them as either too costly or too risky. What you get with the two incantations above is one of the too-slow solutions. If preemptible loops turns out to be a problem for you, there are ways to work around this, but it would be really nice to measure first. One reason is that it might only be a hypothetical problem, another is that if you do experience latency glitches, they turn out to be instances of actual bugs that if fixed would reduce this to a non-problem. (I know of several such potential bugs, their priority depends on their harm, preferably measured. See #27732) I just learned to use the execution tracer to study GC/scheduler bugs, and it would probably be helpful for you. Failing that, a small example taken from your application that demonstrates (measures?) the problem would be helpful to us. And yes, there are ways to work around this, but again, we would prefer to fix bugs rather than have users sprinkle incantations in their code to work around bugs. (For example, once upon a time some people "knew" that adding an empty switch statement to their code would prevent it from being inlined.) Ideally each vile hack that is inserted would include a |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
wsc1
Oct 1, 2018
You can get a pretty-near worst-case test of this right now by either
- building an entire system with
GOEXPERIMENT=preemptibleloops ./make.bashwhich will insert explicit (i.e., slowing your program down) checks that allow preemption in all loops- or compiling your application (but not the runtime or other packages that it uses) with
-gcflags=-d=ssa/insert_resched_checks/on
Thanks, will check it out.
Lack of preemption causes problems for other people, and these have been costly to debug, so assume that this change will occur. We've been working on it for some time and tried (to some extent) four different techniques for addressing it, and discarded three of them as either too costly or too risky. What you get with the two incantations above is one of the too-slow solutions.
If permissible loops turns out to be a problem for you, there are ways to work around this, but it would be really nice to measure first. One reason is that it might only be a hypothetical problem,
pre-empts which can cause sys calls are not hypothetical problems in my opinion. People who have been working with audio for a long time have used real time OS thread scheduling long enough to know very well that sys calls during a callback cause glitching. So they don't do it, and they won't adopt a program that does do it when given an option to choose one that doesn't. That is very much not hypothetical, it is the reality of quality real-time audio software.
another is that if you do experience latency glitches, they turn out to be instances of actual bugs that if fixed would reduce this to a non-problem. (I know of several such potential bugs, their priority depends on their harm, preferably measured. See #27732) I just learned to use the execution tracer to study GC/scheduler bugs, and it would probably be helpful for you.
Thanks, will look into that as well.
Failing that, a small example taken from your application that demonstrates (measures?) the problem would be helpful to us.
As previously noted, there is no small application because demonstrating the problem involves placing the entire host OS under test and load, to measure glitch in situ to see how well Go is interacting with OS real time scheduling. Clearly, a lack of sys-calls on a real-time thread will perform better than the same thing with sys-calls in any case.
A small application which measures timing jitter is possible and necessary for me in the near future anyway, and I'm glad it may be helpful here as well. But it would take a lot more to test the timing jitter in situ as that measure also is a measure of how well the app is interacting with the OS expectations of user code, not just how long a gc pause takes standalone. This later in situ measure is what matters. The former, run standalone, only gives a best-case measure which only applies to dedicated hardware which is not the expected deployment for most real-time audio processing software and does not exercise the role of pre-empts in fulfilling OS expectations.
And yes, there are ways to work around this,
I am interested in learning more about that.
but again, we would prefer to fix bugs rather than have users sprinkle incantations in their code to work around bugs. (For example, once upon a time some people "knew" that adding an empty switch statement to their code would prevent it from being inlined.) Ideally each vile hack that is inserted would include a
// TODO: remove when #xyzzy is fixedannotation.
Yes linking code issues is a good thing, but I'm personally skeptical of linking source code to a code hosting site I don't control. Ideally, a distributed issue tracker would solve that.
Thanks for the much needed overview of the status. I'll update here with measurements and tests as I get to them unless y'all suggest a more appropriate venue.
wsc1
commented
Oct 1, 2018
•
Thanks, will check it out.
pre-empts which can cause sys calls are not hypothetical problems in my opinion. People who have been working with audio for a long time have used real time OS thread scheduling long enough to know very well that sys calls during a callback cause glitching. So they don't do it, and they won't adopt a program that does do it when given an option to choose one that doesn't. That is very much not hypothetical, it is the reality of quality real-time audio software.
Thanks, will look into that as well.
As previously noted, there is no small application because demonstrating the problem involves placing the entire host OS under test and load, to measure glitch in situ to see how well Go is interacting with OS real time scheduling. Clearly, a lack of sys-calls on a real-time thread will perform better than the same thing with sys-calls in any case. A small application which measures timing jitter is possible and necessary for me in the near future anyway, and I'm glad it may be helpful here as well. But it would take a lot more to test the timing jitter in situ as that measure also is a measure of how well the app is interacting with the OS expectations of user code, not just how long a gc pause takes standalone. This later in situ measure is what matters. The former, run standalone, only gives a best-case measure which only applies to dedicated hardware which is not the expected deployment for most real-time audio processing software and does not exercise the role of pre-empts in fulfilling OS expectations.
I am interested in learning more about that.
Yes linking code issues is a good thing, but I'm personally skeptical of linking source code to a code hosting site I don't control. Ideally, a distributed issue tracker would solve that. Thanks for the much needed overview of the status. I'll update here with measurements and tests as I get to them unless y'all suggest a more appropriate venue. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
wsc1
Oct 2, 2018
If preemptible loops turns out to be a problem for you, there are ways to work around this, but it would be really nice to measure first. [...]. Failing that, a small example taken from your application that demonstrates (measures?) the problem would be helpful to us.
As it seems to me there is a disconnection about what it means to measure and test the issue for the case of real-time audio, I wanted to try to clarify.
We have the following hierarchical structure
OS kernel
|-Go runtime
|-Other programs/system load
Let me denote the OS kernel by K, the go runtime by R and the other programs by O.
Processes/threads in R and O are subject to a partition of the OS scheduling. I will denote
by High the real time priority and Low the normal priority. High priority processes/threads, once scheduled to run, always remain runnable by K until they yield, or the time is up, or they sys-call.
Now suppose there is a goroutine g in R running on some real-time High priority thread, doing processing for audio PCM data. If g is pre-empted by R, then the pre-emption may invoke a sys-call depending on whether, for example, a runtime scheduling semaphore or lock fails a user land fast-path.
The real-time constraints in this case are wall-clock, and so they must take into account O and the effects of O on K. The existence of a single sys-call in R pre-emption of g will make K context switch from the thread on which g is running.
Now if O is small, then the sys call won't necessarily take much time. In this case, we can measure latency accurately but only IF we control the whole OS. It is not so interesting a measure because the condition of control of the OS is unrealistic.
If, on the other hand, at any point in time during the real-time audio I/O or processing, O spikes during a pre-emption by R of g, then any sys-call by R will take longer. Given that the audio I/O is useless or highly invaluable if it glitches in many use-cases, and that we don't control O, we should assume that any sys-call on the thread of g, especially a lock or file i/o or allocation, will take unbounded time w.r.t. the real-time needs of g.
Now, how to measure this? Do I run g standalone and measure the latency or time jitter? (*) No. that doesn't provide any information about the effect of a sys-call under load on the wall clock real-timing of g.
In this issue commentary, I have not seen any acknowledgement of (*). To the contrary, I have only seen suggestions that it might not be a real concern, despite the fact that it is well-established in real-time audio software culture.
Now if I look at this as a game-theoretical question where O is an adversary to R, then I can easily, in theory at least, construct a benchmark of the whole OS where I look at the source code for R, find a likely sys-call, and instruct O to use what resources it has to slow down K as much as possible in implementing the sys-call. I can also lower the latency requirements of g as much as I want. There will always be a tradeoff point for a given set of resources for the OS, O, g, and R.
On the other hand, if there is no sys-call in the real time allotted by the OS to g, then the game above doesn't matter so much because g is always scheduled High and running (running because otherwise the question of pre-emption doesn't matter).
So one way (let's call this M1) to "measure" the jitter is this: when a pre-emption of g occurs, instruct all sys-calls by R on the thread of g to crash the program and consider the result a total failure. We can then measure time to failures for different implementations of g and R and discount implementing O.
Another way (let's call this M2) is implement O and measure the actual the actual timing jitter of g and find the tradeoff points at which failures don't occur for a long time.
My point here is that there is a necessarily qualitative/analytical aspect to the problem centered around sys-calls in Go's runtime. The simplest way to evaluate and proceed would be to eliminate entirely the possibility of a sys-call when g is working under real time constraints. I can already do that today with cooperative Go runtime scheduling.
From what I have learned about the runtime so far, I see there are different levels of locks for different tasks (from Hacking.md). On some OSs some of the tasks are doable with atomics in user-land and some are not. On all OS's some of the tasks are not doable in user land without a sys call (eg signal blocking). Pre-emption does not specify which of these will be invoked during the real-time allocated to a running g.
So, to measure this, I'd like to advocate that the best measure considers any sys call during pre-emption the primary source of failures, and then the measure will be either
a) by construction, there are no such failures; or
b) for a given 'K', 'O', 'R', 'g' setup how long does it take to reach such a failure or latency requirement violation.
Some things like LockOSThread play a role, as noted by @bcmills
However, one question I have is whether or not the pre-emption mechanism destined to go into the runtime uses system calls in order to pre-empt in the first place. If so, it makes M2 aka case b above simply dependent on time it takes for a pre-emption to occur.
Thanks for any thoughts or hopefully acknowledgement of (*).
CC @hajimehoshi @yobert @gordonklaus @dskinner
note: not Go members, but on github/google/oboe
CC @philburk @dturner
Real time audio callback requirements docs:
- most cited blog
- stackoverflow
- juce forum
- chrome developer docs
- android priority inversion
- oboe callback requirements (see "callbacks do's and don'ts")
- audio programming
wsc1
commented
Oct 2, 2018
•
As it seems to me there is a disconnection about what it means to measure and test the issue for the case of real-time audio, I wanted to try to clarify. We have the following hierarchical structure
Let me denote the OS kernel by Processes/threads in Now suppose there is a goroutine The real-time constraints in this case are wall-clock, and so they must take into account Now if If, on the other hand, at any point in time during the real-time audio I/O or processing, Now, how to measure this? Do I run In this issue commentary, I have not seen any acknowledgement of (*). To the contrary, I have only seen suggestions that it might not be a real concern, despite the fact that it is well-established in real-time audio software culture. Now if I look at this as a game-theoretical question where On the other hand, if there is no sys-call in the real time allotted by the OS to So one way (let's call this Another way (let's call this My point here is that there is a necessarily qualitative/analytical aspect to the problem centered around sys-calls in Go's runtime. The simplest way to evaluate and proceed would be to eliminate entirely the possibility of a sys-call when From what I have learned about the runtime so far, I see there are different levels of locks for different tasks (from Hacking.md). On some OSs some of the tasks are doable with atomics in user-land and some are not. On all OS's some of the tasks are not doable in user land without a sys call (eg signal blocking). Pre-emption does not specify which of these will be invoked during the real-time allocated to a running So, to measure this, I'd like to advocate that the best measure considers any sys call during pre-emption the primary source of failures, and then the measure will be either Some things like LockOSThread play a role, as noted by @bcmills However, one question I have is whether or not the pre-emption mechanism destined to go into the runtime uses system calls in order to pre-empt in the first place. If so, it makes M2 aka case b above simply dependent on time it takes for a pre-emption to occur. Thanks for any thoughts or hopefully acknowledgement of (*). CC @hajimehoshi @yobert @gordonklaus @dskinner Real time audio callback requirements docs:
|
wsc1
referenced this issue
Oct 2, 2018
Open
unreliability in C callback APIs which execute the callback on foreign thread #17
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
wsc1
Oct 4, 2018
For a data point on numeric tight loop effects of -gcflags=-d=ssa/insert_resched_checks/on
hole9.cnf.txt
is a file, which you can either rename with .cnf extension or cat as input to http://github.com/irifrance/gini
It is a bit slower for this case, see below.
The benchmark is a classic for stress testing SAT solvers, level 9 gives something that runs long enough to exercise tight loops without worrying about cache warmup time influencing it, and short enough to give a result without needing to wait too long.
One would expect a similar slowdown for a wide variety of SAT problems.
It is unrelated to the audio case.
⎣ ⇨ go install github.com/irifrance/gini/...
⎡ /Users/scott/Dev/gini/src/github.com/irifrance/gini:scott@pavillion1 (18-10-04 15:00:07) ───────────────────── (86365|0)⎤
⎣ ⇨ time gini ~/G/Benchmarks/SAT/hole9.cnf
c [gini] 2018/10/04 15:00:11 /Users/scott/G/Benchmarks/SAT/hole9.cnf
c [gini] 2018/10/04 15:00:11 parsed dimacs in 1.37853ms
s UNSATISFIABLE
gini ~/G/Benchmarks/SAT/hole9.cnf 15.45s user 0.11s system 97% cpu 15.961 total
⎣ ⇨ go install -gcflags=-d=ssa/insert_resched_checks/on github.com/irifrance/gini/...
⎡ /Users/scott/Dev/gini/src/github.com/irifrance/gini:scott@pavillion1 (18-10-04 15:01:04) ───────────────────── (86367|0)⎤
⎣ ⇨ time gini ~/G/Benchmarks/SAT/hole9.cnf
c [gini] 2018/10/04 15:01:07 /Users/scott/G/Benchmarks/SAT/hole9.cnf
c [gini] 2018/10/04 15:01:07 parsed dimacs in 1.109492ms
s UNSATISFIABLE
gini ~/G/Benchmarks/SAT/hole9.cnf 15.80s user 0.09s system 98% cpu 16.115 total
wsc1
commented
Oct 4, 2018
•
|
For a data point on numeric tight loop effects of is a file, which you can either rename with It is a bit slower for this case, see below. The benchmark is a classic for stress testing SAT solvers, level 9 gives something that runs long enough to exercise tight loops without worrying about cache warmup time influencing it, and short enough to give a result without needing to wait too long. One would expect a similar slowdown for a wide variety of SAT problems. It is unrelated to the audio case.
|
wsc1
referenced this issue
Oct 4, 2018
Open
is this API appropriate, especially for real time use #3
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
wsc1
Oct 4, 2018
Most importantly, there are also deliberate tight loops without pre-emption which are useful and may be relied upon.
Having thought about the audio case and relation to sys calls, I would guess that it might be worth thinking about whether or not pre-emption could cause existing Go user sys calls to fail, or the runtime.
For example,
a loop like:
for ... {
syscall.SysCall(syscall.A, ...)
syscall.SysCall(syscall.B, ...)
}
Or if pre-emption is at loop boundaries only then the variant form:
for i := range ... {
if i%2 == 0 {
syscall.SysCall(syscall.A, ...)
} else {
syscall.SysCall(syscall.B, ...)
}
}
Then I would think the programmer likely currently assumes that no syscall other than that called by her program happened in between syscall.A and syscall.B.
If then the syscall relates to a syscall used in the runtime in a bad way, then that might be a source of problems that is also very hard to debug, both for the Go programmer and runtime developer. For example, signal related sys calls.
not to say I think the above is common, just food for thought.
wsc1
commented
Oct 4, 2018
•
Having thought about the audio case and relation to sys calls, I would guess that it might be worth thinking about whether or not pre-emption could cause existing Go user sys calls to fail, or the runtime. For example,
Or if pre-emption is at loop boundaries only then the variant form:
Then I would think the programmer likely currently assumes that no syscall other than that called by her program happened in between syscall.A and syscall.B. If then the syscall relates to a syscall used in the runtime in a bad way, then that might be a source of problems that is also very hard to debug, both for the Go programmer and runtime developer. For example, signal related sys calls. not to say I think the above is common, just food for thought. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
wsc1
Oct 4, 2018
Perhaps not inserting runtime.Gosched into loops containing sys calls or runtime.Gosched would be worth considering?
wsc1
commented
Oct 4, 2018
|
Perhaps not inserting runtime.Gosched into loops containing sys calls or runtime.Gosched would be worth considering? |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
wsc1
Oct 7, 2018
Or if pre-emption is at loop boundaries only then the variant form:
for i := range ... { if i%2 == 0 { syscall.SysCall(syscall.A, ...) } else { syscall.SysCall(syscall.B, ...) } }
Actually, it is deeper than that and independent of whether the compiler treats sys calls specially w.r.t. pre-emption in loops because, the sys calls can occur outside the loops which would pre-empt:
syscall.SysCall(syscall.A, ...)
for {
// built in math/logic only
}
syscall.SysCall(syscall.B, ...)
wsc1
commented
Oct 7, 2018
Actually, it is deeper than that and independent of whether the compiler treats sys calls specially w.r.t. pre-emption in loops because, the sys calls can occur outside the loops which would pre-empt:
|
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
dr2chase
Oct 8, 2018
Contributor
We're going to fix this bug (that is, the bug with this number, that Go code must respond to a request for a handshake w/ the runtime in a very timely fashion). It's a problem for users in already-running code, and has been for some time. In the worst case, applications can hang because of this, which is not acceptable.
The go runtime can't provide guarantees like "the Go scheduler contains no syscalls" -- for one thing, as soon as a goroutine is locked to an OS thread, starting and stopping that thread requires syscalls. I looked, and as the runtime is currently written, a locked thread is always put to sleep when its quantum expires, even if it is destined to be the next runnable goroutine because there are no others that are runnable. We have no evidence this is causing anyone problems -- there's no reproducible test case, no problems with blown SLOs, etc.
I think it would be helpful if you could start small, and with the plainest possible Go, so we actually have some examples to work with, to measure latencies, and figure out their causes, and also figure out what the constraints are. There will almost certainly be problems, but they'll be actual problems, not hypothesized problems, and we'll be able to measure them and decide what's the best way to deal with them. Crucially, because we have example code to work with, we can use it to create test cases so that whatever bugs get fixed, stay fixed.
|
We're going to fix this bug (that is, the bug with this number, that Go code must respond to a request for a handshake w/ the runtime in a very timely fashion). It's a problem for users in already-running code, and has been for some time. In the worst case, applications can hang because of this, which is not acceptable. The go runtime can't provide guarantees like "the Go scheduler contains no syscalls" -- for one thing, as soon as a goroutine is locked to an OS thread, starting and stopping that thread requires syscalls. I looked, and as the runtime is currently written, a locked thread is always put to sleep when its quantum expires, even if it is destined to be the next runnable goroutine because there are no others that are runnable. We have no evidence this is causing anyone problems -- there's no reproducible test case, no problems with blown SLOs, etc. I think it would be helpful if you could start small, and with the plainest possible Go, so we actually have some examples to work with, to measure latencies, and figure out their causes, and also figure out what the constraints are. There will almost certainly be problems, but they'll be actual problems, not hypothesized problems, and we'll be able to measure them and decide what's the best way to deal with them. Crucially, because we have example code to work with, we can use it to create test cases so that whatever bugs get fixed, stay fixed. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
wsc1
Oct 8, 2018
We're going to fix this bug (that is, the bug with this number, that Go code must respond to a request for a handshake w/ the runtime in a very timely fashion). It's a problem for users in already-running code, and has been for some time. In the worst case, applications can hang because of this, which is not acceptable.
I'm glad it will help those for whom it causes problems. I don't want to prevent that.
As a long time Go user, I don't classify this as a bug, knowing the system won't pre-empt is a long standing quality of the language that I and probably others have knowingly made use of and invested time in.
So, one man's bug is another's feature. A veritable testament to the success of Go. Let's keep being inclusive so more things like that will happen.
wsc1
commented
Oct 8, 2018
I'm glad it will help those for whom it causes problems. I don't want to prevent that. As a long time Go user, I don't classify this as a bug, knowing the system won't pre-empt is a long standing quality of the language that I and probably others have knowingly made use of and invested time in. So, one man's bug is another's feature. A veritable testament to the success of Go. Let's keep being inclusive so more things like that will happen. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
wsc1
Oct 8, 2018
The go runtime can't provide guarantees like "the Go scheduler contains no syscalls" -- for one thing, as soon as a goroutine is locked to an OS thread, starting and stopping that thread requires syscalls. I looked, and as the runtime is currently written, a locked thread is always put to sleep when its quantum expires, even if it is destined to be the next runnable goroutine because there are no others that are runnable. We have no evidence this is causing anyone problems -- there's no reproducible test case, no problems with blown SLOs, etc.
Thanks for checking and the info. I agree the scheduler can't guarantee no sys calls. It could probably benefit from more use of lockless structures.
wsc1
commented
Oct 8, 2018
Thanks for checking and the info. I agree the scheduler can't guarantee no sys calls. It could probably benefit from more use of lockless structures. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
wsc1
Oct 8, 2018
I think it would be helpful if you could start small, and with the plainest possible Go, so we actually have some examples to work with, to measure latencies, and figure out their causes, and also figure out what the constraints are. There will almost certainly be problems, but they'll be actual problems, not hypothesized problems, and we'll be able to measure them and decide what's the best way to deal with them. Crucially, because we have example code to work with, we can use it to create test cases so that whatever bugs get fixed, stay fixed.
Thanks, as previously noted, we're dealing with wall clock time and it appears to me from all commentary that no-one understands that you can't put wall-clock time latency tests in a box and pass it around the cloud for analysis without simulating the entire system or having control of the hardware.
If the need for a concrete test is what is necessary for deeming a problem more important than hypothetical, then see above, if you provide a way to host or receive such tests, then I'm happy to do help advance that. If you do not, then I guess we'll just have to agree to disagree about whether the problem is hypothetical and I'll be on my way.
Thanks in any case.
wsc1
commented
Oct 8, 2018
Thanks, as previously noted, we're dealing with wall clock time and it appears to me from all commentary that no-one understands that you can't put wall-clock time latency tests in a box and pass it around the cloud for analysis without simulating the entire system or having control of the hardware. If the need for a concrete test is what is necessary for deeming a problem more important than hypothetical, then see above, if you provide a way to host or receive such tests, then I'm happy to do help advance that. If you do not, then I guess we'll just have to agree to disagree about whether the problem is hypothetical and I'll be on my way. Thanks in any case. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
ianlancetaylor
Oct 8, 2018
Contributor
@wsc1 I suppose I'm beating a dead horse, but you haven't demonstrated a problem. You've pointed to a possible problem. That's not the same thing. That's why we keep calling it hypothetical.
This issue was opened because of real problems demonstrated by real users with real code. You have made clear that the solutions being discussed to those real problems may possibly cause problems in other code. But we don't know. So let's try fixing the real problems and find out what happens with the other code. Then, based on real experience, let's decide how to fix all the real problems.
Nobody is saying that we should break audio code. We're saying: let's see what happens. It doesn't make sense to not fix real problems because of the fear of hypothetical problems. Perhaps the fix under consideration causes too many other problems and can not be implemented. But let's try it and find out, not reject it without trying.
Alternatively, if you have a proposal for a different fix to the real problems we're trying to fix, by all means tell us.
|
@wsc1 I suppose I'm beating a dead horse, but you haven't demonstrated a problem. You've pointed to a possible problem. That's not the same thing. That's why we keep calling it hypothetical. This issue was opened because of real problems demonstrated by real users with real code. You have made clear that the solutions being discussed to those real problems may possibly cause problems in other code. But we don't know. So let's try fixing the real problems and find out what happens with the other code. Then, based on real experience, let's decide how to fix all the real problems. Nobody is saying that we should break audio code. We're saying: let's see what happens. It doesn't make sense to not fix real problems because of the fear of hypothetical problems. Perhaps the fix under consideration causes too many other problems and can not be implemented. But let's try it and find out, not reject it without trying. Alternatively, if you have a proposal for a different fix to the real problems we're trying to fix, by all means tell us. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
wsc1
Oct 9, 2018
Thanks. My problem with the fix is real. I would appreciate it if there were a way to demonstrate that. But the Go team does not seem to have the resources to test, receive, or evaluate those real problems.
Such a circumstance doesn't to me merit discrediting what has been presented as "not real", I think a better categorisation would be "we can't know (given our resources) how much of a problem it is".
Also, this change was hidden from my view of Go activities, distracted by everything else that was pushed and not mentioned by anyone during a long discussion on golang-dev when the themes were quite clearly related. That cost me a great deal of time, as well as the need to rethink and design and implement about a million things. If that cost is not real to you or y'all then, well, then I don't appreciate that nor do I think would anyone.
wsc1
commented
Oct 9, 2018
|
Thanks. My problem with the fix is real. I would appreciate it if there were a way to demonstrate that. But the Go team does not seem to have the resources to test, receive, or evaluate those real problems. Also, this change was hidden from my view of Go activities, distracted by everything else that was pushed and not mentioned by anyone during a long discussion on golang-dev when the themes were quite clearly related. That cost me a great deal of time, as well as the need to rethink and design and implement about a million things. If that cost is not real to you or y'all then, well, then I don't appreciate that nor do I think would anyone. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
ianlancetaylor
Oct 9, 2018
Contributor
I'm sorry if it seems like we are wasting your time. That is not my intent. I don't understand what you mean when you say that we lack resources; that may well be true, but if you send us benchmarks we can run them.
In general we seem to be talking past each other. Perhaps best to revisit this if and when the design described in this issue is implemented.
|
I'm sorry if it seems like we are wasting your time. That is not my intent. I don't understand what you mean when you say that we lack resources; that may well be true, but if you send us benchmarks we can run them. In general we seem to be talking past each other. Perhaps best to revisit this if and when the design described in this issue is implemented. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
AndrewSav
Oct 9, 2018
@wsc1 you wrote:
Because of its relation to OS scheduling, Note that to test reliability, one must load the OS outside the program in question, so you cannot draw conclusions from testing one program without also placing the entire host OS under test. Sorry I can't provide such a test which controls the whole OS at this time.
You also wrote:
it appears to me from all commentary that no-one understands that you can't put wall-clock time latency tests in a box and pass it around the cloud for analysis without simulating the entire system or having control of the hardware.
I can tell you that I personally definitely do not understand it. Of course it says nothing about whether others understand it, or not, but perhaps some explanation can be in order? In particular:
- "one must load the OS outside the program in question" - what does that mean? How do you "load" OS? Do you mean booting it up? If yes surely booting it up happens outside the program in question. What do you mean then?
- "so you cannot draw conclusions from testing one program without also placing the entire host OS under test" - why is that, you cannot draw conclusions? also what does it mean, specifically, "placing entire host OS under test". How does one do that and why?
- "Sorry I can't provide such a test which controls the whole OS at this time." Why cannot you? What is involved?
Thank you in advance, I would really appreciate your insight!
AndrewSav
commented
Oct 9, 2018
|
@wsc1 you wrote:
You also wrote:
I can tell you that I personally definitely do not understand it. Of course it says nothing about whether others understand it, or not, but perhaps some explanation can be in order? In particular:
Thank you in advance, I would really appreciate your insight! |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
wsc1
Oct 9, 2018
I'm sorry if it seems like we are wasting your time. That is not my intent.
Thanks.
I don't understand what you mean when you say that we lack resources; that may well be true, but if you send us benchmarks we can run them.
Thanks again.
In general we seem to be talking past each other. Perhaps best to revisit this if and when the design described in this issue is implemented.
Yes, I agree
@AndrewSav if you want to follow up about the latency tests nonetheless feel free to mail me or comment on the "Questions" issue at github.com/zikichombo/sio.
wsc1
commented
Oct 9, 2018
Thanks.
Thanks again.
Yes, I agree @AndrewSav if you want to follow up about the latency tests nonetheless feel free to mail me or comment on the "Questions" issue at github.com/zikichombo/sio. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
dr2chase
Oct 9, 2018
Contributor
no-one understands that you can't put wall-clock time latency tests in a box and pass it around the cloud for analysis without simulating the entire system or having control of the hardware.
We can currently eyeball execution traces for latency problems that the Go runtime causes. It gives a quite reasonable rendition of virtual-ish time, where OS-induced latency appears as "the runtime didn't touch this thread, but it mysteriously did nothing for a little while". For runtime-induced latency purposes, we mostly ignore those implicit glitches (we need to keep an eye on them, for cases where the runtime does a syscall before logging the event, but I think you get the picture). We could automate this.
There would be an annoying one-time investment in a useful bug harness, but I think it would be adequate to this purpose. Simple rule of thumb, if it's not tested, it doesn't work. If we intend to support this sort of low latency behavior, we need a test.
We can currently eyeball execution traces for latency problems that the Go runtime causes. It gives a quite reasonable rendition of virtual-ish time, where OS-induced latency appears as "the runtime didn't touch this thread, but it mysteriously did nothing for a little while". For runtime-induced latency purposes, we mostly ignore those implicit glitches (we need to keep an eye on them, for cases where the runtime does a syscall before logging the event, but I think you get the picture). We could automate this. There would be an annoying one-time investment in a useful bug harness, but I think it would be adequate to this purpose. Simple rule of thumb, if it's not tested, it doesn't work. If we intend to support this sort of low latency behavior, we need a test. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
wsc1
Oct 11, 2018
@wsc1 you wrote:
I can tell you that I personally definitely do not understand it. Of course it says nothing about whether others understand it, or not, but perhaps some explanation can be in order? In particular:
- "one must load the OS outside the program in question" - what does that mean? How do you "load" OS? Do you mean booting it up? If yes surely booting it up happens outside the program in question. What do you mean then?
I mean placing the operating system under load by other processes than the one(s) in being measured.
- "so you cannot draw conclusions from testing one program without also placing the entire host OS under test" - why is that, you cannot draw conclusions? also what does it mean, specifically, "placing entire host OS under test". How does one do that and why?
Because we are dealing with wall clock time, which means time that includes the time the operating system devotes to other processes which compete for resources with the piece of software in question.
So to know how the software acts in place with respect to wall clock time we need to know the whole picture, the time inside the process is less relevant to solving the problem than how the process interacts with other programs in the context of an operating system. (This result in part thanks to the great work on Go's garbage runtime so far).
- "Sorry I can't provide such a test which controls the whole OS at this time." Why cannot you? What is involved?
I can and have and do test on what hardware I have, however, as the test framework available to @drchase to provide to me does not provide an interface for readily specifying what the operating system is doing at the same time, and I would guess also because it might not be hardware but rather a container or VM or kubernetes or something, the wall clock timing is not expected to reliably agree when we run the tests.
Also, on my end, it requires finding a reasonable and useful model of what those other processes are, so we can turn a nob and see a latency and then use that nob as a reference point for real operating system situations.
These measures are, as @drchase mentioned something he can eyeball/estimate, but I'm interested in measuring time to failure given uncontrolled external influences, also with and without real time threads (which require root access).
These problematics lead me to think maybe it's better to simulate the timing of the system calls by modifying go for the sole purpose of the test (which would enable precise measurement on all platforms under simulation) or maybe it's better to modify the execution tracer to output times under a simulation after a test has already been run.
But all this is work and requires coordination to communicate above standard benchmarking and test cases.
It is also work which is in part necessitated by assertions that the runtime should be able to pre-empt the programmer at any time without the ability of the programmer to control that. Because to the extent I can control the pre-emptions as a Go programmer, (which is possible today and has been for a long time), this problem becomes my concern on my hardware and not one of coordinating tests and defining OS simulations which can be coordinated by the Go issue tracker.
I prefer independence, because introducing the dependency on @dr2chase's resources available to me gives the Go language control over my timeline, which I simply can't afford.
Thank you in advance, I would really appreciate your insight!
Thanks for your questions.
wsc1
commented
Oct 11, 2018
I mean placing the operating system under load by other processes than the one(s) in being measured.
Because we are dealing with wall clock time, which means time that includes the time the operating system devotes to other processes which compete for resources with the piece of software in question. So to know how the software acts in place with respect to wall clock time we need to know the whole picture, the time inside the process is less relevant to solving the problem than how the process interacts with other programs in the context of an operating system. (This result in part thanks to the great work on Go's garbage runtime so far).
I can and have and do test on what hardware I have, however, as the test framework available to @drchase to provide to me does not provide an interface for readily specifying what the operating system is doing at the same time, and I would guess also because it might not be hardware but rather a container or VM or kubernetes or something, the wall clock timing is not expected to reliably agree when we run the tests. Also, on my end, it requires finding a reasonable and useful model of what those other processes are, so we can turn a nob and see a latency and then use that nob as a reference point for real operating system situations. These measures are, as @drchase mentioned something he can eyeball/estimate, but I'm interested in measuring time to failure given uncontrolled external influences, also with and without real time threads (which require root access). These problematics lead me to think maybe it's better to simulate the timing of the system calls by modifying go for the sole purpose of the test (which would enable precise measurement on all platforms under simulation) or maybe it's better to modify the execution tracer to output times under a simulation after a test has already been run. But all this is work and requires coordination to communicate above standard benchmarking and test cases. It is also work which is in part necessitated by assertions that the runtime should be able to pre-empt the programmer at any time without the ability of the programmer to control that. Because to the extent I can control the pre-emptions as a Go programmer, (which is possible today and has been for a long time), this problem becomes my concern on my hardware and not one of coordinating tests and defining OS simulations which can be coordinated by the Go issue tracker. I prefer independence, because introducing the dependency on @dr2chase's resources available to me gives the Go language control over my timeline, which I simply can't afford.
Thanks for your questions. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
wsc1
Oct 11, 2018
@aclements
Very impressive work on the pre-emption.
Please don't take my timeline concerns as an objection to the work, as it is clear it fixes many other problems. I'll just stick with go1.11 and work out compatibility later.
wsc1
commented
Oct 11, 2018
|
@aclements Please don't take my timeline concerns as an objection to the work, as it is clear it fixes many other problems. I'll just stick with go1.11 and work out compatibility later. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
wsc1
Oct 11, 2018
There would be an annoying one-time investment in a useful bug harness, but I think it would be adequate to this purpose. Simple rule of thumb, if it's not tested, it doesn't work. If we intend to support this sort of low latency behavior, we need a test.
Thanks very much for the offer. However, I can't afford making the feasibility of support for what I am working on subject to your timeline and resources. I'll make my own tests and stay with go1.11 and try to keep in mind things that might help us communicate wall clock tests for getting TTF along the way. One such thing that comes to mind is making the scheduler random seed set-able for reproducibility.
wsc1
commented
Oct 11, 2018
Thanks very much for the offer. However, I can't afford making the feasibility of support for what I am working on subject to your timeline and resources. I'll make my own tests and stay with go1.11 and try to keep in mind things that might help us communicate wall clock tests for getting TTF along the way. One such thing that comes to mind is making the scheduler random seed set-able for reproducibility. |
aclements commentedMay 26, 2015
Currently goroutines are only preemptible at function call points. Hence, it's possible to write a tight loop (e.g., a numerical kernel or a spin on an atomic) with no calls or allocation that arbitrarily delays preemption. This can result in arbitrarily long pause times as the GC waits for all goroutines to stop.
In unusual situations, this can even lead to deadlock when trying to stop the world. For example, the runtime's TestGoroutineParallelism tries to prevent GC during the test to avoid deadlock. It runs several goroutines in tight loops that communicate through a shared atomic variable. If the coordinator that starts these is paused part way through, it will deadlock.
One possible fix is to insert preemption points on control flow loops that otherwise have no preemption points. Depending on the cost of this check, it may also require unrolling such loops to amortize the overhead of the check.
This has been a longstanding issue, so I don't think it's necessary to fix it for 1.5. It can cause the 1.5 GC to miss its 10ms STW goal, but code like numerical kernels is probably not latency-sensitive. And as far as I can tell, causing a deadlock like the runtime test can do requires looking for trouble.
@RLH