New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: tight loops should be preemptible #10958

Open
aclements opened this Issue May 26, 2015 · 105 comments

Comments

Projects
None yet
@aclements
Member

aclements commented May 26, 2015

Currently goroutines are only preemptible at function call points. Hence, it's possible to write a tight loop (e.g., a numerical kernel or a spin on an atomic) with no calls or allocation that arbitrarily delays preemption. This can result in arbitrarily long pause times as the GC waits for all goroutines to stop.

In unusual situations, this can even lead to deadlock when trying to stop the world. For example, the runtime's TestGoroutineParallelism tries to prevent GC during the test to avoid deadlock. It runs several goroutines in tight loops that communicate through a shared atomic variable. If the coordinator that starts these is paused part way through, it will deadlock.

One possible fix is to insert preemption points on control flow loops that otherwise have no preemption points. Depending on the cost of this check, it may also require unrolling such loops to amortize the overhead of the check.

This has been a longstanding issue, so I don't think it's necessary to fix it for 1.5. It can cause the 1.5 GC to miss its 10ms STW goal, but code like numerical kernels is probably not latency-sensitive. And as far as I can tell, causing a deadlock like the runtime test can do requires looking for trouble.

@RLH

@aclements aclements added this to the Go1.6 milestone May 26, 2015

@minux

This comment has been minimized.

Member

minux commented May 26, 2015

@randall77

This comment has been minimized.

Contributor

randall77 commented May 26, 2015

The problem with interrupts is that the goroutine can be at an arbitrary address. The interrupted address is probably at a place where we have no data about where the pointers are. We'd either have to record a lot more frame layout/type/liveness information (including registers), or we'd have to simulate the goroutine forward to a safepoint. Both are challenging.

@minux

This comment has been minimized.

Member

minux commented May 26, 2015

@ianlancetaylor

This comment has been minimized.

Contributor

ianlancetaylor commented May 26, 2015

It seems to me that we can preempt at a write barrier. So the only loops we are talking about are those that make no writes to the heap and make no function calls. If we think in terms of adding

    uint8 b
  loop:
    ++b
    if b == 0 {
        preemptCheck()
    }

then the normal path through the loop will have two extra instructions (add/beq) where the add may be to either a register or a memory location, depending on overall register pressure. This will be measurable in tight loops but for most cases shouldn't be too bad.

@randall77

This comment has been minimized.

Contributor

randall77 commented May 26, 2015

No pointer writes to the heap.

But maybe this is enough? If we can preempt at the first write barrier, then even if we let the mutator run it can't modify anything that the GC cares about. So maybe we just set the write barrier enabled bit and let goroutines run. Once a goroutine sees the write barrier bit (how do we force that?) it can't modify any memory the GC cares about. So it is safe to start the GC without waiting for every goroutine to stop.

@aclements

This comment has been minimized.

Member

aclements commented May 27, 2015

The problem of inserting artificial preemption points is reduced
numerical performance.

True. That's why I suggested unrolling loops, which AFAIK is a standard solution to this problem.

However, I think the check can be done in just two instructions, even without adding Ian's loop counter. Just CMP the preempt flag and branch if it's set. That branch will almost never be hit, so it will be highly predictable, and the preempt flag should be in the L1, so the check may in fact be extremely cheap.

No pointer writes to the heap. But maybe this is enough?

This certainly reduces the set of programs that have this problem, but I don't think it actually helps with either of the examples I gave, since numerical kernels probably won't have heap pointer writes, and the runtime test that can deadlock the GC certainly has no pointer writes.

@RLH

This comment has been minimized.

Contributor

RLH commented May 27, 2015

This is the reference for doing a GC safepoint at every instruction. It is
not something we want to do for x86 much less the various other
architectures we need to support.
http://dl.acm.org/citation.cfm?id=301652

Most systems I know of do what Austin suggests, unroll the loop and insert
a check. No numbers but 8 unrolls seems to be what I recall. Not only is
the branch highly predictable but the check does not introduce any
dependencies making it close to free on an out of order machine. There have
been other approaches such as code plugging and predicated branches but HW
has moved on. I had not seen Ian's suggestion.

On Tue, May 26, 2015 at 9:24 PM, Austin Clements notifications@github.com
wrote:

The problem of inserting artificial preemption points is reduced
numerical performance.

True. That's why I suggested unrolling loops, which AFAIK is a standard
solution to this problem.

However, I think the check can be done in just two instructions, even
without adding Ian's loop counter. Just CMP the preempt flag and branch if
it's set. That branch will almost never be hit, so it will be highly
predictable, and the preempt flag should be in the L1, so the check may in
fact be extremely cheap.

No pointer writes to the heap. But maybe this is enough?

This certainly reduces the set of programs that have this problem, but I
don't think it actually helps with either of the examples I gave, since
numerical kernels probably won't have heap pointer writes, and the runtime
test that can deadlock the GC certainly has no pointer writes.


Reply to this email directly or view it on GitHub
#10958 (comment).

@kingluo

This comment has been minimized.

kingluo commented Sep 9, 2015

@aclements,

Currently goroutines are only preemptible at function call points.

To be precise, the preemption check seems only at newstack()?

For example,

package main

import (
    "fmt"
    "runtime"
    "time"
)

func test() {
    a := 100
    for i := 1; i < 1000; i++ {
        a = i*100/i + a
    }
}

func main() {
    runtime.GOMAXPROCS(1)
    go func() {
        for {
            test()
        }
    }()
    time.Sleep(100 * time.Millisecond)
    fmt.Println("hello world")
}

The test() is not inline, and the infinite loop calls test(), but it would not preempt at the calls, because the morestack() --> newstack() not involved, so the "hello world" would never be printed.

@randall77

This comment has been minimized.

Contributor

randall77 commented Sep 9, 2015

test() is not inlined, but since it is a leaf it is promoted to nosplit, so it doesn't have the preemption check any more. Fixing that sounds much easier than the rest of this bug. Maybe we could forbid nosplit promotion for functions called from loops?

@griesemer

This comment has been minimized.

Contributor

griesemer commented Sep 10, 2015

For the reference, the same problem appeared in the HotSpot JVM. I remember two approaches:

  1. Around 1997/1998, Rene Schmidt (http://jaoo.dk/aarhus2007/speaker/Rene+W.+Schmidt) implemented the following mechanism: Threads running a tight loop w/o function calls would receive a signal to temporarily suspend them. The runtime then dynamically generated a partial "copy" of the loop instructions the thread was running, from the current PC to the (unconditional) backward branch, except that the backward branch was replaced with a call into the runtime (leading to proper suspension at a safe point). The thread was then restarted with the pc modified such that it would run the newly generated piece of code. That code would run to the end of the loop body and then suspend itself at which point the code copy was discarded and the pc (return address) adjusted to continue with the original code (after the safe point).

This mechanism was ingenious but also insanely complicated. It was abandoned eventually for:

  1. A simple test and branch (2 additional instructions) at the end of a loop body which didn't contain any calls. As far as I recall, we didn't do any form of loop unrolling and the performance implications were not significant in the overall picture of a larger application.

My vote would be for 2) to start in loops where @ianlancetaylor's suggestion doesn't work. I suspect that for all but the smallest long-running inner loops (where unrolling might make sense independently), the performance penalty is acceptable.

If this is not good enough, and we don't want to pay the code size cost of unrolling a loop multiple times, here's another idea based on 1): Instead of the backward branch check as in 2) plus unrolling, keep the original loop, and generate (at compile time) a 2nd version of the loop body that ends in a runtime call to suspend itself instead of the backward branch. The code size cost is about the cost of having the loop unrolled twice. When the goroutine needs to run to a safe point, use a signal to temporarily suspend the goroutine, switch the pc to the corresponding pc in the copy of the loop body, continue with execution there and have the code suspend itself at a safe point. There's no dynamic code generation involved, generating the extra loop body is trivial at compile-time, the extra amount of code is less than with loop unrolling, and there's only a little bit of runtime work to modify the pc. The regular code would run at full speed if no garbage collection is needed.

@nightlyone

This comment has been minimized.

Contributor

nightlyone commented Sep 11, 2015

Another idea, similiar to @ianlancetaylor's idea, is to estimate the cost (in ns) of the loop and only check for required suspension every N loops through that loop body.

Once the compiler can unroll/unpeel, that check-every-N logic can be either inlined after the unrolled body, if unrolling is beneficial, or kept when unrolling makes no sense for that loop body.

Such logic also reads better when using debuggers.

@griesemer

This comment has been minimized.

Contributor

griesemer commented Sep 11, 2015

@nightlyone The check-every-N seems more complex than just having two extra instructions (compare and branch). And if unrolling is done, that compare-and-branch is already needed only every N loop iterations only (where N is the number of times the loop was unrolled).

@nightlyone

This comment has been minimized.

Contributor

nightlyone commented Sep 11, 2015

@griesemer not sure how the test will avoid atomic loads, that's why I suggested the check-every-N.

So the pseudo-go-code before unrolling would look like this:

loop:   
        // loop-body        

        if counter > N {          
                counter = 0                       

                if need_stop = atomic.LoadBool(&runtime.getg().need_stop); need_stop {
                        runtime.Gosched()                                    
                }                                                               
        }                                                                                                                       
        counter++                                                                                                           
        goto loop

after unrolling it would look like

loop:   
        // loop-body0
        // loop-body2
        // ...
        // loop-bodyM        

        if counter > N/M {          
                counter = 0                       

                if need_stop = atomic.LoadBool(&runtime.getg().need_stop); need_stop {
                        runtime.Gosched()                                    
                }                                                               
        }                                                                                                                       
        counter++                                                                                                           
        goto loop

So the code inserted would be a constant overhead, but still run every N loops.

@aclements

This comment has been minimized.

Member

aclements commented Sep 11, 2015

The load doesn't have to be atomic. The current preemption check in the function prologue isn't atomic.

One tricky bit with the compare and branch is that the actual preemption point needs to have no live registers. Presumably, for performance reasons, we want the loop to be able to keep things in registers, so the code we branch to on a preempt has to flush any live registers before the preempt and reload them after the preempt. I don't think this will be particularly hard, but it is something to consider, since it might affect which stage of the compiler is responsible for generating this code.

@griesemer

This comment has been minimized.

Contributor

griesemer commented Sep 11, 2015

@aclements Indeed. Which is why perhaps switching the pc to a 2nd version of the loop body that ends in a safe point might not be much more complex and permit the loop to run at full speed in the normal case.

@aclements

This comment has been minimized.

Member

aclements commented Sep 11, 2015

@aclements Indeed. Which is why perhaps switching the pc to a 2nd version of the loop body that ends in a safe point might not be much more complex and permit the loop to run at full speed in the normal case.

From the runtime's perspective, I think this would be more complicated because stealing a signal is a logistic pain and we'd have to deal with tables mapping from fast loop PCs to slow loop PCs. The compiler would have to generate these tables. This seems like a very clever plan B, but I think first we should trying adding a no-op compare and branch and see if it's actually a problem for dense numerical kernels.

@randall77

This comment has been minimized.

Contributor

randall77 commented Sep 11, 2015

The new SSA register allocator handles situations like this well, keeping everything in registers on the common edges and spill/call/restore on the unlikely case.

@RLH

This comment has been minimized.

Contributor

RLH commented Sep 11, 2015

The load is not dependent on anything in the loop. Assuming the loop is
tight, the value is almost certainly to a location already in the L1 cache,
the branch will be highly predictable. Just to make sure the branch
predictor doesn't even need to be warmed up, we can make it a backward
branch. I would be sort of surprised that if on an out of order machine the
cost would even be noticed. That said build and measure is the only way to
be sure.

On Fri, Sep 11, 2015 at 3:38 PM, Keith Randall notifications@github.com
wrote:

The new SSA register allocator handles situations like this well, keeping
everything in registers on the common edges and spill/call/restore on the
unlikely case.


Reply to this email directly or view it on GitHub
#10958 (comment).

@gopherbot

This comment has been minimized.

gopherbot commented Feb 16, 2016

CL https://golang.org/cl/19516 mentions this issue.

gopherbot pushed a commit that referenced this issue Feb 16, 2016

runtime: fix deadlock in TestCrashDumpsAllThreads
TestCrashDumpsAllThreads carefully sets the number of Ps to one
greater than the number of non-preemptible loops it starts so that the
main goroutine can continue to run (necessary because of #10958).
However, if GC starts, it can take over that one spare P and lock up
the system while waiting for the non-preemptible loops, causing the
test to eventually time out. This deadlock is easily reproducible if
you run the runtime test with GOGC=1.

Fix this by forcing GOGC=off when running this test.

Change-Id: Ifb22da5ce33f9a61700a326ea92fcf4b049721d1
Reviewed-on: https://go-review.googlesource.com/19516
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Russ Cox <rsc@golang.org>
@wsc1

This comment has been minimized.

wsc1 commented Oct 4, 2018

For a data point on numeric tight loop effects of -gcflags=-d=ssa/insert_resched_checks/on
hole9.cnf.txt

is a file, which you can either rename with .cnf extension or cat as input to http://github.com/irifrance/gini

It is a bit slower for this case, see below.

The benchmark is a classic for stress testing SAT solvers, level 9 gives something that runs long enough to exercise tight loops without worrying about cache warmup time influencing it, and short enough to give a result without needing to wait too long.

One would expect a similar slowdown for a wide variety of SAT problems.

It is unrelated to the audio case.

⎣ ⇨ go install   github.com/irifrance/gini/... 
⎡ /Users/scott/Dev/gini/src/github.com/irifrance/gini:scott@pavillion1 (18-10-04 15:00:07) ───────────────────── (86365|0)⎤
⎣ ⇨ time gini ~/G/Benchmarks/SAT/hole9.cnf    
c [gini] 2018/10/04 15:00:11 /Users/scott/G/Benchmarks/SAT/hole9.cnf
c [gini] 2018/10/04 15:00:11 parsed dimacs in 1.37853ms
s UNSATISFIABLE
gini ~/G/Benchmarks/SAT/hole9.cnf  15.45s user 0.11s system 97% cpu 15.961 total

⎣ ⇨ go install  -gcflags=-d=ssa/insert_resched_checks/on github.com/irifrance/gini/...
⎡ /Users/scott/Dev/gini/src/github.com/irifrance/gini:scott@pavillion1 (18-10-04 15:01:04) ───────────────────── (86367|0)⎤
⎣ ⇨ time gini ~/G/Benchmarks/SAT/hole9.cnf                                            
c [gini] 2018/10/04 15:01:07 /Users/scott/G/Benchmarks/SAT/hole9.cnf
c [gini] 2018/10/04 15:01:07 parsed dimacs in 1.109492ms
s UNSATISFIABLE
gini ~/G/Benchmarks/SAT/hole9.cnf  15.80s user 0.09s system 98% cpu 16.115 total
@wsc1

This comment has been minimized.

wsc1 commented Oct 4, 2018

Most importantly, there are also deliberate tight loops without pre-emption which are useful and may be relied upon.

Having thought about the audio case and relation to sys calls, I would guess that it might be worth thinking about whether or not pre-emption could cause existing Go user sys calls to fail, or the runtime.

For example,
a loop like:

for ... {
   syscall.SysCall(syscall.A, ...)
   syscall.SysCall(syscall.B, ...)
}

Or if pre-emption is at loop boundaries only then the variant form:

for i := range ... {
   if i%2 == 0 {
       syscall.SysCall(syscall.A, ...)
   } else {
       syscall.SysCall(syscall.B, ...)
   }
}

Then I would think the programmer likely currently assumes that no syscall other than that called by her program happened in between syscall.A and syscall.B.

If then the syscall relates to a syscall used in the runtime in a bad way, then that might be a source of problems that is also very hard to debug, both for the Go programmer and runtime developer. For example, signal related sys calls.

not to say I think the above is common, just food for thought.

@wsc1

This comment has been minimized.

wsc1 commented Oct 4, 2018

Perhaps not inserting runtime.Gosched into loops containing sys calls or runtime.Gosched would be worth considering?

@wsc1

This comment has been minimized.

wsc1 commented Oct 7, 2018

Or if pre-emption is at loop boundaries only then the variant form:

for i := range ... {
   if i%2 == 0 {
       syscall.SysCall(syscall.A, ...)
   } else {
       syscall.SysCall(syscall.B, ...)
   }
}

Actually, it is deeper than that and independent of whether the compiler treats sys calls specially w.r.t. pre-emption in loops because, the sys calls can occur outside the loops which would pre-empt:

syscall.SysCall(syscall.A, ...)
for {
   // built in math/logic only
}
syscall.SysCall(syscall.B, ...)
@dr2chase

This comment has been minimized.

Contributor

dr2chase commented Oct 8, 2018

We're going to fix this bug (that is, the bug with this number, that Go code must respond to a request for a handshake w/ the runtime in a very timely fashion). It's a problem for users in already-running code, and has been for some time. In the worst case, applications can hang because of this, which is not acceptable.

The go runtime can't provide guarantees like "the Go scheduler contains no syscalls" -- for one thing, as soon as a goroutine is locked to an OS thread, starting and stopping that thread requires syscalls. I looked, and as the runtime is currently written, a locked thread is always put to sleep when its quantum expires, even if it is destined to be the next runnable goroutine because there are no others that are runnable. We have no evidence this is causing anyone problems -- there's no reproducible test case, no problems with blown SLOs, etc.

I think it would be helpful if you could start small, and with the plainest possible Go, so we actually have some examples to work with, to measure latencies, and figure out their causes, and also figure out what the constraints are. There will almost certainly be problems, but they'll be actual problems, not hypothesized problems, and we'll be able to measure them and decide what's the best way to deal with them. Crucially, because we have example code to work with, we can use it to create test cases so that whatever bugs get fixed, stay fixed.

@wsc1

This comment has been minimized.

wsc1 commented Oct 8, 2018

We're going to fix this bug (that is, the bug with this number, that Go code must respond to a request for a handshake w/ the runtime in a very timely fashion). It's a problem for users in already-running code, and has been for some time. In the worst case, applications can hang because of this, which is not acceptable.

I'm glad it will help those for whom it causes problems. I don't want to prevent that.

As a long time Go user, I don't classify this as a bug, knowing the system won't pre-empt is a long standing quality of the language that I and probably others have knowingly made use of and invested time in.

So, one man's bug is another's feature. A veritable testament to the success of Go. Let's keep being inclusive so more things like that will happen.

@wsc1

This comment has been minimized.

wsc1 commented Oct 8, 2018

The go runtime can't provide guarantees like "the Go scheduler contains no syscalls" -- for one thing, as soon as a goroutine is locked to an OS thread, starting and stopping that thread requires syscalls. I looked, and as the runtime is currently written, a locked thread is always put to sleep when its quantum expires, even if it is destined to be the next runnable goroutine because there are no others that are runnable. We have no evidence this is causing anyone problems -- there's no reproducible test case, no problems with blown SLOs, etc.

Thanks for checking and the info. I agree the scheduler can't guarantee no sys calls. It could probably benefit from more use of lockless structures.

@wsc1

This comment has been minimized.

wsc1 commented Oct 8, 2018

I think it would be helpful if you could start small, and with the plainest possible Go, so we actually have some examples to work with, to measure latencies, and figure out their causes, and also figure out what the constraints are. There will almost certainly be problems, but they'll be actual problems, not hypothesized problems, and we'll be able to measure them and decide what's the best way to deal with them. Crucially, because we have example code to work with, we can use it to create test cases so that whatever bugs get fixed, stay fixed.

Thanks, as previously noted, we're dealing with wall clock time and it appears to me from all commentary that no-one understands that you can't put wall-clock time latency tests in a box and pass it around the cloud for analysis without simulating the entire system or having control of the hardware.

If the need for a concrete test is what is necessary for deeming a problem more important than hypothetical, then see above, if you provide a way to host or receive such tests, then I'm happy to do help advance that. If you do not, then I guess we'll just have to agree to disagree about whether the problem is hypothetical and I'll be on my way.

Thanks in any case.

@ianlancetaylor

This comment has been minimized.

Contributor

ianlancetaylor commented Oct 8, 2018

@wsc1 I suppose I'm beating a dead horse, but you haven't demonstrated a problem. You've pointed to a possible problem. That's not the same thing. That's why we keep calling it hypothetical.

This issue was opened because of real problems demonstrated by real users with real code. You have made clear that the solutions being discussed to those real problems may possibly cause problems in other code. But we don't know. So let's try fixing the real problems and find out what happens with the other code. Then, based on real experience, let's decide how to fix all the real problems.

Nobody is saying that we should break audio code. We're saying: let's see what happens. It doesn't make sense to not fix real problems because of the fear of hypothetical problems. Perhaps the fix under consideration causes too many other problems and can not be implemented. But let's try it and find out, not reject it without trying.

Alternatively, if you have a proposal for a different fix to the real problems we're trying to fix, by all means tell us.

@wsc1

This comment has been minimized.

wsc1 commented Oct 9, 2018

@ianlancetaylor

Thanks. My problem with the fix is real. I would appreciate it if there were a way to demonstrate that. But the Go team does not seem to have the resources to test, receive, or evaluate those real problems.
Such a circumstance doesn't to me merit discrediting what has been presented as "not real", I think a better categorisation would be "we can't know (given our resources) how much of a problem it is".

Also, this change was hidden from my view of Go activities, distracted by everything else that was pushed and not mentioned by anyone during a long discussion on golang-dev when the themes were quite clearly related. That cost me a great deal of time, as well as the need to rethink and design and implement about a million things. If that cost is not real to you or y'all then, well, then I don't appreciate that nor do I think would anyone.

@ianlancetaylor

This comment has been minimized.

Contributor

ianlancetaylor commented Oct 9, 2018

I'm sorry if it seems like we are wasting your time. That is not my intent. I don't understand what you mean when you say that we lack resources; that may well be true, but if you send us benchmarks we can run them.

In general we seem to be talking past each other. Perhaps best to revisit this if and when the design described in this issue is implemented.

@AndrewSav

This comment has been minimized.

AndrewSav commented Oct 9, 2018

@wsc1 you wrote:

Because of its relation to OS scheduling, Note that to test reliability, one must load the OS outside the program in question, so you cannot draw conclusions from testing one program without also placing the entire host OS under test. Sorry I can't provide such a test which controls the whole OS at this time.

You also wrote:

it appears to me from all commentary that no-one understands that you can't put wall-clock time latency tests in a box and pass it around the cloud for analysis without simulating the entire system or having control of the hardware.

I can tell you that I personally definitely do not understand it. Of course it says nothing about whether others understand it, or not, but perhaps some explanation can be in order? In particular:

  • "one must load the OS outside the program in question" - what does that mean? How do you "load" OS? Do you mean booting it up? If yes surely booting it up happens outside the program in question. What do you mean then?
  • "so you cannot draw conclusions from testing one program without also placing the entire host OS under test" - why is that, you cannot draw conclusions? also what does it mean, specifically, "placing entire host OS under test". How does one do that and why?
  • "Sorry I can't provide such a test which controls the whole OS at this time." Why cannot you? What is involved?

Thank you in advance, I would really appreciate your insight!

@wsc1

This comment has been minimized.

wsc1 commented Oct 9, 2018

I'm sorry if it seems like we are wasting your time. That is not my intent.

Thanks.

I don't understand what you mean when you say that we lack resources; that may well be true, but if you send us benchmarks we can run them.

Thanks again.

In general we seem to be talking past each other. Perhaps best to revisit this if and when the design described in this issue is implemented.

Yes, I agree

@AndrewSav if you want to follow up about the latency tests nonetheless feel free to mail me or comment on the "Questions" issue at github.com/zikichombo/sio.

@dr2chase

This comment has been minimized.

Contributor

dr2chase commented Oct 9, 2018

no-one understands that you can't put wall-clock time latency tests in a box and pass it around the cloud for analysis without simulating the entire system or having control of the hardware.

We can currently eyeball execution traces for latency problems that the Go runtime causes. It gives a quite reasonable rendition of virtual-ish time, where OS-induced latency appears as "the runtime didn't touch this thread, but it mysteriously did nothing for a little while". For runtime-induced latency purposes, we mostly ignore those implicit glitches (we need to keep an eye on them, for cases where the runtime does a syscall before logging the event, but I think you get the picture). We could automate this.

There would be an annoying one-time investment in a useful bug harness, but I think it would be adequate to this purpose. Simple rule of thumb, if it's not tested, it doesn't work. If we intend to support this sort of low latency behavior, we need a test.

@wsc1

This comment has been minimized.

wsc1 commented Oct 11, 2018

@wsc1 you wrote:

I can tell you that I personally definitely do not understand it. Of course it says nothing about whether others understand it, or not, but perhaps some explanation can be in order? In particular:

  • "one must load the OS outside the program in question" - what does that mean? How do you "load" OS? Do you mean booting it up? If yes surely booting it up happens outside the program in question. What do you mean then?

I mean placing the operating system under load by other processes than the one(s) in being measured.

  • "so you cannot draw conclusions from testing one program without also placing the entire host OS under test" - why is that, you cannot draw conclusions? also what does it mean, specifically, "placing entire host OS under test". How does one do that and why?

Because we are dealing with wall clock time, which means time that includes the time the operating system devotes to other processes which compete for resources with the piece of software in question.

So to know how the software acts in place with respect to wall clock time we need to know the whole picture, the time inside the process is less relevant to solving the problem than how the process interacts with other programs in the context of an operating system. (This result in part thanks to the great work on Go's garbage runtime so far).

  • "Sorry I can't provide such a test which controls the whole OS at this time." Why cannot you? What is involved?

I can and have and do test on what hardware I have, however, as the test framework available to @drchase to provide to me does not provide an interface for readily specifying what the operating system is doing at the same time, and I would guess also because it might not be hardware but rather a container or VM or kubernetes or something, the wall clock timing is not expected to reliably agree when we run the tests.

Also, on my end, it requires finding a reasonable and useful model of what those other processes are, so we can turn a nob and see a latency and then use that nob as a reference point for real operating system situations.

These measures are, as @drchase mentioned something he can eyeball/estimate, but I'm interested in measuring time to failure given uncontrolled external influences, also with and without real time threads (which require root access).

These problematics lead me to think maybe it's better to simulate the timing of the system calls by modifying go for the sole purpose of the test (which would enable precise measurement on all platforms under simulation) or maybe it's better to modify the execution tracer to output times under a simulation after a test has already been run.

But all this is work and requires coordination to communicate above standard benchmarking and test cases.

It is also work which is in part necessitated by assertions that the runtime should be able to pre-empt the programmer at any time without the ability of the programmer to control that. Because to the extent I can control the pre-emptions as a Go programmer, (which is possible today and has been for a long time), this problem becomes my concern on my hardware and not one of coordinating tests and defining OS simulations which can be coordinated by the Go issue tracker.

I prefer independence, because introducing the dependency on @dr2chase's resources available to me gives the Go language control over my timeline, which I simply can't afford.

Thank you in advance, I would really appreciate your insight!

Thanks for your questions.

@wsc1

This comment has been minimized.

wsc1 commented Oct 11, 2018

@aclements
Very impressive work on the pre-emption.

Please don't take my timeline concerns as an objection to the work, as it is clear it fixes many other problems. I'll just stick with go1.11 and work out compatibility later.

@wsc1

This comment has been minimized.

wsc1 commented Oct 11, 2018

There would be an annoying one-time investment in a useful bug harness, but I think it would be adequate to this purpose. Simple rule of thumb, if it's not tested, it doesn't work. If we intend to support this sort of low latency behavior, we need a test.

Thanks very much for the offer. However, I can't afford making the feasibility of support for what I am working on subject to your timeline and resources. I'll make my own tests and stay with go1.11 and try to keep in mind things that might help us communicate wall clock tests for getting TTF along the way. One such thing that comes to mind is making the scheduler random seed set-able for reproducibility.

@wsc1

This comment has been minimized.

wsc1 commented Oct 17, 2018

There would be an annoying one-time investment in a useful bug harness, but I think it would be adequate to this purpose. Simple rule of thumb, if it's not tested, it doesn't work. If we intend to support this sort of low latency behavior, we need a test.

Thanks very much for the offer. However, I can't afford making the feasibility of support for what I am working on subject to your timeline and resources. I'll make my own tests and stay with go1.11 and try to keep in mind things that might help us communicate wall clock tests for getting TTF along the way. One such thing that comes to mind is making the scheduler random seed set-able for reproducibility.

For a reference about another project's hosting of latency tests, please see this

@robaho

This comment has been minimized.

robaho commented Dec 12, 2018

This may of already been discussed, but based on not wanting to do a test and branch, due to cost and code size, isn't another method is to load a 'flag' memory address, and when unprotected it succeeds - when the runtime wants a safe-point it protects the address, causing a trap on the read, and the trap handler uses the scheduler to pause, this way it is a single instruction in the loops and in almost all cases will be in the L1 cache so negligible performance impact ?

Btw, if you review http://psy-lob-saw.blogspot.com/2014/03/where-is-my-safepoint.html - the process I described is how OpenJDK (at least then) did the safepoint check, and it also outlines the rational.

@dr2chase

This comment has been minimized.

Contributor

dr2chase commented Dec 12, 2018

We tried that, the performance hit on some benchmarks was still surprisingly annoying.
At least some of that cost was caused by glitches in register allocation which I think have been fixed, but it still has higher overhead than what we're attempting/planning to do now.

Go (in its current/usual implementation) has an advantage over Java (usual case) because its "threads" (goroutines) are not OS threads and generally cooperatively schedule themselves. That means that the runtime knows the state of most threads and at worst only has to interrupt about as many threads as there are processors, which makes interrupt-based schemes plausible.

@robaho

This comment has been minimized.

robaho commented Dec 12, 2018

That makes sense. It would seem to be racey though with threads entering/exiting syscalls because interrupting those might be problematic leading to more locking around the syscall and interrupt, but I’m sure you guys have a plan for that. :)

@robaho

This comment has been minimized.

robaho commented Dec 12, 2018

@dr2chase does that scale though? Imagine running on a 1024 core machine - it would seem that interrupts would be far less efficient, even 64 cores could lead to significant latencies over polling

@aclements

This comment has been minimized.

Member

aclements commented Dec 12, 2018

@robaho, that's an interesting question, and it can be asked of any approach. For example, with the page unmapping approach, interrupting all of the threads requires a global TLB shootdown, so it becomes a question of how well that scales. Polling a global flag requires either a memory barrier on the read side at every poll (which will be very expensive in the non-preemption case) or a global memory barrier when the flag is set, which, like the global TLB shootdown, requires an IPI broadcast to all cores. Both of these approaches also have the downside that they can only be used for global preemption, so they're not useful for regular scheduler preemptions.

To support scheduler preemption, polling requires a per-something flag. Currently we use a per-goroutine flag. But then it's a question of how scalably we can set those flags, in addition to what the non-preemption overhead of these checks is.

A signal-driven approach supports scheduler preemptions, and believe can be made quite scalable. For example, we could use a broadcast tree to deliver the signals.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment