proposal: runtime: add SchedStats API #15490
Comments
You can get all of these statistics by parsing the tracing output from
runtime/trace.
|
@minux, while true, runtime/trace seems like a pretty high overhead way to collect what amounts to a fairly small amount of information. It's certainly low overhead for what it does, but what it does is much more than what's needed here. The metrics @deft-code wants are primarily intended for continuous monitoring (based on offline conversations), so it needs to be cheap. |
Here are the notes on the desired metrics I had from our meeting a while ago: Ring buffer of sampled duration between entering and exiting runnable state
Four global stats
Maybe current number of running goroutines |
Assigning to @aclements to decide what we're willing to support long-term. |
Ping @aclements |
Still up to @aclements. |
Ping @aclements. Can you look at this during the release candidate quiet? |
@deft-code, do the specific stats I suggested in #15490 (comment) address your needs? |
Sorry, I'd lost track of the fact that there was a concrete proposal doc for this: https://github.com/deft-code/proposal/blob/master/design/15490-schedstats.md @deft-code, could you mail a CL to add this to the go-proposal repository and, once submitted, edit your first post to link to it? Thanks. |
I'll get on top of it. |
Thanks! |
CL https://golang.org/cl/38180 mentions this issue. |
Do we need to keep this issue open, or should we accept it? |
There's definitely still work to do on how and what exactly the API should expose, but I think it's pretty clear we need to provide some visibility into the scheduler. |
Teams inside Google are patching in CL 38180 and getting some experience with it. If others would like to do the same, please do. We'll probably wait until Go 1.10 to decide to add the API officially. Putting the proposal on hold until then. |
Update: Some teams inside Google tried CL 38180 for monitoring CPU load and performing load shedding. However, they've found that the stats provided in the CL aren't a good indicator of CPU load. In particular, it seemed that since goroutines are so cheap (compared to, say, using runnable threads as a load indicator), the runnable goroutine count often fluctuated dramatically, even when the system was under normal load. It was common to see a huge number of goroutines newly started or woken that would exit or sleep almost immediately once run. If the load shedder happened to sample the SchedStats during one of these spikes, it would think the system was overloaded. There may be some other stat that is a more robust indicator of load. For example, maybe smoothing would help: the runtime could provide the time-integral of the runnable count, from which an application could compute the average runnable count over any desired time window. Or it could expose something similar for the running goroutines to give a measure of idle time (once a system was overloaded, this wouldn't be able to tell how overloaded it was). |
DO NOT SUBMIT. This is an experimental API. This introduces a runtime.SchedStats API to mirror the existing runtime.MemStats API. Currently, SchedStats reports the number of goroutines in four major states: running, runnable, non-go (syscall/cgo), and blocked. The intent is that these can be used to determine the CPU load of a Go process and use this to perform load shedding. This is *not* a complete solution since the Go scheduler cannot account for threads in syscalls or cgo; however, a complete solution can be built by combining these statistics with kernel-provided statistics. The comments on SchedStats attempt to make this clear. ReadSchedStats collects these counts efficiently by scanning the P states and using a running count of the number of goroutines in syscalls that don't own a P (which avoids doing any additional accounting in the syscall fast path). This way, it can avoid scanning all of the goroutines, which could potentially be expensive. With this approach, at GOMAXPROCS=4, ReadSchedStats takes only 33 ns. Updates golang#15490, golang#17089. Change-Id: I202f33eea5d10c83dbf41cb45c8c619ff17fa4c4
DO NOT SUBMIT. This is an experimental API. This introduces a runtime.SchedStats API to mirror the existing runtime.MemStats API. Currently, SchedStats reports the number of goroutines in four major states: running, runnable, non-go (syscall/cgo), and blocked. The intent is that these can be used to determine the CPU load of a Go process and use this to perform load shedding. This is *not* a complete solution since the Go scheduler cannot account for threads in syscalls or cgo; however, a complete solution can be built by combining these statistics with kernel-provided statistics. The comments on SchedStats attempt to make this clear. ReadSchedStats collects these counts efficiently by scanning the P states and using a running count of the number of goroutines in syscalls that don't own a P (which avoids doing any additional accounting in the syscall fast path). This way, it can avoid scanning all of the goroutines, which could potentially be expensive. With this approach, at GOMAXPROCS=4, ReadSchedStats takes only 33 ns. Updates golang#15490, golang#17089. Change-Id: I202f33eea5d10c83dbf41cb45c8c619ff17fa4c4
We are looking at ways of detecting overload in CockroachDB, and statistics like described in #15490 (comment) or in the proposal would be extremely helpful. Is there any hope of getting something like this in Go any time soon? I can help with the work if folks are in agreement on what to build. As a timid first step, I was thinking of adding a Also, is there any more information about the experiments to use the number of runnable goroutines as an overload indicator? I understand the fluctuations are an issue, but wouldn't sampling this info frequently and looking at an average over a reasonable timeframe address that? |
MemStats provides a way to monitor allocation and garbage collection.
We need a similar facility to monitor the Scheduler.
Briefly:
The text was updated successfully, but these errors were encountered: