New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: expose number of running/runnable goroutines #17089

Open
fd opened this Issue Sep 13, 2016 · 15 comments

Comments

Projects
None yet
10 participants
@fd

fd commented Sep 13, 2016

Summary

I'd like to propose a way to expose the number of active (running + runnable) goroutines.

Background

My primary use case for this metric is to estimate application load (num-active-goroutines / num-cpu) in order to implement load shedding. Other metrics, like the times() syscall, don't expose application overload and don't work well in the presence of noisy neighbours.

Plan

Currently the runtime package includes runtime.NumGoroutine() int which returns the number of live, non-system goroutines.

The runtime package could be extended to include runtime.NumActiveGoroutine() int. NumActiveGoroutine() should count all goroutines where isSystemGoroutine() is false and where status is _Grunnable|_Grunning|_Gsyscall.

It seems that such a function would need to acquire sched.lock and allglock. This could have some performance implications.

@quentinmit quentinmit added this to the Go1.8Maybe milestone Sep 13, 2016

@quentinmit

This comment has been minimized.

Contributor

quentinmit commented Sep 13, 2016

@quentinmit

This comment has been minimized.

Contributor

quentinmit commented Sep 13, 2016

I'm going to tentatively mark this as a feature request for runtime, instead of a proposal, since it seems pretty uncontroversial to me.

@quentinmit quentinmit changed the title from Proposal: Expose number of running/runnable goroutines to runtime: expose number of running/runnable goroutines Sep 13, 2016

@davecheney

This comment has been minimized.

Contributor

davecheney commented Sep 13, 2016

It doesn't seem like a very interesting number, it'll always be less than
or equal to GOMAXPROCS.

On Wed, 14 Sep 2016, 00:43 Quentin Smith notifications@github.com wrote:

I'm going to tentatively mark this as a feature request for runtime,
instead of a proposal, since it seems pretty uncontroversial to me.


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#17089 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AAAcA_CxhRkJYUwt9yo_IlXyO0wh4e9Uks5qpraLgaJpZM4J7r8b
.

@ianlancetaylor

This comment has been minimized.

Contributor

ianlancetaylor commented Sep 13, 2016

@davecheney The suggestion counts runnable goroutines, so it can be larger than GOMAXPROCS.

My concern is that I don't see that this adds anything very useful over NumGoroutine. If you are worried about shedding load then I don't see why you want to ignore the system goroutines. And there aren't very many system goroutines anyhow, so if you are in a condition where load shedding is relevant they are just going to be a rounding error.

@randall77

This comment has been minimized.

Contributor

randall77 commented Sep 13, 2016

I think more importantly he doesn't want to count goroutines which are blocked (which NumGoroutine does count).

@ianlancetaylor

This comment has been minimized.

Contributor

ianlancetaylor commented Sep 13, 2016

Oh, I see, but then the goroutines in state _Gsyscall are ambiguous, as they could be blocked.

@randall77

This comment has been minimized.

Contributor

randall77 commented Sep 13, 2016

Indeed. Only things blocked in Go (select{}, ...) would not be counted if we used the raw goroutine states.

@fd

This comment has been minimized.

fd commented Sep 14, 2016

@ianlancetaylor: The suggestion counts runnable goroutines, so it can be larger than GOMAXPROCS.

That is correct.

@ianlancetaylor: My concern is that I don't see that this adds anything very useful over NumGoroutine.

Using NumGoroutine breaks down when you have long running goroutines that do background work (like periodically refreshing cache entries). This approach also breaks down for proxy servers as they spend most of their time waiting on the network.

@ianlancetaylor: If you are worried about shedding load then I don't see why you want to ignore the system goroutines.

Based on the POC that I made, It seems that at least some system goroutines always appear active. I could be wrong here as they might be blocking on a syscall (like say the netpoller).

The other reason I think system goroutines should be excluded is because NumGoroutine also excludes them.

@ianlancetaylor: Oh, I see, but then the goroutines in state _Gsyscall are ambiguous, as they could be blocked.

That is correct maybe the g.waitreason should be taken into account to?
Otherwise the _Gsyscall state could be excluded.

@randall77: Only things blocked in Go (select{}, ...) would not be counted if we used the raw goroutine states.

How can this be detected?


Here is the POC code I wrote: https://gist.github.com/fd/7136de67a56e174d8c06cb505f7278aa

@randall77

This comment has been minimized.

Contributor

randall77 commented Sep 14, 2016

Goroutines blocked in the Go runtime have Gwaiting state. You probably don't want to count those, they contribute nothing to CPU load (but do consume some memory).

It is not clear whether you should count goroutines in the Gsyscall state. Whether you want to count them depends on whether they are doing real work in the syscall (reading a large file, say) or waiting (read on an idle network socket). I don't think the runtime has the information needed to make that call, although we might be able to make some approximation. That's what makes this problem hard.

@fd

This comment has been minimized.

fd commented Sep 14, 2016

So, how about this:

  • timers and network IO are managed by the runtime and don't result in _Gsyscall goroutines (except for the system goroutines which should be excluded).
  • cgocalls can include syscalls which cannot be tracked. as a result these syscalls are always counted.
  • Applications accessing the network through cgo (which hides the syscalls) should be rare.
  • other syscalls generally result in CPU load and thus should be included.

So unless you are heavily using something like gopkg.in/fsnotify.v1 NumActiveGoroutine should be a decent approximation of the actual work load.

Including _Gsyscall should be a good starting point for NumActiveGoroutine.
The runtime could be extended to record the called syscall in G.
Then syscall package could be extended with a list of syscalls that result in some form of idling.
Given these changes, NumActiveGoroutine can decide whether to consider the goroutine active or not. Syscalls called from cgo are still hidden in this senario.

Remember, it is not my goal to find an accurate estimation of the CPU utilisation. Instead it is my goal to find a good-enough estimation of the application utilisation. I included a excerpt from Site Reliability Engineering, How Google Runs Production Systems which seems to suggest that Google uses a similar metric/approach.


Site Reliability Engineering, How Google Runs Production Systems - p. 366

The utilization signals we use are based on the state local to the task (since the goal of the signals is to protect the task) and we have implementations for various signals. The most generally useful signal is based on the “load” in the process, which is determined using a system we call executor load average .

To find the executor load average, we count the number of active threads in the process. In this case, “active” refers to threads that are currently running or ready to run and waiting for a free processor. We smooth this value with exponential decay and begin rejecting requests as the number of active threads grows beyond the number of processors available to the task. That means that an incoming request that has a very large fan-out (i.e., one that schedules a burst of a very large number of short-lived operations) will cause the load to spike very briefly, but the smoothing will mostly swallow that spike. However, if the operations are not short-lived (i.e., the load increases and remains high for a significant amount of time), the task will start rejecting requests.

@ianlancetaylor

This comment has been minimized.

Contributor

ianlancetaylor commented Sep 14, 2016

Using NumGoroutine breaks down when you have long running goroutines that do background work (like periodically refreshing cache entries). This approach also breaks down for proxy servers as they spend most of their time waiting on the network.

As you say, you are looking for an approximation, and you care about load shedding. Unless you start a long running goroutine for each incoming request, the number of long running goroutines should be a tiny fraction of the total number of goroutines, and are therefore ignorable for approximation purposes.

I agree that proxy servers are a problem.

Since you have proof of concept code, do you have a way to see the difference between NumGoroutine and NumActiveGoroutine for a large server?

I would be less concerned about adding NumActiveGoroutines if it weren't for the ambiguity about _Gsyscall. I'm worried about how to document what the result really means for programs that call C code. It's probably unusual to call C code that makes direct network calls, but it's not in the least unusual to call C code that uses the file system, which may be networked, or that uses a library that in turn makes DNS lookups or in some other way uses the network. So while NumActiveGoroutines is easy to understand for pure Go code, I don't see how it's easily generalizable for Go programs that call C code.

One possibility would be to return two numbers: the number of running/runnable goroutines and the number of goroutines waiting for a system call or C code. But that seems to me to be too tied to the current details of how system calls and cgo are implemented.

I assume you are looking for some sort of general framework here, because for any specific program that wants to do load shedding I would say just count the number of active requests.

@RLH

This comment has been minimized.

Contributor

RLH commented Sep 14, 2016

The problem NumActiveGoroutines is trying to solve is when to shed load. Wouldn't monitoring the latency of an application request be a more direct and ultimately more correct way to do this. If latency increases shed load. If latency improves increase load.

Is there a use case where this doesn't work but NumActiveGoroutines does?

Discussing the nuances of what _Gidle, _Grunnable, _Grunning, _Gsyscall, _Gwaiting plus what _Gscanrunning _Gscanrunnable, _Gscansyscall, and _Gscanidle means in this context is a very implementation dependent discussion.

@quentinmit

This comment has been minimized.

Contributor

quentinmit commented Oct 11, 2016

Even NumGoroutines does not capture all the work C is doing; the C code may have spawned threads that are independently doing work as well.

I think it's reasonable to say that goroutines in C are not active from the perspective of Go, regardless of what they're calling.

@quentinmit quentinmit added the NeedsFix label Oct 11, 2016

@rsc

This comment has been minimized.

Contributor

rsc commented Oct 21, 2016

This is not uncontroversial.

@rsc rsc modified the milestones: Go1.9, Go1.8Maybe Oct 21, 2016

@rsc rsc added NeedsDecision and removed NeedsFix labels Oct 21, 2016

@gopherbot

This comment has been minimized.

gopherbot commented Mar 14, 2017

CL https://golang.org/cl/38180 mentions this issue.

@bradfitz bradfitz added NeedsFix and removed NeedsDecision labels May 15, 2017

@aclements aclements modified the milestones: Go1.10Early, Go1.9 Jun 7, 2017

@bradfitz bradfitz removed this from the Go1.10Early milestone Jun 14, 2017

@bradfitz bradfitz modified the milestones: Go1.10, Go1.10Early Jun 14, 2017

@rsc rsc modified the milestones: Go1.10, Go1.11 Nov 22, 2017

@ianlancetaylor ianlancetaylor modified the milestones: Go1.11, Unplanned Jul 9, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment