Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

proposal: runtime: add per-goroutine CPU stats #41554

Open
asubiotto opened this issue Sep 22, 2020 · 10 comments
Open

proposal: runtime: add per-goroutine CPU stats #41554

asubiotto opened this issue Sep 22, 2020 · 10 comments
Labels
Projects
Milestone

Comments

@asubiotto
Copy link

@asubiotto asubiotto commented Sep 22, 2020

Per-process CPU stats can currently be obtained via third-party packages like https://github.com/elastic/gosigar. However, I believe that there exists a need for a certain type of applications to be able to understand CPU usage at a finer granularity.

Example

At a high level in CockroachDB, whenever an application sends a query to the database, we spawn one or more goroutines to handle the request. If more queries are sent to the database, they each get an independent set of goroutines. Currently, we have no way of showing the database operator how much CPU is used per query. This is useful for operators in order to understand which queries are using more CPU and measure that against their expectations in order to do things like cancel a query that's using too many resources (e.g. an accidental overly intensive analytical query). If we had per-goroutine CPU stats, we could implement this by aggregating CPU usage across all goroutines that were spawned for that query.

Fundamentally, I think this is similar to bringing up a task manager when you feel like your computer is slow, figuring out which process on your computer is using more resources than expected, and killing that process.

Proposal

Add a function to the runtime package that does something like:

type CPUStats struct {
    user time.Duration
    system time.Duration
    ...
}

// ReadGoroutineCPUStats writes the active goroutine's CPU stats into
// CPUStats.
func ReadGoroutineCPUStats(stats *CPUStats) 

Alternatives

From a correctness level, an alternative to offering these stats is to LockOSThread the active goroutine for exclusive thread access and then get coarser-grained thread-level cpu usage by calling Getrusage for the current thread. The performance impact is unclear.

Additional notes

Obtaining execution statistics during runtime at a fine-grained goroutine level is essential for an application like a database. I'd like to focus this conversation on CPU usage specifically, but the same idea applies to goroutine memory usage. We'd like to be able to tell how much live memory a single goroutine has allocated to be able to decide whether this goroutine should spill a memory-intensive computation to disk, for example. This is reminiscent of #29696 but at a finer-grained level without a feedback mechanism.

I think that offering per-goroutine stats like this is useful even if it's just from an obervability standpoint. Any application that divides work into independent sets of goroutines and would like to track resource usage of a single group should benefit.

@gopherbot gopherbot added this to the Proposal milestone Sep 22, 2020
@martisch
Copy link
Contributor

@martisch martisch commented Sep 22, 2020

A possible solution to showing high level usage of different query paths can be achieved by setting profiling labels on the goroutine:
https://golang.org/src/runtime/pprof/runtime.go?s=1362:1432#L26

And doing background profiling on the job:
https://go-review.googlesource.com/c/go/+/102755

Overall Go program usage can be queried from the enclosing container or process stats from the Operating system directly.

Loading

@ianlancetaylor
Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Sep 22, 2020

Yes, this is exactly what labels are for. A nice thing about labels is that they let you measure CPU or heap performance across a range of goroutines all cooperating on some shared task.

https://golang.org/pkg/runtime/pprof/#Labels
https://golang.org/pkg/runtime/pprof/#Do

Please let us know if you need something that is not addressed by labels.

Loading

@asubiotto
Copy link
Author

@asubiotto asubiotto commented Sep 23, 2020

Thanks for the suggestion. My main concern with profiling is that there is a non-negligible performance overhead. For example, running a quick workload (95% reads and 5% writes against a CockroachDB SQL table) shows that throughput drops by at least 8% when profiling with a one second interval.

I'm hoping that this information can be gathered by the scheduler in a much cheaper way since the question to answer is not "where has this goroutine spent most of its time" but "how much CPU time has this goroutine used". Would this even be feasible?

Loading

@ianlancetaylor
Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Sep 24, 2020

Ah, I see. I would think that always collecting CPU statistics would be unreasonably expensive. But it does seem possible to collect them upon request in some way, at least when running on GNU/Linux. Every time a thread switched to a different goroutine it would call getrusage with RUSAGE_THREAD. The delta would be stored somewhere with the old goroutine. Then it could be retrieved as you suggest. Memory profiling information could be collected separately.

I don't know how useful this would be for most programs. In Go it is very easy to start a new goroutine, and it is very easy to ask an existing goroutine to do work on your behalf. That means that it's very easy for goroutine based stats to accidentally become very misleading, for example when the program forgets to collect the stats of some newly created goroutine. That is why runtime/pprof uses the labels mechanism.

Perhaps it would also be possible for this mechanism to use the labels mechanism. But then it is hard to see where the data should be stored or how it should be retrieved.

Loading

@ianlancetaylor ianlancetaylor changed the title proposal: add per-goroutine CPU stats proposal: runtime: add per-goroutine CPU stats Sep 24, 2020
@ianlancetaylor ianlancetaylor added this to Incoming in Proposals Sep 24, 2020
@tbg
Copy link

@tbg tbg commented Feb 19, 2021

And doing background profiling on the job:
https://go-review.googlesource.com/c/go/+/102755

Overall Go program usage can be queried from the enclosing container or process stats from the Operating system directly.

I'm curious what the internal experience with this mechanism is and how it is used exactly. I suppose some goroutine will be tasked with continuously reading the background profiling stream, but what does it do with it? Is there any code I can look at for how to get to "here's a stream of timestamped label maps" to "this label used X CPU resources"? Do we "just" create a profile every X seconds from the stream? And if that is the idea, how is that different from doing "regular" profiling at a lower sample rate (you might miss a beat every time you restart the sample, but let's assume this is ok)? I feel like I'm missing the bigger picture here.

For the record, I did get the background tracing PR linked above working on the go1.16 branch: cockroachdb/cockroach#60795

Loading

@tbg
Copy link

@tbg tbg commented Feb 22, 2021

Ah, I see. I would think that always collecting CPU statistics would be unreasonably expensive. But it does seem possible to collect them upon request in some way, at least when running on GNU/Linux. Every time a thread switched to a different goroutine it would call getrusage with RUSAGE_THREAD. The delta would be stored somewhere with the old goroutine. Then it could be retrieved as you suggest. Memory profiling information could be collected separately.

With the RUSAGE_THREAD idea, wouldn't we be adding a syscall per scheduler tick? That seems very expensive and would likely be a non-starter for use cases such as ours where we want to track usage essentially at all times for most of the goroutines.

The most lightweight variant I have seen is based on nanotime() per scheduler ticks, as my colleague @knz prototyped here (the code just counts ticks, but imagine adding up the nanotime() instead of the ticks:

https://github.com/golang/go/compare/master...cockroachdb:taskgroup?expand=1

I understand that this basic approach has caveats (if threads get preempted, the duration of the preemption will be counted towards the goroutine's CPU time, and there's probably something about cgo calls too) but it seems workable and cheap enough to at least be able to opt into globally.

I don't know how useful this would be for most programs. In Go it is very easy to start a new goroutine, and it is very easy to ask an existing goroutine to do work on your behalf. That means that it's very easy for goroutine based stats to accidentally become very misleading, for example when the program forgets to collect the stats of some newly created goroutine. That is why runtime/pprof uses the labels mechanism.

The PR above avoids (if I understand you correctly) this by using propagation rules very similar to labels. A small initial set of goroutines (ideally just one, at the top of the "logical task") is explicitly labelled via the task group (identified by, say, an int64) to which the app holds a handle (for querying). The ID is inherited by child goroutines. Statistics are accrued at the level of the task group, not at the individual goroutine level (though goroutine level is possible, just create unique task group IDs). I recall that the Go team is averse to giving users goroutine-local storage by accident, which this approach would not (there would not need to be a way to retrieve the current task group from a goroutine) - given the task group ID, one can ask for its stats. But one can not ask a goroutine about its task group ID.

Perhaps it would also be possible for this mechanism to use the labels mechanism. But then it is hard to see where the data should be stored or how it should be retrieved.

I agree that getting labels and either of the new mechanisms proposed to play together would be nice, but it doesn't seem straightforward. You could assign to each label pair (key -> value) assigned a counter (i.e. two goroutines with both have label[k]=v share a task group <k,v>. If a goroutine with labels map[string]string{"l1": "foo", "l2": "bar"} ran for 10ms, we would accrue these to both of these label pairs, i.e. conceptually somewhere in the runtime m[tup{"l1", "bar"}] += 10 // ms and similarly for l2. Perhaps the way to access these metrics would be via the runtime/metrics package.

One difficulty is that we lose the simple identification of a task group with a memory address which we had before, because two maps with identical contents identify the same task group.

On the plus side, any labelled approach would avoid the clobbering of counters that could naively result if libraries did their own grouping (similar to how one should not call runtime.SetGoroutineLabels), as counters could be namespaced and/or nested.

You mention above that

A nice thing about labels is that they let you measure CPU or heap performance

but profiler labels are not used for heap profiles yet, and I think it's because of a somewhat similar problem - a goroutine's allocations in general outlive the goroutine, so the current mechanism (where the labels hang off the g) doesn't work and the lifecycle needs to be managed explicitly. (Everything I know about this I learned from #23458 (comment)). Maybe the way forward there opens a way forward for sane per-label-pair counters?

Loading

@knz
Copy link

@knz knz commented Feb 22, 2021

The most lightweight variant I have seen is based on nanotime() per scheduler ticks, as my colleague @knz prototyped here (the code just counts ticks, but imagine adding up the nanotime() instead of the ticks:

The implementation using nanotime() is in fact ready already here: cockroachdb@b089033

Perhaps the way to access these metrics would be via the runtime/metrics package.

Done here: cockroachdb@9020a4b

You could assign to each label pair (key -> value) assigned a counter (i.e. two goroutines with both have label[k]=v share a task group <k,v>

One elephant in the room is that looking up labels and traversing a Go map in the allocation hot path is a performance killer.

A nice thing about labels is that they let you measure CPU or heap performance

but profiler labels are not used for heap profiles yet,

This is the other elephant in the room: partitioning the heap allocator by profiler labels would run amok of the small heap optimization. (plus, it would be CPU-expensive to group the profiling data by labels)

Loading

@knz
Copy link

@knz knz commented Feb 22, 2021

In case it wasn't clear from the latest comments from @tbg and myself: we believe that doing things at the granularity of goroutines is too fine grained, and raises painful questions about where to accumulate the metrics when goroutines terminate.

While @tbg is trying to salvage pprof labels as the entity that defines a grouping of goroutines, I am promoting a separate "task group" abstraction which yields a simpler implementation and a lower run-time overhead. I don't know which one is going to win yet—we need to run further experiments—but in any case we don't want a solution that does stats per individual goroutines.

Loading

@crazycs520
Copy link

@crazycs520 crazycs520 commented May 13, 2021

@knz I really like the feature in https://github.com/cockroachdb/go/commits/crdb-fixes. Could you create a pull request to go official?

Loading

@knz
Copy link

@knz knz commented May 13, 2021

thank you for your interest!
We're still evaluating whether this approach is useful in practice. Once we have a good story to tell, we'll share it with the go team.

Loading

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Proposals
Incoming
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
7 participants