Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

proposal: runtime: add per-goroutine CPU stats #41554

Open
asubiotto opened this issue Sep 22, 2020 · 34 comments · May be fixed by #51347
Open

proposal: runtime: add per-goroutine CPU stats #41554

asubiotto opened this issue Sep 22, 2020 · 34 comments · May be fixed by #51347

Comments

@asubiotto
Copy link

asubiotto commented Sep 22, 2020

Per-process CPU stats can currently be obtained via third-party packages like https://github.com/elastic/gosigar. However, I believe that there exists a need for a certain type of applications to be able to understand CPU usage at a finer granularity.

Example

At a high level in CockroachDB, whenever an application sends a query to the database, we spawn one or more goroutines to handle the request. If more queries are sent to the database, they each get an independent set of goroutines. Currently, we have no way of showing the database operator how much CPU is used per query. This is useful for operators in order to understand which queries are using more CPU and measure that against their expectations in order to do things like cancel a query that's using too many resources (e.g. an accidental overly intensive analytical query). If we had per-goroutine CPU stats, we could implement this by aggregating CPU usage across all goroutines that were spawned for that query.

Fundamentally, I think this is similar to bringing up a task manager when you feel like your computer is slow, figuring out which process on your computer is using more resources than expected, and killing that process.

Proposal

Add a function to the runtime package that does something like:

type CPUStats struct {
    user time.Duration
    system time.Duration
    ...
}

// ReadGoroutineCPUStats writes the active goroutine's CPU stats into
// CPUStats.
func ReadGoroutineCPUStats(stats *CPUStats) 

Alternatives

From a correctness level, an alternative to offering these stats is to LockOSThread the active goroutine for exclusive thread access and then get coarser-grained thread-level cpu usage by calling Getrusage for the current thread. The performance impact is unclear.

Additional notes

Obtaining execution statistics during runtime at a fine-grained goroutine level is essential for an application like a database. I'd like to focus this conversation on CPU usage specifically, but the same idea applies to goroutine memory usage. We'd like to be able to tell how much live memory a single goroutine has allocated to be able to decide whether this goroutine should spill a memory-intensive computation to disk, for example. This is reminiscent of #29696 but at a finer-grained level without a feedback mechanism.

I think that offering per-goroutine stats like this is useful even if it's just from an obervability standpoint. Any application that divides work into independent sets of goroutines and would like to track resource usage of a single group should benefit.

@gopherbot gopherbot added this to the Proposal milestone Sep 22, 2020
@martisch
Copy link
Contributor

martisch commented Sep 22, 2020

A possible solution to showing high level usage of different query paths can be achieved by setting profiling labels on the goroutine:
https://golang.org/src/runtime/pprof/runtime.go?s=1362:1432#L26

And doing background profiling on the job:
https://go-review.googlesource.com/c/go/+/102755

Overall Go program usage can be queried from the enclosing container or process stats from the Operating system directly.

@ianlancetaylor
Copy link
Contributor

ianlancetaylor commented Sep 22, 2020

Yes, this is exactly what labels are for. A nice thing about labels is that they let you measure CPU or heap performance across a range of goroutines all cooperating on some shared task.

https://golang.org/pkg/runtime/pprof/#Labels
https://golang.org/pkg/runtime/pprof/#Do

Please let us know if you need something that is not addressed by labels.

@asubiotto
Copy link
Author

asubiotto commented Sep 23, 2020

Thanks for the suggestion. My main concern with profiling is that there is a non-negligible performance overhead. For example, running a quick workload (95% reads and 5% writes against a CockroachDB SQL table) shows that throughput drops by at least 8% when profiling with a one second interval.

I'm hoping that this information can be gathered by the scheduler in a much cheaper way since the question to answer is not "where has this goroutine spent most of its time" but "how much CPU time has this goroutine used". Would this even be feasible?

@ianlancetaylor
Copy link
Contributor

ianlancetaylor commented Sep 24, 2020

Ah, I see. I would think that always collecting CPU statistics would be unreasonably expensive. But it does seem possible to collect them upon request in some way, at least when running on GNU/Linux. Every time a thread switched to a different goroutine it would call getrusage with RUSAGE_THREAD. The delta would be stored somewhere with the old goroutine. Then it could be retrieved as you suggest. Memory profiling information could be collected separately.

I don't know how useful this would be for most programs. In Go it is very easy to start a new goroutine, and it is very easy to ask an existing goroutine to do work on your behalf. That means that it's very easy for goroutine based stats to accidentally become very misleading, for example when the program forgets to collect the stats of some newly created goroutine. That is why runtime/pprof uses the labels mechanism.

Perhaps it would also be possible for this mechanism to use the labels mechanism. But then it is hard to see where the data should be stored or how it should be retrieved.

@ianlancetaylor ianlancetaylor changed the title proposal: add per-goroutine CPU stats proposal: runtime: add per-goroutine CPU stats Sep 24, 2020
@ianlancetaylor ianlancetaylor added this to Incoming in Proposals (old) Sep 24, 2020
@tbg
Copy link

tbg commented Feb 19, 2021

And doing background profiling on the job:
https://go-review.googlesource.com/c/go/+/102755

Overall Go program usage can be queried from the enclosing container or process stats from the Operating system directly.

I'm curious what the internal experience with this mechanism is and how it is used exactly. I suppose some goroutine will be tasked with continuously reading the background profiling stream, but what does it do with it? Is there any code I can look at for how to get to "here's a stream of timestamped label maps" to "this label used X CPU resources"? Do we "just" create a profile every X seconds from the stream? And if that is the idea, how is that different from doing "regular" profiling at a lower sample rate (you might miss a beat every time you restart the sample, but let's assume this is ok)? I feel like I'm missing the bigger picture here.

For the record, I did get the background tracing PR linked above working on the go1.16 branch: cockroachdb/cockroach#60795

@tbg
Copy link

tbg commented Feb 22, 2021

Ah, I see. I would think that always collecting CPU statistics would be unreasonably expensive. But it does seem possible to collect them upon request in some way, at least when running on GNU/Linux. Every time a thread switched to a different goroutine it would call getrusage with RUSAGE_THREAD. The delta would be stored somewhere with the old goroutine. Then it could be retrieved as you suggest. Memory profiling information could be collected separately.

With the RUSAGE_THREAD idea, wouldn't we be adding a syscall per scheduler tick? That seems very expensive and would likely be a non-starter for use cases such as ours where we want to track usage essentially at all times for most of the goroutines.

The most lightweight variant I have seen is based on nanotime() per scheduler ticks, as my colleague @knz prototyped here (the code just counts ticks, but imagine adding up the nanotime() instead of the ticks:

https://github.com/golang/go/compare/master...cockroachdb:taskgroup?expand=1

I understand that this basic approach has caveats (if threads get preempted, the duration of the preemption will be counted towards the goroutine's CPU time, and there's probably something about cgo calls too) but it seems workable and cheap enough to at least be able to opt into globally.

I don't know how useful this would be for most programs. In Go it is very easy to start a new goroutine, and it is very easy to ask an existing goroutine to do work on your behalf. That means that it's very easy for goroutine based stats to accidentally become very misleading, for example when the program forgets to collect the stats of some newly created goroutine. That is why runtime/pprof uses the labels mechanism.

The PR above avoids (if I understand you correctly) this by using propagation rules very similar to labels. A small initial set of goroutines (ideally just one, at the top of the "logical task") is explicitly labelled via the task group (identified by, say, an int64) to which the app holds a handle (for querying). The ID is inherited by child goroutines. Statistics are accrued at the level of the task group, not at the individual goroutine level (though goroutine level is possible, just create unique task group IDs). I recall that the Go team is averse to giving users goroutine-local storage by accident, which this approach would not (there would not need to be a way to retrieve the current task group from a goroutine) - given the task group ID, one can ask for its stats. But one can not ask a goroutine about its task group ID.

Perhaps it would also be possible for this mechanism to use the labels mechanism. But then it is hard to see where the data should be stored or how it should be retrieved.

I agree that getting labels and either of the new mechanisms proposed to play together would be nice, but it doesn't seem straightforward. You could assign to each label pair (key -> value) assigned a counter (i.e. two goroutines with both have label[k]=v share a task group <k,v>. If a goroutine with labels map[string]string{"l1": "foo", "l2": "bar"} ran for 10ms, we would accrue these to both of these label pairs, i.e. conceptually somewhere in the runtime m[tup{"l1", "bar"}] += 10 // ms and similarly for l2. Perhaps the way to access these metrics would be via the runtime/metrics package.

One difficulty is that we lose the simple identification of a task group with a memory address which we had before, because two maps with identical contents identify the same task group.

On the plus side, any labelled approach would avoid the clobbering of counters that could naively result if libraries did their own grouping (similar to how one should not call runtime.SetGoroutineLabels), as counters could be namespaced and/or nested.

You mention above that

A nice thing about labels is that they let you measure CPU or heap performance

but profiler labels are not used for heap profiles yet, and I think it's because of a somewhat similar problem - a goroutine's allocations in general outlive the goroutine, so the current mechanism (where the labels hang off the g) doesn't work and the lifecycle needs to be managed explicitly. (Everything I know about this I learned from #23458 (comment)). Maybe the way forward there opens a way forward for sane per-label-pair counters?

@knz
Copy link

knz commented Feb 22, 2021

The most lightweight variant I have seen is based on nanotime() per scheduler ticks, as my colleague @knz prototyped here (the code just counts ticks, but imagine adding up the nanotime() instead of the ticks:

The implementation using nanotime() is in fact ready already here: cockroachdb@b089033

Perhaps the way to access these metrics would be via the runtime/metrics package.

Done here: cockroachdb@9020a4b

You could assign to each label pair (key -> value) assigned a counter (i.e. two goroutines with both have label[k]=v share a task group <k,v>

One elephant in the room is that looking up labels and traversing a Go map in the allocation hot path is a performance killer.

A nice thing about labels is that they let you measure CPU or heap performance

but profiler labels are not used for heap profiles yet,

This is the other elephant in the room: partitioning the heap allocator by profiler labels would run amok of the small heap optimization. (plus, it would be CPU-expensive to group the profiling data by labels)

@knz
Copy link

knz commented Feb 22, 2021

In case it wasn't clear from the latest comments from @tbg and myself: we believe that doing things at the granularity of goroutines is too fine grained, and raises painful questions about where to accumulate the metrics when goroutines terminate.

While @tbg is trying to salvage pprof labels as the entity that defines a grouping of goroutines, I am promoting a separate "task group" abstraction which yields a simpler implementation and a lower run-time overhead. I don't know which one is going to win yet—we need to run further experiments—but in any case we don't want a solution that does stats per individual goroutines.

@crazycs520
Copy link

crazycs520 commented May 13, 2021

@knz I really like the feature in https://github.com/cockroachdb/go/commits/crdb-fixes. Could you create a pull request to go official?

@knz
Copy link

knz commented May 13, 2021

thank you for your interest!
We're still evaluating whether this approach is useful in practice. Once we have a good story to tell, we'll share it with the go team.

irfansharif added a commit to irfansharif/go that referenced this issue Feb 24, 2022
Fixes github.com/golang/issues/41554.

This commit introduces a /sched/goroutine/running:nanoseconds metric,
defined as the total time spent by a goroutine in the running state.
This measurement is useful for systems that would benefit from
fine-grained CPU attribution for.

An alternative for scheduler-backed CPU attribution would be the use of
profiler labels. Given it's common to spawn multiple goroutines for the
same task, goroutine-backed statistics can easily become misleading.
Profiler labels instead let you measure CPU performance across a set of
cooperating goroutines. That said, it comes with overhead that makes it
unfeasible to always enable. For high-performance systems that care
about fine-grained CPU attribution (databases for e.g. that want to
measure total CPU time spent processing each request), profiler labels
are too cost-prohibitive, especially given the Go runtime has a much
cheaper view of the data needed.

It's worth noting that we already export /sched/latencies:seconds to
track scheduling latencies, i.e. time spent by a goroutine in the
runnable state. This commit does effectively the same, except for the
running state. Users are free to use this metric to power histograms or
tracking on-CPU time across a set of goroutines.

Change-Id: Id21ae4fcee0cd5f983604d61dad373098a0966bc
irfansharif added a commit to irfansharif/go that referenced this issue Feb 24, 2022
Fixes github.com/golang/issues/41554.

This commit introduces a /sched/goroutine/running:nanoseconds metric,
defined as the total time spent by a goroutine in the running state.
This measurement is useful for systems that would benefit from
fine-grained CPU attribution.

An alternative for scheduler-backed CPU attribution would be the use of
profiler labels. Given it's common to spawn multiple goroutines for the
same task, goroutine-backed statistics can easily become misleading.
Profiler labels instead let you measure CPU performance across a set of
cooperating goroutines. That said, it comes with overhead that makes it
unfeasible to always enable. For high-performance systems that care
about fine-grained CPU attribution (databases for e.g. that want to
measure total CPU time spent processing each request), profiler labels
are too cost-prohibitive, especially given the Go runtime has a much
cheaper view of the data needed.

It's worth noting that we already export /sched/latencies:seconds to
track scheduling latencies, i.e. time spent by a goroutine in the
runnable state. This commit does effectively the same, except for the
running state. Users are free to use this metric to power histograms or
tracking on-CPU time across a set of goroutines.

Change-Id: Id21ae4fcee0cd5f983604d61dad373098a0966bc
@gopherbot
Copy link

gopherbot commented Feb 24, 2022

Change https://go.dev/cl/387874 mentions this issue: runtime,runtime/metrics: track on-cpu time per goroutine

irfansharif added a commit to irfansharif/go that referenced this issue Feb 24, 2022
Fixes golang#41554.

This commit introduces a /sched/goroutine/running:nanoseconds metric,
defined as the total time spent by a goroutine in the running state.
This measurement is useful for systems that would benefit from
fine-grained CPU attribution.

An alternative for scheduler-backed CPU attribution would be the use of
profiler labels. Given it's common to spawn multiple goroutines for the
same task, goroutine-backed statistics can easily become misleading.
Profiler labels instead let you measure CPU performance across a set of
cooperating goroutines. That said, it comes with overhead that makes it
unfeasible to always enable. For high-performance systems that care
about fine-grained CPU attribution (databases for e.g. that want to
measure total CPU time spent processing each request), profiler labels
are too cost-prohibitive, especially given the Go runtime has a much
cheaper view of the data needed.

It's worth noting that we already export /sched/latencies:seconds to
track scheduling latencies, i.e. time spent by a goroutine in the
runnable state (go-review.googlesource.com/c/go/+/308933). This commit
does effectively the same, except for the running state. Users are free
to use this metric to power histograms or tracking on-CPU time across a
set of goroutines.

Change-Id: Id21ae4fcee0cd5f983604d61dad373098a0966bc
irfansharif added a commit to irfansharif/go that referenced this issue Feb 24, 2022
Fixes golang#41554.

This commit introduces a /sched/goroutine/running:nanoseconds metric,
defined as the total time spent by a goroutine in the running state.
This measurement is useful for systems that would benefit from
fine-grained CPU attribution.

An alternative to scheduler-backed CPU attribution would be the use of
profiler labels. Given it's common to spawn multiple goroutines for the
same task, goroutine-backed statistics can easily become misleading.
Profiler labels instead let you measure CPU performance across a set of
cooperating goroutines. That said, it comes with overhead that makes it
unfeasible to always enable. For high-performance systems that care
about fine-grained CPU attribution (databases for e.g. that want to
measure total CPU time spent processing each request), profiler labels
are too cost-prohibitive, especially given the Go runtime has a much
cheaper view of the data needed.

It's worth noting that we already export /sched/latencies:seconds to
track scheduling latencies, i.e. time spent by a goroutine in the
runnable state (go-review.googlesource.com/c/go/+/308933). This commit
does effectively the same except for the running state. Users are free
to use this metric to power histograms or tracking on-CPU time across a
set of goroutines.

Change-Id: Ie78336a3ddeca0521ae29cce57bc7a5ea67da297
@prattmic
Copy link
Member

prattmic commented Feb 24, 2022

@irfansharif sent https://go.dev/cl/387874 today which adds a /sched/goroutine/running:nanoseconds metric to runtime/metrics, which total time in _Grunning for the current goroutine. This is effectively equivalent to the ReadGoroutineCPUStats API from the first comment but hidden behind the runtime/metrics API.

FWIW, I don't think this API is a good fit for runtime/metrics, which typically returns the same results regardless of the calling goroutine [1]. Given the other metrics available, like /sched/latencies:seconds, I expected this to be a histogram of time spent in running, which I think could be useful, but not for the resource isolation requirements in this proposal.

[1] We also expect some monitoring systems to discover and export all metrics. This metric would be meaningless to export directly, as it would only report on a single reporter goroutine.

cc @golang/runtime @mknyszek

@irfansharif
Copy link

irfansharif commented Feb 24, 2022

Thanks for the quick turnaround. I'm happy to go the route of ReadGoroutineCPUStats or a more direct GoroutineRunningNanos like API; I only went with runtime/metrics given the flexibility of the API and because direct APIs have commentary faboring the runtime/metrics variants. I would also be happy with the adding just the private grunningnanos() helper for external dependants like CRDB to go:linkname directly against, but I can see why that's unsatisfying in to include in the stdlib. If the nanos was just tracked in type g struct, that too is altogether sufficient.

We also expect some monitoring systems to discover and export all metrics. This metric would be meaningless to export directly, as it would only report on a single reporter goroutine.

Great point. When documenting that the metric is scoped only to the calling goroutine, I hoped that'd be sufficient for monitoring systems to know and filter out explicitly.

irfansharif added a commit to irfansharif/go that referenced this issue Feb 24, 2022
Fixes golang#41554.

This commit introduces a /sched/goroutine/running:nanoseconds metric,
defined as the total time spent by a goroutine in the running state.
This measurement is useful for systems that would benefit from
fine-grained CPU attribution.

An alternative to scheduler-backed CPU attribution would be the use of
profiler labels. Given it's common to spawn multiple goroutines for the
same task, goroutine-backed statistics can easily become misleading.
Profiler labels instead let you measure CPU performance across a set of
cooperating goroutines. That said, it comes with overhead that makes it
unfeasible to always enable. For high-performance systems that care
about fine-grained CPU attribution (databases for e.g. that want to
measure total CPU time spent processing each request), profiler labels
are too cost-prohibitive, especially given the Go runtime has a much
cheaper view of the data needed.

It's worth noting that we already export /sched/latencies:seconds to
track scheduling latencies, i.e. time spent by a goroutine in the
runnable state (go-review.googlesource.com/c/go/+/308933). This commit
does effectively the same except for the running state. Users are free
to use this metric to power histograms or tracking on-CPU time across a
set of goroutines.

Change-Id: Ie78336a3ddeca0521ae29cce57bc7a5ea67da297
irfansharif added a commit to irfansharif/go that referenced this issue Feb 24, 2022
Fixes golang#41554.

This commit introduces a /sched/goroutine/running:nanoseconds metric,
defined as the total time spent by a goroutine in the running state.
This measurement is useful for systems that would benefit from
fine-grained CPU attribution.

An alternative to scheduler-backed CPU attribution would be the use of
profiler labels. Given it's common to spawn multiple goroutines for the
same task, goroutine-backed statistics can easily become misleading.
Profiler labels instead let you measure CPU performance across a set of
cooperating goroutines. That said, it comes with overhead that makes it
unfeasible to always enable. For high-performance systems that care
about fine-grained CPU attribution (databases for e.g. that want to
measure total CPU time spent processing each request), profiler labels
are too cost-prohibitive, especially given the Go runtime has a much
cheaper view of the data needed.

It's worth noting that we already export /sched/latencies:seconds to
track scheduling latencies, i.e. time spent by a goroutine in the
runnable state (go-review.googlesource.com/c/go/+/308933). This commit
does effectively the same except for the running state. Users are free
to use this metric to power histograms or tracking on-CPU time across a
set of goroutines.

Change-Id: Ie78336a3ddeca0521ae29cce57bc7a5ea67da297
irfansharif added a commit to irfansharif/go that referenced this issue Feb 24, 2022
Fixes golang#41554.

This commit introduces a /sched/goroutine/running:nanoseconds metric,
defined as the total time spent by a goroutine in the running state.
This measurement is useful for systems that would benefit from
fine-grained CPU attribution.

An alternative to scheduler-backed CPU attribution would be the use of
profiler labels. Given it's common to spawn multiple goroutines for the
same task, goroutine-backed statistics can easily become misleading.
Profiler labels instead let you measure CPU performance across a set of
cooperating goroutines. That said, it comes with overhead that makes it
unfeasible to always enable. For high-performance systems that care
about fine-grained CPU attribution (databases for e.g. that want to
measure total CPU time spent processing each request), profiler labels
are too cost-prohibitive, especially given the Go runtime has a much
cheaper view of the data needed.

It's worth noting that we already export /sched/latencies:seconds to
track scheduling latencies, i.e. time spent by a goroutine in the
runnable state (go-review.googlesource.com/c/go/+/308933). This commit
does effectively the same except for the running state. Users are free
to use this metric to power histograms or tracking on-CPU time across a
set of goroutines.

Change-Id: Ie78336a3ddeca0521ae29cce57bc7a5ea67da297
@irfansharif
Copy link

irfansharif commented Feb 25, 2022

On the CRDB side I've filed cockroachdb/cockroach#77041 which I hope adds more color for as to why we're eager to get more observability into runtime internals.

Unrelated but perhaps of interest to the golang/runtime folks, we also use the # of runnable goroutines as an indicator for CPU saturation. We use this to power our admission control machinery for stability + better tail latencies. For that metric, we were able to get by peeking into already tracked runtime internals. For this running nanos however, I don't think we had similar allowances, hence the patch.

https://github.com/cockroachdb/cockroach/blob/0679c4cb6578/pkg/util/goschedstats/runnable.go#L22-L34

https://github.com/cockroachdb/cockroach/blob/0679c4cb6578/pkg/util/admission/granter.go#L29-L37

@aclements
Copy link
Member

aclements commented Feb 25, 2022

For the number of runnable goroutines, that would be quite reasonable to add to runtime/metrics. I actually thought we had that, but it looks like we only have to total number of goroutines. It would perhaps make sense to add something like the /memory/classes hierarchy where we subdivide the total number of goroutines by state.

@aclements
Copy link
Member

aclements commented Feb 25, 2022

For example, running a quick workload (95% reads and 5% writes against a CockroachDB SQL table) shows that throughput drops by at least 8% when profiling with a one second interval.

I find this incredibly surprising. I would expect the impact of profiling to be well under 1%. In fact, it's not obvious to me that statistical profiling would be more expensive than tracking time on every goroutine switch in an application that's heavy on goroutine scheduling.

If profiling didn't have this overhead, would it make sense to use profiling labels plus a stream of profile events to accumulate the information you need?

@irfansharif
Copy link

irfansharif commented Feb 25, 2022

If profiling didn't have this overhead, would it make sense to use profiling labels plus a stream of profile events to accumulate the information you need?

We attempted something similar in cockroachdb/cockroach#60508. I can try re-running a stock workload with profiling enabled to measure what the overhead is, perhaps things have changed since that 2020 comment and/or we didn't measure correctly. I'll do it next week.

In fact, it's not obvious to me that statistical profiling would be more expensive than tracking time on every goroutine switch in an application that's heavy on goroutine scheduling.

For my own understanding:

  • wouldn't tracking time on goroutines be faster than profiling since we'd avoid logging the entire stack trace? This presents also as memory overhead.
  • wouldn't it be slower to compute the total on-CPU time for a trace by analyzing the stream of profile events instead of reading off of a counter?
  • does the overhead for https://go-review.googlesource.com/c/go/+/387874/ come primarily from the use of nanotime() on every goroutine switch? Or are we also worried about the two added fields to type g struct? I didn't know how to evaluate the latter. Given we invoke nanotime() today every 8th switch today to track scheduler latencies, is this still a concern?

irfansharif added a commit to cockroachdb/go that referenced this issue Mar 14, 2022
Fixes golang#41554.

This commit introduces a /sched/goroutine/running:nanoseconds metric,
defined as the total time spent by a goroutine in the running state.
This measurement is useful for systems that would benefit from
fine-grained CPU attribution.

An alternative to scheduler-backed CPU attribution would be the use of
profiler labels. Given it's common to spawn multiple goroutines for the
same task, goroutine-backed statistics can easily become misleading.
Profiler labels instead let you measure CPU performance across a set of
cooperating goroutines. That said, it comes with overhead that makes it
unfeasible to always enable. For high-performance systems that care
about fine-grained CPU attribution (databases for e.g. that want to
measure total CPU time spent processing each request), profiler labels
are too cost-prohibitive, especially given the Go runtime has a much
cheaper view of the data needed.

It's worth noting that we already export /sched/latencies:seconds to
track scheduling latencies, i.e. time spent by a goroutine in the
runnable state (go-review.googlesource.com/c/go/+/308933). This commit
does effectively the same except for the running state. Users are free
to use this metric to power histograms or tracking on-CPU time across a
set of goroutines.

Change-Id: Ie78336a3ddeca0521ae29cce57bc7a5ea67da297
@andy-kimball
Copy link

andy-kimball commented Mar 17, 2022

To add to what Irfan has said, the inability to opt-in to per goroutine CPU tracking is a critical gap in the Go runtime for us. We're building a Serverless database where we charge users per request, and doing that well requires finer granularity of CPU tracking. Using the profiling functionality isn't a good match for this scenario, because we want a more precise measurement, not a statistical sample. Simple prepared queries can take just a few hundred microseconds to execute, which (as I understand it) is below the sampling granularity. In addition, charging usage based on samples is a blunt instrument.

While the details of Irfan's patch may not fit the Go runtime patterns well enough, can we get guidance on what would work better? As Irfan said, the current code calls nanotime every 8th goroutine switch. Even if the Go team is not comfortable with calling nanotime on every switch as in this patch, perhaps there could be an opt-in mechanism that enables apps to set gTrackingPeriod equal to 1 (and makes it a variable instead of a constant), so that only those apps pay the overhead (and are you sure calling nanotime here is actually a performance bottleneck to begin with?).

@g-talbot
Copy link

g-talbot commented Apr 1, 2022

For server applications that often have to actually measure this on individual requests for accounting/loadbalancing/etc. purposes, it's good to know this programmatically. As I recall, Google has this as a class in C++ that uses the Linux performance counters support to give per-thread/fiber user/system CPU time used between specified points in a thread/fiber's execution. Can this approach be adopted for Golang, as many companies, like Cockroach, are running into this in larger-scale systems?

@g-talbot
Copy link

g-talbot commented Apr 1, 2022

One other comment I'll make about profiling is that, unless you're using some sort of reservoir sampling approach, that the profiling will give you a great picture of the common case, but miss important edge cases. This is because the common cases will dominate in the sampling. It's not unusual to need to be able to find events in the 3 or 4 9's (or more) that have vastly different performance characteristics that never get sampled, but when they happen, cause the rest of the system to "hiccup" and have inordinate effect. Having a way to directly measure goroutines this way helps with that too.

@andreimatei
Copy link
Contributor

andreimatei commented Apr 1, 2022

I also wanted to get the temperature of the Go team around putting some Linux perf events around the scheduler. In particular, a few days ago, this user events patch was merged into the kernel allowing user space programs to register perf events that can be dynamically enabled and disabled from outside the program through the tracefs file system. The program can check whether a particular event is enabled and, if it is, it outputs data to trace by writing to a tracefs file. The events can then be consumed through standard interfaces (perf, BPF, etc).
In theory, runtime scheduler events outputted to perf can be joined with kernel scheduler events, and true on-cpu time can be computed for goroutines.
I don't know the cost of producing the respective events when enabled (the cost if basically zero when disabled).

Irrespective of other decisions for adding Go native APIs for introspection discussed in this proposal, would a patch exploring the perf path be welcomed?

@mknyszek
Copy link
Contributor

mknyszek commented Apr 14, 2022

@andreimatei I think supporting something like that would be of great interest to us.

The devil is in the details however. Like, it would be great to just dump the execution trace (runtime/trace) events into a kernel trace and be able to correlate everything together. But how exactly would that be enabled? Maybe a if a GODEBUG setting is present, StartTrace actually starts a kernel trace instead? I'm honestly not sure how well that fits with how user tracepoints are modeled. I'm interested in hearing and/or seeing more details about what you had in mind.

@andreimatei
Copy link
Contributor

andreimatei commented Apr 27, 2022

Maybe a if a GODEBUG setting is present, StartTrace actually starts a kernel trace instead?

@mknyszek what I had in mind was to add some type of userspace tracepoints into the scheduler, making available the scheduler events as well as different arguments to tracers such as perf. For the use cases discussed in this issue, I imagine that the profiler labels of the goroutines involved would be good arguments to include in the tracepoints, as they can be used by the go program to attach arbitrary information to its goroutines. Then, one could use standard Linux techniques perf record, perf script, etc to collect the data at runtime, analyze it, and profit. Depending on exactly how we do it, BPF programs could also be attached to these tracepoints (courtesy of the Linux tracing infrastructure), which opens the door to collecting more arbitrary information from the running program.
I would try to not have anything control by a GODEBUG, but instead have the probes always available for a tracer to attach to them.

As you perhaps suggest, demonstrating that an execution trace equivalent to runtime/trace can be generated purely based on these tracepoints, and additionally that it can contain information invisible to the runtime around the kernel thread scheduling, would be very cool.

There are different details to work out. For example, what kind of tracepoints would we use. The new user events type that I mentioned in my previous message seem to have quickly fallen out of favor in the kernel; I think they might end up being removed. But, there are other types. In particular, there's DTrace USDTs, which are supported by the various Linux tools. These also have a feature (USDT semaphores) that makes them practically zero cost when tracing is not enabled. Except USDTs are not natively available to Go programs, I believe, or event to the Go runtime (as you need the compiler and linker to cooperate). There are tricks available out there, but I don't know how suitable they are for the runtime and whether they preserve the zero-cost property. (Btw, I think that supporting USDTs would generally be a major improvement to Go). In theory, for a fixed version of the runtime, we could also use dynamic uprobes to extract the information we want. These don't need any changes to the runtime, but reading the arguments from memory is tedious. However, a simple noinline dummy function that promises a somewhat stable interface and groups the arguments together for the only purpose of them being easily readable by a probe attached to that function could be workable.

@irfansharif
Copy link

irfansharif commented May 13, 2022

I can try re-running a stock workload with profiling enabled to measure what the overhead is, perhaps things have changed since that 2020 comment and/or we didn't measure correctly. I'll do it next week.

Apologies for being slow. The profiling overhead is not as stark as @asubiotto (hi Alfonso!) originally described, so perhaps we did measure things incorrectly.

would it make sense to use profiling labels plus a stream of profile events to accumulate the information you need?

In addition to @andy-kimball's point around wanting to measure at a granularity finer than 100Hz/10ms for microsecond accounting, another reason to not rely on the profile approach is its inaccuracy (profile event proportions for stack traces don't match actual on-CPU proportions) and imprecision (repeated measurements have a lot of variability) that we'd inherit if looking to build control loops on top of it; I'm borrowing the verbiage and experimental results from #36821. I was also curious to see how measured CPU time would fare with tests in that proposal, so I tried the following two things (available at https://github.com/irfansharif/runner):

... (Click here to expand)
// TestEquivalentGoroutines is a variant of the "parallel test" in
// https://github.com/golang/go/issues/36821. It tests whether goroutines that
// (should) spend the same amount of time on-CPU have similar measured on-CPU
// time.
func TestEquivalentGoroutines(t *testing.T) {
  mu := struct {
    sync.Mutex
    nanos map[int]int64
  }{}
  mu.nanos = make(map[int]int64)

  f := func(wg *sync.WaitGroup, id int) {
    defer wg.Done()

    var sum int
    for i := 0; i < 500000000; i++ {
      sum -= i / 2
      sum *= i
      sum /= i/3 + 1
      sum -= i / 4
    }

    nanos := grunningnanos()
    mu.Lock()
    mu.nanos[id] = nanos
    mu.Unlock()
  }

  var wg sync.WaitGroup
  for i := 0; i < 10; i++ {
    i := i // copy loop variable
    wg.Add(1)
    go f(&wg, i)
  }
  wg.Wait()

  mu.Lock()
  defer mu.Unlock()

  total := int64(0)
  for _, nanos := range mu.nanos {
    total += nanos
  }

  minexp, maxexp := float64(0.085), float64(0.115)
  for i, nanos := range mu.nanos {
    got := float64(nanos) / float64(total)

    assert.Greaterf(t, got, minexp,
      "expected proportion > %f, got %f", minexp, got)
    assert.Lessf(t, got, maxexp,
      "expected proportion < %f, got %f", maxexp, got)

    t.Logf("%d's got %0.2f%% of total time", i, got*100)
  }
}


// TestProportionalGoroutines is a variant of the "serial test" in
// https://github.com/golang/go/issues/36821. It tests whether goroutines that
// (should) spend a proportion of time on-CPU have proportionate measured on-CPU
// time.
func TestProportionalGoroutines(t *testing.T) {
  f := func(wg *sync.WaitGroup, v uint64, trip uint64, result *int64) {
    defer wg.Done()

    ret := v
    for i := trip; i > 0; i-- {
      ret += i
      ret = ret ^ (i + 0xcafebabe)
    }

    nanos := grunningnanos()
    atomic.AddInt64(result, nanos)
  }

  results := make([]int64, 10, 10)
  var wg sync.WaitGroup

  for iters := 0; iters < 10000; iters++ {
    for i := uint64(0); i < 10; i++ {
      i := i // copy loop variable
      wg.Add(1)
      go f(&wg, i+1, (i+1)*100000, &results[i])
    }
  }

  wg.Wait()

  total := int64(0)
  for _, result := range results {
    total += result
  }

  initial := float64(results[0]) / float64(total)
  maxdelta := float64(0.5)
  for i, result := range results {
    got := float64(result) / float64(total)
    mult := got / initial
    assert.InDelta(t, float64(i+1), mult, maxdelta)

    t.Logf("%d's got %0.2f%% of total time (%fx)", i, got*100, mult)
  }
}

The results suggest that the measured on-CPU time is both accurate and precise:

=== RUN   TestEquivalentGoroutines
    0's got  9.98% of total time
    1's got  9.53% of total time
    2's got  9.22% of total time
    3's got 10.42% of total time
    4's got  9.84% of total time
    5's got 10.43% of total time
    6's got 10.50% of total time
    7's got 10.21% of total time
    8's got 10.03% of total time
    9's got  9.86% of total time

=== RUN   TestProportionalGoroutines
    0's got  1.87% of total time (1.000000x)
    1's got  3.60% of total time (1.931999x)
    2's got  5.41% of total time (2.899312x)
    3's got  7.21% of total time (3.864451x)
    4's got  9.11% of total time (4.880925x)
    5's got 10.94% of total time (5.864723x)
    6's got 12.77% of total time (6.842004x)
    7's got 14.34% of total time (7.685840x)
    8's got 16.58% of total time (8.885060x)
    9's got 18.18% of total time (9.741030x)

the inability to opt-in to per goroutine CPU tracking is a critical gap in the Go runtime for us.

does the overhead for https://go-review.googlesource.com/c/go/+/387874/ come primarily from the use of nanotime() on every goroutine switch?

How do the package maintainers feel about making #51347 opt-in with zero overhead if opted out? We could for e.g. only track the nanoseconds if a GODEBUG var is set, relying on branch prediction to get it right ~always + never invoke the additional nanotime() if unset. EDIT: Actually, opting-in/out through GODEBUG wouldn't work as we want CRDB binaries to always opt-into this tracking without needing the envvar to be set. The zero overhead approach would still work if we introduced an API to switch this tracking on/off (but that of course is not a net-zero API change).

@rhysh
Copy link
Contributor

rhysh commented May 13, 2022

The report in #36821 reflects bugs in / shortcomings of Go's CPU profiling on Linux as of early 2020. @irfansharif , have you found that it is still inaccurate on Linux as of Go 1.18, or is it possible that the runtime/pprof sampling profiler could give acceptable results for on-CPU time?

Some of the earlier comments here pointed to map-based goroutine labels adding unacceptable computation overhead. If those were more efficient, would goroutine labels plus the sampling profiler work well enough?

There's also the question from @andy-kimball of granularity, in particular for billing customers for their CPU time. I don't fully understand why sampling isn't an option, even for (repeated) operations that take only a few hundred microseconds each. It seems like the work of repeated operations would show up in over 1000 samples (at 100 Hz) well before it's consumed $0.01 of CPU time (at current cloud prices).

Overall it seems to me that the current tools we have are close, so I'm interested in how we can improve them enough to be useful here.

irfansharif added a commit to irfansharif/go that referenced this issue May 13, 2022
Fixes golang#41554.

This commit introduces a /sched/goroutine/running:nanoseconds metric,
defined as the total time spent by a goroutine in the running state.
This measurement is useful for systems that would benefit from
fine-grained CPU attribution.

An alternative to scheduler-backed CPU attribution would be the use of
profiler labels. Given it's common to spawn multiple goroutines for the
same task, goroutine-backed statistics can easily become misleading.
Profiler labels instead let you measure CPU performance across a set of
cooperating goroutines. That said, it has two downsides:
- performance overhead; for high-performance systems that care about
  fine-grained CPU attribution (databases for e.g. that want to measure
  total CPU time spent processing each request), profiler labels are too
  cost-prohibitive, especially given the Go runtime has a much cheaper
  and more granular view of the data needed
- inaccuracy and imprecision, as evaluated in
  golang#36821

It's worth noting that we already export /sched/latencies:seconds to
track scheduling latencies, i.e. time spent by a goroutine in the
runnable state (go-review.googlesource.com/c/go/+/308933). This commit
does effectively the same except for the running state on the requesting
goroutine. Users are free to use this metric to power histograms or
tracking on-CPU time across a set of goroutines.

Change-Id: Ie78336a3ddeca0521ae29cce57bc7a5ea67da297
@andy-kimball
Copy link

andy-kimball commented May 14, 2022

We would like to get to the point where a user can run an EXPLAIN ANALYZE and get back a cost for that single statement. This should work on a dev/staging cluster (i.e. with little or no other load), and give results that are similar to a production cluster that is fully loaded. Sampling would not work well for that.

Similarly, we'd like to be able to show a list of all recent SQL statements that have been run and show the CPU consumed by each. While some statements may have thousands of iterations and show an accurate number, there's often a "long tail" of other SQL statements that have only a few iterations. Those would often show 0, since we didn't happen to get a "hit". While it's perhaps better than nothing, we're trying to enable a better customer experience than that.

@rhysh
Copy link
Contributor

rhysh commented May 16, 2022

Collecting a timestamp every time the scheduler starts or stops a goroutine, tracking goroutines' running time to sub-microsecond levels, sounds close to key features of the execution tracer (runtime/trace). That data stream also describes when goroutines interact with each other, which allows tracking which parts of the work each goroutine does are for particular inbound requests.

Rather than build up additional ways to see data that's already available via runtime/trace, are there ways to pare down or modify the execution tracer's view to something with lower complexity and overhead? I have guesses (below), but I'm curious for your view on how the execution tracer falls short today for your needs.

Is it:

  • CPU overhead (from collecting timestamps, from collecting stack traces, writing out the byte stream, or some other part) is too high?
  • Expensive startup (does work proportional to the number of living goroutines)?
  • Runtime does not flush partial results in a timely manner?
  • Inconvenient to parse?

@mknyszek
Copy link
Contributor

mknyszek commented May 16, 2022

@rhysh I am currently having similar thoughts, and I think a low-enough overhead, user-parseable trace (with some way of inserting markers at the user level) sounds like it could potentially resolve this situation, assuming it's OK to stream this data out and analyze it programmatically out-of-band.

For this to work, though, the trace format itself has to be more scalable and robust, too. On scalability, the traces come out to be pretty large, and they unfortunately basically need to be read fully into memory. On robustness, you really always need a perfect trace to get any useful information out (or at least, the tools back out if it's not perfect). For instance, partial traces don't exist as a concept because there's no enforcement that, say, a block of really old events actually appears early in the trace, forcing you to always deal with the whole trace (from StartTrace to StopTrace).

I'm currently thinking a bit about tracing going forward, and this has been on my mind. No promises of course, but Go's tracing story in general could use some improvements.

@ajwerner
Copy link

ajwerner commented Jun 2, 2022

What would it take to get this proposal moved into the Active column of the Proposals board?

@rsc
Copy link
Contributor

rsc commented Jun 22, 2022

Anyone interested in tracing, I can't recommend highly enough Dick Sites's new book Understanding Software Dynamics.

For an intro see his Stanford talk.

It seems to me that really good tracing along these lines would help a lot more than scattered per-goroutine CPU stats.

@irfansharif
Copy link

irfansharif commented Jun 22, 2022

I agree that really good tracing would help, but for the environments we're hoping to use per-goroutine CPU stats, and the contexts we're hoping to use it under (CPU controllers), we're not running with the kinds of kernel patches (https://github.com/dicksites/KUtrace) Dick's tracing techniques seem to be predicated on. I've not evaluated how the more mainstream kernel tracing techniques (ftrace, strace) or use of bpf probes fare performance wise.

I also recently learned about how Go's own GC is able to bound it's CPU use to 25% (https://github.com/golang/proposal/blob/master/design/44167-gc-pacer-redesign.md#a-note-about-cpu-utilization, authored by participants in this thread), which effectively uses a similar idea: by capturing the grunning time for time spent in GC assists. This kind of CPU control is similar to what we could do if per-goroutine CPU stats were exposed by the runtime, it would let us hard-cap a certain tenant on a given machine to some fixed % of CPU (with the usual caveats around how accurate this measure is).

@rsc
Copy link
Contributor

rsc commented Jun 22, 2022

@irfansharif I think there's an interesting question how much we could learn with a Go-runtime-run version of the tracers (GUTrace, say) and no kernel help. I think quite a lot. The point was inspiration, not direct adoption.

The setting where tracing would not help is if you want the program to observe itself and respond, like the pacer does. The tracing I am thinking of has a workflow more like the Go profiler, where you collect the profile/trace and then interact with it outside the program.

@irfansharif
Copy link

irfansharif commented Jun 22, 2022

The setting where tracing would not help is if you want the program to observe itself and respond, like the pacer does.

Ack. This is precisely the class of problems I'm hoping to push on with per-goroutine CPU stats (or at least just the grunning time).

@irfansharif
Copy link

irfansharif commented Jun 22, 2022

@irfansharif, have you found that it is still inaccurate on Linux as of Go 1.18, or is it possible that the runtime/pprof sampling profiler could give acceptable results for on-CPU time?

For posterity, running the reproductions from https://github.com/chabbimilind/GoPprofDemo#serial-program:

$ go version
go version devel go1.19-7e33e9e7a3 Fri May 13 14:03:15 2022 -0400 darwin/arm64

$ time go run serial.go && go tool pprof -top serial_prof |  grep 'expect'
      60ms 20.00% 20.00%       70ms 23.33%  main.H_expect_14_546 (inline)
      50ms 16.67% 36.67%       50ms 16.67%  main.G_expect_12_73 (inline)
      50ms 16.67% 53.33%       60ms 20.00%  main.J_expect_18_18 (inline)
      30ms 10.00% 63.33%       30ms 10.00%  main.F_expect_10_91 (inline)
      20ms  6.67% 70.00%       20ms  6.67%  main.C_expect_5_46 (inline)
      20ms  6.67% 76.67%       20ms  6.67%  main.E_expect_9_09 (inline)
      20ms  6.67% 83.33%       20ms  6.67%  main.I_expect_16_36 (inline)
      10ms  3.33%   100%       10ms  3.33%  main.D_expect_7_27 (inline)

________________________________________________________
Executed in  793.74 millis    fish           external
   usr time  473.95 millis    0.13 millis  473.83 millis
   sys time  178.80 millis    3.86 millis  174.94 millis

Still somewhat inaccurate + imprecise (we observe variances across multiple runs) for small total running time (controllable by -m), increasing in accuracy + precision when sampling over larger durations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: Todo
Status: Incoming
Development

Successfully merging a pull request may close this issue.