runtime: scheduler is slow when goroutines are frequently woken #18237
Comments
@philhofer, any chance you could try Go 1.8beta1? Even if a bug were found in Go 1.7, that branch is closed for all but security issues at this point. Go 1.8 should be a drop-in replacement for 1.7. See https://beta.golang.org/doc/go1.8 for details. The SSA back end for ARM will probably help your little devices a fair bit. See https://dave.cheney.net/2016/11/19/go-1-8-toolchain-improvements |
(Tagging this Go 1.9, unless you can reproduce on 1.8 and @aclements thinks it's easily fixable enough for 1.8) |
@pilhofer. I'd be keen to see the .svg versions of those profiles if you
are able to attach them to the issue.
…On Thu, Dec 8, 2016 at 7:25 AM, Brad Fitzpatrick ***@***.***> wrote:
@philhofer <https://github.com/philhofer>, any chance you could try Go
1.8beta1? Even if a bug were found in Go 1.7, that branch is closed for all
but security issues at this point.
Go 1.8 should be a drop-in replacement for 1.7. See
https://beta.golang.org/doc/go1.8 for details. The SSA back end for ARM
will probably help your little devices a fair bit. See
https://dave.cheney.net/2016/11/19/go-1-8-toolchain-improvements
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#18237 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AAAcA3CKeXoytI49tHLwYxgUG6n5H54xks5rFzJJgaJpZM4LHL5e>
.
|
@bradfitz Yes, we're busy trying to get 1.8beta1 on some hardware to benchmark it. We're very excited about the arm performance improvements. (However, these profiles are on an Intel Xeon host, which I presume will perform similarly between 1.7 and 1.8, unless there have been substantial changes made to the scheduler that I missed?) @davecheney Yes; I'll try to post a slightly-redacted one. |
Update: most of the scheduler activity is caused by blocking network reads. The call chain goes across two call stacks, which makes it a little tough to track down through stack traces alone, but here it is:
The raw call counts suggest that roughly 90% of the @davecheney I haven't extracted our profile format into the pprof format yet, but I hope that answers the same question you were hoping the svg web would answer. |
Oh, sorry, missed that. In any case, please test 1.8 wherever possible in the next two weeks. It's getting increasingly hard to make changes to 1.8. The next two weeks are the sweet spot for bug reports. Thanks! |
We just finished our first set of runs on 1.8, and things look pretty much identical on our x86 machines.
|
@aclements, what's the status here? |
Ping @aclements |
I have a little more information, in case you're interested. Fundamentally, the issue here is that Now, in a sane world we could wire up So, part of this is Linux's fault, and part of it is caused by the scheduler being generally slow. (Consider: in that profile, we spend nearly twice as much time in the scheduler as we do checksumming every single byte received over the network.) |
Thanks for the extra information, @philhofer. That's very useful in understanding what's going on here. Given how much time you're spending in I'm not really sure what to do about this. It would at least help confirm this if you could post an execution trace (in this case, a sufficiently zoomed-in screen shot is probably fine, since there's no easy way to redact the stacks in an execution trace). |
I have a similar issue, with no network involved. My project performs application protocol analysis against a libpcap capture. Different pools of goroutines perform reading the raw trace from disk, packet parsing, and TCP/IP flow reassembly. CPU profiling indicates > 25% of total time spent in @aclements description does not appear to fit my situation, since more data is always available throughout a run. Whenever individual goroutines block, it's because they have just dispatched work to one or more goroutines further down the pipeline. I'm running The project is open-source, so I can point you to the source and the SVG profiles generated from my perf tests. Would that be helpful, and would it be better to keep in this issue or file a new one? |
The project is open-source, so I can point you to the source and the SVG
profiles generated from my perf tests. Would that be helpful, and would it
be better to keep in this issue or file a new one?
Please file a new issue. If it turns out this is a duplicate we can merge
them. Thanks
…On Wed, Jan 3, 2018 at 10:22 AM, mspielberg ***@***.***> wrote:
I have a similar issue, with no network involved.
My project performs application protocol analysis against a libpcap
capture. Different pools of goroutines perform reading the raw trace from
disk, packet parsing, and TCP/IP flow reassembly. CPU profiling indicates >
25% of total time spent in findrunnable. I'm running on 64-bit OSX, so
most of that time is in kevent.
@aclements <https://github.com/aclements> description does not appear to
fit my situation, since more data is always available throughout a run.
Whenever individual goroutines block, it's because they have just
dispatched work to one or more goroutines further down the pipeline.
I'm running go version go1.9.1 darwin/amd64.
The project is open-source, so I can point you to the source and the SVG
profiles generated from my perf tests. Would that be helpful, and would it
be better to keep in this issue or file a new one?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#18237 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AAAcA_493m6y3ni9ZWgPPnkt2c3WgXU-ks5tGrorgaJpZM4LHL5e>
.
|
I was able to capture a trace of the original issue @philhofer described and wanted to add the requested screenshots to verify that this is the scheduler worst case scenario described by @aclements. Though the profiling samples nicely show the time being spent in From a macro view, here's about 40ms total: Most of the tiny slivers are network wake-ups to read an MTU off a particular socket, but not enough data to fill the desired buffer (think 1500 MTU but 64k desired buffers). The burst of longer operations on the right is processing that happened when enough data has been received to do higher level work with the data (Reed-Solomon computation in this case). Next screenshot is a zoom in to the small goroutine section (~2ms total): I've selected a tiny slice and that's the identical stack across all the very small routines. I think this tells the story of the scheduler constantly going idle, then being woken up by the network. Also willing to post some screenshots like this, if there's more specific questions. |
Change https://golang.org/cl/228577 mentions this issue: |
Change https://golang.org/cl/259578 mentions this issue: |
Change https://golang.org/cl/264477 mentions this issue: |
Work stealing is a scalability bottleneck in the scheduler. Since each P has a work queue, work stealing must look at every P to determine if there is any work. The number of Ps scales linearly with GOMAXPROCS (i.e., the number of Ps _is_ GOMAXPROCS), thus this work scales linearly with GOMAXPROCS. Work stealing is a later attempt by a P to find work before it goes idle. Since the P has no work of its own, extra costs here tend not to directly affect application-level benchmarks. Where they show up is extra CPU usage by the process as a whole. These costs get particularly expensive for applications that transition between blocked and running frequently. Long term, we need a more scalable approach in general, but for now we can make a simple observation: idle Ps ([1]) cannot possibly have anything in their runq, so we need not bother checking at all. We track idle Ps via a new global bitmap, updated in pidleput/pidleget. This is already a slow path (requires sched.lock), so we don't expect high contention there. Using a single bitmap avoids the need to touch every P to read p.status. Currently, the bitmap approach is not significantly better than reading p.status. However, in a future CL I'd like to apply a similiar optimization to timers. Once done, findrunnable would not touch most Ps at all (in mostly idle programs), which will avoid memory latency to pull those Ps into cache. When reading this bitmap, we are racing with Ps going in and out of idle, so there are a few cases to consider: 1. _Prunning -> _Pidle: Running P goes idle after we check the bitmap. In this case, we will try to steal (and find nothing) so there is no harm. 2. _Pidle -> _Prunning while spinning: A P that starts running may queue new work that we miss. This is OK: (a) that P cannot go back to sleep without completing its work, and (b) more fundamentally, we will recheck after we drop our P. 3. _Pidle -> _Prunning after spinning: After spinning, we really can miss work from a newly woken P. (a) above still applies here as well, but this is also the same delicate dance case described in findrunnable: if nothing is spinning anymore, the other P will unpark a thread to run the work it submits. Benchmark results from WakeupParallel/syscall/pair/race/1ms (see golang.org/cl/228577): name old msec new msec delta Perf-task-clock-8 250 ± 1% 247 ± 4% ~ (p=0.690 n=5+5) Perf-task-clock-16 258 ± 2% 259 ± 2% ~ (p=0.841 n=5+5) Perf-task-clock-32 284 ± 2% 270 ± 4% -4.94% (p=0.032 n=5+5) Perf-task-clock-64 326 ± 3% 303 ± 2% -6.92% (p=0.008 n=5+5) Perf-task-clock-128 407 ± 2% 363 ± 5% -10.69% (p=0.008 n=5+5) Perf-task-clock-256 561 ± 1% 481 ± 1% -14.20% (p=0.016 n=4+5) Perf-task-clock-512 840 ± 5% 683 ± 2% -18.70% (p=0.008 n=5+5) Perf-task-clock-1024 1.38k ±14% 1.07k ± 2% -21.85% (p=0.008 n=5+5) [1] "Idle Ps" here refers to _Pidle Ps in the sched.pidle list. In other contexts, Ps may temporarily transition through _Pidle (e.g., in handoffp); those Ps may have work. Updates #28808 Updates #18237 Change-Id: Ieeb958bd72e7d8fb375b0b1f414e8d7378b14e29 Reviewed-on: https://go-review.googlesource.com/c/go/+/259578 Run-TryBot: Michael Pratt <mpratt@google.com> TryBot-Result: Go Bot <gobot@golang.org> Reviewed-by: Michael Knyszek <mknyszek@google.com> Reviewed-by: Austin Clements <austin@google.com> Trust: Michael Pratt <mpratt@google.com>
Change https://golang.org/cl/264697 mentions this issue: |
Following golang.org/cl/259578, findrunnable still must touch every other P in checkTimers in order to look for timers to steal. This scales poorly with GOMAXPROCS and potentially performs poorly by pulling remote Ps into cache. Add timerpMask, a bitmask that tracks whether each P may have any timers on its timer heap. Ideally we would update this field on any timer add / remove to always keep it up to date. Unfortunately, updating a shared global structure is antithetical to sharding timers by P, and doing so approximately doubles the cost of addtimer / deltimer in microbenchmarks. Instead we only (potentially) clear the mask when the P goes idle. This covers the best case of avoiding looking at a P _at all_ when it is idle and has no timers. See the comment on updateTimerPMask for more details on the trade-off. Future CLs may be able to expand cases we can avoid looking at the timers. Note that the addition of idlepMask to p.init is a no-op. The zero value of the mask is the correct init value so it is not necessary, but it is included for clarity. Benchmark results from WakeupParallel/syscall/pair/race/1ms (see golang.org/cl/228577). Note that these are on top of golang.org/cl/259578: name old msec new msec delta Perf-task-clock-8 244 ± 4% 246 ± 4% ~ (p=0.841 n=5+5) Perf-task-clock-16 247 ±11% 252 ± 4% ~ (p=1.000 n=5+5) Perf-task-clock-32 270 ± 1% 268 ± 2% ~ (p=0.548 n=5+5) Perf-task-clock-64 302 ± 3% 296 ± 1% ~ (p=0.222 n=5+5) Perf-task-clock-128 358 ± 3% 352 ± 2% ~ (p=0.310 n=5+5) Perf-task-clock-256 483 ± 3% 458 ± 1% -5.16% (p=0.008 n=5+5) Perf-task-clock-512 663 ± 1% 612 ± 4% -7.61% (p=0.008 n=5+5) Perf-task-clock-1024 1.06k ± 1% 0.95k ± 2% -10.24% (p=0.008 n=5+5) Updates #28808 Updates #18237 Change-Id: I4239cd89f21ad16dfbbef58d81981da48acd0605 Reviewed-on: https://go-review.googlesource.com/c/go/+/264477 Run-TryBot: Michael Pratt <mpratt@google.com> TryBot-Result: Go Bot <gobot@golang.org> Reviewed-by: Michael Knyszek <mknyszek@google.com> Trust: Michael Pratt <mpratt@google.com>
Change https://golang.org/cl/266367 mentions this issue: |
In golang.org/cl/264477, I missed this new block after rebasing past golang.org/cl/232298. These fields must be zero if there are no timers. Updates #28808 Updates #18237 Change-Id: I2d9e1cbf326497c833daa26b11aed9a1e12c2270 Reviewed-on: https://go-review.googlesource.com/c/go/+/266367 Run-TryBot: Michael Pratt <mpratt@google.com> Reviewed-by: Austin Clements <austin@google.com> Reviewed-by: Ian Lance Taylor <iant@golang.org> Reviewed-by: Michael Knyszek <mknyszek@google.com> TryBot-Result: Go Bot <gobot@golang.org> Trust: Michael Pratt <mpratt@google.com>
Hello gophers. We have a similar issue — mcall, findrunnable, schedule excessively consume CPU time (mcall is up to 75% on stack). The application itself consumes 30-50% CPU while it mainly should be in I/O wait. The application is basically a proxy with tiny additional logic (receive 20 bytes packets over TCP/IP, accumulate up to 500 bytes and send it again over TCP/IP). What makes it worse in our case, that mcall/findrunnable are very aggressive, to the point where the application is mainly running runtime functions, and our own code gets delayed so that we see a very visible lag up to seconds. The issue described by @philhofer is not applicable in our case, we read and write relatively small (should be below 1500) packets, and increasing packet size (batch write / batch read) seems to decrease the CPU consumption. We use simple protocol on top of TCP/IP: Please let me know if you need full profile or source code.
|
@aka-rider What CPU, and OS, and version of Go did this profile come from? The amount of time spent in |
@aka-rider In addition to Chris' questions, could you describe the application a bit more? In particular:
I'm particularly curious about the last question, as you mention "lag up to seconds", yet your profile only shows 118s of CPU on 300s wall-time. i.e., less than 1 CPU of work on average. Unless the system is under heavy load, I wouldn't expect to see such extreme latency, which may be indicative of a scheduling bug rather than overall performance (what this issue is about). |
I forgot to mention that Go 1.16 will include several timer overhead improvements, so if you can test on tip (if this isn't already on tip), that would be helpful. |
Hi @ChrisHines We running virtual machines in KVM: I agree, nanotime also bothers me. I came across this ticket #27707, and the following snippet: package main
import (
"time"
)
func main() {
for {
time.Sleep(time.Millisecond)
}
} consumes ~10% CPU on our cluster (it's < 2% CPU on my Mac and my colleague's Ubuntu laptop) |
Hi @prattmic We have 1 goroutine per incoming connection (receive), 1 goroutine per outgoing connection (async send) The number of connections is 10-200. Interestingly, the CPU consumption doesn't depend on the number of connections too much — it's ~0% with 0 connections, 30% with one connection, 30-50% with 200 connections. This particular profile comes from our development environment, the amount of connections was ~6-8, and machine was not busy. We used to see significant delays with increased number of connections. I tried GOMAXPROCS 1, 2, 5, 20, 50 — and it doesn't seem to change anything. The initial design was using channels. Every read goroutine would write received data to a single buffered channel Right now, we use circular buffer with mutex — all receiving goroutines write into it, one goroutine reads from it, and communication with send goroutines is the same, except is buffer+mutex per sending goroutine. I also tried to get rid of asynchronous send goroutines, and write to TCP directly from the processing goroutine, it increased the CPU consumption from 30% to 50% for a single connection. It makes me think that it specifically TCPConn.Write (and maybe Read) cause this behavior. |
@aka-rider lots of time in I don't know of a handy way to see which clock is in besides reading it directly from a program. https://gist.github.com/prattmic/a9816731294715426d7b85eb091d0102 will dump out the VDSO parameter page, from which on my system I can see the clock mode at offset 0x80 is 0x1 (VDSO_CLOCKMODE_TSC), which is what I want. (Offset and location varies depending on kernel version...). All of this is quite a bit of a tangent from the issue at hand, though, as even a slow VDSO shouldn't be causing this much trouble. I'll take a look at the rest of your summary tomorrow. |
Hi @prattmic
|
With gotip the CPU profile and consumption is very similar to what it was before
|
It does seem that your system is using VCLOCK_PVCLOCK instead of TSC, though looking at the implementation it seems pretty efficient except for the VCLOCK_NONE case, which the comment says will only affect Xen, not KVM. I still don't think the clock source is really the underlying problem here, but it may be interesting if you can run this workload in a non-virtualized environment to see if it behaves better. More importantly, I now notice that the entry point to ~90% of your time in the scheduler is due to calls to |
Thank for your suggestion @prattmic.
It seems to behave better on non-virtualized environment, although it's hard to reproduce the exact workload. Lab benchmarks on our laptops don't have the same CPU consumption. I had the following in my code (in the next lines there's a call to for cycles := 0; cycles < 10000; cycles++ {
if payload, stream, err = s.dequeue(p); err != nil || payload != nil {
return
}
runtime.Gosched()
} The following profile with the code above commented out:
The CPU consumption stays roughly the same — about 30% |
Ah, this looks better. Not better that usage isn't down much, but better in that it now looks like the symptoms that this issue was originally intended to address. i.e., this now looks like there may be lots of fast cycles between idle and active, where the scheduler is spending lots of time futilely trying to find work, then finally going to sleep only to be re-woken almost immediately. Is there a particular reason you need that To help verify the above behavior, could you provide the full pprof profile for this run? Plus a runtime trace? You may need to make the run fairly short if the trace is unreasonably large. |
Hi.
Right now the application behaves well without it, so it can go. The reason it was added is because waiting on chan could left a goroutine sleeping for too long. It seems like waiting on a Cond is better in that sense. pprof.buffer.samples.cpu.008.pb.gz I will think of how to get the trace from Docker container. |
In general the Go scheduler reacts badly to any kind of busy loop, whether implemented with |
Thanks for letting me know @ianlancetaylor Please find the attached trace |
What version of Go are you using (
go version
)?go1.7.3
What operating system and processor architecture are you using (
go env
)?linux/amd64; Xeon E5-2670 (dual-socket 6-core packages, non-HT)
Our profiles indicate that we're spending an enormous number of cycles in
runtime.findrunnable
(and its callees) on the hosts that serve as our protocol heads.Briefly, our these hosts translate HTTP CRUD operations into sets of transactions to be performed on our storage hosts, so the only real I/O these hosts do is networking.
Here's what I see in our cpu profiles when I run a benchmark with 40 clients against a single host backed by 60 storage controllers:
... here's the same benchmark, but this time against two hosts backed by (the same) 60 storage controllers:
Interestingly, the single-head cpu consumption is at 560% of 1200%, and the dual-head cpu consumption is at 470% and 468% of 1200%, respectively.
A couple notable details:
runtime.findrunnable
in the single-node case. I'd expect that system to have on average 2x the number of goroutines, but I didn't think more goroutines would cause the proportional amount of time in the scheduler to increase. (I had presumed that more goroutines meant less work-stealing and polling, which would mean proportionally less time doing expensive stuff like syscalls and atomics.)Let me know if there are other details I can provide.
Thanks,
Phil
The text was updated successfully, but these errors were encountered: