Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: scheduler is slow when goroutines are frequently woken #18237

philhofer opened this issue Dec 7, 2016 · 35 comments

runtime: scheduler is slow when goroutines are frequently woken #18237

philhofer opened this issue Dec 7, 2016 · 35 comments


Copy link

@philhofer philhofer commented Dec 7, 2016

What version of Go are you using (go version)?


What operating system and processor architecture are you using (go env)?

linux/amd64; Xeon E5-2670 (dual-socket 6-core packages, non-HT)

Our profiles indicate that we're spending an enormous number of cycles in runtime.findrunnable (and its callees) on the hosts that serve as our protocol heads.

Briefly, our these hosts translate HTTP CRUD operations into sets of transactions to be performed on our storage hosts, so the only real I/O these hosts do is networking.

Here's what I see in our cpu profiles when I run a benchmark with 40 clients against a single host backed by 60 storage controllers:

host 486938695e10692ab3a6a554cf47486b: 7356 samples
 top flat  pct symbol            
1831 2030 24.9 syscall.Syscall   
 900  900 12.2 i/c/siphash.blocks
 835  835 11.4 runtime.futex     
 661  661  9.0 runtime.epollwait 
 224  224  3.0 runtime.memmove   
 180  297  2.4 runtime.runqgrab  
 176 2584  2.4 runtime.findrunnable
 171  171  2.3 runtime/internal/atomic.Cas
 116  116  1.6 runtime/internal/atomic.Xchg
  85   85  1.2 runtime/internal/atomic.Load
host 486938695e10692ab3a6a554cf47486b
"runtime.findrunnable" -- in 2584 samples of 7356 (35.1%)
1 callers:
  in  flat symbol
2584 2694.0 runtime.schedule
21 callees:
 out  flat symbol
  67 130.0 runtime.unlock
  20 46.0 runtime/internal/atomic.Xadd
  14 85.0 runtime/internal/atomic.Load
 406 406.0 runtime.injectglist
 488 488.0 runtime.stopm
 331 331.0 runtime.runqsteal
 139 238.0 runtime.lock
  16 31.0 runtime/internal/atomic.Xchg64
  26 26.0 runtime.pidleput
   2  2.0 runtime.releasep
  59 66.0 runtime.runqempty
  21 161.0 runtime.casgstatus
 777 777.0 runtime.netpoll
   9  9.0 runtime/internal/atomic.Store64
   8  8.0 runtime.netpollinited
   2  8.0 runtime.acquirep
  10 15.0 runtime.pidleget
   8  8.0 runtime.globrunqget
   2 12.0 runtime.fastrand1
   2  2.0 runtime.nanotime
   1 10.0 runtime.runqget

... here's the same benchmark, but this time against two hosts backed by (the same) 60 storage controllers:

host 91b42bdeee8bc69fe40c33dff7c146ac: 6563 samples
 top flat  pct symbol            
1695 1829 25.8 syscall.Syscall   
 977  977 14.9 i/c/siphash.blocks
 639  639  9.7 runtime.futex     
 431  431  6.6 runtime.memmove   
 373  373  5.7 runtime.epollwait 
 155  221  2.4 runtime.runqgrab  
 112 1756  1.7 runtime.findrunnable
 100  100  1.5 runtime/internal/atomic.Cas
  89   89  1.4 runtime/internal/atomic.Xchg
  83   83  1.3 runtime.usleep    
host f8e02f9facaa304dce98c8d876270a10: 6540 samples
 top flat  pct symbol            
1593 1716 24.4 syscall.Syscall   
 895  895 13.7 i/c/siphash.blocks
 598  598  9.1 runtime.futex     
 399  399  6.1 runtime.memmove   
 385  385  5.9 runtime.epollwait 
 130  130  2.0 runtime/internal/atomic.Cas
 128  233  2.0 runtime.runqgrab  
 104 1763  1.6 runtime.findrunnable
 102  102  1.6 runtime.usleep    
 101  101  1.5 runtime/internal/atomic.Xchg

host 91b42bdeee8bc69fe40c33dff7c146ac
"runtime.findrunnable" -- in 1756 samples of 6563 (26.8%)
1 callers:
  in  flat symbol
1756 1846.0 runtime.schedule
20 callees:
 out  flat symbol
  41 98.0 runtime.unlock
   5 53.0 runtime/internal/atomic.Load
  45 51.0 runtime.runqempty
  12 12.0 runtime/internal/atomic.Store64
   8 91.0 runtime.casgstatus
  15 49.0 runtime/internal/atomic.Xadd
 364 365.0 runtime.stopm
 443 443.0 runtime.netpoll
 108 172.0 runtime.lock
 295 295.0 runtime.injectglist
 246 246.0 runtime.runqsteal
   3  3.0 runtime.releasep
  30 30.0 runtime.pidleput
   8 16.0 runtime.pidleget
   4  4.0 runtime.netpollinited
   3 12.0 runtime.runqget
   9  9.0 runtime.globrunqget
   3 22.0 runtime/internal/atomic.Xchg64
   1  7.0 runtime.fastrand1
   1  1.0 runtime.nanotime
host f8e02f9facaa304dce98c8d876270a10
1 callers:
  in  flat symbol
1763 1853.0 runtime.schedule
21 callees:
 out  flat symbol
 268 268.0 runtime.runqsteal
  24 24.0 runtime.pidleput
 477 477.0 runtime.netpoll
 109 167.0 runtime.lock
   4 12.0 runtime.acquirep
   6 58.0 runtime/internal/atomic.Load
   7  7.0 runtime/internal/atomic.Store64
 298 298.0 runtime.injectglist
  49 54.0 runtime.runqempty
  33 71.0 runtime.unlock
  11 117.0 runtime.casgstatus
 327 328.0 runtime.stopm
   5 12.0 runtime.pidleget
  10 10.0 runtime.globrunqget
   5  9.0 runtime.runqget
   7  7.0 runtime.netpollinited
  12 40.0 runtime/internal/atomic.Xadd
   1  7.0 runtime.fastrand1
   4 24.0 runtime/internal/atomic.Xchg64
   1  1.0 runtime.releasep
   1  1.0 runtime.nanotime

Interestingly, the single-head cpu consumption is at 560% of 1200%, and the dual-head cpu consumption is at 470% and 468% of 1200%, respectively.

A couple notable details:

  • Performance is substantially worse in the single-host case (65% of the dual-host case), despite the fact that it is only half-loaded and backed by the same set of storage nodes running an identical front-end workload. I suppose some of this could be chalked up to head-of-line blocking, but I suspect there's more going on. In principle I'd expect very little difference between the two configurations, since none of these requests need to synchronize.
  • Proportionally more time (35% vs 27%) is spent in runtime.findrunnable in the single-node case. I'd expect that system to have on average 2x the number of goroutines, but I didn't think more goroutines would cause the proportional amount of time in the scheduler to increase. (I had presumed that more goroutines meant less work-stealing and polling, which would mean proportionally less time doing expensive stuff like syscalls and atomics.)

Let me know if there are other details I can provide.


Copy link

@bradfitz bradfitz commented Dec 7, 2016

@philhofer, any chance you could try Go 1.8beta1? Even if a bug were found in Go 1.7, that branch is closed for all but security issues at this point.

Go 1.8 should be a drop-in replacement for 1.7. See for details. The SSA back end for ARM will probably help your little devices a fair bit. See

@bradfitz bradfitz added this to the Go1.9 milestone Dec 7, 2016
Copy link

@bradfitz bradfitz commented Dec 7, 2016

(Tagging this Go 1.9, unless you can reproduce on 1.8 and @aclements thinks it's easily fixable enough for 1.8)

Copy link

@davecheney davecheney commented Dec 7, 2016

Copy link
Contributor Author

@philhofer philhofer commented Dec 7, 2016

@bradfitz Yes, we're busy trying to get 1.8beta1 on some hardware to benchmark it. We're very excited about the arm performance improvements. (However, these profiles are on an Intel Xeon host, which I presume will perform similarly between 1.7 and 1.8, unless there have been substantial changes made to the scheduler that I missed?)

@davecheney Yes; I'll try to post a slightly-redacted one.

Copy link
Contributor Author

@philhofer philhofer commented Dec 8, 2016

Update: most of the scheduler activity is caused by blocking network reads.

The call chain goes across two call stacks, which makes it a little tough to track down through stack traces alone, but here it is:

  • net.(*netFD).Read
  • runtime.park_m

The raw call counts suggest that roughly 90% of the runtime.schedule calls are a consequence of this particular chain of events.

@davecheney I haven't extracted our profile format into the pprof format yet, but I hope that answers the same question you were hoping the svg web would answer.

Copy link

@bradfitz bradfitz commented Dec 8, 2016

Yes, we're busy trying to get 1.8beta1 on some hardware to benchmark it. We're very excited about the arm performance improvements. (However, these profiles are on an Intel Xeon host, which I presume will perform similarly between 1.7 and 1.8, unless there have been substantial changes made to the scheduler that I missed?)

Oh, sorry, missed that. In any case, please test 1.8 wherever possible in the next two weeks. It's getting increasingly hard to make changes to 1.8. The next two weeks are the sweet spot for bug reports. Thanks!

Copy link
Contributor Author

@philhofer philhofer commented Dec 12, 2016

We just finished our first set of runs on 1.8, and things look pretty much identical on our x86 machines.

host 4edd58c28c7b9b548cc360334bae7af7: 6619 samples
 top flat  pct symbol            
1766 1872 26.7 syscall.Syscall   
 993  993 15.0 i/c/siphash.blocks
 720  720 10.9 runtime.futex     
 461  461  7.0 runtime.epollwait 
 443  443  6.7 runtime.memmove   
 173 1759  2.6 runtime.findrunnable
 107  107  1.6 runtime.casgstatus
  88  136  1.3 runtime.lock      
  86  136  1.3 runtime.runqgrab  
  64   64  1.0 runtime.usleep    
host f40105cffd2f1ec62e180b34677fc560: 6665 samples
 top flat  pct symbol            
1704 1789 25.6 syscall.Syscall   
 976  976 14.6 i/c/siphash.blocks
 666  666 10.0 runtime.futex     
 469  469  7.0 runtime.epollwait 
 408  408  6.1 runtime.memmove   
 168 1768  2.5 runtime.findrunnable
  99   99  1.5 runtime.casgstatus
  95  145  1.4 runtime.lock      
  89   89  1.3 runtime.usleep    
  86  157  1.3 runtime.runqgrab  

Copy link

@bradfitz bradfitz commented Jun 29, 2017

@aclements, what's the status here?

Copy link

@bradfitz bradfitz commented Jul 6, 2017

Ping @aclements

Copy link
Contributor Author

@philhofer philhofer commented Jul 6, 2017

I have a little more information, in case you're interested.

Fundamentally, the issue here is that io.ReadFull(socket, buf) where len(buf) is, say, 64kB (or really any number that is a large-ish multiple of your 1500-byte MTU), causes the scheduler to wake up that goroutine len(buf)/1500 times, since that's the number of times that data becomes available through epoll. So, if you have 20 goroutines doing this with 64kB buffers, then you'll eat more than 850 scheduler wakeups before all those buffers have been filled.

Now, in a sane world we could wire up io.ReadFull on a socket such that it called setsockopt(SO_RCVLOWAT) so that the caller didn't receive a notifcation until there was plenty of data to read, but, frustratingly, SO_RCVLOWAT doesn't work with poll or select.

So, part of this is Linux's fault, and part of it is caused by the scheduler being generally slow. (Consider: in that profile, we spend nearly twice as much time in the scheduler as we do checksumming every single byte received over the network.)

Copy link

@aclements aclements commented Jul 6, 2017

Thanks for the extra information, @philhofer. That's very useful in understanding what's going on here.

Given how much time you're spending in findrunnable, it sounds like you're constantly switching between having something to do and being idle. Presumably the 1500 byte frames are coming in just a little slower than you can process them, so the runtime is constantly looking for work to do, going to sleep, and then immediately being woken up for the next frame. This is the most expensive path in the scheduler (we optimize for the case where there's another goroutine ready to run, which is extremely fast) and there's an implicit assumption here that the cost of going to sleep doesn't really matter if there's nothing to do. But that's violated if new work is coming in at just the wrong rate.

I'm not really sure what to do about this. It would at least help confirm this if you could post an execution trace (in this case, a sufficiently zoomed-in screen shot is probably fine, since there's no easy way to redact the stacks in an execution trace).

@aclements aclements modified the milestones: Go1.10, Go1.9 Jul 6, 2017
@rsc rsc modified the milestones: Go1.10, Go1.11 Nov 22, 2017
Copy link

@mspielberg mspielberg commented Jan 2, 2018

I have a similar issue, with no network involved.

My project performs application protocol analysis against a libpcap capture. Different pools of goroutines perform reading the raw trace from disk, packet parsing, and TCP/IP flow reassembly. CPU profiling indicates > 25% of total time spent in findrunnable. I'm running on 64-bit OSX, so most of that time is in kevent.

@aclements description does not appear to fit my situation, since more data is always available throughout a run. Whenever individual goroutines block, it's because they have just dispatched work to one or more goroutines further down the pipeline.

I'm running go version go1.9.1 darwin/amd64.

The project is open-source, so I can point you to the source and the SVG profiles generated from my perf tests. Would that be helpful, and would it be better to keep in this issue or file a new one?

Copy link

@davecheney davecheney commented Jan 2, 2018

Copy link

@jeffdh jeffdh commented Jan 3, 2018

I was able to capture a trace of the original issue @philhofer described and wanted to add the requested screenshots to verify that this is the scheduler worst case scenario described by @aclements.

Though the profiling samples nicely show the time being spent in runtime.findrunnable, the trace viewer doesn't make it quite as clear since the scheduling behavior has to be inferred from the white space. Here's a couple of screenshots that roughly show the behavior of the socket getting serviced constantly, but no meaningful progress from the program's perspective.

From a macro view, here's about 40ms total:
screen shot 2017-08-11 at 2 37 33 pm

Most of the tiny slivers are network wake-ups to read an MTU off a particular socket, but not enough data to fill the desired buffer (think 1500 MTU but 64k desired buffers). The burst of longer operations on the right is processing that happened when enough data has been received to do higher level work with the data (Reed-Solomon computation in this case).

Next screenshot is a zoom in to the small goroutine section (~2ms total):
screen shot 2017-08-11 at 2 42 10 pm

I've selected a tiny slice and that's the identical stack across all the very small routines.

I think this tells the story of the scheduler constantly going idle, then being woken up by the network. Also willing to post some screenshots like this, if there's more specific questions.

@ianlancetaylor ianlancetaylor modified the milestones: Go1.11, Go1.12 Jul 10, 2018
@ianlancetaylor ianlancetaylor changed the title runtime: runtime.findrunnable chewing cycles runtime: scheduler is slow when goroutines are frequently woken Jul 10, 2018
@aclements aclements modified the milestones: Go1.12, Go1.13 Jan 8, 2019
Copy link

@gopherbot gopherbot commented Apr 16, 2020

Change mentions this issue: runtime: expand BenchmarkWakeupParallel tests

Copy link

@gopherbot gopherbot commented Oct 5, 2020

Change mentions this issue: runtime: don't attempt to steal from idle Ps

Copy link

@gopherbot gopherbot commented Oct 22, 2020

Change mentions this issue: runtime: try to elide timer stealing if P has no timers

gopherbot pushed a commit that referenced this issue Oct 23, 2020
Work stealing is a scalability bottleneck in the scheduler. Since each P
has a work queue, work stealing must look at every P to determine if
there is any work. The number of Ps scales linearly with GOMAXPROCS
(i.e., the number of Ps _is_ GOMAXPROCS), thus this work scales linearly

Work stealing is a later attempt by a P to find work before it goes
idle. Since the P has no work of its own, extra costs here tend not to
directly affect application-level benchmarks. Where they show up is
extra CPU usage by the process as a whole. These costs get particularly
expensive for applications that transition between blocked and running

Long term, we need a more scalable approach in general, but for now we
can make a simple observation: idle Ps ([1]) cannot possibly have
anything in their runq, so we need not bother checking at all.

We track idle Ps via a new global bitmap, updated in pidleput/pidleget.
This is already a slow path (requires sched.lock), so we don't expect
high contention there.

Using a single bitmap avoids the need to touch every P to read p.status.
Currently, the bitmap approach is not significantly better than reading
p.status. However, in a future CL I'd like to apply a similiar
optimization to timers. Once done, findrunnable would not touch most Ps
at all (in mostly idle programs), which will avoid memory latency to
pull those Ps into cache.

When reading this bitmap, we are racing with Ps going in and out of
idle, so there are a few cases to consider:

1. _Prunning -> _Pidle: Running P goes idle after we check the bitmap.
In this case, we will try to steal (and find nothing) so there is no

2. _Pidle -> _Prunning while spinning: A P that starts running may queue
new work that we miss. This is OK: (a) that P cannot go back to sleep
without completing its work, and (b) more fundamentally, we will recheck
after we drop our P.

3. _Pidle -> _Prunning after spinning: After spinning, we really can
miss work from a newly woken P. (a) above still applies here as well,
but this is also the same delicate dance case described in findrunnable:
if nothing is spinning anymore, the other P will unpark a thread to run
the work it submits.

Benchmark results from WakeupParallel/syscall/pair/race/1ms (see

name                            old msec          new msec   delta
Perf-task-clock-8               250 ± 1%          247 ± 4%     ~     (p=0.690 n=5+5)
Perf-task-clock-16              258 ± 2%          259 ± 2%     ~     (p=0.841 n=5+5)
Perf-task-clock-32              284 ± 2%          270 ± 4%   -4.94%  (p=0.032 n=5+5)
Perf-task-clock-64              326 ± 3%          303 ± 2%   -6.92%  (p=0.008 n=5+5)
Perf-task-clock-128             407 ± 2%          363 ± 5%  -10.69%  (p=0.008 n=5+5)
Perf-task-clock-256             561 ± 1%          481 ± 1%  -14.20%  (p=0.016 n=4+5)
Perf-task-clock-512             840 ± 5%          683 ± 2%  -18.70%  (p=0.008 n=5+5)
Perf-task-clock-1024          1.38k ±14%        1.07k ± 2%  -21.85%  (p=0.008 n=5+5)

[1] "Idle Ps" here refers to _Pidle Ps in the sched.pidle list. In other
contexts, Ps may temporarily transition through _Pidle (e.g., in
handoffp); those Ps may have work.

Updates #28808
Updates #18237

Change-Id: Ieeb958bd72e7d8fb375b0b1f414e8d7378b14e29
Run-TryBot: Michael Pratt <>
TryBot-Result: Go Bot <>
Reviewed-by: Michael Knyszek <>
Reviewed-by: Austin Clements <>
Trust: Michael Pratt <>
Copy link

@gopherbot gopherbot commented Oct 23, 2020

Change mentions this issue: runtime: try to elide timer stealing if P has no timers

gopherbot pushed a commit that referenced this issue Oct 28, 2020
Following, findrunnable still must touch every
other P in checkTimers in order to look for timers to steal. This scales
poorly with GOMAXPROCS and potentially performs poorly by pulling remote
Ps into cache.

Add timerpMask, a bitmask that tracks whether each P may have any timers
on its timer heap.

Ideally we would update this field on any timer add / remove to always
keep it up to date. Unfortunately, updating a shared global structure is
antithetical to sharding timers by P, and doing so approximately doubles
the cost of addtimer / deltimer in microbenchmarks.

Instead we only (potentially) clear the mask when the P goes idle. This
covers the best case of avoiding looking at a P _at all_ when it is idle
and has no timers. See the comment on updateTimerPMask for more details
on the trade-off. Future CLs may be able to expand cases we can avoid
looking at the timers.

Note that the addition of idlepMask to p.init is a no-op. The zero value
of the mask is the correct init value so it is not necessary, but it is
included for clarity.

Benchmark results from WakeupParallel/syscall/pair/race/1ms (see Note that these are on top of

name                        old msec           new msec   delta
Perf-task-clock-8           244 ± 4%           246 ± 4%     ~     (p=0.841 n=5+5)
Perf-task-clock-16          247 ±11%           252 ± 4%     ~     (p=1.000 n=5+5)
Perf-task-clock-32          270 ± 1%           268 ± 2%     ~     (p=0.548 n=5+5)
Perf-task-clock-64          302 ± 3%           296 ± 1%     ~     (p=0.222 n=5+5)
Perf-task-clock-128         358 ± 3%           352 ± 2%     ~     (p=0.310 n=5+5)
Perf-task-clock-256         483 ± 3%           458 ± 1%   -5.16%  (p=0.008 n=5+5)
Perf-task-clock-512         663 ± 1%           612 ± 4%   -7.61%  (p=0.008 n=5+5)
Perf-task-clock-1024      1.06k ± 1%         0.95k ± 2%  -10.24%  (p=0.008 n=5+5)

Updates #28808
Updates #18237

Change-Id: I4239cd89f21ad16dfbbef58d81981da48acd0605
Run-TryBot: Michael Pratt <>
TryBot-Result: Go Bot <>
Reviewed-by: Michael Knyszek <>
Trust: Michael Pratt <>
Copy link

@gopherbot gopherbot commented Oct 29, 2020

Change mentions this issue: runtime: elide timer re-check if P has no timers

gopherbot pushed a commit that referenced this issue Oct 30, 2020
In, I missed this new block after rebasing past These fields must be zero if there are no timers.

Updates #28808
Updates #18237

Change-Id: I2d9e1cbf326497c833daa26b11aed9a1e12c2270
Run-TryBot: Michael Pratt <>
Reviewed-by: Austin Clements <>
Reviewed-by: Ian Lance Taylor <>
Reviewed-by: Michael Knyszek <>
TryBot-Result: Go Bot <>
Trust: Michael Pratt <>
Copy link

@aka-rider aka-rider commented Dec 3, 2020

Hello gophers.

We have a similar issue — mcall, findrunnable, schedule excessively consume CPU time (mcall is up to 75% on stack). The application itself consumes 30-50% CPU while it mainly should be in I/O wait. The application is basically a proxy with tiny additional logic (receive 20 bytes packets over TCP/IP, accumulate up to 500 bytes and send it again over TCP/IP).
We see that the CPU workload comes specifically from the network I/O (and bit of timers), without connections the application is almost idling.

What makes it worse in our case, that mcall/findrunnable are very aggressive, to the point where the application is mainly running runtime functions, and our own code gets delayed so that we see a very visible lag up to seconds.
I had to rewrite parts of our application from waiting on a channel / mutex to an infinite loop with Sleep, otherwise our code didn't get scheduled in time.

The issue described by @philhofer is not applicable in our case, we read and write relatively small (should be below 1500) packets, and increasing packet size (batch write / batch read) seems to decrease the CPU consumption.

We use simple protocol on top of TCP/IP: [varint size][payload] and for reading we use bufio.Reader on top of net.TCPConn with default size of 4096 (but the size doesn't change a thing)

Please let me know if you need full profile or source code.

(pprof) top20 -cum
Showing nodes accounting for 83.07s, 69.89% of 118.85s total
Dropped 529 nodes (cum <= 0.59s)
Showing top 20 nodes out of 86
      flat  flat%   sum%        cum   cum%
     1.23s  1.03%  1.03%     89.74s 75.51%  runtime.mcall
     1.02s  0.86%  1.89%     78.96s 66.44%  runtime.gosched_m
     1.20s  1.01%  2.90%     77.97s 65.60%  runtime.goschedImpl
     2.48s  2.09%  4.99%     73.73s 62.04%  runtime.schedule
     4.63s  3.90%  8.89%     43.51s 36.61%  runtime.findrunnable
     5.12s  4.31% 13.19%     36.30s 30.54%  runtime.checkTimers
    28.59s 24.06% 37.25%     30.33s 25.52%  runtime.nanotime (inline)
     0.04s 0.034% 37.28%     13.85s 11.65%  main.main.func2
     1.62s  1.36% 38.65%     11.61s  9.77%*sub).Receive
     5.11s  4.30% 42.94%      9.86s  8.30%*sub).dequeue
         0     0% 42.94%      9.48s  7.98%  runtime.park_m
     9.39s  7.90% 50.85%      9.39s  7.90%  runtime.futex
     0.02s 0.017% 50.86%      8.79s  7.40%  runtime.lock (partial-inline)
     0.15s  0.13% 50.99%      8.77s  7.38%  runtime.lockWithRank
     7.86s  6.61% 57.60%      8.62s  7.25%  runtime.lock2
         0     0% 57.60%      7.71s  6.49%  runtime.unlock (partial-inline)
     0.02s 0.017% 57.62%      7.71s  6.49%  runtime.unlockWithRank (inline)
     7.61s  6.40% 64.02%      7.69s  6.47%  runtime.unlock2
     6.98s  5.87% 69.89%         7s  5.89%  runtime.casgstatus
         0     0% 69.89%      6.23s  5.24%  runtime.mstart
Copy link

@ChrisHines ChrisHines commented Dec 3, 2020

@aka-rider What CPU, and OS, and version of Go did this profile come from? The amount of time spent in nanotime jumps out to me as surprisingly high compared to what I've seen in other profiles.

Copy link

@prattmic prattmic commented Dec 3, 2020

@aka-rider In addition to Chris' questions, could you describe the application a bit more? In particular:

  • Approximately how many goroutines do you have? What is GOMAXPROCS?
  • What kind of cross-goroutine synchronization is there? i.e., is the proxy receiving data, accumulating, and then sending all from the same goroutine, or is it sending data around?
  • What kind of load do you expect this application to be under? What is the overall system load?

I'm particularly curious about the last question, as you mention "lag up to seconds", yet your profile only shows 118s of CPU on 300s wall-time. i.e., less than 1 CPU of work on average. Unless the system is under heavy load, I wouldn't expect to see such extreme latency, which may be indicative of a scheduling bug rather than overall performance (what this issue is about).

Copy link

@prattmic prattmic commented Dec 3, 2020

I forgot to mention that Go 1.16 will include several timer overhead improvements, so if you can test on tip (if this isn't already on tip), that would be helpful.

Copy link

@aka-rider aka-rider commented Dec 3, 2020

Hi @ChrisHines
Yeah, I forgot to mention that.

We running virtual machines in KVM: model name: AMD EPYC Processor (with IBPB) in Ubuntu 16.04 (both host and guest)
go version 1.15 but we have this issue since 1.14 (and most probably since 1.13)

I agree, nanotime also bothers me.

I came across this ticket #27707, and the following snippet:

package main

import (

func main() {
  for {

consumes ~10% CPU on our cluster (it's < 2% CPU on my Mac and my colleague's Ubuntu laptop)

Copy link

@aka-rider aka-rider commented Dec 3, 2020

Hi @prattmic

We have 1 goroutine per incoming connection (receive), 1 goroutine per outgoing connection (async send)
All messages are processed by a single goroutine.

The number of connections is 10-200. Interestingly, the CPU consumption doesn't depend on the number of connections too much — it's ~0% with 0 connections, 30% with one connection, 30-50% with 200 connections.

This particular profile comes from our development environment, the amount of connections was ~6-8, and machine was not busy. We used to see significant delays with increased number of connections.
Our servers are not at the capacity, overall CPU/Memory/bandwidth usage is below 50%, even with 200 connections.

I tried GOMAXPROCS 1, 2, 5, 20, 50 — and it doesn't seem to change anything.

The initial design was using channels. Every read goroutine would write received data to a single buffered channel chan []byte, then processing goroutine would receive, process, and write data to a channel of the particular send goroutine (also buffered chan []byte).
We experienced significant delays within the application, part of the problems came from the GC pressure — each receiving goroutine would allocate memory for each incoming message, and part of the problem was in a channel communication — from my understanding it might be because receiving goroutines weren't scheduled for a significant amount of time, and it would create a vicious circle: more data comes to chan, nobody reads, more CPU spent in runtime.

Right now, we use circular buffer with mutex — all receiving goroutines write into it, one goroutine reads from it, and communication with send goroutines is the same, except is buffer+mutex per sending goroutine.
In that way we no longer see delays, but the CPU consumption is high even with a single incoming and single outgoing connection.

I also tried to get rid of asynchronous send goroutines, and write to TCP directly from the processing goroutine, it increased the CPU consumption from 30% to 50% for a single connection.

It makes me think that it specifically TCPConn.Write (and maybe Read) cause this behavior.

Copy link

@prattmic prattmic commented Dec 3, 2020

@aka-rider lots of time in nanotime could be explained by an inefficient VDSO implementation. Inside of a VM, the VDSO may not be using the fastest time mechanism (via the TSC), which can slow down clock reads significantly.

I don't know of a handy way to see which clock is in besides reading it directly from a program. will dump out the VDSO parameter page, from which on my system I can see the clock mode at offset 0x80 is 0x1 (VDSO_CLOCKMODE_TSC), which is what I want. (Offset and location varies depending on kernel version...).

All of this is quite a bit of a tangent from the issue at hand, though, as even a slow VDSO shouldn't be causing this much trouble.

I'll take a look at the rest of your summary tomorrow.

Copy link

@aka-rider aka-rider commented Dec 4, 2020

Hi @prattmic

VDSO base: 0x7ffda7efc000

0x7ffda7efa000: 0x00000000 0x00000000 0x00000000 0x00000000
0x7ffda7efa010: 0x00000000 0x00000000 0x00000000 0x00000000
0x7ffda7efa020: 0x00000000 0x00000000 0x00000000 0x00000000
0x7ffda7efa030: 0x00000000 0x00000000 0x00000000 0x00000000
0x7ffda7efa040: 0x00000000 0x00000000 0x00000000 0x00000000
0x7ffda7efa050: 0x00000000 0x00000000 0x00000000 0x00000000
0x7ffda7efa060: 0x00000000 0x00000000 0x00000000 0x00000000
0x7ffda7efa070: 0x00000000 0x00000000 0x00000000 0x00000000
0x7ffda7efa080: 0x63998e0e 0x00000002 0x9086e69f 0x0028c85b
0x7ffda7efa090: 0xffffffff 0xffffffff 0x007fff2f 0x00000017
0x7ffda7efa0a0: 0xe7b5cf71 0x001b3ec7 0x5fca3a56 0x00000000
0x7ffda7efa0b0: 0x00338eb2 0x00000000 0xca35cf71 0x0011930a
0x7ffda7efa0c0: 0x5fca3a56 0x00000000 0x367d8fcf 0x00000000
0x7ffda7efa0d0: 0x00338eb2 0x00000000 0x23261594 0x00000000
0x7ffda7efa0e0: 0x00000000 0x00000000 0x00000000 0x00000000
0x7ffda7efa0f0: 0x00000000 0x00000000 0x00000000 0x00000000
0x7ffda7efa100: 0x00000000 0x00000000 0x00000000 0x00000000
0x7ffda7efa110: 0x00000000 0x00000000 0x00000000 0x00000000
0x7ffda7efa120: 0x00000000 0x00000000 0x00000000 0x00000000
0x7ffda7efa130: 0x00000000 0x00000000 0x00000000 0x00000000
0x7ffda7efa140: 0x00000000 0x00000000 0x00000000 0x00000000
0x7ffda7efa150: 0x00000000 0x00000000 0x00000000 0x00000000
0x7ffda7efa160: 0x00000000 0x00000000 0x00000000 0x00000000
0x7ffda7efa170: 0x00000000 0x00000000 0x00000000 0x00000000
0x7ffda7efa180: 0x00000000 0x00000000 0x00000000 0x00000000
0x7ffda7efa190: 0x00000000 0x00000000 0x00000000 0x00000000
0x7ffda7efa1a0: 0x00000000 0x00000000 0x00000000 0x00000000
0x7ffda7efa1b0: 0x00000000 0x00000000 0x00000000 0x00000000
0x7ffda7efa1c0: 0x00000000 0x00000000 0x00000000 0x00000000
0x7ffda7efa1d0: 0x00000000 0x00000000 0x00000000 0x00000000
0x7ffda7efa1e0: 0x00000000 0x00000000 0x00000000 0x00000000
0x7ffda7efa1f0: 0x00000000 0x00000000 0x00000000 0x00000000
uname -a
Linux 4.4.0-193-generic #224-Ubuntu SMP Tue Oct 6 17:15:28 UTC 2020 x86_64 GNU/Linux
Copy link

@aka-rider aka-rider commented Dec 4, 2020

With gotip the CPU profile and consumption is very similar to what it was before

(gdb) p 'runtime.buildVersion'
$1 = 0xa68d6c "devel +b67b7dd Fri Dec 4 08:49:16 2020 +0000"
(pprof) top10
Showing nodes accounting for 8960ms, 70.05% of 12790ms total
Dropped 151 nodes (cum <= 63.95ms)
Showing top 10 nodes out of 83
      flat  flat%   sum%        cum   cum%
    2820ms 22.05% 22.05%     2960ms 23.14%  runtime.nanotime (inline)
    1170ms  9.15% 31.20%     1170ms  9.15%  runtime.futex
     860ms  6.72% 37.92%      870ms  6.80%  runtime.unlock2
     850ms  6.65% 44.57%      850ms  6.65%  runtime.casgstatus
     710ms  5.55% 50.12%      790ms  6.18%  runtime.lock2
     650ms  5.08% 55.20%     5180ms 40.50%  runtime.findrunnable
     590ms  4.61% 59.81%     3810ms 29.79%  runtime.checkTimers
     570ms  4.46% 64.27%      980ms  7.66%*sub).dequeue
     460ms  3.60% 67.87%      460ms  3.60%  runtime.epollwait
     280ms  2.19% 70.05%      310ms  2.42%  sync.(*Mutex).Unlock
(pprof) top10 -cum
Showing nodes accounting for 4.89s, 38.23% of 12.79s total
Dropped 151 nodes (cum <= 0.06s)
Showing top 10 nodes out of 83
      flat  flat%   sum%        cum   cum%
     0.15s  1.17%  1.17%     10.27s 80.30%  runtime.mcall
     0.22s  1.72%  2.89%      8.57s 67.01%  runtime.schedule
     0.12s  0.94%  3.83%      8.44s 65.99%  runtime.gosched_m
     0.13s  1.02%  4.85%      8.32s 65.05%  runtime.goschedImpl
     0.65s  5.08%  9.93%      5.18s 40.50%  runtime.findrunnable
     0.59s  4.61% 14.54%      3.81s 29.79%  runtime.checkTimers
     2.82s 22.05% 36.59%      2.96s 23.14%  runtime.nanotime (inline)
         0     0% 36.59%      1.71s 13.37%  main.main.func2
         0     0% 36.59%      1.68s 13.14%  runtime.park_m
     0.21s  1.64% 38.23%      1.20s  9.38%*sub).Receive
Copy link

@prattmic prattmic commented Dec 4, 2020

It does seem that your system is using VCLOCK_PVCLOCK instead of TSC, though looking at the implementation it seems pretty efficient except for the VCLOCK_NONE case, which the comment says will only affect Xen, not KVM. I still don't think the clock source is really the underlying problem here, but it may be interesting if you can run this workload in a non-virtualized environment to see if it behaves better.

More importantly, I now notice that the entry point to ~90% of your time in the scheduler is due to calls to runtime.Gosched. The standard library and runtime rarely call Gosched, so I'm wondering if some part of your application is calling it (and perhaps way too much)? (Note that Gosched calls mcall, which unfortunately breaks the stack trace in the profile, so we can't directly see the caller).

Copy link

@aka-rider aka-rider commented Dec 4, 2020

Thank for your suggestion @prattmic.

if you can run this workload in a non-virtualized environment to see if it behaves better.

It seems to behave better on non-virtualized environment, although it's hard to reproduce the exact workload. Lab benchmarks on our laptops don't have the same CPU consumption.

I had the following in my code (in the next lines there's a call to Cond.Wait() ):

for cycles := 0; cycles < 10000; cycles++ {
	if payload, stream, err = s.dequeue(p); err != nil || payload != nil {

The following profile with the code above commented out:

(pprof) top10
Showing nodes accounting for 51460ms, 69.38% of 74170ms total
Dropped 442 nodes (cum <= 370.85ms)
Showing top 10 nodes out of 92
      flat  flat%   sum%        cum   cum%
   30820ms 41.55% 41.55%    30820ms 41.55%  runtime.futex
    4150ms  5.60% 47.15%     4180ms  5.64%  runtime.runqgrab
    2770ms  3.73% 50.88%     2770ms  3.73%  runtime.usleep
    2750ms  3.71% 54.59%    39370ms 53.08%  runtime.findrunnable
    2690ms  3.63% 58.22%     2920ms  3.94%  syscall.Syscall
    2570ms  3.47% 61.68%     2570ms  3.47%  runtime.epollwait
    2030ms  2.74% 64.42%     9530ms 12.85%  runtime.checkTimers
    1490ms  2.01% 66.43%     1490ms  2.01%  runtime.write1
    1200ms  1.62% 68.05%     1450ms  1.95%  runtime.nanotime (inline)
     990ms  1.33% 69.38%     6440ms  8.68%  runtime.runtimer
(pprof) top10 -cum
Showing nodes accounting for 34.50s, 46.51% of 74.17s total
Dropped 442 nodes (cum <= 0.37s)
Showing top 10 nodes out of 92
      flat  flat%   sum%        cum   cum%
     0.05s 0.067% 0.067%     46.58s 62.80%  runtime.mcall
     0.04s 0.054%  0.12%     46.50s 62.69%  runtime.park_m
     0.19s  0.26%  0.38%     46.38s 62.53%  runtime.schedule
     2.75s  3.71%  4.09%     39.37s 53.08%  runtime.findrunnable
    30.82s 41.55% 45.64%     30.82s 41.55%  runtime.futex
     0.23s  0.31% 45.95%     16.10s 21.71%  runtime.futexsleep
         0     0% 45.95%     15.99s 21.56%  runtime.mstart
         0     0% 45.95%     15.99s 21.56%  runtime.mstart1
     0.30s   0.4% 46.35%     15.99s 21.56%  runtime.sysmon
     0.12s  0.16% 46.51%     15.50s 20.90%  runtime.startm

The CPU consumption stays roughly the same — about 30%

Copy link

@prattmic prattmic commented Dec 4, 2020

Ah, this looks better. Not better that usage isn't down much, but better in that it now looks like the symptoms that this issue was originally intended to address. i.e., this now looks like there may be lots of fast cycles between idle and active, where the scheduler is spending lots of time futilely trying to find work, then finally going to sleep only to be re-woken almost immediately.

Is there a particular reason you need that Gosched in your code, or can it go? If it can go, then this will probably be in a better place for future improvements. (I'm a bit surprised that this didn't improve things at least by a few percent).

To help verify the above behavior, could you provide the full pprof profile for this run? Plus a runtime trace? You may need to make the run fairly short if the trace is unreasonably large.

Copy link

@aka-rider aka-rider commented Dec 7, 2020


Is there a particular reason you need that Gosched in your code, or can it go?

Right now the application behaves well without it, so it can go. The reason it was added is because waiting on chan could left a goroutine sleeping for too long. It seems like waiting on a Cond is better in that sense.


I will think of how to get the trace from Docker container.

Copy link

@ianlancetaylor ianlancetaylor commented Dec 7, 2020

In general the Go scheduler reacts badly to any kind of busy loop, whether implemented with runtime.Gosched or otherwise.

Copy link

@aka-rider aka-rider commented Dec 8, 2020

In general the Go scheduler reacts badly to any kind of busy loop

Thanks for letting me know @ianlancetaylor

Please find the attached trace

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet