New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: scheduler is slow when goroutines are frequently woken #18237

Open
philhofer opened this Issue Dec 7, 2016 · 14 comments

Comments

Projects
None yet
8 participants
@philhofer
Contributor

philhofer commented Dec 7, 2016

What version of Go are you using (go version)?

go1.7.3

What operating system and processor architecture are you using (go env)?

linux/amd64; Xeon E5-2670 (dual-socket 6-core packages, non-HT)

Our profiles indicate that we're spending an enormous number of cycles in runtime.findrunnable (and its callees) on the hosts that serve as our protocol heads.

Briefly, our these hosts translate HTTP CRUD operations into sets of transactions to be performed on our storage hosts, so the only real I/O these hosts do is networking.

Here's what I see in our cpu profiles when I run a benchmark with 40 clients against a single host backed by 60 storage controllers:

host 486938695e10692ab3a6a554cf47486b: 7356 samples
 top flat  pct symbol            
1831 2030 24.9 syscall.Syscall   
 900  900 12.2 i/c/siphash.blocks
 835  835 11.4 runtime.futex     
 661  661  9.0 runtime.epollwait 
 224  224  3.0 runtime.memmove   
 180  297  2.4 runtime.runqgrab  
 176 2584  2.4 runtime.findrunnable
 171  171  2.3 runtime/internal/atomic.Cas
 116  116  1.6 runtime/internal/atomic.Xchg
  85   85  1.2 runtime/internal/atomic.Load
-------------------------------------------------------------------------------------
host 486938695e10692ab3a6a554cf47486b
"runtime.findrunnable" -- in 2584 samples of 7356 (35.1%)
1 callers:
  in  flat symbol
2584 2694.0 runtime.schedule
21 callees:
 out  flat symbol
  67 130.0 runtime.unlock
  20 46.0 runtime/internal/atomic.Xadd
  14 85.0 runtime/internal/atomic.Load
 406 406.0 runtime.injectglist
 488 488.0 runtime.stopm
 331 331.0 runtime.runqsteal
 139 238.0 runtime.lock
  16 31.0 runtime/internal/atomic.Xchg64
  26 26.0 runtime.pidleput
   2  2.0 runtime.releasep
  59 66.0 runtime.runqempty
  21 161.0 runtime.casgstatus
 777 777.0 runtime.netpoll
   9  9.0 runtime/internal/atomic.Store64
   8  8.0 runtime.netpollinited
   2  8.0 runtime.acquirep
  10 15.0 runtime.pidleget
   8  8.0 runtime.globrunqget
   2 12.0 runtime.fastrand1
   2  2.0 runtime.nanotime
   1 10.0 runtime.runqget

... here's the same benchmark, but this time against two hosts backed by (the same) 60 storage controllers:

host 91b42bdeee8bc69fe40c33dff7c146ac: 6563 samples
 top flat  pct symbol            
1695 1829 25.8 syscall.Syscall   
 977  977 14.9 i/c/siphash.blocks
 639  639  9.7 runtime.futex     
 431  431  6.6 runtime.memmove   
 373  373  5.7 runtime.epollwait 
 155  221  2.4 runtime.runqgrab  
 112 1756  1.7 runtime.findrunnable
 100  100  1.5 runtime/internal/atomic.Cas
  89   89  1.4 runtime/internal/atomic.Xchg
  83   83  1.3 runtime.usleep    
--------------------------
host f8e02f9facaa304dce98c8d876270a10: 6540 samples
 top flat  pct symbol            
1593 1716 24.4 syscall.Syscall   
 895  895 13.7 i/c/siphash.blocks
 598  598  9.1 runtime.futex     
 399  399  6.1 runtime.memmove   
 385  385  5.9 runtime.epollwait 
 130  130  2.0 runtime/internal/atomic.Cas
 128  233  2.0 runtime.runqgrab  
 104 1763  1.6 runtime.findrunnable
 102  102  1.6 runtime.usleep    
 101  101  1.5 runtime/internal/atomic.Xchg

host 91b42bdeee8bc69fe40c33dff7c146ac
"runtime.findrunnable" -- in 1756 samples of 6563 (26.8%)
1 callers:
  in  flat symbol
1756 1846.0 runtime.schedule
20 callees:
 out  flat symbol
  41 98.0 runtime.unlock
   5 53.0 runtime/internal/atomic.Load
  45 51.0 runtime.runqempty
  12 12.0 runtime/internal/atomic.Store64
   8 91.0 runtime.casgstatus
  15 49.0 runtime/internal/atomic.Xadd
 364 365.0 runtime.stopm
 443 443.0 runtime.netpoll
 108 172.0 runtime.lock
 295 295.0 runtime.injectglist
 246 246.0 runtime.runqsteal
   3  3.0 runtime.releasep
  30 30.0 runtime.pidleput
   8 16.0 runtime.pidleget
   4  4.0 runtime.netpollinited
   3 12.0 runtime.runqget
   9  9.0 runtime.globrunqget
   3 22.0 runtime/internal/atomic.Xchg64
   1  7.0 runtime.fastrand1
   1  1.0 runtime.nanotime
-----------------
host f8e02f9facaa304dce98c8d876270a10
1 callers:
  in  flat symbol
1763 1853.0 runtime.schedule
21 callees:
 out  flat symbol
 268 268.0 runtime.runqsteal
  24 24.0 runtime.pidleput
 477 477.0 runtime.netpoll
 109 167.0 runtime.lock
   4 12.0 runtime.acquirep
   6 58.0 runtime/internal/atomic.Load
   7  7.0 runtime/internal/atomic.Store64
 298 298.0 runtime.injectglist
  49 54.0 runtime.runqempty
  33 71.0 runtime.unlock
  11 117.0 runtime.casgstatus
 327 328.0 runtime.stopm
   5 12.0 runtime.pidleget
  10 10.0 runtime.globrunqget
   5  9.0 runtime.runqget
   7  7.0 runtime.netpollinited
  12 40.0 runtime/internal/atomic.Xadd
   1  7.0 runtime.fastrand1
   4 24.0 runtime/internal/atomic.Xchg64
   1  1.0 runtime.releasep
   1  1.0 runtime.nanotime

Interestingly, the single-head cpu consumption is at 560% of 1200%, and the dual-head cpu consumption is at 470% and 468% of 1200%, respectively.

A couple notable details:

  • Performance is substantially worse in the single-host case (65% of the dual-host case), despite the fact that it is only half-loaded and backed by the same set of storage nodes running an identical front-end workload. I suppose some of this could be chalked up to head-of-line blocking, but I suspect there's more going on. In principle I'd expect very little difference between the two configurations, since none of these requests need to synchronize.
  • Proportionally more time (35% vs 27%) is spent in runtime.findrunnable in the single-node case. I'd expect that system to have on average 2x the number of goroutines, but I didn't think more goroutines would cause the proportional amount of time in the scheduler to increase. (I had presumed that more goroutines meant less work-stealing and polling, which would mean proportionally less time doing expensive stuff like syscalls and atomics.)

Let me know if there are other details I can provide.

Thanks,
Phil

@bradfitz

This comment has been minimized.

Member

bradfitz commented Dec 7, 2016

@philhofer, any chance you could try Go 1.8beta1? Even if a bug were found in Go 1.7, that branch is closed for all but security issues at this point.

Go 1.8 should be a drop-in replacement for 1.7. See https://beta.golang.org/doc/go1.8 for details. The SSA back end for ARM will probably help your little devices a fair bit. See https://dave.cheney.net/2016/11/19/go-1-8-toolchain-improvements

@bradfitz bradfitz added this to the Go1.9 milestone Dec 7, 2016

@bradfitz

This comment has been minimized.

Member

bradfitz commented Dec 7, 2016

(Tagging this Go 1.9, unless you can reproduce on 1.8 and @aclements thinks it's easily fixable enough for 1.8)

@davecheney

This comment has been minimized.

Contributor

davecheney commented Dec 7, 2016

@philhofer

This comment has been minimized.

Contributor

philhofer commented Dec 7, 2016

@bradfitz Yes, we're busy trying to get 1.8beta1 on some hardware to benchmark it. We're very excited about the arm performance improvements. (However, these profiles are on an Intel Xeon host, which I presume will perform similarly between 1.7 and 1.8, unless there have been substantial changes made to the scheduler that I missed?)

@davecheney Yes; I'll try to post a slightly-redacted one.

@philhofer

This comment has been minimized.

Contributor

philhofer commented Dec 8, 2016

Update: most of the scheduler activity is caused by blocking network reads.

The call chain goes across two call stacks, which makes it a little tough to track down through stack traces alone, but here it is:

  • net.(*netFD).Read
    net.(*pollDesc).wait
    net.(*pollDesc).waitRead
    net.runtime_pollWait
    runtime.netpollblock
    runtime.gopark
    runtime.mcall(park_m)
  • runtime.park_m
    runtime.schedule
    runtime.findrunnable
    (etc)

The raw call counts suggest that roughly 90% of the runtime.schedule calls are a consequence of this particular chain of events.

@davecheney I haven't extracted our profile format into the pprof format yet, but I hope that answers the same question you were hoping the svg web would answer.

@bradfitz

This comment has been minimized.

Member

bradfitz commented Dec 8, 2016

Yes, we're busy trying to get 1.8beta1 on some hardware to benchmark it. We're very excited about the arm performance improvements. (However, these profiles are on an Intel Xeon host, which I presume will perform similarly between 1.7 and 1.8, unless there have been substantial changes made to the scheduler that I missed?)

Oh, sorry, missed that. In any case, please test 1.8 wherever possible in the next two weeks. It's getting increasingly hard to make changes to 1.8. The next two weeks are the sweet spot for bug reports. Thanks!

@philhofer

This comment has been minimized.

Contributor

philhofer commented Dec 12, 2016

We just finished our first set of runs on 1.8, and things look pretty much identical on our x86 machines.

--------------------------
host 4edd58c28c7b9b548cc360334bae7af7: 6619 samples
 top flat  pct symbol            
1766 1872 26.7 syscall.Syscall   
 993  993 15.0 i/c/siphash.blocks
 720  720 10.9 runtime.futex     
 461  461  7.0 runtime.epollwait 
 443  443  6.7 runtime.memmove   
 173 1759  2.6 runtime.findrunnable
 107  107  1.6 runtime.casgstatus
  88  136  1.3 runtime.lock      
  86  136  1.3 runtime.runqgrab  
  64   64  1.0 runtime.usleep    
--------------------------
host f40105cffd2f1ec62e180b34677fc560: 6665 samples
 top flat  pct symbol            
1704 1789 25.6 syscall.Syscall   
 976  976 14.6 i/c/siphash.blocks
 666  666 10.0 runtime.futex     
 469  469  7.0 runtime.epollwait 
 408  408  6.1 runtime.memmove   
 168 1768  2.5 runtime.findrunnable
  99   99  1.5 runtime.casgstatus
  95  145  1.4 runtime.lock      
  89   89  1.3 runtime.usleep    
  86  157  1.3 runtime.runqgrab  

@bradfitz

This comment has been minimized.

Member

bradfitz commented Jun 29, 2017

@aclements, what's the status here?

@bradfitz

This comment has been minimized.

Member

bradfitz commented Jul 6, 2017

Ping @aclements

@philhofer

This comment has been minimized.

Contributor

philhofer commented Jul 6, 2017

I have a little more information, in case you're interested.

Fundamentally, the issue here is that io.ReadFull(socket, buf) where len(buf) is, say, 64kB (or really any number that is a large-ish multiple of your 1500-byte MTU), causes the scheduler to wake up that goroutine len(buf)/1500 times, since that's the number of times that data becomes available through epoll. So, if you have 20 goroutines doing this with 64kB buffers, then you'll eat more than 850 scheduler wakeups before all those buffers have been filled.

Now, in a sane world we could wire up io.ReadFull on a socket such that it called setsockopt(SO_RCVLOWAT) so that the caller didn't receive a notifcation until there was plenty of data to read, but, frustratingly, SO_RCVLOWAT doesn't work with poll or select.

So, part of this is Linux's fault, and part of it is caused by the scheduler being generally slow. (Consider: in that profile, we spend nearly twice as much time in the scheduler as we do checksumming every single byte received over the network.)

@aclements

This comment has been minimized.

Member

aclements commented Jul 6, 2017

Thanks for the extra information, @philhofer. That's very useful in understanding what's going on here.

Given how much time you're spending in findrunnable, it sounds like you're constantly switching between having something to do and being idle. Presumably the 1500 byte frames are coming in just a little slower than you can process them, so the runtime is constantly looking for work to do, going to sleep, and then immediately being woken up for the next frame. This is the most expensive path in the scheduler (we optimize for the case where there's another goroutine ready to run, which is extremely fast) and there's an implicit assumption here that the cost of going to sleep doesn't really matter if there's nothing to do. But that's violated if new work is coming in at just the wrong rate.

I'm not really sure what to do about this. It would at least help confirm this if you could post an execution trace (in this case, a sufficiently zoomed-in screen shot is probably fine, since there's no easy way to redact the stacks in an execution trace).

@aclements aclements modified the milestones: Go1.10, Go1.9 Jul 6, 2017

@rsc rsc modified the milestones: Go1.10, Go1.11 Nov 22, 2017

@mspielberg

This comment has been minimized.

mspielberg commented Jan 2, 2018

I have a similar issue, with no network involved.

My project performs application protocol analysis against a libpcap capture. Different pools of goroutines perform reading the raw trace from disk, packet parsing, and TCP/IP flow reassembly. CPU profiling indicates > 25% of total time spent in findrunnable. I'm running on 64-bit OSX, so most of that time is in kevent.

@aclements description does not appear to fit my situation, since more data is always available throughout a run. Whenever individual goroutines block, it's because they have just dispatched work to one or more goroutines further down the pipeline.

I'm running go version go1.9.1 darwin/amd64.

The project is open-source, so I can point you to the source and the SVG profiles generated from my perf tests. Would that be helpful, and would it be better to keep in this issue or file a new one?

@davecheney

This comment has been minimized.

Contributor

davecheney commented Jan 2, 2018

@jeffdh

This comment has been minimized.

jeffdh commented Jan 3, 2018

I was able to capture a trace of the original issue @philhofer described and wanted to add the requested screenshots to verify that this is the scheduler worst case scenario described by @aclements.

Though the profiling samples nicely show the time being spent in runtime.findrunnable, the trace viewer doesn't make it quite as clear since the scheduling behavior has to be inferred from the white space. Here's a couple of screenshots that roughly show the behavior of the socket getting serviced constantly, but no meaningful progress from the program's perspective.

From a macro view, here's about 40ms total:
screen shot 2017-08-11 at 2 37 33 pm

Most of the tiny slivers are network wake-ups to read an MTU off a particular socket, but not enough data to fill the desired buffer (think 1500 MTU but 64k desired buffers). The burst of longer operations on the right is processing that happened when enough data has been received to do higher level work with the data (Reed-Solomon computation in this case).

Next screenshot is a zoom in to the small goroutine section (~2ms total):
screen shot 2017-08-11 at 2 42 10 pm

I've selected a tiny slice and that's the identical stack across all the very small routines.

I think this tells the story of the scheduler constantly going idle, then being woken up by the network. Also willing to post some screenshots like this, if there's more specific questions.

@ianlancetaylor ianlancetaylor modified the milestones: Go1.11, Go1.12 Jul 10, 2018

@ianlancetaylor ianlancetaylor changed the title from runtime: runtime.findrunnable chewing cycles to runtime: scheduler is slow when goroutines are frequently woken Jul 10, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment