Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: epoll scalability problem with 192 core machine and 1k+ ready sockets #65064

Open
prattmic opened this issue Jan 11, 2024 · 26 comments
Open
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. OS-Linux Performance
Milestone

Comments

@prattmic
Copy link
Member

prattmic commented Jan 11, 2024

Split from #31908 (comment) and full write-up at https://jazco.dev/2024/01/10/golang-and-epoll/.

tl;dr is that a program on a 192 core machine with >2500 sockets and with >1k becoming ready at once results in huge costs in netpoll -> epoll_wait (~65% of total CPU).

Most interesting is that sharding these connections across 8 processes seems to solve the problem, implying some kind of super-linear scaling.

That the profile shows the time spent in epoll_wait itself, this may be a scalability problem in the kernel itself, but we may still be able to mitigate.

@ericvolp12, some questions if you don't mind answering:

  • Which version of Go are you using? And which kernel version?
  • Do you happen to have a reproducer for this problem that you could share? (Sounds like no?)
  • On a similar note, do you have a perf profile of this problem that shows where the time in the kernel is spent?
  • The 128 event buffer size is mentioned several times, but it is not obvious to me that increasing this size would actually solve the problem. Did you try increasing the size and see improved results?

cc @golang/runtime

@prattmic prattmic added Performance NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. labels Jan 11, 2024
@prattmic prattmic added this to the Backlog milestone Jan 11, 2024
@gopherbot gopherbot added the compiler/runtime Issues related to the Go compiler and/or runtime. label Jan 11, 2024
@prattmic
Copy link
Member Author

There is a small chance that #56424 is related, though it seems unlikely as that was at a much smaller scale.

@ericvolp12
Copy link

ericvolp12 commented Jan 11, 2024

  • Which version of Go are you using? And which kernel version?

We're running on the golang:1.21-bullseye docker image base which is currently using: go version go1.21.6 linux/amd64, kernel version 5.15.0-91-generic on Ubuntu

  • Do you happen to have a reproducer for this problem that you could share? (Sounds like no?)

We don't have a reproducer for this problem right now unfortunately, but our suspicion is that it should be easy to replicate by serving or making hundreds of thousands of fast network requests in a go application using TCP.

  • On a similar note, do you have a perf profile of this problem that shows where the time in the kernel is spent?

We don't have a perf profile unfortunately, most of our discovery was done via pprof profiles from the running binary and testing different configurations (4, then 8 containers per host).

  • The 128 event buffer size is mentioned several times, but it is not obvious to me that increasing this size would actually solve the problem. Did you try increasing the size and see improved results?

We did not try increasing the buffer size, it wasn't apparent there was a way to do that without running a custom build of Go and at the time running more than one container was a more accessible solution for us.

Thanks for looking into this, it was definitely a interesting thing to find in the wild!

@whyrusleeping
Copy link

whyrusleeping commented Jan 11, 2024

For some more context, the EpollWait time in the profile was 2800 seconds on a 30 second profile.

Also I don't necessarily think that the epoll buffer itself is the problem, rather just how epoll works under the hood with thousands of 'ready' sockets and hundreds of threads.

The application under load had around 3500 open sockets, http2 clients making requests to our grpc service on one end and us making requests to scyllaDB on the other.

@prattmic
Copy link
Member Author

Thanks for the details! I'll try to write a reproducer when I have some free time, not sure when I'll get to it.

it wasn't apparent there was a way to do that without running a custom build of Go

Indeed, you'd need to manually modify the runtime. Note that is possible to simply edit the runtime source in GOROOT and rebuild your program (no special steps required for the runtime, it is treated like any other package). But if you build in a Docker container it is probably a pain to edit the runtime source.

@prattmic
Copy link
Member Author

Some thoughts from brainstorming for posterity:

My best theory at the moment (though I'd really like to see perf to confirm) is that ~90 threads are calling epoll_wait at once (probably at this non-blocking netpoll: https://cs.opensource.google/go/go/+/master:src/runtime/proc.go;l=3230;drc=dcbe77246922fe7ef41f07df228f47a37803f360). The kernel has a mutex around the entire copy-out portion of epoll_wait, so there is probably a lot of time waiting for the mutex. If that is the case, some form of rate-limiting on how many threads make the syscall at once may be effective. N.B. that this non-blocking netpoll is not load-bearing for correctness, so occasionally skipping it would be OK.

@whyrusleeping
Copy link

Yeah, it was the netpoll call inside findRunnable (though i didnt have my source mapping set up at the time to confirm the exact line numbers).
I overwrite the profile i took from the degerate case unfortunately, if helpful we can probably reorient things back down to a single process per machine and run some tests with perf.

I've also got a spare test machine with the same CPU i can use to try out a repro test case as well.

@sschepens
Copy link

is go using the same epoll instance accross all threads? that might be the underlying problem, most high-throughput applications (nginx, envoy, netty) create several instances (usually one per thread together with an event loop) and connections get distributed to all epoll instances some way or another.

@panjf2000
Copy link
Member

panjf2000 commented Jan 12, 2024

is go using the same epoll instance accross all threads? that might be the underlying problem, most high-throughput applications (nginx, envoy, netty) create several instances (usually one per thread together with an event loop) and connections get distributed to all epoll instances some way or another.

Good point! And to answer your question, yes, Go has been using the single (and global) epoll/kqueue/poll instance internally since the day Go netpoll was introduced. I actually had this concern for a few years, but never got a chance to spot that kind of performance bottleneck emerge. What I had in mind is that we can make a transition from single epoll instance to per-P epoll instances, or just multiple global epoll instances simply, which could also help.

From where I stand, I reckon that refactoring the current epoll from a single instance to multiple instances would require much less work than introducing io_uring. What is more, given the current Go codebase, io_uring is better suited for file I/O than for network I/O. Oh boy, I can already imagine now how many obstacles we'll have to go through before io_uring is implemented for network I/O eventually, and also transparently.

To sum up, multiple epoll instances should be able to gain sufficient credits for the performance boost of network I/O, and in consideration of the complexity from introducing io_uring for network I/O, I think the former is more feasible at this stage.

@sschepens
Copy link

sschepens commented Jan 12, 2024

using multiple epoll instances would mean that connections or fds would now be bound to a single thread? does this means that it could be possible for connection imbalances to happen where some threads could be handling many long lived connections while others be mostly idle?

@panjf2000
Copy link
Member

panjf2000 commented Jan 13, 2024

using multiple epoll instances would mean that connections or fds would now be bound to a single thread? does this means that it could be possible for connection imbalances to happen where some threads could be handling many long lived connections while others be mostly idle?

This is one of the potential issues we may encounter and need to resolve if we decide to introduce multiple epoll instances for Go runtime. But I don't think it's going to be our big concern cuz there are ways for us to mitigate that, for instance, the work-stealing mechanism, or just to put surplus tasks in the global run queue.

I actually drafted a WIP implementation of multiple epoll/kqueue/poll instances a long time ago on my local computer, and I can take on this if we eventually decide to introduce multiple netpollers after the root cause of this issue has been revealed.

@panjf2000 panjf2000 added the NeedsDecision Feedback is required from experts, contributors, and/or the community before a change can be made. label Jan 13, 2024
@gopherbot gopherbot removed the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Jan 13, 2024
@panjf2000 panjf2000 added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Jan 13, 2024
@gopherbot gopherbot removed the NeedsDecision Feedback is required from experts, contributors, and/or the community before a change can be made. label Jan 13, 2024
@errantmind
Copy link

errantmind commented Jan 13, 2024

A casual observation (not go specific): one reason epoll doesn't scale well when a single epoll instance is shared across threads is the file descriptor table, which is typically shared across the process. This is one of the reasons why, say, 8 separate processes usually performs better than a single process with 8 threads. The impact is present both with multiple epoll instances (per thread), or a single epoll instance shared across threads. The way to circumvent this is to unshare (syscall) the file descriptor table across threads upon thread creation, then create an epoll instance per thread. This yields similar performance to a multi process approach (within 1% in my experience). After that you can distribute the work however you want, maybe with SO_REUSEPORT. Also, be careful unsharing the file descriptor table, it is not appropriate for all situations.

Side note, if you are sharing an epoll instance across threads you should use edge triggered to avoid all threads from being woken up, most unnecessarily.

This is my experience anyway when using a thread per core model, although the principle would apply regardless of the number of threads. I don't know anything about go internals so I'll leave it there.

@bwerthmann
Copy link

bwerthmann commented Jan 15, 2024

I don't want to derail this issue, let me know if I should move this to a separate bug...

We are seeing a similar issue on a system with 128 cores, we're only reading from 96 Unix Sockets, 1 per goroutine. Go was spending much time in netpoll -> epoll_wait and perf top reported much of time in the kernel in osq_lock.

I'm looking for the profiles from the Go App, in the mean time I can share that we reproduced this issue with a simple socat invocation:

image

I wrote a workaround that does not invoke netpoll at all, instead it just makes raw syscalls and throughput improved by 5x (which is the next bottleneck in the App, also related to Mutex being slow). My microbenchmark with raw syscalls just reading from unix sockets performed >10x in terms of bandwidth with the 96 producers.

Let me know if there's anything I can do to help.

@bwerthmann
Copy link

These kernel patches may be of interest:
locking/osq_lock: Fix false sharing of optimistic_spin_node in osq_lock, may not be accepted yet?

https://lore.kernel.org/lkml/20230615120152.20836-1-guohui@uniontech.com/

@panjf2000
Copy link
Member

I wrote a workaround that does not invoke netpoll at all, instead it just makes raw syscalls and throughput improved by 5x

Just to make sure I don't misread what said, you achieved that by using raw syscalls of socket(), bind(), listen(), connect(), read() write(), etc. instead of the APIs provided by std net, right?
@bwerthmann

@bwerthmann
Copy link

I wrote a workaround that does not invoke netpoll at all, instead it just makes raw syscalls and throughput improved by 5x

Just to make sure I don't misread what said, you achieved that by using raw syscalls of socket(), bind(), listen(), connect(), read() write(), etc. instead of the APIs provided by std net, right?
@bwerthmann

Correct. I'll ask today if I can share an example.

@valyala
Copy link
Contributor

valyala commented Jan 17, 2024

I think it would be great if Go runtime could maintain a separate epoll file descriptor (epfd) per each P. Then every P could register file descriptors in its own local epfd and call epoll() on it when its local list of goroutines ready to run becomes empty and it needs to find runnable goroutine. This scheme has the following benefits:

  • Goroutines, which work with network, will tend to stay on the same P, since the file descriptors created by the goroutine are registered in P-local epfd. Even if the goroutine migrates to another P for some reason, it will migrate to the original P after the next network IO. This improves locality of data accessed by the goroutine, so it remains for longer in P-local CPU caches. This should improve the overall performance, since access to local CPU caches is usually faster than access to shared memory.
  • This should improve scalability of epoll() calls, since every P will poll its own epfd, thus removing bottlenecks related to access synchronization to shared epfd in kernel space.

Such a scheme may result in imbalance of goroutines among P workers, if a single goroutine creates many network connections (e.g. server accept loop). Then the Go scheduler will migrate all the goroutines, which make IO on these connections, to the original P where the original goroutine created all these network connections. This can be solved by periodic even re-distribution of the registered network connections among P-local epfds. For example, if P cannot find ready to run goroutines in local queue and in local epfd, then it can steal a few network connections from the busiest P, to de-register them from that P's epfd and then to register them in local epfd. The busiest P can be determined from some rough per-P CPU usage stats.

@aclements
Copy link
Member

I agree that most likely we need multiple epoll FDs, with some sort of affinity.

@bwerthmann , since you're able to get perf profiles, could you get one with perf record -g? I'd love to see where the osq_lock call is coming from to confirm the hypothesis.

It would be really helpful if someone could create a benchmark that reproduces this issue. If it can be done with only 96 UNIX domain sockets, it may not even be especially hard.

@aclements
Copy link
Member

If we want to go deep here, it might even be possible for the Go scheduler to become RX queue aware using sockopts like SO_INCOMING_CPU or SO_INCOMING_NAPI_ID. I suspect we can do a lot better without bringing in that complexity, but it's an interesting opportunity to consider.

@bwerthmann
Copy link

bwerthmann commented Jan 23, 2024

@aclements profile as requested. Taken with go1.21.5 on a 128 core machine:
image

@aclements
Copy link
Member

@bwerthmann , thanks for the profile. Are you able to get one with call stacks that extend into the kernel? What I really want to see is what in the kernel is spending so much time on osq_lock.

@aclements
Copy link
Member

Nevermind! I was reading your profile backwards. 😅

@bwerthmann
Copy link

bwerthmann commented Jan 24, 2024

@aclements here's a stack sample with the main func at the top:
image

profile from FlameScope:

blanks on the left and right are when the profile stopped and started.

image

Flamegraph:

image

@bwerthmann
Copy link

bwerthmann commented Jan 29, 2024

I'd love to see where the osq_lock call is coming from to confirm the hypothesis.

@aclements what are your thoughts on the profiles?

@prattmic
Copy link
Member Author

@bwerthmann Thanks for the detailed profile! This seems to confirm my suspicion in #65064 (comment).

The mutex being taken appears to be https://elixir.bootlin.com/linux/v5.10.209/source/fs/eventpoll.c#L696 [1]. This is held around a loop over the ready list (https://elixir.bootlin.com/linux/v5.10.209/source/fs/eventpoll.c#L1722), which double-checks that events are still ready (ep_item_poll) and then copies them to the userspace output buffer (__put_user). This loop exits when it reaches the end of the ready list or when it has written the max events that fit in the userspace output buffer, whichever is sooner.

With a very long ready list, we're probably hitting the 128 event limit specified by netpoll. It's possible shrinking this could actually help by making the critical section shorter, but probably not nearly as much as reducing concurrent calls to epoll_wait (either directly, or sharding across multiple epoll FDs).

As an aside, I also see a fair amount of contention on runtime locks in your profile (probably sched.lock), so that bottleneck would likely come up next.

[1] ep_scan_ready_list was removed completely in 2020 in Linux v5.11 in a fairly large epoll refactor: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=1a825a6a0e7eb55c83c06f3c74631c2eeeb7d27f, but the lock simply moved to the caller and seems to protect a similar critical section. #65064 (comment) noted Linux v5.15, so presumably it has similar issues.

@ianlancetaylor
Copy link
Contributor

It seems to me that we can partially mitigate the immediate issue by just limiting the number of P's that do a non-blocking epoll call in findRunnable. We get no advantage from having multiple P's call netpoll(0) simultaneously. And if we prevent that from happening, then we seem likely to avoid the contention in the kernel. We'll still have contention in userspace, but we have that anyhow, and that will continue until we are able to do a major overhaul of the scheduler for NUMA support.

If anybody who can easily recreate the issue has time for experimentation, it might be interesting to see whether https://go.dev/cl/564197 makes any difference. Thanks.

@gopherbot
Copy link

Change https://go.dev/cl/564197 mentions this issue: runtime: only poll network from one P at a time in findRunnable

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. OS-Linux Performance
Projects
Development

No branches or pull requests