-
Notifications
You must be signed in to change notification settings - Fork 17.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
runtime: epoll scalability problem with 192 core machine and 1k+ ready sockets #65064
Comments
There is a small chance that #56424 is related, though it seems unlikely as that was at a much smaller scale. |
We're running on the
We don't have a reproducer for this problem right now unfortunately, but our suspicion is that it should be easy to replicate by serving or making hundreds of thousands of fast network requests in a go application using TCP.
We don't have a
We did not try increasing the buffer size, it wasn't apparent there was a way to do that without running a custom build of Go and at the time running more than one container was a more accessible solution for us. Thanks for looking into this, it was definitely a interesting thing to find in the wild! |
For some more context, the EpollWait time in the profile was 2800 seconds on a 30 second profile. Also I don't necessarily think that the epoll buffer itself is the problem, rather just how epoll works under the hood with thousands of 'ready' sockets and hundreds of threads. The application under load had around 3500 open sockets, http2 clients making requests to our grpc service on one end and us making requests to scyllaDB on the other. |
Thanks for the details! I'll try to write a reproducer when I have some free time, not sure when I'll get to it.
Indeed, you'd need to manually modify the runtime. Note that is possible to simply edit the runtime source in GOROOT and rebuild your program (no special steps required for the runtime, it is treated like any other package). But if you build in a Docker container it is probably a pain to edit the runtime source. |
Some thoughts from brainstorming for posterity: My best theory at the moment (though I'd really like to see perf to confirm) is that ~90 threads are calling epoll_wait at once (probably at this non-blocking netpoll: https://cs.opensource.google/go/go/+/master:src/runtime/proc.go;l=3230;drc=dcbe77246922fe7ef41f07df228f47a37803f360). The kernel has a mutex around the entire copy-out portion of epoll_wait, so there is probably a lot of time waiting for the mutex. If that is the case, some form of rate-limiting on how many threads make the syscall at once may be effective. N.B. that this non-blocking netpoll is not load-bearing for correctness, so occasionally skipping it would be OK. |
Yeah, it was the netpoll call inside findRunnable (though i didnt have my source mapping set up at the time to confirm the exact line numbers). I've also got a spare test machine with the same CPU i can use to try out a repro test case as well. |
is go using the same epoll instance accross all threads? that might be the underlying problem, most high-throughput applications (nginx, envoy, netty) create several instances (usually one per thread together with an event loop) and connections get distributed to all epoll instances some way or another. |
Good point! And to answer your question, yes, Go has been using the single (and global) From where I stand, I reckon that refactoring the current To sum up, multiple |
using multiple |
This is one of the potential issues we may encounter and need to resolve if we decide to introduce multiple I actually drafted a WIP implementation of multiple |
A casual observation (not go specific): one reason epoll doesn't scale well when a single epoll instance is shared across threads is the file descriptor table, which is typically shared across the process. This is one of the reasons why, say, 8 separate processes usually performs better than a single process with 8 threads. The impact is present both with multiple epoll instances (per thread), or a single epoll instance shared across threads. The way to circumvent this is to unshare (syscall) the file descriptor table across threads upon thread creation, then create an epoll instance per thread. This yields similar performance to a multi process approach (within 1% in my experience). After that you can distribute the work however you want, maybe with SO_REUSEPORT. Also, be careful unsharing the file descriptor table, it is not appropriate for all situations. Side note, if you are sharing an epoll instance across threads you should use edge triggered to avoid all threads from being woken up, most unnecessarily. This is my experience anyway when using a thread per core model, although the principle would apply regardless of the number of threads. I don't know anything about go internals so I'll leave it there. |
I don't want to derail this issue, let me know if I should move this to a separate bug... We are seeing a similar issue on a system with 128 cores, we're only reading from 96 Unix Sockets, 1 per goroutine. Go was spending much time in I'm looking for the profiles from the Go App, in the mean time I can share that we reproduced this issue with a simple I wrote a workaround that does not invoke Let me know if there's anything I can do to help. |
These kernel patches may be of interest: https://lore.kernel.org/lkml/20230615120152.20836-1-guohui@uniontech.com/ |
Just to make sure I don't misread what said, you achieved that by using raw syscalls of |
Correct. I'll ask today if I can share an example. |
I think it would be great if Go runtime could maintain a separate epoll file descriptor (epfd) per each P. Then every P could register file descriptors in its own local epfd and call
Such a scheme may result in imbalance of goroutines among P workers, if a single goroutine creates many network connections (e.g. server |
I agree that most likely we need multiple epoll FDs, with some sort of affinity. @bwerthmann , since you're able to get perf profiles, could you get one with It would be really helpful if someone could create a benchmark that reproduces this issue. If it can be done with only 96 UNIX domain sockets, it may not even be especially hard. |
If we want to go deep here, it might even be possible for the Go scheduler to become RX queue aware using sockopts like |
@aclements profile as requested. Taken with |
@bwerthmann , thanks for the profile. Are you able to get one with call stacks that extend into the kernel? What I really want to see is what in the kernel is spending so much time on |
Nevermind! I was reading your profile backwards. 😅 |
@aclements here's a stack sample with the main func at the top: profile from FlameScope: blanks on the left and right are when the profile stopped and started. Flamegraph: |
@aclements what are your thoughts on the profiles? |
@bwerthmann Thanks for the detailed profile! This seems to confirm my suspicion in #65064 (comment). The mutex being taken appears to be https://elixir.bootlin.com/linux/v5.10.209/source/fs/eventpoll.c#L696 [1]. This is held around a loop over the ready list (https://elixir.bootlin.com/linux/v5.10.209/source/fs/eventpoll.c#L1722), which double-checks that events are still ready ( With a very long ready list, we're probably hitting the 128 event limit specified by netpoll. It's possible shrinking this could actually help by making the critical section shorter, but probably not nearly as much as reducing concurrent calls to epoll_wait (either directly, or sharding across multiple epoll FDs). As an aside, I also see a fair amount of contention on runtime locks in your profile (probably [1] |
It seems to me that we can partially mitigate the immediate issue by just limiting the number of P's that do a non-blocking If anybody who can easily recreate the issue has time for experimentation, it might be interesting to see whether https://go.dev/cl/564197 makes any difference. Thanks. |
Change https://go.dev/cl/564197 mentions this issue: |
Is anybody interested in seeing whether https://go.dev/cl/564197 fixes the problem? To be clear, I'm not going to submit it unless I have some reason to think that it helps. Thanks. |
Is there any chance you could apply CL 564197 in production? Or maybe do it in the dev/test environment to which you replay the live traffic using some tool like goreplay? @ericvolp12 @whyrusleeping |
I might have some cycles this week to test my reproducer. |
Great! Thanks! @bwerthmann |
Are there any instructions or easy buttons for checking out the changes needed in https://go-review.googlesource.com/c/go/+/564197/ to my GOROOT? |
This comment was marked as off-topic.
This comment was marked as off-topic.
I've had other priorities, I'd like to get back to this in a few weeks or so. Sorry for the delay. |
For reproducing (I don't have a 192 core machine to check) you could probably use https://github.com/fortio/fortio Happy to help using it/if it helps but something like a large |
will this be solved with io_uring? |
|
I have some cycles to test this out, as we have a reproducer of sort that generates random latencies. |
epoll contention on TCP causes latency build-up when we have high volume ingress. This PR is an attempt to relieve this pressure. upstream issue golang/go#65064 seems to be a deeper problem, haven't yet tried the fix provide in this issue but however this change without changing the compiler helps. Of course this is a workaround.
epoll contention on TCP causes latency build-up when we have high volume ingress. This PR is an attempt to relieve this pressure. upstream issue golang/go#65064 seems to be a deeper problem, haven't yet tried the fix provide in this issue but however this change without changing the compiler helps. Of course this is a workaround.
epoll contention on TCP causes latency build-up when we have high volume ingress. This PR is an attempt to relieve this pressure. upstream issue golang/go#65064 seems to be a deeper problem, haven't yet tried the fix provide in this issue but however this change without changing the compiler helps. Of course this is a workaround.
epoll contention on TCP causes latency build-up when we have high volume ingress. This PR is an attempt to relieve this pressure. upstream issue golang/go#65064 seems to be a deeper problem, haven't yet tried the fix provide in this issue but however this change without changing the compiler helps. Of course this is a workaround.
epoll contention on TCP causes latency build-up when we have high volume ingress. This PR is an attempt to relieve this pressure. upstream issue golang/go#65064 seems to be a deeper problem, haven't yet tried the fix provide in this issue but however this change without changing the compiler helps. Of course this is a workaround.
epoll contention on TCP causes latency build-up when we have high volume ingress. This PR is an attempt to relieve this pressure. upstream issue golang/go#65064 It seems to be a deeper problem; haven't yet tried the fix provide in this issue, but however this change without changing the compiler helps. Of course, this is a workaround for now, hoping for a more comprehensive fix from Go runtime.
Split from #31908 (comment) and full write-up at https://jazco.dev/2024/01/10/golang-and-epoll/.
tl;dr is that a program on a 192 core machine with >2500 sockets and with >1k becoming ready at once results in huge costs in
netpoll -> epoll_wait
(~65% of total CPU).Most interesting is that sharding these connections across 8 processes seems to solve the problem, implying some kind of super-linear scaling.
That the profile shows the time spent in
epoll_wait
itself, this may be a scalability problem in the kernel itself, but we may still be able to mitigate.@ericvolp12, some questions if you don't mind answering:
perf
profile of this problem that shows where the time in the kernel is spent?cc @golang/runtime
The text was updated successfully, but these errors were encountered: