-
Notifications
You must be signed in to change notification settings - Fork 17.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
runtime: allow short-term drop of work conservation to increase CPU efficiency #54622
Labels
compiler/runtime
Issues related to the Go compiler and/or runtime.
NeedsInvestigation
Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Performance
Milestone
Comments
prattmic
added
Performance
NeedsInvestigation
Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
compiler/runtime
Issues related to the Go compiler and/or runtime.
labels
Aug 23, 2022
Being able to tune such cutoff for a given goroutine can help to ease situation when runtime.findrunnable is eating too much cpu |
copybara-service bot
pushed a commit
to google/gvisor
that referenced
this issue
Oct 3, 2022
Before this CL, using `kernel.Task.Block*` to wait for host FD readiness requires going through fdnotifier, resulting in a significant amount of overhead. As an example, consider a blocking application `recvmsg` on a hostinet socket that blocks: - The task goroutine invokes `recvmsg` and gets `EAGAIN`. - The task goroutine heap-allocates a `waiter.Entry` and a channel of `waiter.EventMask`, and invokes `epoll_ctl` to add the socket FD to fdnotifier's epoll FD. - The task goroutine invokes `recvmsg` and gets `EAGAIN`, again. - The task goroutine blocks in Go (on the channel select in `kernel.Task.block`). If the thread that was running the task goroutine can find idle goroutines to run, then it does so; otherwise, it invokes `futex(FUTEX_WAIT)` to block in the host. Note that the vast majority of the sentry's "work" consists of executing application code, during which the corresponding task goroutines appear to the Go scheduler to be blocked in host syscalls; furthermore, time that *is* spent executing sentry code (in Go) is overhead relative to the application's execution. Consequently, the sentry has relatively little Go code to execute and is generally optimized to have less, making this tradeoff less favorable than in (presumably) more typical Go programs. - When the socket FD becomes readable, fdnotifier's goroutine returns from `epoll_wait` and wakes the task goroutine, usually invoking `futex(FUTEX_WAKE)` to wake another thread. It then yields control of its thread to other goroutines, improving wakeup-to-execution latency for the task goroutine. The `futex(FUTEX_WAKE)` is skipped if any of the following are true: - `GOMAXPROCS` threads are already executing goroutines. For reasons described above, we expect this to occur infrequently. - At least one already-running thread is in the "spinning" state, because it was itself recently woken but has not yet started executing goroutines. - At least one already-running thread is in the "spinning" state, because it recently ran out of goroutines to run and is busy-polling before going to sleep. A "spinning" thread stops spinning either because it successfully busy-polls for an idle goroutine to run, or it times out while busy-polling in the latter case; in the former case the thread usually invokes `futex(FUTEX_WAKE)` to wake *another* thread as described above, and in the latter case the thread invokes `futex(FUTEX_WAIT)` to go to sleep. - The task goroutine invokes `recvmsg` and succeeds. - The task goroutine invokes `epoll_ctl` to remove the socket FD from fdnotifier's epoll FD. This CL demonstrates how fdnotifier may be replaced by making host syscalls from task goroutine context. After this CL, after per-thread initialization (`sigprocmask`), the same scenario results in: - The task goroutine invokes `recvmsg` and gets `EAGAIN`. - The task goroutine invokes `ppoll` on the host FD, which returns when the socket FD becomes available. The Go runtime maintains a thread called "sysmon" which runs periodically. When this thread determines that another thread has been blocked in a host syscall for "long enough" (20-40us + slack) and there are idle goroutines to run, it steals that thread's runqueue and invokes `futex(FUTEX_WAKE)` to wake another thread to run the stolen runqueue. - The task goroutine invokes `recvmsg` and succeeds. For now, this functionality is only used in hostinet where socket methods are responsible for blocking; applying it more generally (e.g. to `read(2)` from hostinet sockets) requires additional work to move e.g. `read(2)` blocking from `//pkg/sentry/syscalls/linux` into file description implementations. Some of the overheads before this CL are tractable without removing fdnotifier. The `sleep` package only requires one allocation - of `sleep.Waker` - per registration, and the `syncevent` package requires none. `EventRegister` can return the last known readiness mask to avoid the second `recvmsg`. However, the interactions with the Go runtime - and in particular the many `FUTEX_WAKE`s we incur when waking goroutines due to `ready() -> wakep() -> schedule() -> resetspinning() -> wakep()` - are not. The leading alternative solutions to the same problem are `sleep.Sleeper.AssertAndFetch` and "change the Go runtime". The former is very dubiously safe; it works by transiently lying about `runtime.sched.nmspinning`, a global variable, and not calling `runtime.resetspinning()` when it stops lying, so side effects start at "`runtime.wakep()` is disabled globally rather than only on the caller's thread" and go from there. The latter is dubiously tractable for reasons including the atypicality of the sentry described above, though see golang/go#54622. PiperOrigin-RevId: 478129654
Change https://go.dev/cl/473656 mentions this issue: |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
compiler/runtime
Issues related to the Go compiler and/or runtime.
NeedsInvestigation
Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Performance
Currently Go's scheduler is work conserving, meaning that it maximizes utilization of CPU resources. That is, if there is new work and an idle P, we will always wake that P to run the new work, rather than allowing runnable work and idle Ps.
In general, this is a good property that prevents us from getting in odd edge cases of under-utilizing CPU resources when we have work that could fill them. However, this is not without cost. Most notably, waking a thread to run a P is a heavyweight operation. If the work completes quickly, the thread may need to go back to sleep, which is another heavyweight operation.
This inefficiency has been a significant source of overhead in applications I've profiled. It is a contribution to #18237 and #21827, the target of #32113, and probably other issues I can't find at the moment.
There are several relevant measurements here w.r.t. cost of scheduling new work:
wake
)wait
)run
)sleep
)Some observations here:
wake > wait
, then waking a thread is useless. It is worse for latency and CPU usage. However, this case is likely quite rare.wake + sleep ≫ run
, then waking a thread is quite wasteful w.r.t. CPU usage. There is no objective cutoff here, because latency is likely still better thanwait
, but we are paying quite a high price for better latency.This last case is what I'd like to improve; we could make a trade-off that we'll accept higher latency in these cases in exchange for better CPU efficiency by not waking a thread (immediately) for the new work, thus dropping work conservation.
A prototype of this concept was created by @nixprime several months ago that did this via creation of a "latency target" knob in the scheduler defining how long we were willing to let
wait
get before waking a thread. This prototype achieved very solid improvements (+15% productivity for some applications) with very small target latencies (~20-40us).Such a knob would be difficult to set intuitively, but 20-40us latencies are sufficiently low that I believe we may be able to make this a permanent, knob-less, feature of scheduler, perhaps with additional heuristics to avoid the worse cases (e.g., predicting that
run
will be very high?).This needs more investigation, prototyping, and experimenting.
cc @aclements @mknyszek
The text was updated successfully, but these errors were encountered: