runtime: allow short-term drop of work conservation to increase CPU efficiency #54622

prattmic · 2022-08-23T17:58:53Z

Currently Go's scheduler is work conserving, meaning that it maximizes utilization of CPU resources. That is, if there is new work and an idle P, we will always wake that P to run the new work, rather than allowing runnable work and idle Ps.

In general, this is a good property that prevents us from getting in odd edge cases of under-utilizing CPU resources when we have work that could fill them. However, this is not without cost. Most notably, waking a thread to run a P is a heavyweight operation. If the work completes quickly, the thread may need to go back to sleep, which is another heavyweight operation.

This inefficiency has been a significant source of overhead in applications I've profiled. It is a contribution to #18237 and #21827, the target of #32113, and probably other issues I can't find at the moment.

There are several relevant measurements here w.r.t. cost of scheduling new work:

The time to wake a sleeping thread to run this work. (wake)
Alternatively, the time to wait for the work to schedule on an already running thread if we don't wake. (wait)
The time the work will run for. (run)
The time to put a thread with nothing to do to sleep. (sleep)

Some observations here:

If wake > wait, then waking a thread is useless. It is worse for latency and CPU usage. However, this case is likely quite rare.
If wake + sleep ≫ run, then waking a thread is quite wasteful w.r.t. CPU usage. There is no objective cutoff here, because latency is likely still better than wait, but we are paying quite a high price for better latency.

This last case is what I'd like to improve; we could make a trade-off that we'll accept higher latency in these cases in exchange for better CPU efficiency by not waking a thread (immediately) for the new work, thus dropping work conservation.

A prototype of this concept was created by @nixprime several months ago that did this via creation of a "latency target" knob in the scheduler defining how long we were willing to let wait get before waking a thread. This prototype achieved very solid improvements (+15% productivity for some applications) with very small target latencies (~20-40us).

Such a knob would be difficult to set intuitively, but 20-40us latencies are sufficiently low that I believe we may be able to make this a permanent, knob-less, feature of scheduler, perhaps with additional heuristics to avoid the worse cases (e.g., predicting that run will be very high?).

This needs more investigation, prototyping, and experimenting.

cc @aclements @mknyszek

The text was updated successfully, but these errors were encountered:

d0rc · 2022-09-15T21:35:40Z

Being able to tune such cutoff for a given goroutine can help to ease situation when runtime.findrunnable is eating too much cpu

Before this CL, using `kernel.Task.Block*` to wait for host FD readiness requires going through fdnotifier, resulting in a significant amount of overhead. As an example, consider a blocking application `recvmsg` on a hostinet socket that blocks: - The task goroutine invokes `recvmsg` and gets `EAGAIN`. - The task goroutine heap-allocates a `waiter.Entry` and a channel of `waiter.EventMask`, and invokes `epoll_ctl` to add the socket FD to fdnotifier's epoll FD. - The task goroutine invokes `recvmsg` and gets `EAGAIN`, again. - The task goroutine blocks in Go (on the channel select in `kernel.Task.block`). If the thread that was running the task goroutine can find idle goroutines to run, then it does so; otherwise, it invokes `futex(FUTEX_WAIT)` to block in the host. Note that the vast majority of the sentry's "work" consists of executing application code, during which the corresponding task goroutines appear to the Go scheduler to be blocked in host syscalls; furthermore, time that *is* spent executing sentry code (in Go) is overhead relative to the application's execution. Consequently, the sentry has relatively little Go code to execute and is generally optimized to have less, making this tradeoff less favorable than in (presumably) more typical Go programs. - When the socket FD becomes readable, fdnotifier's goroutine returns from `epoll_wait` and wakes the task goroutine, usually invoking `futex(FUTEX_WAKE)` to wake another thread. It then yields control of its thread to other goroutines, improving wakeup-to-execution latency for the task goroutine. The `futex(FUTEX_WAKE)` is skipped if any of the following are true: - `GOMAXPROCS` threads are already executing goroutines. For reasons described above, we expect this to occur infrequently. - At least one already-running thread is in the "spinning" state, because it was itself recently woken but has not yet started executing goroutines. - At least one already-running thread is in the "spinning" state, because it recently ran out of goroutines to run and is busy-polling before going to sleep. A "spinning" thread stops spinning either because it successfully busy-polls for an idle goroutine to run, or it times out while busy-polling in the latter case; in the former case the thread usually invokes `futex(FUTEX_WAKE)` to wake *another* thread as described above, and in the latter case the thread invokes `futex(FUTEX_WAIT)` to go to sleep. - The task goroutine invokes `recvmsg` and succeeds. - The task goroutine invokes `epoll_ctl` to remove the socket FD from fdnotifier's epoll FD. This CL demonstrates how fdnotifier may be replaced by making host syscalls from task goroutine context. After this CL, after per-thread initialization (`sigprocmask`), the same scenario results in: - The task goroutine invokes `recvmsg` and gets `EAGAIN`. - The task goroutine invokes `ppoll` on the host FD, which returns when the socket FD becomes available. The Go runtime maintains a thread called "sysmon" which runs periodically. When this thread determines that another thread has been blocked in a host syscall for "long enough" (20-40us + slack) and there are idle goroutines to run, it steals that thread's runqueue and invokes `futex(FUTEX_WAKE)` to wake another thread to run the stolen runqueue. - The task goroutine invokes `recvmsg` and succeeds. For now, this functionality is only used in hostinet where socket methods are responsible for blocking; applying it more generally (e.g. to `read(2)` from hostinet sockets) requires additional work to move e.g. `read(2)` blocking from `//pkg/sentry/syscalls/linux` into file description implementations. Some of the overheads before this CL are tractable without removing fdnotifier. The `sleep` package only requires one allocation - of `sleep.Waker` - per registration, and the `syncevent` package requires none. `EventRegister` can return the last known readiness mask to avoid the second `recvmsg`. However, the interactions with the Go runtime - and in particular the many `FUTEX_WAKE`s we incur when waking goroutines due to `ready() -> wakep() -> schedule() -> resetspinning() -> wakep()` - are not. The leading alternative solutions to the same problem are `sleep.Sleeper.AssertAndFetch` and "change the Go runtime". The former is very dubiously safe; it works by transiently lying about `runtime.sched.nmspinning`, a global variable, and not calling `runtime.resetspinning()` when it stops lying, so side effects start at "`runtime.wakep()` is disabled globally rather than only on the caller's thread" and go from there. The latter is dubiously tractable for reasons including the atypicality of the sentry described above, though see golang/go#54622. PiperOrigin-RevId: 478129654

gopherbot · 2023-03-06T19:03:37Z

Change https://go.dev/cl/473656 mentions this issue: runtime: don't usleep() in runqgrab()

prattmic added Performance NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. compiler/runtime Issues related to the Go compiler and/or runtime. labels Aug 23, 2022

prattmic added this to the Backlog milestone Aug 23, 2022

prattmic self-assigned this Aug 23, 2022

copybara-service bot mentioned this issue Oct 3, 2022

Add kernel.Task.BlockFD[WithDeadline]. google/gvisor#8044

Closed

CannibalVox mentioned this issue Feb 13, 2023

runtime: eliminate the notion of a "syscall state" #58492

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

runtime: allow short-term drop of work conservation to increase CPU efficiency #54622

runtime: allow short-term drop of work conservation to increase CPU efficiency #54622

prattmic commented Aug 23, 2022

d0rc commented Sep 15, 2022

gopherbot commented Mar 6, 2023

runtime: allow short-term drop of work conservation to increase CPU efficiency #54622

runtime: allow short-term drop of work conservation to increase CPU efficiency #54622

Comments

prattmic commented Aug 23, 2022

d0rc commented Sep 15, 2022

gopherbot commented Mar 6, 2023