Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: allow short-term drop of work conservation to increase CPU efficiency #54622

Open
prattmic opened this issue Aug 23, 2022 · 1 comment
Assignees
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. Performance
Milestone

Comments

@prattmic
Copy link
Member

prattmic commented Aug 23, 2022

Currently Go's scheduler is work conserving, meaning that it maximizes utilization of CPU resources. That is, if there is new work and an idle P, we will always wake that P to run the new work, rather than allowing runnable work and idle Ps.

In general, this is a good property that prevents us from getting in odd edge cases of under-utilizing CPU resources when we have work that could fill them. However, this is not without cost. Most notably, waking a thread to run a P is a heavyweight operation. If the work completes quickly, the thread may need to go back to sleep, which is another heavyweight operation.

This inefficiency has been a significant source of overhead in applications I've profiled. It is a contribution to #18237 and #21827, the target of #32113, and probably other issues I can't find at the moment.

There are several relevant measurements here w.r.t. cost of scheduling new work:

  • The time to wake a sleeping thread to run this work. (wake)
  • Alternatively, the time to wait for the work to schedule on an already running thread if we don't wake. (wait)
  • The time the work will run for. (run)
  • The time to put a thread with nothing to do to sleep. (sleep)

Some observations here:

  • If wake > wait, then waking a thread is useless. It is worse for latency and CPU usage. However, this case is likely quite rare.
  • If wake + sleep ≫ run, then waking a thread is quite wasteful w.r.t. CPU usage. There is no objective cutoff here, because latency is likely still better than wait, but we are paying quite a high price for better latency.

This last case is what I'd like to improve; we could make a trade-off that we'll accept higher latency in these cases in exchange for better CPU efficiency by not waking a thread (immediately) for the new work, thus dropping work conservation.

A prototype of this concept was created by @nixprime several months ago that did this via creation of a "latency target" knob in the scheduler defining how long we were willing to let wait get before waking a thread. This prototype achieved very solid improvements (+15% productivity for some applications) with very small target latencies (~20-40us).

Such a knob would be difficult to set intuitively, but 20-40us latencies are sufficiently low that I believe we may be able to make this a permanent, knob-less, feature of scheduler, perhaps with additional heuristics to avoid the worse cases (e.g., predicting that run will be very high?).

This needs more investigation, prototyping, and experimenting.

cc @aclements @mknyszek

@prattmic prattmic added Performance NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. compiler/runtime Issues related to the Go compiler and/or runtime. labels Aug 23, 2022
@prattmic prattmic added this to the Backlog milestone Aug 23, 2022
@prattmic prattmic self-assigned this Aug 23, 2022
@d0rc
Copy link

d0rc commented Sep 15, 2022

Being able to tune such cutoff for a given goroutine can help to ease situation when runtime.findrunnable is eating too much cpu

copybara-service bot pushed a commit to google/gvisor that referenced this issue Oct 3, 2022
Before this CL, using `kernel.Task.Block*` to wait for host FD readiness
requires going through fdnotifier, resulting in a significant amount of
overhead. As an example, consider a blocking application `recvmsg` on a
hostinet socket that blocks:

- The task goroutine invokes `recvmsg` and gets `EAGAIN`.

- The task goroutine heap-allocates a `waiter.Entry` and a channel of
  `waiter.EventMask`, and invokes `epoll_ctl` to add the socket FD to
  fdnotifier's epoll FD.

- The task goroutine invokes `recvmsg` and gets `EAGAIN`, again.

- The task goroutine blocks in Go (on the channel select in
  `kernel.Task.block`). If the thread that was running the task goroutine can
  find idle goroutines to run, then it does so; otherwise, it invokes
  `futex(FUTEX_WAIT)` to block in the host.

  Note that the vast majority of the sentry's "work" consists of executing
  application code, during which the corresponding task goroutines appear to
  the Go scheduler to be blocked in host syscalls; furthermore, time that *is*
  spent executing sentry code (in Go) is overhead relative to the application's
  execution. Consequently, the sentry has relatively little Go code to execute
  and is generally optimized to have less, making this tradeoff less favorable
  than in (presumably) more typical Go programs.

- When the socket FD becomes readable, fdnotifier's goroutine returns from
  `epoll_wait` and wakes the task goroutine, usually invoking
  `futex(FUTEX_WAKE)` to wake another thread. It then yields control of its
  thread to other goroutines, improving wakeup-to-execution latency for the
  task goroutine.

  The `futex(FUTEX_WAKE)` is skipped if any of the following are true:

  - `GOMAXPROCS` threads are already executing goroutines. For reasons
    described above, we expect this to occur infrequently.

  - At least one already-running thread is in the "spinning" state, because it
    was itself recently woken but has not yet started executing goroutines.

  - At least one already-running thread is in the "spinning" state, because it
    recently ran out of goroutines to run and is busy-polling before going to
    sleep.

  A "spinning" thread stops spinning either because it successfully busy-polls
  for an idle goroutine to run, or it times out while busy-polling in the
  latter case; in the former case the thread usually invokes
  `futex(FUTEX_WAKE)` to wake *another* thread as described above, and in the
  latter case the thread invokes `futex(FUTEX_WAIT)` to go to sleep.

- The task goroutine invokes `recvmsg` and succeeds.

- The task goroutine invokes `epoll_ctl` to remove the socket FD from
  fdnotifier's epoll FD.

This CL demonstrates how fdnotifier may be replaced by making host syscalls
from task goroutine context. After this CL, after per-thread initialization
(`sigprocmask`), the same scenario results in:

- The task goroutine invokes `recvmsg` and gets `EAGAIN`.

- The task goroutine invokes `ppoll` on the host FD, which returns when the
  socket FD becomes available.

  The Go runtime maintains a thread called "sysmon` which runs periodically.
  When this thread determines that another thread has been blocked in a host
  syscall for "long enough" (20-40us + slack) and there are idle goroutines to
  run, it steals that thread's runqueue and invokes `futex(FUTEX_WAKE)` to wake
  another thread to run the stolen runqueue.

- The task goroutine invokes `recvmsg` and succeeds.

For now, this functionality is only used in hostinet where socket methods are
responsible for blocking; applying it more generally (e.g. to `read(2)` from
hostinet sockets) requires additional work to move e.g. `read(2)` blocking from
`//pkg/sentry/syscalls/linux` into file description implementations.

Some of the overheads before this CL are tractable without removing fdnotifier.
The `sleep` package only requires one allocation - of `sleep.Waker` - per
registration, and the `syncevent` package requires none. `EventRegister` can
return the last known readiness mask to avoid the second `recvmsg`. However,
the interactions with the Go runtime - and in particular the many `FUTEX_WAKE`s
we incur when waking goroutines due to `ready() -> wakep() -> schedule() ->
resetspinning() -> wakep()` - are not. The leading alternative solutions to the
same problem are `sleep.Sleeper.AssertAndFetch` and "change the Go runtime".
The former is very dubiously safe; it works by transiently lying about
`runtime.sched.nmspinning`, a global variable, and not calling
`runtime.resetspinning()` when it stops lying, so side effects start at
"`runtime.wakep()` is disabled globally rather than only on the caller's
thread" and go from there. The latter is dubiously tractable for reasons
including the atypicality of the sentry described above, though see
golang/go#54622.

PiperOrigin-RevId: 478129654
copybara-service bot pushed a commit to google/gvisor that referenced this issue Oct 3, 2022
Before this CL, using `kernel.Task.Block*` to wait for host FD readiness
requires going through fdnotifier, resulting in a significant amount of
overhead. As an example, consider a blocking application `recvmsg` on a
hostinet socket that blocks:

- The task goroutine invokes `recvmsg` and gets `EAGAIN`.

- The task goroutine heap-allocates a `waiter.Entry` and a channel of
  `waiter.EventMask`, and invokes `epoll_ctl` to add the socket FD to
  fdnotifier's epoll FD.

- The task goroutine invokes `recvmsg` and gets `EAGAIN`, again.

- The task goroutine blocks in Go (on the channel select in
  `kernel.Task.block`). If the thread that was running the task goroutine can
  find idle goroutines to run, then it does so; otherwise, it invokes
  `futex(FUTEX_WAIT)` to block in the host.

  Note that the vast majority of the sentry's "work" consists of executing
  application code, during which the corresponding task goroutines appear to
  the Go scheduler to be blocked in host syscalls; furthermore, time that *is*
  spent executing sentry code (in Go) is overhead relative to the application's
  execution. Consequently, the sentry has relatively little Go code to execute
  and is generally optimized to have less, making this tradeoff less favorable
  than in (presumably) more typical Go programs.

- When the socket FD becomes readable, fdnotifier's goroutine returns from
  `epoll_wait` and wakes the task goroutine, usually invoking
  `futex(FUTEX_WAKE)` to wake another thread. It then yields control of its
  thread to other goroutines, improving wakeup-to-execution latency for the
  task goroutine.

  The `futex(FUTEX_WAKE)` is skipped if any of the following are true:

  - `GOMAXPROCS` threads are already executing goroutines. For reasons
    described above, we expect this to occur infrequently.

  - At least one already-running thread is in the "spinning" state, because it
    was itself recently woken but has not yet started executing goroutines.

  - At least one already-running thread is in the "spinning" state, because it
    recently ran out of goroutines to run and is busy-polling before going to
    sleep.

  A "spinning" thread stops spinning either because it successfully busy-polls
  for an idle goroutine to run, or it times out while busy-polling in the
  latter case; in the former case the thread usually invokes
  `futex(FUTEX_WAKE)` to wake *another* thread as described above, and in the
  latter case the thread invokes `futex(FUTEX_WAIT)` to go to sleep.

- The task goroutine invokes `recvmsg` and succeeds.

- The task goroutine invokes `epoll_ctl` to remove the socket FD from
  fdnotifier's epoll FD.

This CL demonstrates how fdnotifier may be replaced by making host syscalls
from task goroutine context. After this CL, after per-thread initialization
(`sigprocmask`), the same scenario results in:

- The task goroutine invokes `recvmsg` and gets `EAGAIN`.

- The task goroutine invokes `ppoll` on the host FD, which returns when the
  socket FD becomes available.

  The Go runtime maintains a thread called "sysmon" which runs periodically.
  When this thread determines that another thread has been blocked in a host
  syscall for "long enough" (20-40us + slack) and there are idle goroutines to
  run, it steals that thread's runqueue and invokes `futex(FUTEX_WAKE)` to wake
  another thread to run the stolen runqueue.

- The task goroutine invokes `recvmsg` and succeeds.

For now, this functionality is only used in hostinet where socket methods are
responsible for blocking; applying it more generally (e.g. to `read(2)` from
hostinet sockets) requires additional work to move e.g. `read(2)` blocking from
`//pkg/sentry/syscalls/linux` into file description implementations.

Some of the overheads before this CL are tractable without removing fdnotifier.
The `sleep` package only requires one allocation - of `sleep.Waker` - per
registration, and the `syncevent` package requires none. `EventRegister` can
return the last known readiness mask to avoid the second `recvmsg`. However,
the interactions with the Go runtime - and in particular the many `FUTEX_WAKE`s
we incur when waking goroutines due to `ready() -> wakep() -> schedule() ->
resetspinning() -> wakep()` - are not. The leading alternative solutions to the
same problem are `sleep.Sleeper.AssertAndFetch` and "change the Go runtime".
The former is very dubiously safe; it works by transiently lying about
`runtime.sched.nmspinning`, a global variable, and not calling
`runtime.resetspinning()` when it stops lying, so side effects start at
"`runtime.wakep()` is disabled globally rather than only on the caller's
thread" and go from there. The latter is dubiously tractable for reasons
including the atypicality of the sentry described above, though see
golang/go#54622.

PiperOrigin-RevId: 478129654
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. Performance
Projects
Status: Todo
Development

No branches or pull requests

2 participants