Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: eliminate the notion of a "syscall state" #58492

Open
CannibalVox opened this issue Feb 13, 2023 · 10 comments
Open

runtime: eliminate the notion of a "syscall state" #58492

CannibalVox opened this issue Feb 13, 2023 · 10 comments
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Milestone

Comments

@CannibalVox
Copy link

CannibalVox commented Feb 13, 2023

Abstract

Prevent longer-than-microsecond syscalls from causing excessive context change churn by eliminating the syscall state altogether. Goroutines will no longer enter into a special syscall state when making syscalls or cgo calls. Instead, the syscall will be executed by a separate syscall thread while the original goroutine is in an ordinary parked state. All Go Ms will now consist of a primary thread and a syscall thread.

Background

There are several ongoing issues with scheduler performance related to decisions to scale up or down the number of OS threads (Ms) used for executing goroutines. In #54622, the case is laid out that unnecessarily raising threads for a brief boost in workload can have undesirable performance implications. Effectively, this task identifies that the largest extant performance issues in the scheduler today are related to unnecessary thread creation and destruction. However, spinning up threads as a result of syscalls can have much more serious performance implications even than what are identified in the task above:

  • Moving the P the syscall originally came from to a different M can cause a context switch
  • If there is not enough work to sustain an additional P when the syscall returns (which is almost certainly the case), then the scheduler is extremely likely to return Go to the state it was in before the offending syscall. This means spinning down the M that performed the syscall, in addition to any intermediate states that might be traversed through before Go eventually decides to return to a single P.
  • If longer syscalls are made repeatedly in a go program, this can cause a very high percentage of system CPU to be dedicated to context switches and thread orchestration.

This usage pattern was recently revealed to be an issue in #58336, in which it appears that windows network calls via WSARecv/WSASend are blocking rather than nonblocking. A simple go network proxy run in Windows will perform thousands of context switches per second due to long calls repeatedly changing what M the program’s 2 G’s are run on. It does not do this in other operating systems, as those network calls are nonblocking, which allows the G to return to the P it came from without a new M being provisioned on non-Windows systems.

Generally speaking, the behavior of spinning up a new thread for the syscall state is always a problem, the Go team has previously chosen to address it by making short stints in the syscall state not engage in this behavior. By doing so, they have separated syscall behavior into three classes:

  • In the most common case, nonblocking syscalls return within nanoseconds and do not cause any problems because no new threads are created
  • In the second-most-common case, blocking syscalls are made rarely and the unnecessary context switch goes mostly unnoticed.
  • In the last case, frequent blocking syscalls, Go performance becomes untenable.

Proposal

I propose that every M be created with two threads instead of one: a thread for executing Go code and a thread for executing syscalls. When a goroutine attempts to execute a syscall, it will be carried out on the syscall thread while the original goroutine will stay in a completely ordinary parked state. Other goroutines that attempt to carry out syscalls during this time will park while waiting on the syscall thread to become available. Additionally, if there are other Ps with syscall threads that have less traffic, they could choose to steal G’s that have syscall work.

This will ensure that while longer syscalls will occupy shared syscall resources, which may become saturated, they will not cause M flapping or context switching. In an advanced case, syscall thread contention could be used as a metric for P scaling, and that would be much easier to measure and respond to than the situation right now, where long syscalls spin up additional Ms that don’t easily fit into the existing scheduler architecture and must be dealt with after the fact.

Rationale

The biggest problem with syscall and cgo performance today is that the threads created by long syscalls do not have any place within the go scheduler’s understanding of itself. It has a very tightly tuned understanding of how many Ms should be running and there is no way for it to respond appropriately to a new M suddenly being dumped in the middle of the scheduler, which is what long syscalls do.

Additionally, while moving the P to a new M after a syscall passes the threshold allows the 90% case to perform very well, it also guarantees a context switch in the 10% case, which is often unacceptable. In order to have a guaranteed route for a context-switch-free syscall, we need a route for syscalls to be handled without pulling the existing M away from the P. That means that there must be some sort of dedicated thread for syscalls, somewhere.

Alternatives

Also considered was the idea of a thread pool that lives outside of the M/P/G scheduler architecture and is used to process syscalls. The thread pool would consist of a stack of threads, which would scale between 1 and GOMAXPROCS threads, and a queue of syscall requests. New threads would be added when wait times on the queue passed a certain threshold, and threads would be removed on the garbage collector cadence in the same way items in an ObjectPool are, using a victim list to remove unused threads and eventually spin them down.

While idle threads would make up a much lower % of total program resources, and it is more flexible with syscall contention, this solution would require much more complicated orchestration. It also has a problem with OS-locked threads, since there is no way to guarantee that the same thread services syscalls for a particular P. This problem could be solved by having syscalls on OS-locked threads be executed inline instead of via the pool (OS-locked threads technically never needed the syscall state since there are no other waiting Gs when a goroutine is running a syscall) but this would require a much larger scope of changes within the scheduler.

Another alternative would be to tune the scheduler to prefer to place goroutines that have recently made long-running syscalls into their own P and avoid spinning it down until some time has passed since the last long syscall. We would then choose not to create a new M during long syscalls in cases the origin P has no additional G’s to serve, even if the syscall extended past the threshold. This has the following downsides:

  • Today, programs have GOMAXPROCS P’s available for running Go at all times, because long syscalls are removed from the scheduler while they are running. This change would remove available P’s equal to the number of goroutines that interact with long syscalls. If there are more goroutines that interact with long syscalls than GOMAXPROCS, we would be back where we started in terms of context switching and M thrashing.
  • It seems to me that “time since last long syscall” is not a very good measure of long syscall contention or throughput, and running long syscalls on a dedicated resource would make it easier to measure whether there is contention and whether it can be reduced by P’s stealing work.

Compatibility

Because this is a change to an internal system, it would not cause language compatibility issues. Additionally, while performance characteristics for large programs without long-running syscalls would change slightly (and this is most Go programs), adding even a few dozen idle threads would not make a measurable difference in Go performance. On the other hand, an entire class of Go applications would suddenly perform much better, including network-heavy applications on Windows.

Late edit: It just occurred to me that another class of go performance would perform much worse unless #21827 is addressed: parking goroutines OS-locked threads tend to create context switches themselves. Alternatively, the very inflammatory title of this issue could be changed, and the syscall state could be used to indicate "I am currently waiting on the syscall thread to work".

@gopherbot gopherbot added this to the Proposal milestone Feb 13, 2023
@seankhliao seankhliao changed the title proposal: runtime: Eliminate The Syscall State proposal: runtime: eliminate the syscall state Feb 13, 2023
@seankhliao
Copy link
Member

cc @golang/runtime

@mknyszek
Copy link
Contributor

Moving this out of proposal. (In the past we have phrased these kinds of internal changes as proposals but I think we've stopped doing that as the proposal process became more of an actual process. And given that all the changes here would be internal, I don't see a reason as to why this needs to go through the proposal review process. This is more about the merits of the implementation anyway.)

@mknyszek mknyszek modified the milestones: Proposal, Unplanned Feb 13, 2023
@mknyszek mknyszek added compiler/runtime Issues related to the Go compiler and/or runtime. and removed Proposal labels Feb 13, 2023
@mknyszek mknyszek changed the title proposal: runtime: eliminate the syscall state runtime: eliminate the notion of a "syscall state" Feb 13, 2023
@prattmic
Copy link
Member

prattmic commented Feb 13, 2023

In #54622, the case is laid out that unnecessarily raising threads for a brief boost in workload can have undesirable performance implications. Effectively, this task identifies that the largest extant performance issues in the scheduler today are related to unnecessary thread creation and destruction.

For clarification, the issue is not unnecessary thread creation and destruction, but unnecessary thread wake and sleep. Most programs reach a steady state of thread count fairly quickly (we ~never destroy threads). It is the wakeup of an idle thread and subsequent sleep when that thread has nothing else to do that is expensive.

IIUC, this proposal introduces a wakeup of the syscall thread for every syscall (unless the syscall thread is already running). I suspect that this would result in a significant performance degradation for most programs, even if improves the tail case for long syscalls.

In #54622, thread sleep is particularly expensive because the Go runtime does so much work trying to find something to do prior to sleep. This proposal wouldn't have that problem; the conditions for the syscall thread to sleep would be much simpler. But I still think the OS-level churn of requiring a thread wakeup (a several microsecond ordeal) just to make any syscall will be a non-starter.

@prattmic
Copy link
Member

Compatibility

Users often get/set various bits of thread-specific state via syscall.Syscall and having them fetch from a different thread would break those use cases.

That said, the scheduler can migrate goroutines between threads at any time, so I think we could argue this only matters for goroutines that called runtime.LockOSThread. Those would need to make syscalls directly on the calling thread.

@CannibalVox
Copy link
Author

IIUC, this proposal introduces a wakeup of the syscall thread for every syscall (unless the syscall thread is already running). I suspect that this would result in a significant performance degradation for most programs, even if improves the tail case for long syscalls.

The intent was that the threads would be live and waiting on some sort of sync primitive rather than needing to be resumed

Users often get/set various bits of thread-specific state via syscall.Syscall and having them fetch from a different thread would break those use cases.

I would expect syscall.Syscall to be executed on the syscall thread, so for OS-locked thread, thread context would all be present in the same place, the syscall thread.

@CannibalVox
Copy link
Author

But I still think the OS-level churn of requiring a thread wakeup (a several microsecond ordeal) just to make any syscall will be a non-starter.

Additionally, not to put too fine a point on it, but this is already the plan for syscalls that take longer than a microsecond.

@prattmic
Copy link
Member

The intent was that the threads would be live and waiting on some sort of sync primitive rather than needing to be resumed

Could you be more specific about what you mean here? The main options I can think of here are:

  1. Busy loop
  2. Busy loop with PAUSE instruction
  3. Loop calling sched_yield (or equivalent syscall)
  4. Block in futex (or other wake-able syscall)

1 and 2 burn CPU continuously (2 slightly more efficiently), 3 burns CPU unless the system is fully loaded, and 4 requires a wake-up (and is what I was referring to).

@mknyszek

This comment was marked as duplicate.

@CannibalVox
Copy link
Author

CannibalVox commented Feb 13, 2023

The intent was that the threads would be live and waiting on some sort of sync primitive rather than needing to be resumed

Could you be more specific about what you mean here? The main options I can think of here are:

  1. Busy loop
  2. Busy loop with PAUSE instruction
  3. Loop calling sched_yield (or equivalent syscall)
  4. Block in futex (or other wake-able syscall)

1 and 2 burn CPU continuously (2 slightly more efficiently), 3 burns CPU unless the system is fully loaded, and 4 requires a wake-up (and is what I was referring to).

I understand now- I guess the faster wakeups in go primitives are due to the fact that the P stays in motion continuously.

It's safe to say that the design as written won't work, then, but that mainly pushes me toward the alternatives. As you identified, waking and sleeping a thread with every syscall is fairly untenable- Go is in a state right now where network communications on windows have massive performance issues because it does exactly that. Having 1 thread burn CPU per P is unacceptable, but having 1 thread total do it + others for a short periods at the tail end of a burst is not. The current situation is fairly dire.

@prattmic
Copy link
Member

I certainly agree that the bad cases of syscall churn could use improvement. I haven't had a chance to look closely at #58336, but it seems like that provides a good example case.

@prattmic prattmic added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Feb 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Projects
Status: No status
Development

No branches or pull requests

5 participants