Skip to content

runtime: eliminate the notion of a "syscall state" #58492

@CannibalVox

Description

@CannibalVox

Abstract

Prevent longer-than-microsecond syscalls from causing excessive context change churn by eliminating the syscall state altogether. Goroutines will no longer enter into a special syscall state when making syscalls or cgo calls. Instead, the syscall will be executed by a separate syscall thread while the original goroutine is in an ordinary parked state. All Go Ms will now consist of a primary thread and a syscall thread.

Background

There are several ongoing issues with scheduler performance related to decisions to scale up or down the number of OS threads (Ms) used for executing goroutines. In #54622, the case is laid out that unnecessarily raising threads for a brief boost in workload can have undesirable performance implications. Effectively, this task identifies that the largest extant performance issues in the scheduler today are related to unnecessary thread creation and destruction. However, spinning up threads as a result of syscalls can have much more serious performance implications even than what are identified in the task above:

  • Moving the P the syscall originally came from to a different M can cause a context switch
  • If there is not enough work to sustain an additional P when the syscall returns (which is almost certainly the case), then the scheduler is extremely likely to return Go to the state it was in before the offending syscall. This means spinning down the M that performed the syscall, in addition to any intermediate states that might be traversed through before Go eventually decides to return to a single P.
  • If longer syscalls are made repeatedly in a go program, this can cause a very high percentage of system CPU to be dedicated to context switches and thread orchestration.

This usage pattern was recently revealed to be an issue in #58336, in which it appears that windows network calls via WSARecv/WSASend are blocking rather than nonblocking. A simple go network proxy run in Windows will perform thousands of context switches per second due to long calls repeatedly changing what M the program’s 2 G’s are run on. It does not do this in other operating systems, as those network calls are nonblocking, which allows the G to return to the P it came from without a new M being provisioned on non-Windows systems.

Generally speaking, the behavior of spinning up a new thread for the syscall state is always a problem, the Go team has previously chosen to address it by making short stints in the syscall state not engage in this behavior. By doing so, they have separated syscall behavior into three classes:

  • In the most common case, nonblocking syscalls return within nanoseconds and do not cause any problems because no new threads are created
  • In the second-most-common case, blocking syscalls are made rarely and the unnecessary context switch goes mostly unnoticed.
  • In the last case, frequent blocking syscalls, Go performance becomes untenable.

Proposal

I propose that every M be created with two threads instead of one: a thread for executing Go code and a thread for executing syscalls. When a goroutine attempts to execute a syscall, it will be carried out on the syscall thread while the original goroutine will stay in a completely ordinary parked state. Other goroutines that attempt to carry out syscalls during this time will park while waiting on the syscall thread to become available. Additionally, if there are other Ps with syscall threads that have less traffic, they could choose to steal G’s that have syscall work.

This will ensure that while longer syscalls will occupy shared syscall resources, which may become saturated, they will not cause M flapping or context switching. In an advanced case, syscall thread contention could be used as a metric for P scaling, and that would be much easier to measure and respond to than the situation right now, where long syscalls spin up additional Ms that don’t easily fit into the existing scheduler architecture and must be dealt with after the fact.

Rationale

The biggest problem with syscall and cgo performance today is that the threads created by long syscalls do not have any place within the go scheduler’s understanding of itself. It has a very tightly tuned understanding of how many Ms should be running and there is no way for it to respond appropriately to a new M suddenly being dumped in the middle of the scheduler, which is what long syscalls do.

Additionally, while moving the P to a new M after a syscall passes the threshold allows the 90% case to perform very well, it also guarantees a context switch in the 10% case, which is often unacceptable. In order to have a guaranteed route for a context-switch-free syscall, we need a route for syscalls to be handled without pulling the existing M away from the P. That means that there must be some sort of dedicated thread for syscalls, somewhere.

Alternatives

Also considered was the idea of a thread pool that lives outside of the M/P/G scheduler architecture and is used to process syscalls. The thread pool would consist of a stack of threads, which would scale between 1 and GOMAXPROCS threads, and a queue of syscall requests. New threads would be added when wait times on the queue passed a certain threshold, and threads would be removed on the garbage collector cadence in the same way items in an ObjectPool are, using a victim list to remove unused threads and eventually spin them down.

While idle threads would make up a much lower % of total program resources, and it is more flexible with syscall contention, this solution would require much more complicated orchestration. It also has a problem with OS-locked threads, since there is no way to guarantee that the same thread services syscalls for a particular P. This problem could be solved by having syscalls on OS-locked threads be executed inline instead of via the pool (OS-locked threads technically never needed the syscall state since there are no other waiting Gs when a goroutine is running a syscall) but this would require a much larger scope of changes within the scheduler.

Another alternative would be to tune the scheduler to prefer to place goroutines that have recently made long-running syscalls into their own P and avoid spinning it down until some time has passed since the last long syscall. We would then choose not to create a new M during long syscalls in cases the origin P has no additional G’s to serve, even if the syscall extended past the threshold. This has the following downsides:

  • Today, programs have GOMAXPROCS P’s available for running Go at all times, because long syscalls are removed from the scheduler while they are running. This change would remove available P’s equal to the number of goroutines that interact with long syscalls. If there are more goroutines that interact with long syscalls than GOMAXPROCS, we would be back where we started in terms of context switching and M thrashing.
  • It seems to me that “time since last long syscall” is not a very good measure of long syscall contention or throughput, and running long syscalls on a dedicated resource would make it easier to measure whether there is contention and whether it can be reduced by P’s stealing work.

Compatibility

Because this is a change to an internal system, it would not cause language compatibility issues. Additionally, while performance characteristics for large programs without long-running syscalls would change slightly (and this is most Go programs), adding even a few dozen idle threads would not make a measurable difference in Go performance. On the other hand, an entire class of Go applications would suddenly perform much better, including network-heavy applications on Windows.

Late edit: It just occurred to me that another class of go performance would perform much worse unless #21827 is addressed: parking goroutines OS-locked threads tend to create context switches themselves. Alternatively, the very inflammatory title of this issue could be changed, and the syscall state could be used to indicate "I am currently waiting on the syscall thread to work".

Metadata

Metadata

Assignees

No one assigned

    Labels

    NeedsInvestigationSomeone must examine and confirm this is a valid issue and not a duplicate of an existing one.compiler/runtimeIssues related to the Go compiler and/or runtime.

    Type

    No type

    Projects

    Status

    No status

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions