Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

proposal: runtime: allow N goroutines to be simultaneously locked to the same OS thread #49848

Closed
aarzilli opened this issue Nov 29, 2021 · 14 comments
Labels
Projects
Milestone

Comments

@aarzilli
Copy link
Contributor

@aarzilli aarzilli commented Nov 29, 2021

Summary

The aim of this proposal is to extend the LockOSThread mechanism so that it is possible to create a N:1 association between N goroutines and 1 OS thread, by adding the following function to the runtime package

func LockToRecvOSThread[T any](ch <- chan T)

When LockToRecvOSThread is called the calling goroutine will be locked to the same thread as the goroutine receiving on ch, which is assumed to be already locked to a OS thread.

Programs could use this interface to work around the performance problems described in issue #21827, in some cases.

Motivation

Operating systems and C libraries will sometimes insist that all calls be made from the same OS thread. Examples of this include macOS's GUI API, OpenGL and ptrace on Linux.

This is introduces some difficulties in Go, where goroutines are normally scheduled randomly on a pool of OS threads. The recommended way to deal with such APIs is to call LockOSThread, which will create a 1:1 association between a goroutine and a thread.

Users of LockOSThread however also opt out of Go's concurrency paradigm, all calls to the special API have to happen in the same goroutine and the solution to the problem has to be written in a sequential style, even if a concurrent solution would be otherwise easier to implement.

This leads users of LockOSThread to sometime create what I will call a "syscall server" goroutine:

func syscallServer() chan <- func() {
	ch := make(chan func())
	go func() {
		runtime.LockOSThread()
		defer runtime.UnlockOSThread()
		for fn := range ch {
			fn()
		}
	}()
	return ch
}

This arrangement allows programs that need to use LockOSThread to be written in a normal concurrent style. For example, Delve has one such facility and github.com/faiface/mainthread is a library implementing this technique in a general way.

The problem with this approach is that the interaction of LockOSThread and channel send imposes a much greater performance penalty in the program than it would be expected from simply using a channel: normal use of a channel is fast because it doesn't involve the OS scheduler and the Go scheduler has special optimized paths for it. But when LockOSThread is involved both the full cost of Go scheduler and the full cost of the OS scheduler are paid every time a send on the channel happens.

On some workloads this overhead can be as high as 60% of the execution time.

The proposed LockToRecvOSThread API would eliminate the need of a syscall server and allow programs that need to use a API that needs LockOSThread to place as many goroutines as they need on the blessed thread, improving their performance.

Case study: Delve

It could be argued that the syscall server is simply a common anti-pattern: you picked the wrong level of granularity for a costly operation (which you could have reasonably expected to be costly).

However I have one example, Delve, where (I think) getting rid of the syscall server would reduce the code quality.

Delve supports call injection, it's possible for the user to write something like:

(dlv) call f(x) + i

and Delve will evaluate the f(x) + i expression by resuming execution of the target program as needed to evaluate the call to functionf. When this happens three goroutines are primarily involved:

  • the aforementioned "syscall server" goroutine
  • a goroutine running the event loop of Delve, which takes care of stop, resume and breakpoint handling (which could happen concurrently to the call injection)
  • a goroutine running a recursive AST interpreter for the expression passed to call, this goroutine will occasionally ask the "event loop" goroutine to resume the target process to progress call injections.

Both the event loop goroutine and interpreter goroutine will issue requests to the syscall server to read and write memory and to resume execution of the target program.

To rewrite this so that there is no need for a syscall server the expression interpreter would have to be rewritten, likely in a continuation passing style, which would make it considerably less understandable.

Behavior of LockToRecvOSThread

The goroutine calling LockToRecvOSThread will be locked to the same goroutine as the goroutine that is currently receiving on ch:

  • if ch is nil or closed LockToRecvOSThread will panic
  • if no goroutine is receiving from ch LockToRecvOSThread will wait until one appears
  • if the goroutine calling LockToRecvOSThread is already locked to a thread LockToRecvOSThread will panic
  • if the goroutine receiving from ch did not call LockOSThread, LockToRecvOSThread will panic

The thread lock acquired by LockToRecvOSThread can be released by calling UnlockOSThread.

Possible problems / notes

  • Because of how LockOSThread is currently implemented this proposal requires a non-trivial change to Go scheduler, a complex, performance sensitive piece of code that few people understand well.

  • It could be that this change will degrade the performance of other programs.

  • It could be that the assumptions about what causes the negative interaction between channel operations and LockOSThread are wrong and this does not solve the problem.

  • It could be that most programs that have the problem described in issue #21827 can be easily refactored to have a different granularity of requests to the "syscall server" goroutine and this is only actually useful to Delve.

  • It could be that the problem in issue #21827 can be solved by other optimizations, that do not require new API and user program changes.

  • The proposed API requires programs that wish to use LockToRecvOSThread to "leak" one channel and one goroutine. It is possible to imagine different APIs, for example:

      GetLockedOSThread() *OSThread
      LockToSpecificOSThread(*OSThread)
    

    In my opinion the proposed API has a smaller surface area and is less ambiguous as to what happens to the OS thread when UnlockOSThread is called, and the resource loss of one goroutine and one channel is negligible.

@gopherbot gopherbot added this to the Proposal milestone Nov 29, 2021
@bcmills
Copy link
Member

@bcmills bcmills commented Nov 29, 2021

When LockToRecvOSThread is called the calling goroutine will be locked to the same thread as the goroutine receiving on ch,

There is in general no guarantee that any goroutine is currently “receiving on” ch. (For example, one might have just received a value and unparked.) How would you specify the behavior in that case?

More generally, though, I think this API is too channel-specific. There are also valid synchronization patterns one might want to use with a thread-local API that do not involve channels: for example, one could use an atomic.Value or sync.Mutex wrapping a queue of callbacks, along with a channel or sync.Cond that is closed or signaled when the queue becomes non-empty.

The problem with this approach is that the interaction of LockOSThread and channel send imposes a much greater performance penalty in the program than it would be expected from simply using a channel: normal use of a channel is fast because it doesn't involve the OS scheduler and the Go scheduler has special optimized paths for it. But when LockOSThread is involved both the full cost of Go scheduler and the full cost of the OS scheduler are paid every time a send on the channel happens.

Can you give some more detail on this? Would it be feasible to optimize the compiler and/or runtime to avoid this cost?

@ianlancetaylor ianlancetaylor changed the title Proposal: runtime: allow N goroutines to be simultaneously locked to the same OS thread proposal: runtime: allow N goroutines to be simultaneously locked to the same OS thread Nov 29, 2021
@ianlancetaylor ianlancetaylor added this to Incoming in Proposals Nov 29, 2021
@ianlancetaylor
Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Nov 29, 2021

In your example of Delve, it's not obvious to me that forcing the event loop and the AST interpreter to share a thread will be an overall win. It means that while the AST interpreter is running the program will not pay attention to events. I'm concerned that the performance will not be as desired, and that the next step will be a request for some sort of goroutine prioritization.

That said, like @bcmills I don't think an API based on channels makes sense here. What about something more like

// A ThreadLocker permits multiple goroutines to be locked to a single thread.
// Values of type ThreadLocker must be obtained by calling MakeThreadLocker.
type ThreadLocker struct { /* unexported fields */ }

// MakeThreadLocker returns a new ThreadLocker.
func MakeThreadLocker() *ThreadLocker

// Lock the currently running goroutine to the ThreadLocker. If this is the first goroutine to be locked to
// this ThreadLocker, then the goroutine will be locked to the currently running thread, and the currently
// running thread will be associated with this ThreadLocker.
// Otherwise, the goroutine will be locked to the thread associated with this ThreadLocker,
// meaning that it will be immediately preempted, and that when the method returns the goroutine
// will be running on the thread.
// A thread associated with a ThreadLocker will only run goroutines locked to that thread;
// if there are no such goroutines ready to run, the thread will wait until one is available.
// This panics if the ThreadLocker was not created by MakeThreadLocker.
func (*ThreadLocker) Lock()

// Unlock the currently running goroutine from the ThreadLocker. The goroutine will be immediately
// preempted, and when the method returns the goroutine will be running on a different thread.
// If this is the last goroutine running on the associated thread, that thread will exit.
// The same happens when the last locked goroutine returns or calls Goexit.
// This panics if the current goroutine is not locked to this ThreadLocker.
func (*ThreadLocker) Unlock()

@aarzilli
Copy link
Contributor Author

@aarzilli aarzilli commented Nov 29, 2021

When LockToRecvOSThread is called the calling goroutine will be locked to the same thread as the goroutine receiving on ch,

There is in general no guarantee that any goroutine is currently “receiving on” ch. (For example, one might have just received a value and unparked.) How would you specify the behavior in that case?

It would wait until there is one. Other details of the behavior are in the "Behavior of LockToRecvOSThread" section.

More generally, though, I think this API is too channel-specific. There are also valid synchronization patterns one might want to use with a thread-local API that do not involve channels: for example, one could use an atomic.Value or sync.Mutex wrapping a queue of callbacks, along with a channel or sync.Cond that is closed or signaled when the queue becomes non-empty.

Nothing in this proposal, that I can see, prevents other synchronization from being used by goroutines locked to the same OS thread. The LockToRecvOSThread call is just a way to extend the OS thread lock to more goroutines. Also, I'm not particularly married to this specific API, as far as I am concerned it is just one of several reasonable ones.

The problem with this approach is that the interaction of LockOSThread and channel send imposes a much greater performance penalty in the program than it would be expected from simply using a channel: normal use of a channel is fast because it doesn't involve the OS scheduler and the Go scheduler has special optimized paths for it. But when LockOSThread is involved both the full cost of Go scheduler and the full cost of the OS scheduler are paid every time a send on the channel happens.

Can you give some more detail on this? Would it be feasible to optimize the compiler and/or runtime to avoid this cost?

See:

I find the assessment in #18023 convincing.

@aarzilli
Copy link
Contributor Author

@aarzilli aarzilli commented Nov 29, 2021

In your example of Delve, it's not obvious to me that forcing the event loop and the AST interpreter to share a thread will be an overall win. It means that while the AST interpreter is running the program will not pay attention to events. I'm concerned that the performance will not be as desired, and that the next step will be a request for some sort of goroutine prioritization.

There is no parallelism between the even loop and the AST interpreter: the event loop only runs when the target process is running and the AST interpreter only runs when it isn't. I have experimented with running the event loop entirely inside the LockOSThread goroutine and I do get the expected performance improvement (as long as there is no AST interpreter, otherwise it will deadlock when it needs to communicate with it).

That said, like @bcmills I don't think an API based on channels makes sense here. What about something more like

This looks fine to me too.

@rsc
Copy link
Contributor

@rsc rsc commented Dec 1, 2021

I used to program in a system that let you arrange programs this way, and it was easy to end up with deadlocks, because the one "goroutine" in a thread was blocked in a system call and stopping the other from running (and maybe it was going to wake up the system call by its actions).

We designed LockOSThread to avoid this problem by making it a single goroutine.

If there are performance issues, it sounds like we should fix the runtime scheduler, not introduce new API that might lead to worse problems.

@aarzilli
Copy link
Contributor Author

@aarzilli aarzilli commented Dec 1, 2021

Would this opinion change if the performance issue was an inevitable consequence of the design of LockOSThread?

@rsc
Copy link
Contributor

@rsc rsc commented Dec 1, 2021

This proposal has been added to the active column of the proposals project
and will now be reviewed at the weekly proposal review meetings.
— rsc for the proposal review group

@rsc rsc moved this from Incoming to Active in Proposals Dec 1, 2021
@aarzilli
Copy link
Contributor Author

@aarzilli aarzilli commented Dec 6, 2021

I made a simple experiment, I rewrote the benchmark of #21827 in go and a rough equivalent in C with pthreads. With Go, using a LockOSThread server I get 8µs per server request with C I get 5µs per ping/pong.

So there is some optimization that could be done in the scheduler (or in the channel implementation), but even assuming it could be as fast as the C implementation it would still be much slower than the version that does not use LockOSThread, which runs at 0.5µs per request.

@rsc
Copy link
Contributor

@rsc rsc commented Dec 8, 2021

It's unfortunate if LockOSThread has a performance problem, but the vast majority of programs don't need LockOSThread, and there are significant semantic problems with this, so it doesn't seem like we should do this.

@rsc rsc moved this from Active to Likely Decline in Proposals Dec 8, 2021
@rsc
Copy link
Contributor

@rsc rsc commented Dec 8, 2021

Based on the discussion above, this proposal seems like a likely decline.
— rsc for the proposal review group

@aarzilli
Copy link
Contributor Author

@aarzilli aarzilli commented Dec 9, 2021

I think it's worth pointing out that the system call server, which this is aiming to replace, is just as much prone to deadlocks. So this would replace slow deadlock prone code with fast deadlock prone code.
Also asynchronous preemption is a mitigating factor here, that didn't exist when LockOSThread was originally added to the language.

@bcmills
Copy link
Member

@bcmills bcmills commented Dec 9, 2021

I think it's worth pointing out that the system call server, which this is aiming to replace, is just as much prone to deadlocks. So this would replace slow deadlock prone code with fast deadlock prone code.

That may be true, but debuggability is also a concern: today it is viable to use goroutine dumps and/or static or dynamic analyzers to diagnose a deadlock involving channel communication, whereas debugging the proposed API would require substantial new debugging tools.

@aarzilli
Copy link
Contributor Author

@aarzilli aarzilli commented Dec 9, 2021

That may be true, but debuggability is also a concern: today it is viable to use goroutine dumps and/or static or dynamic analyzers to diagnose a deadlock involving channel communication, whereas debugging the proposed API would require substantial new debugging tools.

I think detecting this already requires a substantial amount of application specific investigation. What you would see is the server goroutine blocked on a system call and one or more other goroutines waiting to to send on the server's channel.
This is normal, every time you look at the program you will see something like this, what you need to know is what's going to make the blocked syscall return and what the goroutines waiting for the server are going to do.
If anything it's going to be easier to determine which goroutines are waiting to run on the locked thread, compared to having to examine the caller frames of runtime.chansend.

@rsc rsc moved this from Likely Decline to Declined in Proposals Dec 15, 2021
@rsc
Copy link
Contributor

@rsc rsc commented Dec 15, 2021

No change in consensus, so declined.
— rsc for the proposal review group

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Proposals
Declined
Development

No branches or pull requests

5 participants