Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime,iter: the interaction of Pull with runtime.LockOSThread is currently buggy, and generally not well-defined #65889

Closed
mknyszek opened this issue Feb 22, 2024 · 57 comments
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. FixPending Issues that have a fix which has not yet been reviewed or submitted. NeedsFix The path to resolution is known, but the work has not been done. release-blocker
Milestone

Comments

@mknyszek
Copy link
Contributor

Problem

The current implementation of iter.Pull in the rangefunc experiment creates a new goroutine and uses an optimized goroutine switch path during iteration.

The problem with this implementation, as it stands, is that it interacts poorly with runtime.LockOSThread. If the goroutine spawned for the iterator (let's call it the iterator goroutine) calls runtime.LockOSThread, first and foremost, that call will not be respected. For example, if the next function is passed to a different goroutine, the runtime will not schedule the iterator goroutine on the thread it was locked to. Bad things might happen. Simultaneously, even in the easy case, if the iterator goroutine always switches with the same calling goroutine on the same thread, then once that iterator goroutine exists, it will take down the current thread with it. This is expected, but this happens before the calling goroutine gets placed back into the scheduler's ownership. The calling goroutine instead stays blocked forever with no hope of executing it again.

One possible solution

The most straightforward fix to this is to fall back to a slower kind of switching of goroutines if either the iterator goroutine or the calling goroutine locks itself to a thread. This slower kind of switching would probably be done with channels, and would have semantics that are equivalent to the existing implementation of iter.Pull. Because at least one of the goroutines is locked to a thread, the overhead of OS context switching also comes into play, so this implementation would likely be dramatically slower.

This fix, then, has a really unfortunate performance cliff. One could imagine that a UI library (for example) locks a goroutine to a thread because it needs to interact with C libraries that require staying on the main thread. Using iter.Pull on such a goroutine would be a very, very bad idea if this workaround was adopted. The resulting performance regression would be very surprising.

The proposed solution

Discussing with @aclements, @dr2chase, and @cherrymui, we have an alternative fix that avoids the performance cliff. Specifically, we would restrict iterators from being able to call runtime.LockOSThread by panicking if called from an iterator goroutine (the runtime handles the iterator goroutine's creation, and thus can enforce this). Meanwhile, the calling goroutine (the caller of next, that is) is fully free to be locked to an thread.

If the iterator is prevented from locking itself to a thread from the iterator goroutine, then it's totally fine for that iterator goroutine to just execute on the locked calling goroutine's thread. That iterator goroutine effectively gets locked to the same thread while its executing, and we know that it will eventually switch back to the calling goroutine. It's fine for the calling goroutine to send next to another goroutine as well, even if that other goroutine is locked, because it can only execute once the iterator goroutine has ceded control of the thread (by the semantics and implementation of iter.Pull).

One hazard of this approach is that the iterator, running on the iterator goroutine, may care about thread-local state, and threads locked to goroutines tend to be locked precisely because there is some important thread-local state that needs to be preserved. However, if the iterator goroutine really cared about thread-local state, it should call runtime.LockOSThread, which in this case, will panic. That does mean that the iterator is slightly restricted in what it can do, but we feel that it's better for the code author to find that out rather than deal with some subtle issue related to runtime.LockOSThread and local thread state not being what they expect.

Furthermore, we think it's very unlikely that an iterator will want to lock itself to a thread anyway, so the extra restriction is unlikely to matter to the vast majority of Go users. Simultaneously, making this choice lets us keep iter.Pull fast for many more (more realistic) use cases, lets us avoid the pain of maintaining two implementations of iter.Pull, and prevents surprising performance cliffs for users of iter.Pull.

@mknyszek mknyszek added this to the Go1.23 milestone Feb 22, 2024
@mknyszek mknyszek added NeedsDecision Feedback is required from experts, contributors, and/or the community before a change can be made. compiler/runtime Issues related to the Go compiler and/or runtime. labels Feb 22, 2024
@mknyszek mknyszek changed the title iter: the interaction of Pull with runtime.LockOSThread is currently buggy, and generally not well-defined runtime,iter: the interaction of Pull with runtime.LockOSThread is currently buggy, and generally not well-defined Feb 22, 2024
@mknyszek
Copy link
Contributor Author

CC @rsc

@thediveo
Copy link
Contributor

just to get this right, as I'm often working on system-related Go code dealing with Linux kernel namespaces that needs Os-level thread locking: the caller creating the iterator and asking to iterate needs to lock the thread, as the iterator itself cannot, correct?

Is there a way for the iterator to ensure its caller has locked the thread?

@prattmic
Copy link
Member

Specifically, we would restrict iterators from being able to call runtime.LockOSThread by panicking if called from an iterator goroutine

I worry this is too restrictive, as some libraries call LockOSThread temporarily as part of normal operation. Being unable to use these libraries seems rather strange.

A few examples I have found:

Admittedly I haven't found a ton of cases like this, but from a user perspective it seems like a very orthogonal limitation to for loop iterators.

@mknyszek
Copy link
Contributor Author

just to get this right, as I'm often working on system-related Go code dealing with Linux kernel namespaces that needs Os-level thread locking: the caller creating the iterator and asking to iterate needs to lock the thread, as the iterator itself cannot, correct?

Correct. The caller creating the iterator and any caller of next or stop is allowed to be locked to the thread. Just the iter.Seq passed to iter.Pull would not be allowed.

Is there a way for the iterator to ensure its caller has locked the thread?

I'm not sure I understand. The iter.Seq implementation is executed on a separate goroutine when passed to iter.Pull. Because it's a different goroutine altogether, there's already no way to ensure anything about the calling goroutine's state. The iter.Seq implementation would have to ensure it locked to an OS thread itself, if it needed (and if it was allowed, which the proposed fix does not allow).

I worry this is too restrictive, as some libraries call LockOSThread temporarily as part of normal operation. Being unable to use these libraries seems rather strange.

That's a good point. You might not be in control of, or be aware of, what the library is doing. I think the glog example is especially interesting, though I'm not sure the git2go example makes a lot of sense to me (calling into C already temporarily binds the goroutine to the thread, and that will continue to work).

We could also punt the question, pick a more restrictive implementation to begin with, and see what happens in practice. We can always backwards-compatibly lift the restriction in the future.

Thinking out loud, could we have a slow implementation, and only fall back to it if only the iterator goroutine locks to the thread? That might be OK? However, we then have a bit of a "worst of both worlds" situation. Not only do you have a performance cliff (it's just in a different place now), you still have the pitfall wherein the iterator goroutine (not locked) may see thread state the code author didn't expect it to.

@thediveo
Copy link
Contributor

slow would be better than panic.

@prattmic
Copy link
Member

I'm not sure the git2go example makes a lot of sense to me (calling into C already temporarily binds the goroutine to the thread, and that will continue to work).

I was confused as well initially. The problem here is that MakeGitError calls C.git_last_error(), which presumably uses something thread-local similar to errno.

	ecode := C.git_blame_options_init(&opts, C.GIT_BLAME_OPTIONS_VERSION)
	if ecode < 0 {
		return BlameOptions{}, MakeGitError(ecode)
	}

@chad-bekmezian-snap
Copy link

Imo, it's better to start with something more restrictive (such as panicking for now), and then lift some of those restrictions over time as issues arise and solutions appear more clear

@eliasnaur
Copy link
Contributor

Very interesting. I believe pull iterators may well alleviate the need for #64777 (and #64755), but I ran into issues with runtime.LockOSThread and Cgo callbacks. I documented my findings in #65946 in case they're not a duplicate of this issue.

@mknyszek
Copy link
Contributor Author

mknyszek commented Feb 27, 2024

Myself, @aclements, @cherrymui, and @dr2chase discussed this offline a little bit more, and if hypothetically panics are off the table, I think we have an idea of what the implementation would have to look like.

If the calling goroutine is locked but the iterator goroutine isn't, that iterator goroutine can share the thread with the calling goroutine. If next moves to a different goroutine on a different thread, the iterator might run there, and that's fine. If the iterator really needs to read thread-local state, it must call runtime.LockOSThread. This case is still fast.

If the iterator goroutine calls runtime.LockOSThread, it immediately sets a flag in the coro implementation to fall back on a channel-based implementation. It then wakens the calling goroutine, which must check whether that flag is set. (This can be checked non-atomically, because the calling goroutine must be blocked when this happens.) The iterator goroutine then calls into the scheduler. Upon executing later, it locks itself to whatever thread it ends up running on. This ensures that both the calling goroutine and iterator goroutine, if both locked, are guaranteed to be locked to different threads.

Even if the calling goroutine is not locked to a thread, for simplicity, we'll always fall back to the channel-based implementation.

In short: the iterator goroutine can share threads with thread-locked goroutines, but the calling goroutine cannot.

The reason for this asymmetry is that you can pass the iterator goroutine's context to another thread, and it would be difficult to support a calling goroutine being able to share a thread with the iterator goroutine when the iterator goroutine is locked specifically. (The calling goroutine would have to be able to force itself to switch threads to the iterator goroutine's thread. This might require a substantial amount of work in the scheduler.)

We suspect the iterator wanting to lock to the thread to be less common in general than the calling goroutine being locked to a thread. There's still a performance pitfall here, but it hopefully shouldn't show up most of the time. That is, if a thread-locked goroutine is just iterating over, say, just some straight-forward all-in-Go data structures with iter.Pull (say, iterating over a pair of trees or something), that will still be fast.

@eliasnaur
Copy link
Contributor

What happens when the iterator goroutine is switched? Will this program successfully hand over the OS main thread to a different goroutine?

package main

import (
	"fmt"
	"iter"
	"runtime"
)

func init() {
	// Lock main thread to main goroutine.
	runtime.LockOSThread()
}

func main() {
	next, _ := iter.Pull(switchThread)
	// Pass thread to iterator, and get back a different
	// thread.
	next()

	// Continue on a different, non-main thread.
	fmt.Println("successfully passed thread")
}

func switchThread(yield func(struct{}) bool) {
	// Unblock the waiting pull iterator with different
	// thread.
	go yield(struct{}{})

	// Perform main-thread work.
	fmt.Println("received thread")
}

If so, #64777 and #64755 can be closed.

@aarzilli
Copy link
Contributor

My understanding is that this is what would happen

@eliasnaur
Copy link
Contributor

eliasnaur commented Mar 3, 2024

My understanding is that this is what would happen

I don't understand how main can get the main thread back. switchThread is called on the locked main thread and doesn't give it up. The original proposal states:

That iterator goroutine effectively gets locked to the same thread while its executing

But to your argument the proposal also states

and we know that it will eventually switch back to the calling goroutine.

which is not necessary true with the go yield() hack.

@aarzilli
Copy link
Contributor

aarzilli commented Mar 3, 2024

I don't understand how main can get the main thread back. switchThread is called on the locked main thread and doesn't give it up. The original proposal states

I was talking about @mknyszek's comment that immediately precedes yours.

@eliasnaur
Copy link
Contributor

And of course, as long as the iterator goroutine runs on the locked main thread, I can force the runtime to let me keep it:

package main

import (
	"fmt"
	"iter"
	"runtime"
)

/*
static void main_thread_api(void) {
    switchBack();
    call_main_thread_api_that_never_returns();
}
*/
import "C"

func init() {
	runtime.LockOSThread()
}

func main() {
	next, _ := iter.Pull(switchThread)
        // Runs on the main thread.
	next()
        // Continues on some other thread.
	fmt.Println("successfully passed thread")
}

//export switchBack
func switchBack() {
        // Runs in a Cgo callback which effectively locks the main
        // thread so the runtime can't give it back through the call
        // to yield.
	go yieldMain(struct{}{})
}

var yield func(struct{}) bool

func switchThread(yield func(struct{}) bool) {
	yieldMain = yield
	fmt.Println("received main thread, running main thread API")
	C.main_thread_api()
}

@eliasnaur
Copy link
Contributor

I don't understand how main can get the main thread back. switchThread is called on the locked main thread and doesn't give it up. The original proposal states

I was talking about @mknyszek's comment that immediately precedes yours.

Which states:

If the calling goroutine is locked but the iterator goroutine isn't, that iterator goroutine can share the thread with the calling goroutine.

which to me seems to support my use-case: my calling goroutine is locked to the main thread, the iterator goroutine isn't, so they can (will?) share threads.

@aarzilli
Copy link
Contributor

aarzilli commented Mar 3, 2024

I don't understand how main can get the main thread back. switchThread is called on the locked main thread and doesn't give it up. The original proposal states

I was talking about @mknyszek's comment that immediately precedes yours.

Which states:

If the calling goroutine is locked but the iterator goroutine isn't, that iterator goroutine can share the thread with the calling goroutine.

which to me seems to support my use-case: my calling goroutine is locked to the main thread, the iterator goroutine isn't, so they can (will?) share threads.

My understanding is that the iterator goroutine can run on the locked thread but it isn't locked to it. The scheduler can still migrate it to a different goroutine if it wants to. So you don't have any guarantees as to which thread will execute any of the instructions in switchThread.

It would be weird if it worked the way you are proposing, it would mean that in general people writing iterators shouldn't pass yield to a different goroutine, because doing so could result in removing a thread lock.

To be honest, I'm now a bit worried about this thread sharing, @mknyszek: what happens if the body of switchThread calls something that blocks the thread (some blocking cgo code or a syscall)? Couldn't that potentially lead to a deadlock (in theory the main thread should be able to run because yield was called, but it cannot because its locked thread is unavailable forever).

@thediveo
Copy link
Contributor

thediveo commented Mar 3, 2024

yield confusion

@eliasnaur
Copy link
Contributor

I don't understand how main can get the main thread back. switchThread is called on the locked main thread and doesn't give it up. The original proposal states

I was talking about @mknyszek's comment that immediately precedes yours.

Which states:

If the calling goroutine is locked but the iterator goroutine isn't, that iterator goroutine can share the thread with the calling goroutine.

which to me seems to support my use-case: my calling goroutine is locked to the main thread, the iterator goroutine isn't, so they can (will?) share threads.

My understanding is that the iterator goroutine can run on the locked thread but it isn't locked to it. The scheduler can still migrate it to a different goroutine if it wants to. So you don't have any guarantees as to which thread will execute any of the instructions in switchThread.

It would be weird if it worked the way you are proposing, it would mean that in general people writing iterators shouldn't pass yield to a different goroutine, because doing so could result in removing a thread lock.

I see. Your analysis makes sense, but then I'm surprised that the iterator does not run on the same thread as the caller, regardless of iter.Pull. In particular:

package main

import "runtime"

func main() {
	runtime.LockOSThread()

	it := func(yield func() bool) {
		// Surprisingly, this body of code may run on
		// a completely different thread.
	}
	for range it {
	}
}

@aarzilli
Copy link
Contributor

aarzilli commented Mar 4, 2024

I see. Your analysis makes sense, but then I'm surprised that the iterator does not run on the same thread as the caller, regardless of iter.Pull

No, iter.Pull would be what makes the difference, IIUC. If you are using for range f everything will run on the same goroutine (and therefore thread if LockOSThread is used).

@mknyszek
Copy link
Contributor Author

mknyszek commented Mar 4, 2024

As @aarzilli says, this does not affect for range f. You'd have to explicitly pass the function next returned by iter.Pull to a different goroutine.

To be honest, I'm now a bit worried about this thread sharing, @mknyszek: what happens if the body of switchThread calls something that blocks the thread (some blocking cgo code or a syscall)? Couldn't that potentially lead to a deadlock (in theory the main thread should be able to run because yield was called, but it cannot because its locked thread is unavailable forever).

That example is breaking my brain. I need to time to digest that.

It may well be that having the iterator call into cgo or a syscall means we need to treat it the same as calling LockOSThread, meaning it'd be kicked to a different thread immediately.

@eliasnaur
Copy link
Contributor

As @aarzilli says, this does not affect for range f. You'd have to explicitly pass the function next returned by iter.Pull to a different goroutine.

Are you saying that iter.Pull behaves differently than for range f? And only some of the time, in an indeterministic way? In particular:

package main

import "runtime"
import "iter"

func main() {
	runtime.LockOSThread()

	it := func(yield func(struct{}) bool) {
		// Always on the main thread if called in a
		// `for range f`, (most of?) the time on the
		// main thread if called by `iter.Pull`.
	}

	for _ = range it {
	}

	next, _ := iter.Pull(it)
	for {
		next()
	}

	// Bonus question: what happens when
	// yield is called from another goroutine?
	it2 := func(yield func(struct{}) bool) {
		go yield(struct{}{})
	}
	for _ = range it2 {

	}
}

To be honest, I'm now a bit worried about this thread sharing, @mknyszek: what happens if the body of switchThread calls something that blocks the thread (some blocking cgo code or a syscall)? Couldn't that potentially lead to a deadlock (in theory the main thread should be able to run because yield was called, but it cannot because its locked thread is unavailable forever).

That example is breaking my brain. I need to time to digest that.

It may well be that having the iterator call into cgo or a syscall means we need to treat it the same as calling LockOSThread, meaning it'd be kicked to a different thread immediately.

Yeah, considering cgo/syscalls as implicitly calling LockOSThread for the duration of their calls seems consistent.

@mknyszek
Copy link
Contributor Author

mknyszek commented Mar 4, 2024

Are you saying that iter.Pull behaves differently than for range f? And only some of the time, in an indeterministic way?

iter.Pull is defined as creating a new goroutine to run the iterator, so I'm not surprised it behaves differently. for range f meanwhile is defined as iterating on the same goroutine: it's just a bunch of function calls. iter.Pull has the same semantics as passing values back and forth between goroutines across a channel (approximately, it's really more of an unbuffered channel that is used to synchronized a shared memory location, but close enough). It's just very tightly optimized, meaning that goroutines pass time between each other directly instead of going through actual channels.

@eliasnaur
Copy link
Contributor

Are you saying that iter.Pull behaves differently than for range f? And only some of the time, in an indeterministic way?

iter.Pull is defined as creating a new goroutine to run the iterator

FWIW, the documentation doesn't mention goroutines:

 % GOEXPERIMENT=rangefunc go doc iter.Pull
package iter // import "iter"

func Pull[V any](seq Seq[V]) (next func() (V, bool), stop func())
    Pull converts the “push-style” iterator sequence seq into a “pull-style”
    iterator accessed by the two functions next and stop.

    Next returns the next value in the sequence and a boolean indicating whether
    the value is valid. When the sequence is over, next returns the zero V and
    false. It is valid to call next after reaching the end of the sequence
    or after calling stop. These calls will continue to return the zero V and
    false.

    Stop ends the iteration. It must be called when the caller is no longer
    interested in next values and next has not yet signaled that the sequence is
    over (with a false boolean return). It is valid to call stop multiple times
    and when next has already returned false.

    It is an error to call next or stop from multiple goroutines simultaneously.

In any case it seems better to have consistent behaviour between for range func and iter.Pull. Here's one way:

  • Guarantee that the iterator goroutine runs on the same thread as the calling goroutine, same as for for range func.
  • If the thread is locked, yield called from another goroutine will panic (similar to how calling yield after iteration has stopped also panics).

This has several advantages:

  • Same thread locking behaviour as for range func.
  • Yield panics guarantees that a locked thread cannot be stolen by the iterator goroutine, only temporarily transferred while the caller is blocked.
  • No need for a slow channel-based fallback for iter.Pull: calling LockOSThread from the iterator goroutine behaves the same as if the calling goroutine had called it.
  • Bonus: fixes runtime/mainthread: new package to manage main thread #64777.

@dr2chase
Copy link
Contributor

I've been trying to figure out what should/shouldn't work and where something must change. I have a default bias that LockOSThread is the pig at the party here, but maybe that's wrong.

I think that these things should by-default work, with possible exceptions in the case of LockOSThread "somewhere":

  • a rangefunc iterator type Seq[V any] func(yield func(V) bool) is a function, and a function is a value, there are no restrictions on how it is used.
  • the result of a calling iter.Pull[V any](seq Seq[V]) is a pair of functions, and functions are values, there are no restrictions on how they are used, other than "be sure to call stop() after done calling yield()".
  • a goroutine that has already called LockOsThread should be able to use rangefunc iterators and Pull functions. In general there's no way to know where iterators might appear.

The last case can create problems, depending on why it called LockOSThread. Did the goroutine need it for reasons unrelated to the iteration (context), or was it because of iteration (iterator), or was it because the loop body needed to run on a particular OS thread (body)?

If it's just context, it doesn't matter what the range function does or where the loop body runs, they are incidental. And if Pull is called, that is not a problem either, whether it uses goroutine or coroutine because the iterator is insensitive to goroutine/thread.

If it's the loop body, then Pull is not an issue because Pull works on iterators. In the default case of rangefunc, that does not explicitly do interesting things with goroutines (range functions can do things with goroutines, but it's not the default, and it looks like perhaps running yield (the loop body) in a different goroutine should be discouraged), this will also "just work", the range function runs in the same goroutine as the body and as its callers, so it all works there, too.

The interesting case is the iterator. I think this breaks down into three subcases, which differ in how the deal with undoing LockOSThread.

  1. the caller is expected to call LockOSThread, before the iteration, and then undo it afterwards. In this case, the need for LockOSThread and the special binding is not a secret, so the programmer is to some extent "on notice".
  2. the caller is not expected to call LockOSThread, but the same call that provides the range function also provides a dispose function to be called after the iterator is done, to undo the LockOSThread. There are two subcases, in both, there is a clue for the programmer that this is not a standard range function:
    a. Lock/UnlockOSthread occur in the caller's context
    b. the range function or its generator creates a separate goroutine, which is Locked, and communicates with it via channels.
  3. the rangefunc-maker implements cleanup with a finalizer. The programmer may not notice that anything special is going on, and might treat the iterator function like any other value. HOWEVER, finalizers run in a different goroutine; to make the unlock OS thread happen, there must be an explicitly created goroutine, that is locked to an OS thread, generating values, and stuffing them into a channel, and at the same time is also waiting on its own "go away" channel for notification that it is no longer needed and unlock the OS thread, and that is the channel that the finalizer writes to. This means that the range function is actually insensitive to what goroutine it is run in.

In the first case, it's obvious that LockOSThread is needed, and thus the special rule applies, "don't pass the range function to other goroutines". This case is also sensitive to where the Pull function executes, so a coroutine implementation is preferred. 2a is the same.

For 2b and 3, because the locked goroutine is hidden within the range function, it doesn't matter where the range function itself runs.

I think the only tricky case is for an iterator that requires locking to an OS thread, but these are either obvious when they are created, or else insulate iterator value generation inside another goroutine.

It seems like a range function that runs "yield" (the loop body) in a different thread than the outer caller is officially a bad idea unless this is documented (i.e., it has a purpose).

I think it is also a bad idea if an range function calls LockOSThread (in the caller's goroutine, so, 2a) without unlocking before it returns to the caller. That implies a need for a balancing UnlockOSThread that is not necessarily provided. GoodStyle™ would be to lock the OS thread when the range function is created, not when it is called.

This is the best I've got on this much time. I'll think about it some more, but maybe other people will have better thoughts, improve it, or shoot it down.

@eliasnaur
Copy link
Contributor

eliasnaur commented Apr 16, 2024

Thank you for the detailed example. I believe I've identified where our "coroutine" semantics differ. Hopefully, the following explanation is clearer.

First, assumptions:

  • There should be no observable difference between rangefunc and iter.Pull behaviour, except for performance.
  • Since goroutines have no IDs, the only observable difference is the underlying OS thread.
  • Since the Go runtime is free to move non-threadlocked goroutines among non-locked threads at any time, the only observable difference is the underlying locked thread, if any.

Then my proposal: let the locked thread (if any) and its lock count follow goroutine switches between caller and iterator, for both rangefunc and iter.Pull.

The consequences are:

  • No more than one goroutine has a particular thread locked at any one time, eliminating any deadlock conditions.
  • A locked thread follows execution, as expected: once LockOSThread is called by either caller or iterator, the execution trace has that thread locked from then on.
  • It's now possible to "steal" locked threads, e.g. if the caller calls LockOSThread and the iterator yields from another goroutine. This is perhaps surprising, but notably already possible with the current implementation of rangefunc:
func main() {
        // This iterator runs on the locked thread.
	rng := func(yield func(struct{}) bool) {
		go yield(struct{}{}) // Run the loop body with a different thread.
                select{} // Or a blocking Cgo function.
	}
        runtime.LockOSThread()
	for range rng {
		// The loop body is *not* running on the locked thread!
	}
}

Suppose goroutine A creates a pull function P, and passes it to goroutine B. B calls P, P calls LockOSThread and returns a value to B. B calls LockOSThread (just to be really unambiguous that B and P now both have a lock on the same OS thread), and acks A in a channel, and then blocks in a cgo call waiting for some almost-never-happen event, normally B would still be waiting there on process exit.

With my proposal, the following happens:

  • A creates a pull function P, and passes it to goroutine B. B calls P.
  • P calls LockOSThread and is locked to thread T1.
  • P returns a value (and the locked thread T1) to B and blocks in yield.
  • B continues locked to T1, calls LockOSThread which increments the lock count on T1.
  • B calls a Cgo function and is blocked forever (on T1).

(Alternate scenario for glueing B and P to the same thread, B is already LockOSThreaded, P then calls LockOSThread when B calls it. Normally in this situation we would increment a counter.)

A calls LockOSThread. A calls P. What happens? I don't think we can get the OS Thread loose from the cgo call.

  • A calls LockOSThread and is locked to thread T2.
  • A calls P, thereby passing T2 to P, and blocks in next.
  • P continues locked on T2.

It's not formal deadlock, but it's not great.

By passing locked threads between caller and iterator, LockOSThread deadlocks are no longer possible.

Another interesting case occurs if, A is locked to an OSThread, and creates two pull functions P and Q (and for sake of non-ambiguity, pulls a value out of each one and they each also call LockOSThread), then hands P and Q to separate goroutines B and C, who attempt to call P and Q concurrently. We can time slice those, of course, so no deadlock, but that's new and interesting for the Go scheduler.

  • A locks to thread T.
  • A calls P, passing T to it.
  • P runs locked on T, calls LockOSThread and increments the lock count of T.
  • P returns some value to A, passing T (and lock count 2) back to A.
  • A continues locked to T.

The same happens for Q, leaving A locked to T.

Now, B and C have no threads locked so P and Q will run on some arbitrary thread when called by them.

@dr2chase
Copy link
Contributor

This seems problematic:

  • P calls LockOSThread and is locked to thread T1.
  • ...
  • A calls P, thereby passing T2 to P, and blocks in next.
  • P continues locked on T2.

Since P was previously locked to T1, and might have made that call expecting that subsequent calls would also be locked to T1, running on T2 seems wrong. However in my "how is this used anyway?" scenario, the iterator inside the pull function "should not" have exported (left unreversed) a call to LockOSThread; that should have occurred in the creating context, perhaps, and this is even something we can check for in the Pull implementation, and panic.

Locking the OS thread when the iterator is created in A/T2 creates the opposite problem of "what happens in B/T1 when B calls P", and how do we distinguish that "bad" case from the "not bad" case where the iterator does not care that it is thread-locked, and just happens to be embedded within code that A called, and A was locked to T2 for other reasons?

I am inclined to assert that if there is an iterator that requires thread-locking to work properly, that it must be coded to hide that from its callers by creating its own goroutine, locking that to an OS thread, and using it to generate values, which are then passed to the yield function in the caller's context (that way, the calling function can also require thread-locking for the body, perhaps even to a different thread). Probably such an iterator would be paired with a dispose/stop function and might also have a finalizer, to recover the goroutine and locked OS thread.

So if I were to try to get closer to a proposal, Pull functions use coroutines, and if the locked count changes between entry and exit to the yield function (i.e., unpaired lock/unlock), panic, and if an iterator (a range function) wishes to run on a locked OS thread, it must create a separate goroutine for that purposes, the calling context is not guaranteed.

@eliasnaur
Copy link
Contributor

I am inclined to assert that if there is an iterator that requires thread-locking to work properly, that it must be coded to hide that from its callers by creating its own goroutine, locking that to an OS thread, and using it to generate values, which are then passed to the yield function in the caller's context (that way, the calling function can also require thread-locking for the body, perhaps even to a different thread). Probably such an iterator would be paired with a dispose/stop function and might also have a finalizer, to recover the goroutine and locked OS thread.

This only applies to iterators that require a locked thread separate from their caller, correct? That is, iterators that must be locked to their caller's locked thread will just work like the equivalent rangefunc does.

So if I were to try to get closer to a proposal, Pull functions use coroutines, and if the locked count changes between entry and exit to the yield function (i.e., unpaired lock/unlock), panic,

Would rangefunc iterators also panic? If not, why not?

and if an iterator (a range function) wishes to run on a locked OS thread, it must create a separate goroutine for that purposes, the calling context is not guaranteed.

By "calling context is not guaranteed" do you mean "the calling context is what's calling next"? That is, the context is well-defined but under control of the caller and thus may change during each call to yield.

@dr2chase
Copy link
Contributor

This only applies to iterators that require a locked thread separate from their caller, correct? That is, iterators that must be locked to their caller's locked thread will just work like the equivalent rangefunc does.

I think so, including "if you pass the range function to a different goroutine, it will see a different context" (don't do that if you want it to run in a constant context -- in this situation the programmer is/should-be aware that thread locking is in play).

Would rangefunc iterators also panic? If not, why not?

Probably not, and the reason is cost. We're one optimization and some boosted inlining away from range function iterators running really well, and all the checking code that gets inserted into the generated yield function constant-propagates away. Checking the locked-thread-state won't go away like that. We might include a more-checking mode for this, maybe we would always check it anyway.

By "calling context is not guaranteed" do you mean "the calling context is what's calling next"? That is, the context is well-defined but under control of the caller and thus may change during each call to yield.

Right.

The underlying problem here is that LockOSThread is probably the wrong primitive, it would have been better expressed as func WithLockedOSThread(runsInLockedContext func()), but, oops, mistakes were made. This would have made some of the problematic cases unsayable. We had the go-func and defer-func examples right there to copy, and we blew it, oh well.

One performance-related question -- if we declare that it is GoodStyle™ for range functions that need locking to create a separate locked goroutine and access that through channels, that's going to be even more expensive when passed to Pull. A problem with range function iterators is that they're opaque -- they don't provide the option of offering a cheaper implementation of Pull (e.g., a SliceIterator has a much cheaper implementation of Pull, but that's not available). It seems like in this case if you know that the iterator will require a thread locked goroutine, we might be better off inverting the relationship, have Pull be the default, and the range function derived from that. (I am not at all sure about this, but I don't like the pile of expensive layers.)

@dr2chase
Copy link
Contributor

The other reason that rangefunc iterators likely won't panic is that the range function is written by a human and really is "just a function". The compiler rewrite of

for x := somerangefunction {
   body
}

is

   <declare and initialize some state variables>
   yield := func(x T) bool {
      <fiddle with state variables>
      body // with tweaked rewrites for break/continue/goto/return
      return true
   }
   somerangefunction(yield)

The most we could check is that thread locking state doesn't change between initialization and each iteration of the loop body, but the bare range function is just a function.

And strictly speaking, nobody's required to use iter.Pull, a somewhat slower non-coroutine version would also work. I think the interesting question here is whether we want to sign up for the different, extra, and potentially useful semantics of passing OS-thread-locked state across a coroutine switch. Originally, this had just been a performance optimization.

@eliasnaur
Copy link
Contributor

Would rangefunc iterators also panic? If not, why not?

Probably not, and the reason is cost. We're one optimization and some boosted inlining away from range function iterators running really well, and all the checking code that gets inserted into the generated yield function constant-propagates away. Checking the locked-thread-state won't go away like that. We might include a more-checking mode for this, maybe we would always check it anyway.

May I suggest not panic'ing for iter.Pull either? It's a subtle panic condition, it seems ok to require LockOSThread users to be careful, and most importantly I think it's important to be able to refactor rangefuncs into iter.Pull equivalents and be certain the behaviour is equal.

On the other hand, starting with the panic and relaxing it later is backwards compatible.

And strictly speaking, nobody's required to use iter.Pull, a somewhat slower non-coroutine version would also work.
I think the interesting question here is whether we want to sign up for the different, extra, and potentially useful semantics of passing OS-thread-locked state across a coroutine switch. Originally, this had just been a performance optimization.

iter.Pull is less useful if it doesn't match rangefunc behaviour. Go interpreters come to mind, as well as my reflect example.

In addition, starting with a non-context-passing iter.Pull cannot later be changed to pass context in a backwards compatible way.

@cherrymui
Copy link
Member

I'm not sure I understand the reflect example. Could you explain more about it, maybe with a concrete example?

As I understand, rangefunc is just a function, so the way to use reflection to run a rangefunc loop is just Call the rangefunc, passing in the "body" written in terms of reflection.

@eliasnaur
Copy link
Contributor

I'm not sure I understand the reflect example. Could you explain more about it, maybe with a concrete example?

Please ignore the reflect example; I failed to find an instance that is not a case of the general second example in my comment. That is, cases where (1) iter.Pull is required, and (2) the iterator must run on the same locked thread as the caller.

@timothy-king
Copy link
Contributor

I am very sympathetic to wanting to implement range-over-func using iter.Pull. I spent a couple of weeks in December trying [unsuccessfully] to do this for x/tools/go/ssa and interp. I was eventually convinced that I was trying to pound a square peg through a round hole. x/tools/go/ssa has a proposed implementation based on function rewriting today.

The FAQ already covers that iter.Pull and similar approaches are not equivalent: https://go.dev/wiki/RangefuncExperiment#why-are-the-semantics-not-exactly-like-if-the-iterator-function-ran-in-a-coroutine-or-goroutine . Let me try to give an expanded version. This explanation is mostly outside of the scope of this issue, but I think it addresses what you are asking.

There are two main proposed semantics/implementations for range-over-func statements:

for x := range somerangefunction {
  body
}

The first implementation is function rewriting. This takes a range-over-function statement and constructs a new synthetic function that communicates across function boundaries via hidden state, calls the iterator function on the synthetic function, and reestablishes control flow after the function call. Adjusting what David had a bit:

   <declare and initialize some state variables>
   yield := func(x T) bool {
      <fiddle with state variables>
      body // with tweaked rewrites for break/continue/goto/return
      return true
   }
   somerangefunction(yield)
   <finish tweaked break/continue/go/return and panicking>

What exactly those tweaks are is a bit complicated, but these should be mostly invisible to regular users. They are not invisible to compilers and interpreters though. The overwhelming advantage of this approach is that it is as fast as users expect. This is mostly accomplished by inlining somerangefunction and yield.See https://go.dev/wiki/RangefuncExperiment#can-range-over-a-function-perform-as-well-as-hand-written-loops

The second implementation is to concurrently execute somerangefunction and to communicate between the concurrent routines. In practice, concurrent routines are implemented via goroutines in go. This leads to a candidate semantics based around iter.Pull or similar.

next, stop := iter.Pull(somerangefunction) // execute somerangefunction in another goroutine
defer stop()
for x, done := next(); !done; x, done = next() {
  body
}
stop()

This is a bit over simplified, but the key point is somerangefunction executes concurrently with body in a different goroutine. What is great about iter.Pull is that it can be library function. It does not need extra compiler help to change body or somerangefunction. This leads to the lowering above being simple.

There are an enormous number of additional posible tweaks to both of the above approaches (runtime.coro for example), but those are the two big categories of proposed semantics for range-over-func statements.

These semantics do differ. Stacks and goroutines are observable in go. It is easier to see this with an example.

var count int
func somerangefunc(y func(x int) bool) {
  defer() func { if r := recover(); r != nil { count++; panic(r) } }()
  for y(rand.Int()) {}
}
func f() {
  defer func() { _ = recover() }
  for x := range somerangefunc {
    switch x {
    case 1:  panic(x)
    case 2: runtime.Goexit()
    case 3:  break
    }
  }
}

Let's suppose rand.Int() returns 1 first, so panic(x) happens. What are the stacks at the panic(x) call in the different semantics?

  • Function rewriting has one stack [panic(x), yield, somerangefunc, f]
  • Concurrent execution has two stacks [panic(x), f] and [yield_pull, somerangefunc, iter.Pull]. (yield_pull is an internal lambda in iter.Pull.)

Because the stacks are different, the panic executes differently in these two cases.

  • Function rewriting: panic(x) goes through yield, then somerangefunc, count is incremented, r is rethrown, and the panic is recovered by f.
  • Concurrent execution: panic(x) is recovered by f. It does not go through yield_pull or somerangefunc. Count is not incremented.

So the value of count is an example of an observable difference in the two semantics. The root cause is that the goroutines and calls are structured differently in the two proposals, and this is observable.

It is also worth considering a panic within the iterator function ('somerangefunc' in the example) also plays out differently. One semantics terminates the program. Also runtime.Goexit() within the iterator function also plays out differently. Exercises left to reader. This is not even getting into whether runtime's stack traces are considered observable. The runtime library is particularly prone to having functions like LockOSThread that make them even more observable.

In the end, range-over-func's were chosen to be equivalent to the function rewrite semantics. This is mostly due to efficiency (and partially debugging and partially robustness). So users of range-over-func should have the function rewrite semantics in mind when they use it (the simple version above), not iter.Pull. This gives iter.Pull a bit of a sharp edge that needs to be documented.

You can then ask can anything be done to narrow the divide. Maybe. But I would recommend against trying to do this on the issue tracker. You need to poke some fundamental stuff like "what is a stack in Go?", or "what if there was a second kind of coroutine in the runtime?" and do this in a backward compatible fashion. I recommend taking a few days [or weeks] and try to come up with something offline.

I hope this helped.


With background out of the way, on to specific questions.

#65889 (comment) :

Just like reflect.MapIter, some users would like to obtain range-over-func behaviour from reflection code without the explicit for {} loop. I imagine ssa/interp is one such user.

ssa/interp does not need this. The ssa package does the function rewrite so the interpreter did not need serious modification. It did need modification to handle defer elegantly. (Alternatives for the defer changes exist but they are really ugly.)

This is trivially achieved with a consistent iter.Pull, but very difficult otherwise.

It is not trivial. And it probably will be difficult. Range-over-function is a big language change enabled by compiler magic. This places a burden on tools sensitive to detailed semantics like compilers and interpreters.

FWIW ssa/builder.go is now ~17% more lines of code.

func f() {
runtime.LockOSThread()
lockedIter := func(yield func(Element) bool) {
// This no longer runs on the locked thread!
}
next, stop := iter.Pull(lockedIter)
defer stop()
// Use next and process elements.
}

lockedIter is called from another goroutine than f. If you inline iter.Pull, the go statement becomes clear and this ceases to be surprising. So I would say this example WAI. Yes this is a sharp edge and that is unfortunate.

For example, a Go interpreter should be able to faithfully implement rangefunc with iter.Pull.

I do not know how to do this. Maybe someone more clever than me has already solved this? Based on my own experience updating an interpreter, one can get something 90% of an implementation of range-over-func using iter.Pull, but it will not end up 100% equivalent to the function rewrite semantics. And interpreters need to be >= 99% to be useful. The stacks are different and trying to work around this is hard.

range-over-func is a big language change, and I expect interpreters to need to do a big adjustment. I cannot accurately predict what these will be for other interpreters. Interpreters likely need to understand the code transformation to understand the semantics of a range-over-func statement. A good place to start is the comments here https://go.googlesource.com/go/+/refs/heads/master/src/cmd/compile/internal/rangefunc/rewrite.go (and the tests in rangefunc_test.go). Interpreter maintainers can then determine what support they need from the Go standard library to implement range-over-func in their tool. They can file proposals requesting these features.

FWIW I suspect what they need is to attach state equivalent to #next to range-over-func yield functions, to attach meaning to #next based on the program locations, and handle the "<finish tweaked break/continue/go/return and panicking>" step based on this value. Again they need to understand the details.

iter.Pull is less useful if it doesn't match rangefunc behaviour.

Fair enough. But hopefully it is still useful enough.

That is, cases where (1) iter.Pull is required, and (2) the iterator must run on the same locked thread as the caller.

Maybe try to rephrase how you think of this requirement as 'On the statement go g() the function g must run on the same locked thread as the go g() statement'. I'll leave it to the folks that work on the runtime to address whether/how this is doable.

@eliasnaur
Copy link
Contributor

Thank you very much for your detailed comment.

Your section about function rewriting and Go interpreters made me realize that function rewriting is irrelevant for iter.Pull in a basic sense: there's no control flow to rewrite.

Given iter.Pull is implemented with goroutines, the remaining question relevant to this issue is the LockOSThread behaviour.

That is, cases where (1) iter.Pull is required, and (2) the iterator must run on the same locked thread as the caller.

Maybe try to rephrase how you think of this requirement as 'On the statement go g() the function g must run on the same locked thread as the go g() statement'. I'll leave it to the folks that work on the runtime to address whether/how this is doable.

I tried and failed to understand this rephrasing. Can you elaborate? What I'm proposing is not that two goroutines must run on the same locked thread. I'm proposing that next and yield both act as cooperative scheduling points. That is, the thread entering next or yield is the same thread that resumes execution from the return of its counterpart. Calling next from (locked) thread T will see yield return (locked) on thread T and vice versa.

I don't have enough knowledge of the Go runtime to know whether such behaviour is easy to implement, but I presume the runtime scheduler is already involved every time a goroutine sends or receives from a channel, which is what next and yield do.

@timothy-king
Copy link
Contributor

I'm proposing that next and yield both act as cooperative scheduling points. That is, the thread entering next or yield is the same thread that resumes execution from the return of its counterpart. Calling next from (locked) thread T will see yield return (locked) on thread T and vice versa.

Seems like I misunderstood what you were proposing. I will let others comment on the plausibility of implementing this. The internal coroswitch seems like a very special channel, but I don't know enough details.

I will point out that seq might call yield from another goroutine. So even if next and stop are guaranteed to switch to yield on T when LockOSThread is true, seq may still execute on a different goroutine after yield returns. seq is just some user provided function so it could do this. So a caller of iter.Pull(seq) that cares about LockOSThread would need be aware of some details of the implementation of seq. Maybe good enough?

@eliasnaur
Copy link
Contributor

I'm proposing that next and yield both act as cooperative scheduling points. That is, the thread entering next or yield is the same thread that resumes execution from the return of its counterpart. Calling next from (locked) thread T will see yield return (locked) on thread T and vice versa.

Seems like I misunderstood what you were proposing. I will let others comment on the plausibility of implementing this. The internal coroswitch seems like a very special channel, but I don't know enough details.

I will point out that seq might call yield from another goroutine. So even if next and stop are guaranteed to switch to yield on T when LockOSThread is true, seq may still execute on a different goroutine after yield returns. seq is just some user provided function so it could do this. So a caller of iter.Pull(seq) that cares about LockOSThread would need be aware of some details of the implementation of seq. Maybe good enough?

Indeed. On the other hand without cooperative scheduling, a caller of iter.Pull that cares about LockOSThread would need to be aware of the different behaviour from rangefunc: from the perspective of a LockOSThread user, rangefunc function rewriting effectively cooperatively schedules between the caller and iterator.

So it seems to me it's a trade-off between two kinds of sharp edges.

@dr2chase
Copy link
Contributor

dr2chase commented May 8, 2024

With the freeze approaching, we need to lock in an answer to this issue. Building on everyone's input, this post explains what we plan to implement for this release. We're sure this won't please everyone, but we believe this handles the major use cases efficiently, we know how to implement it in the next few weeks, and it's flexible enough that we believe it can be relaxed in the future if necessary.

One thing to keep in mind here is that LockOSThread is a spanner in the works; this is not first-class semantics, this is something that Go does to support thread-linked state in operating systems, which in turn is mostly an artifact of C lacking closures, the resulting proliferation of global state, and the collision of that global state with threads and multiprocessors. Explicit use of LockOSThread gets second-class support here. Iterators over state that requires LockOSThread will either not work with Pull, or will require an extra layer of pushing results back and forth to an additional (locked) goroutine.

The second thing to keep in mind is that Pull iterators, even with coroutines, are slower than plain iteration. Thus, we expect people to use Pull iterators only when necessary. Their primary use case is operations over 2 (or more) iterators, and because of their higher cost, it's likely that these binary (or more-ary) operations will not even be symmetric; they'll iterate normally over one operand, and Pull the others.

TLDR, what we propose to implement is:

At creation of the coro backing the iter.Pull, snapshot the thread lock count and (if non-zero) the thread. Every later coro switch can donate the locked thread across the switch, assuming the thread lock count and (if non-zero) thread match the snapshot; otherwise it throws.

The effect of this is that the caller of iter.Pull may be on a locked thread, as long as it doesn't lock or unlock the thread and then call next (though it may change the lock count transiently). And, likewise, the iterator function may lock or unlock the thread, as long as it doesn't do it around the call to yield.

The advantage of this proposal is that it maintains good performance composability. That is, Pull iterators remain fast in the presence of the two oblivious uses of LockOSThread.

  • If someone uses a Pull iterator in a library, and that library is used by some goroutine that happens to have already locked itself to an OS thread, that will all run well (quickly). The author of the library has no way of knowing this context, and the caller of the library has no way of knowing that they use Pull.
  • Or, if someone calls LockOSThread transiently, for example in a logging function, that will not force the coroutine to its own thread, and the pulled iteration will continue to be as efficient as possible.

The disadvantage of this proposal is that if someone wishes to write an iterator over some OS thread state, if that iterator should work properly when converted to a Pull iterator, then either

  • the user of the iterator must know this so they can call LockOSThread before converting the iterator to a Pull iterator. This is somewhat ugly.
  • or, the iterator must create its own additional goroutine which calls LockOSThread, and pass control and values back and forth to that goroutine with channels. This is somewhat extra-slow.

And throw is not great, but this is not a recoverable error; if the thread locking state isn't right, the runtime is not in a correct state.

@thediveo
Copy link
Contributor

thediveo commented May 8, 2024

One thing to keep in mind here is that LockOSThread is a spanner in the works; this is not first-class semantics, this is something that Go does to support thread-linked state in operating systems, which in turn is mostly an artifact of C lacking closures, the resulting proliferation of global state, and the collision of that global state with threads and multiprocessors. Explicit use of LockOSThread gets second-class support here. Iterators over state that requires LockOSThread will either not work with Pull, or will require an extra layer of pushing results back and forth to an additional (locked) goroutine.

*cough* I would like to make you aware that LockOSThread isn't necessarily tied to C, but instead is essential in order to deal with Linux thread (task) state in form of not least namespace attachments and other state such as (effective) UID and GID. This is independent of any C closure design or lack thereof, this is OS-level specific independent of any particular language ecosystem. Albeit in case of UID and GID state I would assume this applies to other unix-like OSes with OS native thread/task support as well.

IIRC, Docker/runc was one of the prominent use cases to introduce LockOSThread when it switched to the Linux namespaces technology (which IIRC wasn't the initial implementation).

If anything, you'll need to put up a big, very bright flashing warning sign in Go's documentation to prevent contributors to say, any of the existing container engine/executors code bases to infect them with unexpected and difficult to diagnose bugs -- calling this "second class support" is a totally misleading euphemism for the problems this can and probably will cause.

@mknyszek mknyszek self-assigned this May 8, 2024
@mknyszek mknyszek moved this from Todo to In Progress in Go Compiler / Runtime May 8, 2024
@aclements
Copy link
Member

@thediveo , yes, you're right of course that LockOSThread is necessary for many OS interactions, and we apologize for misrepresenting its use cases. I believe the semantics we're moving forward with shouldn't actually cause problems with systems-level code that needs to mix LockOSThread and iter.Pull, though. This approach has much better composability than the original proposal, without causing unpredictable performance cliffs. There are also ways we can relax it if we hear that it's too constrained in practice.

@mknyszek mknyszek removed their assignment May 8, 2024
@dmitshur dmitshur added NeedsFix The path to resolution is known, but the work has not been done. FixPending Issues that have a fix which has not yet been reviewed or submitted. and removed NeedsDecision Feedback is required from experts, contributors, and/or the community before a change can be made. labels May 15, 2024
@gopherbot
Copy link
Contributor

Change https://go.dev/cl/583675 mentions this issue: runtime: fix coro interactions with thread-locked goroutines

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. FixPending Issues that have a fix which has not yet been reviewed or submitted. NeedsFix The path to resolution is known, but the work has not been done. release-blocker
Projects
Development

No branches or pull requests