Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: optimization to reduce P churn #32113

Open
amscanne opened this issue May 17, 2019 · 10 comments

Comments

Projects
None yet
6 participants
@amscanne
Copy link

commented May 17, 2019

Background

The following is a fairly frequent pattern that appears in our code and others:

goroutine1:

ch1 <- data (1)
result = <-ch2 (2)

goroutine2:

data = <- ch1 (3)
// do work...
ch2 <- result (4)

The scheduler exhibits two different behaviors, depending on whether goroutine2 is busy and there are available Ps.

  • If goroutine2 is busy or there are no idle Ps, then the behavior is fine. The item will be enqueued in the channel, goroutine2 is marked as runnable if needed, and eventually goroutine1 will yield.
  • If goroutine2 is not busy and there are idle Ps, then the behavior is sub-optimal. The operation in (1) will mark goroutine2 as runnable, and wake up some idle P via a relatively expensive system call [1]. Ultimately the wake will likely result in an IPI to wake an idle core, if there are any. The next P will be scheduled and a race to (2) and (3) ensures.

In the second case, if the P wakes and successfully steals the now runnable goroutine2, i.e. (3) happens first, then it will start executing on the new P. Unfortunately, the whole dance will happen again with the result. If the P wakes but does not successfully steal the now runnable goroutine2, i.e. (4) happens first and goroutine2 is run locally, then a large number of cycles are wasted. Either way, this dance happens again with the result. In both cases, we spend a large number of cycles and interprocessor co-ordination costs for what should be a goroutine context switch.

These are further problems caused by this, as it will introduce unnecessary work stealing and bouncing of goroutines between system threads and cores. (Leading to locality inefficiencies.)

Ideal schedule

With an oracle, the ideal schedule after (1) would be:

  • If goroutine2 is running or there are no idle Ps, enqueue only (current behavior).
  • If goroutine1 will not block or has other goroutines in its runqueue, wake idle Ps (current behavior).
  • If goroutine1 will block immediately, and there are no other goroutines in P's local runqueue, do not wake up any other Ps. The goroutine2 will be executed by the current P immediately after goroutine1 blocks.

In essence, we want to yield the goroutine1's time to goroutine2 in this case, or at least avoid all the wasted signaling overhead. To put it another way: if goroutine1's P will block, then it fills the role of the "idle P" far more efficiently.

Proposal

It may be possible to specifically optimize for this case in the compiler, just as certain loop patterns are optimized.

In the case where a blocking channel send is immediately followed by a blocking channel receive, I propose an optimization that tries to avoid these scheduler round trips.

Here's a rough sketch of the idea:

  • runqput returns a bool that indicates whether the newly placed G is the only item on the queue. (Alternatively we could just check the runq length below.)
  • goready takes an additional parameter "deferwake" which skips the wake operation if true. By default this will be false everywhere, which implements current behavior.
  • chansend accepts a similar "deferwake" parameter. This is plumbed through to send, and will be AND'ed with the result of runqput. The deferwake parameter will be passed as true if the compiler detects a blocking receive immediately following the blocking send statement (or possibly in the same block, see below).
  • chanrecv also accepts a "deferwake" parameter, which will be set to true only when proceeded by a call to chansend with deferwake also set to true. If this is true AND the current goroutine will not yield as a result of the recv AND the current runqueue length > 0 AND there are idle Ps, at this point we can call wakeup.

Rejected alternatives

I thought about this problem a few years ago when it caused issues. In the past, I considered the possibility of a different channel operator. Something like:

ch1 <~ data

This operator would write to the channel and immediately yield to the other goroutine, if it was not already running (otherwise would fall back to the existing channel behavior). Using this operator in the above situation would make it much more efficient in general.

However, this is a language change, and confusing to users. When do you use which operator? It would be good to have the effect of this optimization out of the box.

Extensions

  • This optimization may apply to other kinds of wakes. I consider only channels today.
  • The optimization could be extended to cases where a blocking channel receive appears following the blocking send in the same block, not necessary the subsequent statement.

[1] https://github.com/golang/go/blob/master/src/runtime/proc.go#L665

@gopherbot gopherbot added this to the Proposal milestone May 17, 2019

@gopherbot gopherbot added the Proposal label May 17, 2019

@randall77

This comment has been minimized.

Copy link
Contributor

commented May 17, 2019

What about just enforcing a minimum delay between when a G is created and when it can be stolen? That gives the local P time to finish the spawning G (finish = either done or block) and pick up the new G itself.

The delay would be on the order of the overhead to move a G between processors (sys calls, cache warmup, etc.)

The tricky part is to not even wake the remote P when the goroutine is queued. We want a timer somehow that can be cancelled if the G is started locally.

@bradfitz bradfitz changed the title proposal: optimization to reduce P churn proposal: runtime: optimization to reduce P churn May 17, 2019

@bradfitz

This comment has been minimized.

Copy link
Member

commented May 17, 2019

@amscanne

This comment has been minimized.

Copy link
Author

commented May 17, 2019

Yes, most of the waste is generated by the wakeup call itself. Ensuring that the other P does not steal the G is probably a minor improvement, but you're still going to waste a ton of cycles (maybe even doing these wake ups twice -- on (1) and (4)).

I think using a timer gets much trickier. This is the reason I have limited the proposal to compiler-identified sequences of "chansend(block=true); chanrecv(block=true)" calls. It's possible that the system thread could be pre-empted between those calls, but if the system is busy (though Ps in this process may still be idle) it's probably even more valuable to not waste useless cycles.

@amscanne

This comment has been minimized.

Copy link
Author

commented May 17, 2019

(Totally open to a timer, but I'm concerned about replacing a P wakeup with a kick to sysmon in order to enforce the timer, which solves the locality issue but still burns cycles.)

@dvyukov

This comment has been minimized.

Copy link
Member

commented May 18, 2019

Also see #8903 which was about a similar problem. I don't remember all details exactly now, but as far as I remember my proposal was somewhat more generic, but your wins in simplicity and most likely safer from potential negative effects for corner cases.

@rsc

This comment has been minimized.

Copy link
Contributor

commented May 28, 2019

This has come up repeatedly. Obviously it is easy to recognize and fuse

ch1 <- data (1)
result = <-ch2 (2)

It's harder to see that in more complex code that would benefit from the optimization, though. We've fiddled with heuristics in the runtime to try to wait a little bit before stealing a G from a P, and so on. Probably more tuning is needed.

It's unclear this needs to be a proposal, unless you are proposing a language change, and it sounds like you've backed away from that.

The way forward with a suggestion like this is to try implementing it and see how much of an improvement (and how general of an improvement) it yields.

@rsc rsc added the Performance label May 28, 2019

@rsc rsc modified the milestones: Proposal, Unplanned May 28, 2019

@rsc rsc added NeedsInvestigation and removed Proposal labels May 28, 2019

@gopherbot gopherbot added the Proposal label May 28, 2019

@rsc

This comment has been minimized.

Copy link
Contributor

commented May 28, 2019

@golang golang deleted a comment from rsc May 28, 2019

@randall77

This comment has been minimized.

Copy link
Contributor

commented May 28, 2019

Related:

#27345 (start working on a new goroutine immediately, on the parent's stack)
#18237 (lots of time in findrunnable)

I also remember an issue related to ready goroutines ping-ponging around Ps, but I can't find it at the moment.

@amscanne

This comment has been minimized.

Copy link
Author

commented May 29, 2019

I backed away from a language change proposal based on the assumption that it would likely not be accepted. My personal preference would be to have an operation like <~ that immediately switches to the other goroutine if currently waiting. (And behaves like a normal channel operation if busy.) But I realize that the existence of this operator might be confusing.

I think it's unclear how much of a impact this would have in general. This is probably just be a tiny optimization that doesn't matter in the general case, but can help in a few very specific ones. For us, it might let us structure some goroutine interactions much more efficiently.

I hacked something together, and it seems like there's a decent effect on microbenchmarks at least (unless I screwed something up).

Code:

func BenchmarkPingPong(b *testing.B) {
	var wg sync.WaitGroup
	defer wg.Wait()

	ch1 := make(chan struct{}, 1)
	ch2 := make(chan struct{}, 1)
	wg.Add(2)
	go func() {
		defer wg.Done()
		for i := 0; i < b.N; i++ {
			ch1 <- struct{}{}
			<-ch2
		}
	}()
	go func() {
		defer wg.Done()
		<-ch1
		for i := 0; i < b.N-1; i++ {
			ch2 <- struct{}{}
			<-ch1
		}
		ch2 <- struct{}{}
	}()
}

Before:

/usr/bin/time /usr/bin/go test -bench=.* -benchtime=5s
goos: linux
goarch: amd64
BenchmarkPingPong-4   	20000000	       563 ns/op
PASS
ok  	_/home/amscanne/gotest/spin	11.805s
12.68user 1.00system 0:11.98elapsed 114%CPU (0avgtext+0avgdata 46036maxresident)k
0inputs+3816outputs (0major+19758minor)pagefaults 0swaps

After:

/usr/bin/time go test -bench=.* -benchtime=5s
goos: linux
goarch: amd64
BenchmarkPingPong-4   	20000000	       330 ns/op
PASS
ok  	_/home/amscanne/gotest/spin	6.949s
7.11user 0.05system 0:07.11elapsed 100%CPU (0avgtext+0avgdata 46460maxresident)k
0inputs+3824outputs (0major+19084minor)pagefaults 0swaps

The system time is telling @ 20x, and the extra 14% in CPU usage is indicative of an additional P waking up with nothing to do. (Or maybe it occasionally successfully steals the goroutine, which is also bad.)

Assuming this small optimization is readily acceptable -- what's the best way to group those operations and transform the channel calls? The runtime bits are straight-forward, but any up front guidance on the compiler side is appreciated. Otherwise, I'm just planning to call a specialized scan in walkstmt list, but maybe there's a better way.

@rsc

This comment has been minimized.

Copy link
Contributor

commented Jun 4, 2019

Given that there is no language change here anymore, going to move this to being a regular issue.

@rsc rsc changed the title proposal: runtime: optimization to reduce P churn runtime: optimization to reduce P churn Jun 4, 2019

@rsc rsc removed the Proposal label Jun 4, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.