Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: bgsweep does not yield to netpoller-readied goroutines #56060

Open
rhysh opened this issue Oct 5, 2022 · 15 comments
Open

runtime: bgsweep does not yield to netpoller-readied goroutines #56060

rhysh opened this issue Oct 5, 2022 · 15 comments
Assignees
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Milestone

Comments

@rhysh
Copy link
Contributor

rhysh commented Oct 5, 2022

When bgsweep yields, it puts itself on the global run queue. When its P looks for work, the P finds that same goroutine in the global run queue before considering other Ps' local run queues. This results in bgsweep running at higher priority than any sort of work that gets enqueued in a P's local run queue (such as from the netpoller, which may be the easiest type to see in execution traces).

For a 4-thread process, this means the GC consumes 1/4 of available processing time for longer than I'd expect (and perhaps longer than necessary). And if I understand this behavior correctly, a 2-thread process could end up spending 1/2 of its available processing time running GC-related code, in excess of the GC's "25% overhead" design goal.

CC @golang/runtime

What version of Go are you using (go version)?

$ go version
go version go1.19.1 linux/amd64

Does this issue reproduce with the latest release?

I haven't checked since this week's release of go1.19.2 (though the change log for that doesn't mention anything related). I also haven't checked after https://go.dev/cl/429615 (for issue #54767), which isn't part of a stable release yet. That change still attempts to yield using mcall(gosched_m).

What operating system and processor architecture are you using (go env)?

The production system runs on a Linux / AMD64 VM that presents four hardware threads.

What did you do?

The service receives inbound HTTP requests, makes outbound HTTP requests to its dependencies, and does some calculation/processing. This allocates memory, and under load it does a GC cycle about twice per second.

What did you expect to see?

I expected the runtime.bgsweep goroutine to run at a very low priority, to yield its processor to other goroutines if any are runnable. I expected that to include goroutines that are made runnable as a result of inbound network traffic.

What did you see instead?

In execution traces, I see the program keeping 3 of the 4 processors busy with user goroutines and 1 of the 4 processors busy with runtime.bgsweep (Proc 3 in the image below). I see the count of runnable goroutines increase beyond zero, while bgsweep continues to yield and re-acquire its processor. I see goroutines made runnable by the netpoller wait up to several milliseconds before they're picked up by a processor (Proc 0 in the image below), during which the bgsweep goroutine is rescheduled over 1000 times (Proc 3).

In Go 1.19, bgsweep calls Gosched. In tip, it calls goschedIfBusy. Both lead to mcall(gosched_m), which leads to goschedImpl(gp), which leads to globrunqput(gp), which puts gp on the scheduler's global run queue.

When a P finishes its current work and gets more from findRunnable, it will sometimes find no "special" work (safe point function, trace reader, GC mark worker, finalizer), no work in its local run queue, and no work in the global run queue. It will then poll the network and enqueue it via injectglist. When that has a P, and finds that there are no (other) idle Ps, it will place as many as it can (up to 256?) of the newly-runnable goroutines in its own local run queue. (I think the netpoller will only return 128 at a time. But this is a problem with even a small number of goroutines enqueued this way.)

The P that had been running the bgsweep worker needs new work, so it calls findRunnable. It finds no "special" work, finds nothing in its own local run queue, and then checks the global run queue. There it finds the bgsweep goroutine that it dropped off a moment earlier. It picks it back up and continues the sweep work.

In the meantime, the goroutines that another P made runnable (because it grabbed everything that the netpoller had to offer) remain in a different local run queue. No P ever gets to the point of calling stealWork.

bgsweep hogs P, leading to delay in processing inbound network traffic

@gopherbot gopherbot added the compiler/runtime Issues related to the Go compiler and/or runtime. label Oct 5, 2022
@ianlancetaylor
Copy link
Contributor

ianlancetaylor commented Oct 6, 2022

CC @mknyszek @golang/runtime

@mknyszek
Copy link
Contributor

mknyszek commented Oct 6, 2022

In #54767 I suggested that we treat the background sweeper the same way we treat idle mark workers. That seems like it would fix this? Unfortunately I think that requires additional logic in findRunnable and serves to make background workers more special. Maybe we need to introduce a more general concept of idle priority goroutines internally to the runtime. Then, they can all just sit on some queue when they're eligible.

@prattmic
Copy link
Member

prattmic commented Oct 6, 2022

In #54767 I suggested that we treat the background sweeper the same way we treat idle mark workers. That seems like it would fix this?

I think that would work, however we would need to get bgsweep to yield far less often. Going all the way through findRunnable is rather expensive, and if bgsweep still yields often but there is nothing else to do then we'll be spending tons of CPU going through findRunnable over and over again.

@rhysh
Copy link
Contributor Author

rhysh commented Oct 6, 2022

I'm not sure whether this is specific to bgsweep, or if it's a problem with Gosched for any purpose. The docs promise "Gosched yields the processor, allowing other goroutines to run.", but it only yields to a particular set of "other goroutines" (those in the local run queue of this P, those in the global run queue, and some "special" cases like the trace reader or finalizer).

Is there a supported way for application code to do low-priority / non-latency-sensitive background work, if not through calls to Gosched?

What if Gosched (and goschedIfBusy) were to leave a note on the P that would cause its next run of findRunnable to deprioritize the global run queue case, at least low enough for it to make a weak attempt at stealWork first (maybe not attempting to grab runnext)? Or to trigger that behavior change every N calls to Gosched, or every 1000 or 100 µs?

@cagedmantis cagedmantis added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Oct 6, 2022
@cagedmantis cagedmantis added this to the Backlog milestone Oct 6, 2022
@mknyszek
Copy link
Contributor

mknyszek commented Nov 7, 2022

What if Gosched (and goschedIfBusy) were to leave a note on the P that would cause its next run of findRunnable to deprioritize the global run queue case

Thinking about this again, what if instead of deprioritizing the global run queue case, Gosched left a note as to which goroutine called it, and findRunnable ignored that goroutine specifically unless it got to the point where there was nothing else left to execute.

I understand where you're coming from on the point that Gosched isn't really following the spirit of the documentation, because it's not actually going to open up space on a system with unevenly balanced Ps or with goroutines ready in the netpoller. I think in another thread the biggest concern with changing Gosched directly was unintended effects and hurting users' performance. However if we specifically only bias Gosched against the goroutine that just called it, that seems safer.

@rhysh
Copy link
Contributor Author

rhysh commented Nov 8, 2022

So instead of deprioritizing the entire global run queue case, it would deprioritize the "global run queue has a single entry, which is the goroutine this P just put there" case? Yes, that sounds good. Can you check my understanding below? Thanks.

I've sketched it with a note attached to the P. Maybe it could be global; the global run queue is global anyway. But I've also seen recently (in go1.18.8) that it's possible for the sweeper to run on two Ps (the bgsweep goroutine, and a goroutine that made an explicit call to runtime.GC). So we may also want to protect against those two goroutines bouncing back and forth between two Ps and failing to steal work from any other Ps. And, the state of "this P hasn't run any other goroutines since it put gp onto the global run queue" is per-P.

There are three calls from findRunnable to globrunqget: First, the mod-61 "ensure fairness" check. Second, the check between the local run queue and the netpoller. Third, the final effort before dropping the P.

We'd change the behavior of the second globrunqget call.

  • gosched_m or goschedImpl would leave a note on the P like gp.m.p.ptr().lastGosched = gp
  • the start of findRunnable would save and clear the note, before the top label
  • after the second call to globrunqget, before releasing sched.lock, check against the local copy of the note and decide whether to call globrunqput
  • if the goroutine was one that this particular P just put into the global run queue via gosched_m / goschedImpl, this run through findRunnable would proceed to the netpoll and stealWork portions
  • but the last-ditch effort to call globrunqget before pidleput would still be able to return that goroutine

@gopherbot
Copy link

gopherbot commented Nov 11, 2022

Change https://go.dev/cl/449735 mentions this issue: runtime: avoid finding own Gosched goroutine

@rhysh
Copy link
Contributor Author

rhysh commented Nov 11, 2022

Here's a reproducer with user code calling runtime.Gosched, and execution traces of it running with CL 449735 at PS 3, and with that CL's parent. @mknyszek do you have ideas on how to turn this into a non-flaky test? And, do you see a path to this being part of Go 1.20? Thanks.

This reproducer uses GOMAXPROCS-1 goroutines to create and rejoin a handful of workers (10), each of which do 100µs work units. These newly-created goroutines should end up in the P's local run queue. It uses one goroutine to do 100µs work units, interspersed with calls to runtime.Gosched. That goroutine should run with lower priority, but the presence of this issue means it won't.

Here's a 200ms view of the execution trace from the CL's parent (with this issue present). The goroutine that calls Gosched captures Proc 0, while the other 30 goroutines (each of 3 worker managers creates 10) doing the same kind of work share the other 3 Ps. (The single color of goroutine regions on Proc 0 indicates there's only one goroutine that it runs.)

Issue present, single goroutine captures Proc 0

Here's a 200ms view of the execution trace with CL 449735 (with this issue at least partially fixed). All processors run a variety of goroutines.

Issue resolved

The issue isn't entirely gone; two goroutines can work together to capture a P. This zoomed-in view of CL 449735's execution trace (covering a few hundred microseconds) shows Proc 2 switching between the user goroutine that calls Gosched and the bgsweep goroutine. The sweeper shows up in tiny gaps between G19's executions. The reproducer I wrote doesn't cause much GC sweep work (so Proc 2 doesn't stay captured for long), but I expect that the effect could continue for as long as the sweeper (or other Gosched-calling goroutine) is running/runnable.

Two goroutines collude to capture a P

package main

import (
	"context"
	"flag"
	"runtime"
	"sync"
	"sync/atomic"
	"testing"
	"time"
)

var (
	testDur  = flag.Duration("test-duration", 10*time.Second, "Duration of complete test")
	loopDur  = flag.Duration("loop-duration", 100*time.Microsecond, "Duration of each tasklet")
	tasklets = flag.Int("tasklets", 10, "Number of small jobs each worker creates")
)

func TestGosched(t *testing.T) {
	ctx, cancel := context.WithCancel(context.Background())
	defer cancel()

	var (
		workers = runtime.GOMAXPROCS(0) - 1
	)

	var wg1 sync.WaitGroup
	for i := 0; i < workers; i++ {
		wg1.Add(1)
		go func() {
			defer wg1.Done()
			for ctx.Err() == nil {
				var wg2 sync.WaitGroup
				for j := 0; j < *tasklets; j++ {
					wg2.Add(1)
					go func() {
						defer wg2.Done()
						hog(&salt, *loopDur)
					}()
				}
				wg2.Wait()
			}
		}()
	}

	start := time.Now()
	for time.Since(start) < *testDur {
		runtime.Gosched()
		hog(&salt, *loopDur)
	}

	cancel()
	wg1.Wait()
}

var salt int32

func hog(y *int32, d time.Duration) {
	accum := atomic.LoadInt32(y)
	for t0 := time.Now(); time.Since(t0) < d; {
		accum = cpuHog2(accum)
	}
	atomic.StoreInt32(y, accum)
}

func cpuHog2(x int32) int32 {
	foo := x
	for i := 0; i < 1e4; i++ {
		if foo > 0 {
			foo *= foo
		} else {
			foo *= foo + 2
		}
	}
	return foo
}

@mknyszek
Copy link
Contributor

mknyszek commented Nov 14, 2022

Thanks for the change and the in-depth analysis. I think this is leading me to conclude that the only solution to bgsweep not capturing a P with another goroutine is to fully treat it the same way as an idle-priority mark worker.

The fact that two arbitrary user Gosched'ing goroutine can still together capturing a P isn't great. The situation certainly seems better (and so maybe it's worth moving forward with it for now), but it almost feels like there should be a separate, perhaps global, Gosched list that gets checked only after netpoll and work stealing, but before the last-ditch attempt at looking at the global run queue again. Unfortunately with a separate list, Gosched goroutines might now fully starve. That can maybe be resolved by just checking this list ahead of others occasionally (like findRunnable does with the global queue). To be clear, I'm not sure this is actually the right solution because of the additional complexity, but I think at least this gets the semantics right? Maybe it's far too much slack compared to what Gosched does today. At least @rhysh's proposed change doesn't really add much more slack to Gosched compared to today, just allowing for exactly one round of stealing or netpolling.

Part of the difficulty for me here is defining what spirit of Gosched is supposed to be. How much slack does it actually mean?

@ianlancetaylor
Copy link
Contributor

ianlancetaylor commented Nov 14, 2022

Intuitively I think that Gosched should mean something like "get the current set of ready goroutines; put this goroutine to sleep until all of those goroutines have had a chance to run, then let this goroutine run." But it's not precise, and it's fine if this goroutine runs earlier or later than that. It's just a general statement of intent.

@rhysh
Copy link
Contributor Author

rhysh commented Nov 14, 2022

Yes, from what I've seen in the last few releases, the idle priority mark workers seem to behave pretty well: they yield to other work, and they're able to keep working for hundreds of microseconds without (as far as I see) spending lots of their P's time interacting with the scheduler.

For what it's worth, I have the same intuition on the spirit of Gosched.

The global run queue is global of course (so only scales so well on larger machines), but I'm also uneasy about behaviors that require visiting each P. We have that already with stealWork, and it seems like any option we've discussed for fixing Gosched would involve at least a bit of increase to cross-P communication. Although at the times when Gosched should steal work, it should be able to do so without visiting many Ps.

Maybe Gosched could put more than the calling goroutine in the global runq. A carefully-managed system goroutine (similar to fing) could precede the goroutine that called Gosched (if it's not already Runnable), and could force the P that runs it to have a particular scheduling behavior, like "run the netpoller and stealWork, perhaps after pushing your local runq to the head of the globrunq (so the results of stealWork are run first)". That system goroutine would put itself to sleep until the next Gosched call.

In an app with lots of Gosched use and lots of depth to the global runq, that system goroutine would be enqueued in the global runq only once at a time, which could help make this behavior cheaper as the app gets busier.

@mknyszek
Copy link
Contributor

mknyszek commented Nov 14, 2022

"get the current set of ready goroutines; put this goroutine to sleep until all of those goroutines have had a chance to run, then let this goroutine run."

That seems reasonable. My follow-up question with this is "how long" the goroutine remains in this state. I think we can all agree this isn't currently what Gosched does. With @rhysh's CL, it's only true for the call into the scheduler resulting from that Gosched call, and that Gosched call only. This means that two goroutines can exploit a loophole in "put this goroutine to sleep until all of those goroutines have had a chance to run" wherein they just hand time back and forth to each other (at least, on one unit of parallelism).

they yield to other work, and they're able to keep working for hundreds of microseconds without (as far as I see) spending lots of their P's time interacting with the scheduler. ... For what it's worth, I have the same intuition on the spirit of Gosched.

Treating Gosched goroutines like idle priority mark workers (at least until they get scheduled again) extends the time window the goroutine remains in this state until the next quiescence in the system. This avoids the two-goroutine conspiracy, but I'm worried that it enables starvation. There's already a lot of code out there that calls Gosched and currently there's little chance of starvation from just a single call (no more than starvation of anything on the global queue). That seems like it has the potential for significant breakage.

Maybe Gosched could put more than the calling goroutine in the global runq.

I think this is an interesting idea for idle priority goroutines in general. It's almost like the "idle process" equivalent of a goroutine.

@mknyszek
Copy link
Contributor

mknyszek commented Nov 14, 2022

That seems like it has the potential for significant breakage.

On the other hand, maybe that's OK? runtime.Gosched is vague and its precise behavior is an implementation detail. Used without care it can harm the performance of your application, not unlike runtime.GC

@rhysh
Copy link
Contributor Author

rhysh commented Nov 15, 2022

I'm not sure we're on the same page about the sweep work and Gosched. It sounded to me like you intended (at some indefinite point) to change the bgsweep work to take a similar form to the idle mark workers, and thus to not use Gosched (or goschedIfBusy). For the sweep work, I think that sounds like a positive change.

I don't think it's right for all users of Gosched to end up treated as idle workers.

What I thought @ianlancetaylor was describing (and how I expect Gosched to work) as the idealized behavior is that its callers would end up at the end of a virtual queue in which all Ps participate .. that any goroutine in any queue (P-local, or global) that is Runnable would have one opportunity to run first, and then the goroutine that called Gosched would have an opportunity to run. Goroutines that become Runnable after the Gosched call would come after it in the virtual queue, so would not be entitled to run ahead of it. So the Gosched call would create a sort of barrier.

My follow-up question with this is "how long" the goroutine remains in this state.

If we kept track of a timestamp (or other monotonic counter, corresponding to position in the idealized global queue above) for when each goroutine transitioned into the Runnable state, and a timestamp for when a goroutine called Gosched, the goroutine that called Gosched would remain not-quite-runnable until no other goroutines had an earlier timestamp. (I think that's the idealized behavior, not a practical implementation strategy.)


My CL (449735) definitely has the exploitable loophole you describe. It's not a new loophole, and it's smaller than without the CL (takes two colluding goroutines, rather than only one), but it's certainly still present. Is it good to have at least a partial fix, or should we hold off on half measures?

If you think there's potential in the "Maybe Gosched could put more than the calling goroutine in the global runq" / "force the P that runs it to have a particular scheduling behavior" idea @mknyszek ? I could try it out, but fing is pretty weird so I'm not sure whether that pattern is something you'd like to proliferate.

@mknyszek
Copy link
Contributor

mknyszek commented Nov 15, 2022

Ah, got it. I misunderstood what you were saying. I think we're on the same page now, thanks for clarifying.

If we kept track of a timestamp (or other monotonic counter, corresponding to position in the idealized global queue above) for when each goroutine transitioned into the Runnable state, and a timestamp for when a goroutine called Gosched, the goroutine that called Gosched would remain not-quite-runnable until no other goroutines had an earlier timestamp. (I think that's the idealized behavior, not a practical implementation strategy.)

This sounds right. I think a ticket system (where each Gosched increments the next ticket number, creating the barrier you describe) is an interesting idea, actually. It's a lot like the various generation counters we have. The only downside is that would have to be a single global resource. But yeah it's also complex, which is what I assume you mean by not a practical implementation strategy, haha.

Is it good to have at least a partial fix, or should we hold off on half measures?

I'm personally not opposed to landing what you have in general, but I think it might be too close to the freeze to land now. I don't expect it to have many merge conflicts just sitting around, and the freeze is shorter this time around (hopefully). I'm happy to review and iterate with you on it, and I can land it when the tree opens.

FWIW, I'm holding back some of my own changes as well, so you won't be alone. :P Plus it gives us some time to bake this a little more.

I could try it out, but fing is pretty weird so I'm not sure whether that pattern is something you'd like to proliferate.

Yeah... you have a point about fing. I'm not sure if it's worth it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Projects
Status: Todo
Development

No branches or pull requests

6 participants