Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: mark termination is slow to restart mutator #45894

Open
rhysh opened this issue Apr 30, 2021 · 6 comments
Open

runtime: mark termination is slow to restart mutator #45894

rhysh opened this issue Apr 30, 2021 · 6 comments

Comments

@rhysh
Copy link
Contributor

@rhysh rhysh commented Apr 30, 2021

What version of Go are you using (go version)?

Most of the data I have is from go1.15.6.

Does this issue reproduce with the latest release?

Yes, I have seen the same effect (as viewed from execution traces) in an app that uses go1.16.3.

What operating system and processor architecture are you using (go env)?

GOOS=linux and GOARCH=amd64

What did you do?

I have an app that uses go1.15.6 on machines with 96 hyperthreads (GOMAXPROCS=96), and an app that uses go1.16.3 on machines with 36 hyperthreads (GOMAXPROCS=36).

What did you expect to see?

I expected the app's goroutines to resume work immediately after the runtime declares end of the mark termination stop-the-world phase. I expected processors to resume work one at a time, staggered by at most a few tens of microseconds, with dozens of threads resuming useful work every millisecond.

What did you see instead?

I see in execution traces that these applications are slow to resume useful work at the end of the garbage collector's mark phase. The application using Go 1.15 and 96-hyperthread machines has a particularly low latency goal (tail of a few milliseconds), and the slow resumption of work introduces about 5 to 10ms of extra idle time during each GC run.

My read of the traces, of profiles from perf, and the runtime source is that although runtime.gcMarkDone prepares goroutines to run (in large part by injecting the Gs that were blocked on assist credit into the global run queue) and runtime.gcMarkTermination restarts the world (so Ps can pick up and execute Gs), the gcMarkTermination function follows that up immediately by using runtime.forEachP to call runtime.mcache.prepareForSweep.

When the goroutine that's finishing up the mark phase calls runtime.forEachP, it has only recently allowed the world to restart, so most Ps are still idle. That function obtains runtime.sched.lock and holds onto it while iterating the list of idle Ps, and calling the provided function for each one. My understanding of sched.lock is that it guards access to the global goroutine run queue (among other things) -- and that in the two applications where I've seen this effect, nearly every goroutine in the program that would be runnable is sitting in the global run queue since nearly all of them became blocked on assist credit during the mark phase.

I've included some screenshots from go tool trace and contiguous section of the output of go tool trace -d below the fold, all from a single execution trace of the Go 1.15 app on a 96-hyperthread machine (during a load test, so it's especially busy). I'll walk through my understanding of how the events it shows line up with the code in runtime.gcMarkDone and its callees.

  • At 413.835212ms on P 43, the GCDone event comes from runtime.gcMarkTermination, https://github.com/golang/go/blob/go1.15.6/src/runtime/mgc.go#L1691
  • At 413.924727ms on P 43, the GCSTWDone event comes from runtime.startTheWorldWithSema, called from runtime.gcMarkTermination, https://github.com/golang/go/blob/go1.15.6/src/runtime/mgc.go#L1752
  • At 414.023884ms on P 0, the ProcStart event is the beginning of the app getting back to work. There are many attempts to get threads back to work -- in startTheWorldWithSema, in injectglist, etc. -- so I'd expect about every P is already eligible to run.
  • At 414.028108ms on P 0, the GoStart event means the P is running real Go code, for goroutine 1. Lucky!
  • At 414.622347ms on P 1, the ProcStart event attempts to get some work
  • At 417.049823ms on P 0, the GoPreempt would probably come from runtime.forEachP (probably a delayed acknowledgement of the first attempt, https://github.com/golang/go/blob/go1.15.6/src/runtime/proc.go#L1333)
  • At 419.467080ms, P 1 finally starts a goroutine, 4.844733ms after its ProcStart event and 5.438972ms after P 0's GoStart. The GCMarkAssistDone event right after suggests that the goroutine would have been on the global run queue.
  • At 419.475251ms, P 62 starts a goroutine, only 0.008171ms (8µs) after P 1's first GoStart.
  • At 419.478259ms, P 2 starts a goroutine, only 0.003008ms (3µs) after P 62's GoStart.

What I see in the perf profiles (collected with -T for timestamps, -g for full call stacks, and -F 997 for about 1ms per sample) is that the process does very little work at times (only one or two CPUs worth of samples, rather than dozens), and when I see what work it's doing during those times it is inside calls from runtime.gcMarkDone to runtime.systemstack to runtime.gcMarkTermination.func4 to runtime.forEachP for several samples in a row.

413784823 GoUnblock p=43 g=3653436610 off=7142743 g=3653434585 seq=6
413832289 HeapAlloc p=43 g=3653436610 off=7142755 mem=21571879176
413832801 GoUnblock p=43 g=3653436610 off=7142763 g=3 seq=32967
413835212 GCDone p=43 g=0 off=7142771
413835361 HeapGoal p=43 g=3653436610 off=7142773 mem=43143758352
413840844 Gomaxprocs p=43 g=3653436610 off=7142781 procs=96
413924727 GCSTWDone p=43 g=0 off=7142787
414023884 ProcStart p=0 g=0 off=6440481 thread=50
414028108 GoStart p=0 g=1 off=6440486 g=1 seq=50
414029516 HeapAlloc p=0 g=1 off=6440491 mem=21571885928
414622347 ProcStart p=1 g=0 off=9652526 thread=123
414690699 ProcStart p=62 g=0 off=7722494 thread=120
417049823 GoPreempt p=0 g=1 off=6440498
419467080 GoStart p=1 g=3653436412 off=9652531 g=3653436412 seq=5
419467550 GCMarkAssistDone p=1 g=3653436412 off=9652541
419468531 HeapAlloc p=1 g=3653436412 off=9652543 mem=21571892712
419470622 HeapAlloc p=1 g=3653436412 off=9652550 mem=21571899624
419472328 ProcStart p=2 g=0 off=9587511 thread=117
419472478 GoEnd p=1 g=3653436412 off=9652557
419474014 GoStart p=1 g=3653436380 off=9652559 g=3653436380 seq=3
419474270 GCMarkAssistDone p=1 g=3653436380 off=9652567
419474782 HeapAlloc p=1 g=3653436380 off=9652569 mem=21571906088
419475251 GoStart p=62 g=3653434994 off=7722499 g=3653434994 seq=0
419476488 HeapAlloc p=1 g=3653436380 off=9652576 mem=21571912456
419477342 GoUnblock p=62 g=3653434994 off=7722508 g=3653435051 seq=2
419477726 HeapAlloc p=1 g=3653436380 off=9652583 mem=21571917832
419478259 GoStart p=2 g=3653434087 off=9587516 g=3653434087 seq=7
419478835 GCMarkAssistDone p=2 g=3653434087 off=9587525

overview

proc43

CC @aclements @mknyszek

@cherrymui cherrymui added this to the Backlog milestone Apr 30, 2021
@mknyszek
Copy link
Contributor

@mknyszek mknyszek commented May 1, 2021

Thanks for the really detailed report! I think your analysis of the situation makes a lot of sense. Maybe we can do something more clever about flushing mcaches. Honestly, forEachP seems somewhat overkill in this situation. One simple idea I have is a much more simple loop that just preempts each P directly without holding sched.lock.

Another (and I like this one better) is to do the idle Ps here and then tell each P to do it when it next calls back into the scheduler (actively allocating Ps will already flush parts of it themselves). There's a chance we miss a P (e.g. a P transitions from active to idle and stays that way when we set the mark) because it'll be racy, but we can double-check in sweep termination. Missing a P should be unlikely. EDIT: This is already (almost) what forEachP does, minus the races. :)

This must happen before the next GC cycle; we can delay the whole process until sweep termination but then that just pushes the problem somewhere else.

CC @prattmic

@gopherbot
Copy link

@gopherbot gopherbot commented May 1, 2021

Change https://golang.org/cl/315171 mentions this issue: runtime: share sched.lock during forEachP

@rhysh
Copy link
Contributor Author

@rhysh rhysh commented May 1, 2021

I think you're right that this is best addressed with a change to how the runtime prepares for sweep. A secondary effect of this is that a user goroutine can end up on the hook for several milliseconds of work (when GOMAXPROCS is large), if they happen to be doing an assist at the end of mark.

But because it's not quite May 1st (and with it the Go 1.17 freeze) in my time zone, I sent a proposal (CL 315171 above) to change forEachP to release and re-acquire the lock between every call to the provided function. It looks like runtime.unlock will do a handoff if there's an M waiting on the lock, so the world should be able to restart quickly while forEachP chugs along.

@mknyszek
Copy link
Contributor

@mknyszek mknyszek commented May 3, 2021

It occurs to me that what I laid out earlier is basically what forEachP already does, except it's racy because it avoids acquiring sched.lock altogether. Given that forEachP is only used in two places (though, notably, one of those places is mark termination and we need to be pretty careful there) just modifying forEachP sounds like it could be the right way forward for mitigating this specific issue. I do still worry about the mark termination algorithm, though.

@prattmic
Copy link
Member

@prattmic prattmic commented May 3, 2021

One idea that came up in brainstorming with @mknyszek is that we could potentially have findrunnable call prepareForSweep on idle Ps if it can't find any other work to do (really this just means calling acquirep on each of them).

This has the advantage of allowing all of the runnable Gs to get going first after STW and only afterwards start worrying about the idle Ps. In general, I don't think we'd move cost to the start of the next GC, because we should handle all the idle Ps as long as we run out of work on one M at something point. And if we never run out of work, there shouldn't be any idle Ps at all.

The biggest downside to this is more complexity in the scheduler, where we've been trying to reduce complexity. There's also an argument that having the GC flush all of the idle Ps could reduce latency to start a P, but in practice I don't think that makes much difference (and is obviously the opposite case in this bug).

@rhysh
Copy link
Contributor Author

@rhysh rhysh commented May 6, 2021

I have a reproducer. It's in the form of a benchmark for ease of setup, though it's not exactly a benchmark. The code is at the bottom, and I've included a few screenshots of execution traces that it generates. (The benchmark tries to calculate some measures of how often the application code is able to run, but the main value is in looking at the execution traces.)

This issue is a combination of a few circumstances:

  • The shape of the heap means there's not enough mark work or mark assist credit to go around, so the goroutines that allocate memory end up blocked in the assist queue until the end of the GC. These end up in the global run queue via gcWakeAllAssists.
  • The code to restart the world only launches Ps that have goroutines in their local run queue, through the interaction of procresize (returning that list of Ps) and startTheWorldWithSema (launching them).
  • If there are idle Ps and runnable goroutines, a spinning M can pick them up. (The runtime makes sure to trigger the start of this if there's a chance of finding work.) If the spinning M finds a P and some goroutines, it starts executing those after kicking off another spinning M. Pulling work from the global run queue in that way requires holding sched.lock.
  • Moving into the sweep phase uses forEachP, which does the work itself for any P that is idle. It obtains sched.lock before it begins and does not release it until it's finished handling every idle P.

Data from perf record -g -a -e 'sched:sched_switch' -e 'sched:sched_wakeup' combined with execution traces that show fast vs slow resumption after stop-the-worlds point to the "spinning M" mechanism being in play when the application is slow to resume.

The execution traces below (from the reproducer at the very bottom) show the difference in speed between procresize+startTheWorldWithSema resuming lots of Ps (fast) and relying instead on the mspinning mechanism to get back to work (slow). These traces are from a server with two Intel sockets and a total of 96 hyperthreads. I've zoomed them all so the major time divisions are at 5ms intervals.


Here's go1.16.3 with a fast resume during the start of mark. When there are a lot of active Ps right before the STW, the program is able to have a lot of active Ps very soon after the STW. This is good behavior -- the runtime is able to deliver.

go1 16 3-startmark-good

Here's another part of the go1.16.3 execution trace, showing a slow resume at the end of mark. Nearly all of the Gs are in the mark assist queue (the left half of the "Goroutines" ribbon is light green), and nearly all of the Ps are idle (very thin lavender "Threads" ribbon). Once the idle Ps start getting back to work (growing lavender, start of colored ribbons in the "Proc 0" etc rows), it takes about 4ms for the program to get back to work completely. This is the high-level bug behavior.

go1 16 3-startsweep-bad

Here's a third part of the go1.16.3 execution trace, with an interesting middle-ground. After the STW, the lavender "Threads" ribbon grows to about 80% of its full height and stays there for a few hundred microseconds before growing, more slowly than before, to 100%. I interpret this as almost all of the Ps having had local work (so procresize returns them and they're able to start immediately), then forEachP holding sched.lock for a while and preventing further growth, and finally the mspinning mechanism finding the remaining Ps and putting them to work.

go1 16 3-startsweep-ok

Here's Go tip at the parent of PS 6, go version devel go1.17-8e91458b19 Fri Apr 30 20:00:36 2021 +0000. It shows similar behavior to the first go1.16.3 image. This is from a different phase in the benchmark's lifecycle, so the "Goroutines" ribbon looks a little different. The idle Ps take about 5ms to get to work.

gotip-startsweep-bad

Here's PS 6, which (1) assigns goroutines from the global run queue to idle Ps as part of starting the world and (2) does not hold sched.lock in the parts of forEachP that do substantial work. It's from the same part of the benchmark's lifecycle as the previous image, but it only takes about 300µs for the Ps to get to work.

cl6-startsweep-good


package test

import (
	"encoding/hex"
	"flag"
	"math/rand"
	"os"
	"runtime"
	"sort"
	"sync"
	"sync/atomic"
	"testing"
	"time"
)

var (
	sink    atomic.Value
	ballast *link

	busyLen        = flag.Int("busy-bytes", 1000000, "busy loop size")
	bucketDuration = flag.Duration("bucket-duration", 100*time.Microsecond, "size of time buckets")
	workerAlloc    = flag.Int("worker-alloc", 30000, "target for memory allocation between sleeps")
	ballastSize    = flag.Int("ballast", 30000000, "Number of bytes to retain")
)

const (
	ballastPointerCount = 100
	ballastByteCount    = 1000
)

type link struct {
	a [ballastPointerCount]*link
	b [ballastByteCount]byte
}

func TestMain(m *testing.M) {
	flag.Parse()

	for i := 0; i < (*ballastSize)/(8*ballastPointerCount+ballastByteCount); i++ {
		var next link
		for i := range next.a {
			next.a[i] = ballast
		}
		ballast = &next
	}

	os.Exit(m.Run())
}

// BenchmarkResume attempts to demonstrate issue 45894. It generates some
// measurements, but its main result is execution traces that show an
// application that is slow to resume work after the GC's mark phase ends.
//
// It does its work by launching many goroutines that allocate memory in several
// sizes: so each P will have work to do in mcache.prepareForSweep, and for many
// chances for goroutines to get blocked on mark assist credit.
//
// It records the time when each goroutine was able to make progress. That might
// end up as a concise measure of how well the program is able to make progress
// while the GC runs, but for now the clearest signal comes from looking at
// execution traces (via the "-test.trace" flag and "go tool trace" command).
func BenchmarkResume(b *testing.B) {
	whenCh := make(chan time.Time, b.N)
	when := make([]time.Time, 0, b.N)
	var wg sync.WaitGroup
	wg.Add(1)
	go func() {
		defer wg.Done()
		for w := range whenCh {
			when = append(when, w)
		}
	}()

	var mem runtime.MemStats
	runtime.ReadMemStats(&mem)

	busy := make([]byte, *busyLen)

	var seed int64
	b.SetParallelism(10)
	b.ResetTimer()
	b.RunParallel(func(pb *testing.PB) {
		rng := rand.New(rand.NewSource(atomic.AddInt64(&seed, 1)))
		busyDest := make([]byte, hex.EncodedLen(len(busy)))
		for pb.Next() {
			runtime.Gosched()
			hex.Encode(busyDest, busy[:rng.Intn(len(busy))])
			todo := *workerAlloc
			var v []byte
			for _, class := range mem.BySize {
				if todo < int(class.Size) {
					break
				}
				todo -= int(class.Size)
				v = make([]byte, int(class.Size))
			}
			sink.Store(v)
			v = make([]byte, todo)
			whenCh <- time.Now()
		}
	})
	b.StopTimer()
	close(whenCh)
	wg.Wait()

	// Trim off first and last quarter of the time range, to catch the test in
	// steady state.
	sort.Slice(when, func(i, j int) bool { return when[i].Before(when[j]) })
	when = when[len(when)/4 : 3*len(when)/4]

	if len(when) < 2 {
		return
	}

	start := when[0]
	end := when[len(when)-1]

	buckets := make([]int, 1+int(end.Sub(start)/(*bucketDuration)))
	for _, t := range when {
		delta := t.Sub(start)
		bucket := int(delta.Truncate((*bucketDuration)) / (*bucketDuration))
		buckets[bucket]++
	}
	// Trim off the last bucket, which may not cover the full interval.
	buckets = buckets[:len(buckets)-1]

	if len(buckets) < 1 {
		return
	}

	sort.Sort(sort.IntSlice(buckets))

	first := len(buckets)
	for i, n := range buckets {
		if n > 0 {
			first = i
			break
		}
	}

	// See note at top about how useful these measurements are for describing
	// the issue.
	b.ReportMetric(100*float64(first)/float64(len(buckets)), "empty-percent")
	b.ReportMetric(float64(len(buckets)), "buckets")
	b.ReportMetric(float64(buckets[len(buckets)/2]), "count-p50")
	b.ReportMetric(float64(buckets[len(buckets)/10]), "count-p90")
	b.ReportMetric(float64(buckets[len(buckets)/20]), "count-p95")
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
5 participants