Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: GoroutineProfile causes latency spikes in high-goroutine-count application #33250

dpolansky opened this issue Jul 24, 2019 · 2 comments


Copy link

@dpolansky dpolansky commented Jul 24, 2019

What version of Go are you using (go version)?

$ go version
go version devel +60f14fddfe Wed Jun 26 16:35:14 2019 +0000 darwin/amd64

Does this issue reproduce with the latest release?

The issue reproduces on go1.12.7 and go1.13beta1.

What operating system and processor architecture are you using (go env)?

go env Output
$ go env
GOENV="/Users/dpolans/Library/Application Support/go/env"
GOGCCFLAGS="-fPIC -m64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fdebug-prefix-map=/var/folders/cm/rp_0xvp15fn5mp4d672t2qsxkl2s83/T/go-build643112776=/tmp/go-build -gno-record-gcc-switches -fno-common"

What did you do?

I have a (production, latency-sensitive) Go service that has a median goroutine count of about 300,000 during steady state operations. The program periodically collects profiling data, including goroutine profiles.

What did you expect to see?

I expected the application’s performance behavior to be only slightly affected by collecting the goroutine profile (as is the case for collecting a heap profile, mutex contention profile, or CPU profile). I expected the magnitude of the latency impact to not scale with the number of goroutines (as the performance impact of heap profiling doesn’t change significantly with the size of the heap).

I expect some latency impact from a stop-the-world pause to prepare for the profile. I do not expect the duration of that pause to scale in proportion to the number of goroutines.

What did you see instead?

When Goroutine profiling is enabled, the 99th percentile latency for certain operations significantly increases. I have other programs that collect goroutine profiles with far fewer goroutines that do not incur the same latency spikes.

The benchmark attempts to replicate the latency spike by emulating the work done by the real application. The real application has one goroutine that periodically requests a goroutine profile while the other goroutines do some operation and report how long it took. The benchmark sets up a scenario with one goroutine that continuously collects goroutine profiles, n idle goroutines, and one goroutine that reports the duration of sleeping for 1 millisecond. The benchmark runs with several n values to demonstrate the latency impact as n grows. The results here show that when a goroutine profile is being collected, the p90, p99, and p99.9 wall-clock time the reproducer takes to sleep for 1ms increases with the number of live goroutines.

$ go test -bench=. -benchtime 1000x
goos: darwin
goarch: amd64
pkg: playground/runtime-slow-goroutineprofile
BenchmarkGoroutineProfile/1e5_other-8         	    1000	   1369474 ns/op	         4.97 ms-max	         0.477 ms-p90	         0.507 ms-p99	         0.532 ms-p99.9	      1001 work-count
BenchmarkGoroutineProfile/2e5_other-8         	    1000	   1376884 ns/op	         0.538 ms-max	         0.477 ms-p90	         0.516 ms-p99	         0.536 ms-p99.9	       999 work-count
BenchmarkGoroutineProfile/5e5_other-8         	    1000	   1365024 ns/op	         4.89 ms-max	         0.476 ms-p90	         0.507 ms-p99	         0.544 ms-p99.9	      1001 work-count
BenchmarkGoroutineProfile/1e6_other-8         	    1000	   1377884 ns/op	         3.91 ms-max	         0.478 ms-p90	         0.522 ms-p99	         0.900 ms-p99.9	      1002 work-count
BenchmarkGoroutineProfile/1e5_profile-8       	    1000	  83836454 ns/op	       344 ms-max	       148 ms-p90	       284 ms-p99	       338 ms-p99.9	       579 work-count
BenchmarkGoroutineProfile/2e5_profile-8       	    1000	 111897245 ns/op	       558 ms-max	       261 ms-p90	       342 ms-p99	       514 ms-p99.9	       430 work-count
BenchmarkGoroutineProfile/5e5_profile-8       	    1000	 238983513 ns/op	      1383 ms-max	       629 ms-p90	      1205 ms-p99	      1318 ms-p99.9	       379 work-count
BenchmarkGoroutineProfile/1e6_profile-8       	    1000	 129128664 ns/op	      2409 ms-max	      1086 ms-p90	      1234 ms-p99	      2405 ms-p99.9	       106 work-count
ok  	playground/runtime-slow-goroutineprofile	607.172s


package issue

import (

func BenchmarkGoroutineProfile(b *testing.B) {
	testcase := func(n int, profile bool) func(b *testing.B) {
		return func(b *testing.B) {
			var wg sync.WaitGroup
			ctx, cancel := context.WithCancel(context.Background())
			sr := make([]runtime.StackRecord, 2e6)

			// create bystanders
			for i := 0; i < n; i++ {
				go func() {
			// give the bystanders a moment to start up
			time.Sleep(1000 * time.Millisecond)

			// our villain
			var workCount uint64
			go func() {
				defer wg.Done()

				for ctx.Err() == nil {
					atomic.AddUint64(&workCount, 1)

					if profile {
						runtime.GoroutineProfile(sr) // this function iterates over runtime.allgs while all other work is stopped
					time.Sleep(1 * time.Millisecond)

			// record how long it takes to sleep for 1 millisecond b.N times
			times := make([]float64, b.N)
			sleep := 1 * time.Millisecond

			prev := time.Now()
			for i := 0; i < b.N; i++ {
				now := time.Now()
				times[i] = (now.Sub(prev) - sleep).Seconds() / time.Millisecond.Seconds()
				prev = now


			b.ReportMetric(times[len(times)/10], "ms-p90")
			b.ReportMetric(times[len(times)/100], "ms-p99")
			b.ReportMetric(times[len(times)/1000], "ms-p99.9")
			b.ReportMetric(times[0], "ms-max")
			b.ReportMetric(float64(workCount), "work-count")

	b.Run("1e5 other", testcase(1e5, false))
	b.Run("2e5 other", testcase(2e5, false))
	b.Run("5e5 other", testcase(5e5, false))
	b.Run("1e6 other", testcase(1e6, false))
	b.Run("1e5 profile", testcase(1e5, true))
	b.Run("2e5 profile", testcase(2e5, true))
	b.Run("5e5 profile", testcase(5e5, true))
	b.Run("1e6 profile", testcase(1e6, true))

CC @aclements @rhysh

Copy link

@rhysh rhysh commented Oct 4, 2019

@aclements what do you think is the right approach here?

Today, runtime.GoroutineProfile stops the world and keeps it stopped while iterating over runtime.allgs.

Could it instead inspect only the _Grunning goroutines (via allp) while the world is stopped, set a flag that's checked on the transition into _Grunning, and iterate over the rest of allgs while the program executes?

What level of consistency is required in the goroutine profile? Currently it's completely consistent, since the world is stopped. The function's docs also promise that the result slice won't be modified unless it writes out the full profile. The gcount function usually gives an estimate; is it safe to call while the world is stopped, and will that result in an exact count?

The GC has a couple mechanisms for visiting all goroutines: stack scanning and stack shrinking. Extending those seems complex, but might lead to a low execution overhead. The simpler option I can think of involves adding a field to runtime.g to track if it's been included in the current profile, and adding a check to runtime.execute (the transition into _Grunning state / update of runtime.(*m).curg) to profile goroutines before they change their call stacks to be inconsistent with the rest of the profile. What level of complexity, and what level of overhead, would be acceptable in support of this fix?

@rsc rsc modified the milestones: Go1.14, Backlog Oct 9, 2019
Copy link

@hyangah hyangah commented Dec 11, 2019

I also want to see GoroutineProfile scale. On the other hand, we use GoroutineProfile often when investigating issues involving inspections on locks and relationship among consumer/producer goroutines. It would be nice if we can still perform that kind of analysis with goroutine profile after this issue is addressed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
5 participants