Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: maybe allgs should shrink after peak load #34457

Open
cch123 opened this issue Sep 22, 2019 · 6 comments

Comments

@cch123
Copy link
Contributor

commented Sep 22, 2019

What version of Go are you using (go version)?

$ go version
go version go1.12.4 linux/amd64

Does this issue reproduce with the latest release?

Y

What operating system and processor architecture are you using (go env)?

any

What did you do?

When serving a peak load, the system creates a lot of goroutines, and after that the goroutine garbages cause more CPU consuming.

This can be reproduced by:

package main

import (
	"log"
	"net/http"
	_ "net/http/pprof"
	"time"
)

func sayhello(wr http.ResponseWriter, r *http.Request) {}

func main() {
	for i := 0; i < 1000000; i++ {
		go func() {
			time.Sleep(time.Second * 10)
		}()
	}
	http.HandleFunc("/", sayhello)
	err := http.ListenAndServe(":9090", nil)
	if err != nil {
		log.Fatal("ListenAndServe:", err)
	}
}

after 10 seconds, the inuse objects still remain the same.
flame3

What did you expect to see?

The global goroutines shrink to a proper size.

What did you see instead?

Many inuse objects created by malg

@cch123 cch123 changed the title maybe allgs should shrink after peak load runtime : maybe allgs should shrink after peak load Sep 22, 2019
@cch123 cch123 changed the title runtime : maybe allgs should shrink after peak load runtime: maybe allgs should shrink after peak load Sep 22, 2019
@zboya

This comment has been minimized.

Copy link

commented Sep 23, 2019

In fact, allgs never been reduced, it is not conducive to stability, should provide a strategy to reduce, such as sysmon monitoring found that more than half of g are dead, then release it.

@agnivade

This comment has been minimized.

Copy link
Member

commented Sep 23, 2019

@changkun

This comment has been minimized.

Copy link
Contributor

commented Sep 24, 2019

Very interesting practical observation. The actual cause of the issue leads to the scalability of GC.

I played this example a while and here offers a simpler test, which may better describe and benchmark the issue:

func BenchmarkGCLargeAllG(b *testing.B) {
	wg := sync.WaitGroup{}

	for ng := 100; ng <= 1000000; ng *= 10 {
		b.Run(fmt.Sprintf("#g-%d", ng), func(b *testing.B) {
			// Prepare loads of goroutines and wait
			// all goroutines terminate.
			wg.Add(ng)
			for i := 0; i < ng; i++ {
				go func() {
					time.Sleep(100 * time.Millisecond)
					wg.Done()
				}()
			}
			wg.Wait()

			// Run GC once for cleanup
			runtime.GC()

			// Now record GC scalability
			b.ResetTimer()
			b.RunParallel(func(pb *testing.PB) {
				for pb.Next() {
					runtime.GC()
				}
			})
		})

	}
}

A sample benchstat output:

name                      time/op
GCLargeAllG/#g-100-6      22.6µs ± 6%
GCLargeAllG/#g-1000-6     39.1µs ± 2%
GCLargeAllG/#g-10000-6     180µs ± 3%
GCLargeAllG/#g-100000-6   1.69ms ± 1%
GCLargeAllG/#g-1000000-6  16.0ms ± 4%

One can observe that GC poorly scales (e.g. runtime.gcResetMarkState) once we allocated a certain amount of Gs in allgs. In a largely scaled service, the GC can be executed much slower if after a peak flow, and never recover.

@aclements

This comment has been minimized.

Copy link
Member

commented Sep 25, 2019

Your observation is correct. Currently the runtime never frees the g objects created for goroutines, though it does reuse them. The main reason for this is that the scheduler often manipulates g pointers without write barriers (a lot of scheduler code runs without a P, and hence cannot have write barriers), and this makes it very hard to determine when a g can be garbage collected.

One possible solution is to use an RCU-like reclamation scheme over the Ms that understands when each M's scheduler passes through a quiescent state. Then we could schedule unused gs to be reclaimed after a grace period, when all of the Ms have been in a quiescent state. Unfortunately, we can't simply use STWs to detect this grace period because those stop all Ps, so, just like the write barriers, those won't protect against scheduler instances manipulating gs without a P.

@changkun, I'm not sure what your benchmark is measureing. Calling runtime.GC from within a RunParallel doesn't make sense. The garbage collector is already concurrent, and calling runtime.GC doesn't start another garbage collection until the first one is done. Furthermore, if there are several pending runtime.GC calls, they'll all be coalesced into a single GC. If the intent is to just measure how long a GC takes, just call runtime.GC without the RunParallel.

@changkun

This comment has been minimized.

Copy link
Contributor

commented Sep 25, 2019

@aclements It's actually intentional. Since the issue occurs with allgs, then using RunParallel intend to measure the contention of allglock. Calling runtime.GC is just a representative since all related calls will be serialized. If we write an export test function:

// place in runtime/export_test.go
func TravelAllGs() {
	lock(&allglock)
	for _, gp := range allgs {
		_ = gp
	}
	unlock(&allglock)
}

Then do:

b.RunParallel(func(pb *testing.PB) {
	for pb.Next() {
		runtime.TravelAllGs()
	}
})

We can also easily see the expenses of looping allgs here (also implies GC's scan work):

name                      time/op
GCLargeAllG/#g-100-6       132ns ±13%
GCLargeAllG/#g-1000-6      419ns ± 1%
GCLargeAllG/#g-10000-6    3.03µs ± 0%
GCLargeAllG/#g-100000-6   27.2µs ± 0%
GCLargeAllG/#g-1000000-6   260µs ± 1%

As you pointed out, it is hard to determine when a g can be collected. When playing with this observation, I was imaging allgs is likely preferred to be collected during GC (like what sync.Pool did) rather than sysmon if we find a way to determine it.

@aclements

This comment has been minimized.

Copy link
Member

commented Sep 26, 2019

Calling runtime.GC within a RunParallel does not measure contention on allglock. The GCs are serialized by runtime.GC itself, so they're not fighting over allglock, and they're coalesced by runtime.GC, so calling runtime.GC N times concurrently can result in anywhere from 1 to N GCs depending on vagaries of scheduling.

Benchmark aside, though, I think we're all clear on the issue that allgs is never collected and that impacts GC time and heap size.

Since gs are just heap allocated, it would make the most sense to collect them during GC like other heap allocations. The question is when it's safe to unlink them from allgs and allow them to be collected, given that the normal GC reachability invariants don't apply to gs. (At the same time, we don't want to be over-aggressive about unlinking them from allgs either, since we want the allocation pooling behavior to reduce the cost of starting a goroutine.) This is certainly doable, though it would require a fair amount of care.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants
You can’t perform that action at this time.