Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: long pauses STW (sweep termination) on massive block allocation #31222

Open
un000 opened this issue Apr 3, 2019 · 14 comments
Open

runtime: long pauses STW (sweep termination) on massive block allocation #31222

un000 opened this issue Apr 3, 2019 · 14 comments
Milestone

Comments

@un000
Copy link

@un000 un000 commented Apr 3, 2019

What version of Go are you using (go version)?

~ ᐅ go version
go version go1.12.1 darwin/amd64

What operating system and processor architecture are you using (go env)?

go env Output
~ ᐅ go env
GOARCH="amd64"
GOBIN=""
GOCACHE="/Users/un0/Library/Caches/go-build"
GOEXE=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="darwin"
GOOS="darwin"
GOPATH="/Users/un0/go"
GOPROXY=""
GORACE=""
GOROOT="/usr/local/Cellar/go/1.12.1/libexec"
GOTMPDIR=""
GOTOOLDIR="/usr/local/Cellar/go/1.12.1/libexec/pkg/tool/darwin_amd64"
GCCGO="gccgo"
CC="clang"
CXX="clang++"
CGO_ENABLED="1"
GOMOD="/dev/null"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fdebug-prefix-map=/var/folders/bf/sm9v_xtn4tj6xz_rmmwp5tzm0000gn/T/go-build942797956=/tmp/go-build -gno-record-gcc-switches -fno-common"

What did you do?

We have a big entity cache inside an application. So it's loading in background goroutine.
And when the application allocates a slice in the background the latency inside all response handlers increases.

Got a reproducer

package main

import (
	"fmt"
	"math/rand"
	"os"
	"runtime/trace"
	"time"
)

type Rates struct {
	a, b, c, d float64
}

var maxWait time.Duration

func main() {
	const size = 100e6

	out, err := os.OpenFile("prof", os.O_WRONLY|os.O_CREATE|os.O_TRUNC, os.ModePerm)
	if err != nil {
		panic(err)
	}
	defer func() {
		_ = out.Sync()
		_ = out.Close()
	}()
	err = trace.Start(out)
	if err != nil {
		panic(err)
	}

	// simulate some work
	go loop()

	time.Sleep(3 * time.Second)
	// make a huge allocation in parallel
	go func() {
		_ = make([]Rates, size)		// << huge alloc
	}()
	time.Sleep(3 * time.Second)

	trace.Stop()

	fmt.Println("maxWait =", maxWait.String())
}

func loop() {
	for {
		now := time.Now()
		r := make([]Rates, 0)
		for i := 0; i < 100; i++ {
			r = append(r, Rates{
				rand.Float64(), rand.Float64(), rand.Float64(), rand.Float64(),
			})
		}
		d := time.Since(now)
		if maxWait < d {
			maxWait = d
		}
	}
}

image

What did you expect to see?

Worker gorotine(loop in example) should not pause more than 1-5ms.

What did you see instead?

43ms pauses

@un000 un000 changed the title runtime GC: long pauses STW (sweep termination) on massive allocation runtime GC: long pauses STW (sweep termination) on massive block allocation Apr 3, 2019
@agnivade agnivade changed the title runtime GC: long pauses STW (sweep termination) on massive block allocation runtime: long pauses STW (sweep termination) on massive block allocation Apr 3, 2019
@agnivade
Copy link
Contributor

@agnivade agnivade commented Oct 8, 2019

@mknyszek
Copy link
Contributor

@mknyszek mknyszek commented Oct 29, 2019

To be clear, the reproducer is making a 3.2 GiB allocation which is going to take a while for the OS to fulfill.

However, I dug into this a bit and I'm seeing ~1s pauses, sometimes with only ~50ms pauses.

The 50ms pauses represent a lower bound for how fast the OS can give us back pages. The ~1s pauses actually come from the fact that we're zeroing the entire 3.2 GiB allocation, even though it just came from the OS.

We have an optimization which avoids re-zeroing fresh pages, but unfortunately this optimization is conservative. For example, if a needzero span at the end of the heap is contiguous with the new 3.2 GiB allocation, then those two are going to coalesce, and the 3.2 GiB free space will be marked needzero, when really only the first N KiB of it actually needs to be zeroed.

One way we could fix this is by making needzero a number, and if we can guarantee that only the first N KiB need zeroing, then we only zero that. If it's some weird swiss cheese of non-zeroed memory then we just assume the whole thing needs zeroing, rather than trying to keep track of which parts need zeroing.

I'm not sure if this fix is worth it though, since future allocations would still have to go back and zero this memory anyway, causing the 1s delay to return.

@aclements
Copy link
Member

@aclements aclements commented Oct 29, 2019

For 1.15, one of my planned follow-up tasks on non-cooperative preemption is to make memclrNoHeapPointers preemptible. For large pointer-free allocations, this should make the long zeroing operation preemptible.

@evanj
Copy link
Contributor

@evanj evanj commented Apr 16, 2020

In case others are wondering if they are running into this: memclrNoHeapPointers shows up in pprof CPU profiles, so you should be able to find code paths calling it. You can also look for really large allocations in the alloc_space profile. (For the record, we are seeing ~second long GC pauses due to this issue in a production application using Go 1.14; thanks for the excellent reproduction which helped me understand this)

@changkun
Copy link
Contributor

@changkun changkun commented Aug 30, 2021

I just encountered a >1s GC pause scenario, any suggestions on how to avoid the long pause?

image

@mknyszek
Copy link
Contributor

@mknyszek mknyszek commented Aug 30, 2021

@changkun Is that with Go 1.17? Zeroing some large allocations is now preemptible, if you suspect that's the problem here.

1 second is quite a while. Is this reproducible? Can you get a GC trace where this GC happens? That would help in narrowing down the source of the pause (that is, which part of the cycle it happened). Even if it doesn't show up in the trace (like, this is about trying to stop the world, not actually having the world stopped) that's still useful information.

@changkun
Copy link
Contributor

@changkun changkun commented Aug 30, 2021

@mknyszek

I am using a tip version go version devel go1.18-527609d47b Wed Aug 25 17:07:58 2021 +0200 darwin/arm64. The profile is collected from an app running on iOS.

The GC pause reported by runtime/metrics package is as follows:

1.2032e-05:	*
2.2528e-05:	*
3.2768e-05:	**
3.3792e-05:	***
3.4816e-05:	***
3.584e-05:	**
3.6864e-05:	**
3.7888e-05:	**
3.8912e-05:	*
3.9936e-05:	*
4.1984e-05:	*
4.608e-05:	*
4.7104e-05:	*
8.3968e-05:	*
0.000139264:	*
0.010747904:	*
0.012320768:	*
0.017301504:	*
0.01835008:	**
0.019398656:	**
0.019922944:	*
0.020447232:	*
0.02097152:	*
0.021495808:	**
0.022020096:	**
0.02359296:	*
0.037748736:	*
0.040894464:	*
0.042991616:	*
0.044040192:	*
0.04718592:	*
0.049283072:	*
0.071303168:	**
0.075497472:	***
0.081788928:	*
0.08388608:	*
0.090177536:	*
0.096468992:	**
0.142606336:	*
0.14680064:	**
1.476395008:	*

Examples:

GC idle took 0.02 second:

image

GC fractional took 1.2 second and 1.8 second:

image
image

Sweep termination took 1.8 second:

image

I hope I found good examples, if not: there are more examples, see the trace profile:

https://drive.google.com/file/d/1OaMruMiu8RWBXclQDRuf7MGDkzjkR4PP/view?usp=sharing


I had a few running cases where the max observation was 4 seconds (didn't collect trace profile for this case, but the reproducer should be able to observe that again):

telegram-cloud-photo-size-2-5289949315852252757-y


The reproducer (depend on develop branch f52dd5f2461b4b6396df83a6a723037f74a75a72):

package main

import (
	"image/color"
	"time"

	"fyne.io/fyne/v2"
	"fyne.io/fyne/v2/app"
	"fyne.io/fyne/v2/canvas"
	"fyne.io/fyne/v2/container"
	"fyne.io/fyne/v2/layout"
	"fyne.io/fyne/v2/test"
	"fyne.io/fyne/v2/widget"
)

func main() {
	app := app.NewWithID("iosrend")
	win := app.NewWindow("Some Window")

	cells := make([]fyne.CanvasObject, 0, 80*25)
	for i := 0; i < 80; i++ {
		for j := 0; j < 25; j++ {
			cells = append(cells, canvas.NewRectangle(color.RGBA{R: 255, G: 255, B: 255, A: 128}))
		}
	}
	container := container.New(layout.NewGridLayout(25), cells...)

	c := 0
	b := widget.NewButton("START", func() {
		for i := 0; i < 80; i++ {
			for j := 0; j < 25; j++ {
				index := i*25 + j
				obj, _ := container.Objects[index].(*canvas.Rectangle)
				if c == 0 {
					obj.FillColor = color.RGBA{R: 20, G: 30, B: 100, A: 128}
				} else if c == 1 {
					obj.FillColor = color.RGBA{R: 100, G: 20, B: 30, A: 128}
				} else if c == 2 {
					obj.FillColor = color.RGBA{R: 30, G: 100, B: 20, A: 128}
				}
				obj.Refresh()
			}
		}
		c = (c + 1) % 3
	})
	go func() {
		tk := time.NewTicker(50 * time.Millisecond)
		for range tk.C {
			test.Tap(b)
		}
	}()

	win.SetContent(fyne.NewContainerWithLayout(layout.NewBorderLayout(nil, b, nil, nil), container, b))
	win.ShowAndRun()
}

@aclements
Copy link
Member

@aclements aclements commented Sep 9, 2021

Thanks for the examples. The GC idle and fractional traces aren't STW, Something odd appears to be happening in the GC idle example you gave because only G37 is running for a long time; that may or may not be a runtime issue, but it wouldn't show up as pause time. The sweep termination trace, however, is STW. Do you know which of the traces you uploaded is that one? We'd need to look at the stack trace of G21 at the end of the STW to get a sense of where this is. From the screenshot, G21 is clearly doing some extremely long (~35 second?!) operation.

@changkun
Copy link
Contributor

@changkun changkun commented Sep 9, 2021

Thanks for the examples. The GC idle and fractional traces aren't STW, Something odd appears to be happening in the GC idle example you gave because only G37 is running for a long time; that may or may not be a runtime issue, but it wouldn't show up as pause time.

@aclements , thanks for the follow-up. I wasn't unable (or patient enough) to open the slowly loaded huge trace file, but from the collected runtime/metrics, the data shows the largest STW was ~4s. From the traces, it seems that those areas are not working on more rendering work (issue GPU draw calls). Such a phenomenon is always surrounded by GC events, hence I had the assumption that it has some trouble to collaborate with GC.

The sweep termination trace, however, is STW. Do you know which of the traces you uploaded is that one? We'd need to look at the stack trace of G21 at the end of the STW to get a sense of where this is. From the screenshot, G21 is clearly doing some extremely long (~35 second?!) operation.

Yes, you can download the trace file here:
https://drive.google.com/file/d/1OaMruMiu8RWBXclQDRuf7MGDkzjkR4PP/view?usp=sharing
and the G21 is in http://[::]:8080/trace?start=46979199&end=48050088

G21 suppose to do work in the application lifetime, except for a clear request. As the name of G21 revealed, it is a goroutine for an unbounded channel where the implementation is just a for loop that takes some input from the channel, append to a slice buffer, then send out if there is a receiver waiting. Here is the code:

type UnboundedCanvasObjectChan struct { in, out chan fyne.CanvasObject }

func NewUnboundedCanvasObjectChan() *UnboundedCanvasObjectChan {
	ch := &UnboundedCanvasObjectChan{
		in:  make(chan fyne.CanvasObject, 128),
		out: make(chan fyne.CanvasObject, 128),
	}
	go func() {
		q := make([]fyne.CanvasObject, 0, 1<<10)
		for {
			e, ok := <-ch.in
			if !ok {
				close(ch.out)
				return
			}
			q = append(q, e)
			for len(q) > 0 {
				select {
				case ch.out <- q[0]:
					q = q[1:]
				case e, ok := <-ch.in:
					if ok {
						q = append(q, e)
						break
					}
					for _, e := range q {
						ch.out <- e
					}
					close(ch.out)
					return
				}
			}
			if cap(q) < 1<<5 {
				q = make([]fyne.CanvasObject, 0, 1<<10)
			}
		}
	}()
	return ch
}

func (ch *UnboundedCanvasObjectChan) In() chan<- fyne.CanvasObject { return ch.in }
func (ch *UnboundedCanvasObjectChan) Out() <-chan fyne.CanvasObject { return ch.out }

@aclements
Copy link
Member

@aclements aclements commented Sep 9, 2021

Thanks. From the trace, I can confirm that this is a problem with mallocgc blocking preemption. Specifically, the exit stack trace is:

runtime.mallocgc:1165
runtime.growslice:269
fyne.io/fyne/v2/internal/async.NewUnboundedCanvasObjectChan.func1:47

This reflects the first preemption point after the non-preemptible operation. So it's a little hard to say exactly where this time is going in mallocgc, but we can tell it's from an append in NewUnboundedCanvasObjectChan (the one on line 47). fyne.CanvasObject is contains pointers, so the zeroing of that memory will be non-preemptible currently.

Unfortunately, I think all this tells us is that this is the known issue with allocating large blocks of pointerful memory causing long non-preemptible regions.

I am surprised it's a whole second, and that makes me wonder if there's something beyond zeroing the memory at play here. @changkun , do you know how large these slices get? What hardware are you running this on?

@changkun
Copy link
Contributor

@changkun changkun commented Sep 9, 2021

@aclements Thanks again for the confirmation. If it is a known issue, is there is any known approach for a workaround?

The unbounded channel code is for delivering draw operations. It is hard to know how large the slice is, but I can do a rough estimation:

The code runs for about 20 seconds without GC running in the background. Assume the slice is cleared from the last GC and starts from zero sizes. There are at least 2000 pointer objects that must be appended into the slice, every 50 milliseconds. Hence 800,000 objects had been cached in the slice before sweep termination. If a pointer object is 16 bytes, then the entire slice should have roughly at least 12MB.

The reproducer (in previous comments) was running on iOS 15, iPhone 12 Pro (without Max).

@aclements
Copy link
Member

@aclements aclements commented Sep 13, 2021

Hmm. It definitely shouldn't take anywhere near that long just to zero 12MB. For lack of a better hypothesis, my best guess is that there's some unfortunate OS interaction involved, like it has to page in more memory for this process but to do that it has to shut down some other cached process (that's a total guess).

In terms of workarounds, unfortunately the answer is to make smaller allocations. That might be a better code structure here anyway: you're effectively creating a deque, which is often implemented as a linked list of fixed-size blocks. That would limit each individual allocation and allow GC to more effectively reclaim memory.

@changkun
Copy link
Contributor

@changkun changkun commented Sep 14, 2021

Fixed-size blocks is a nice structure but grow a little bit more complexity on the implementation of resizing there. Before doing it, I still wonder is why append will cause a huge allocation. Especially:

		q := make([]*int, 0, 1<<10) // slice
		for {
			e, ok := <-ch.in
			...
			q = append(q, e)                         // (1) large alloc (?)
			for len(q) > 0 {
				select {
				case ch.out <- q[0]:
					q = q[1:]                // (2) here
				case e, ok := <-ch.in:
					if ok {
						q = append(q, e) // (3) large alloc (?)
						break
					}

The previous assumption (~12MB) is really a worst-case scenario meaning that no receiver available in 20 seconds, which is entirely false, a real case should be better than than. The slice operation (in (2)) moves the head of the slice which removes the reference to previous elements. A large allocation in append means it will double an "already-huge" slice capacity. Hence it seems to suggest the slice for caching elements can only be GCed until very late. Is there a particular reason that GC does not collect the those unreferenced elements in the slice?

@aclements
Copy link
Member

@aclements aclements commented Sep 17, 2021

Is there a particular reason that GC does not collect the those unreferenced elements in the slice?

The GC doesn't work in partial objects. That is, it doesn't track the "earliest" reference into a slice, just that there was some reference to the slice. And once the entire slice is retained by GC, it also retains everything it points to.

One could certainly imagine GC tracking the earliest pointer into a slice and only scanning the slice past that point (in fact, you could do this for all objects). This gets really tricky when there are multiple references to an object. For example, the GC could have already followed one reference and scanned the object past that point, only to later find a reference that points earlier in the object and either have to scan it again or track the point in each object it last scanned from. It's all theoretically doable, but it wouldn't pay for its own costs in most cases.

If you want GC to collect the unreferenced elements, set q[0] to nil before slicing it. But that's not going to help with the large allocations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
6 participants