Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: program appears to spend 10% more time in GC on tip 3c47ead than on Go1.13.3 #35430

Open
ardan-bkennedy opened this issue Nov 7, 2019 · 5 comments
Milestone

Comments

@ardan-bkennedy
Copy link

@ardan-bkennedy ardan-bkennedy commented Nov 7, 2019

What version of Go are you using (go version)?

$ gotip version
go version devel +3c47ead Thu Nov 7 19:20:57 2019 +0000 darwin/amd64

Does this issue reproduce with the latest release?

Current version of 1.13.3 runs faster. In fact, a version of gotip as of yesterday saw this program spending 50% of its time in GC. With this latest version of tip, it now is running at 33%.

What operating system and processor architecture are you using (go env)?

go env Output
$ gotip env
GO111MODULE=""
GOARCH="amd64"
GOBIN=""
GOCACHE="/Users/bill/Library/Caches/go-build"
GOENV="/Users/bill/Library/Application Support/go/env"
GOEXE=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="darwin"
GONOPROXY=""
GONOSUMDB=""
GOOS="darwin"
GOPATH="/Users/bill/code/go"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/Users/bill/sdk/gotip"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/Users/bill/sdk/gotip/pkg/tool/darwin_amd64"
GCCGO="gccgo"
AR="ar"
CC="clang"
CXX="clang++"
CGO_ENABLED="1"
GOMOD=""
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fdebug-prefix-map=/var/folders/f8/nl6gsnzs1m7530bkx9ct8rzc0000gn/T/go-build761411139=/tmp/go-build -gno-record-gcc-switches -fno-common"

What did you do?

https://github.com/ardanlabs/gotraining/tree/master/topics/go/profiling/trace

With the following code changes.

// Uncomment out these two lines.
44     trace.Start(os.Stdout)
45     defer trace.Stop()

Comment out line 53 and uncomment out line 56.

52     topic := "president"
53     // n := freq(topic, docs)
54     // n := freqConcurrent(topic, docs)
55     // n := freqConcurrentSem(topic, docs)
56     n := freqNumCPU(topic, docs)
57     // n := freqNumCPUTasks(topic, docs)
58     // n := freqActor(topic, docs)

Run the program

$ gotip build
$ ./trace > t.out
$ gotip tool trace t.out

What did you expect to see?

I expected to see the GC to be at or under 25% of the total runtime for the program. I didn't expect the program to run slower. Also the freqConcurrent version of the algorithm used to run at a comparable run time. Now on tip, this is faster as well by close to 300 milliseconds.

What did you see instead?

With the latest version of tip for today, I saw GC using 33% of the total run time.

On Tip

GC | 282,674,620 ns | 282,674,620 ns | 674,641 ns | 419
Selection start: 3,595,151 ns
Selection extent: 845,408,873 ns
Total Run time: 849.3ms

On 1.13.3

GC | 174,446,968 ns | 174,446,968 ns | 425,480 ns | 410
Selection start: 2,872,528 ns
Selection extent: 763,358,190 ns
Total Run time: 768.0ms
@odeke-em

This comment has been minimized.

Copy link
Member

@odeke-em odeke-em commented Nov 7, 2019

Thank you for reporting this issue @ardan-bkennedy!

Kindly paging @mknyszek @randall77 @aclements @RLH.

@odeke-em odeke-em changed the title runtime/GC: Program appears to spend 10% more time in GC on tip runtime: program appears to spend 10% more time in GC on tip 3c47ead than on Go1.13.3 Nov 7, 2019
@mknyszek

This comment has been minimized.

Copy link
Contributor

@mknyszek mknyszek commented Nov 7, 2019

This is likely related to golang.org/cl/200439 which allows the GC to assist more than 25% in cases where there's a high rate of allocation.

Although this seems like a regression, please stay tuned. I'm currently in the process of landing a set of patches related to #35112 and by the end, with this additional GC use, it's a net win for heavily allocating applications (AFAICT).

The reason we're allowing GC to exceed 25% in these cases is because #35112 makes the page allocator fast enough to out-run the GC and drive the trigger ratio to very low values (like 0.01), which means the next mark phase is starting almost immediately, meaning pretty much all new memory would be allocated black, leading to an unnecessary RSS increase. By bounding the trigger ratio like in golang.org/cl/200439, your application may end up assisting more, but the latency win from #35112 should still beat that latency hit by a significant margin in my experiments.

I'll poke this thread again when I've finished landing the full stack of changes, so please try again at that point.

In the meantime, if you could provide some information about your application? In particular:

  • What is the value of GOMAXPROCS when running this program?
  • How heavily does it allocate/do you expect it to allocate?
    • Does it perform these allocations concurrently?

This will help me get a better idea of whether this will be a win, or whether this is a loss in single-threaded performance or something else.

@ardan-bkennedy

This comment has been minimized.

Copy link
Author

@ardan-bkennedy ardan-bkennedy commented Nov 7, 2019

Hardware Overview:

  Model Name:	MacBook Pro
  Model Identifier:	MacBookPro15,1
  Processor Name:	6-Core Intel Core i9
  Processor Speed:	2.9 GHz
  Number of Processors:	1
  Total Number of Cores:	6
  L2 Cache (per Core):	256 KB
  L3 Cache:	12 MB
  Hyper-Threading Technology:	Enabled
  Memory:	32 GB

This runs as a 12 threaded Go program. So the code is using a pool of 12 goroutines and the GC is keeping the heap at 4 meg. In the version of code that creates a goroutine per file, I see the heap grow as high as 80 meg.

The program is opening, reading, decoding and searching 4000 files. It's memory intensive to an extent. Throwing 4000 groutines at this problem on tip is finishing the work faster than using a pool. That was never the case in 1.13.

@ardan-bkennedy

This comment has been minimized.

Copy link
Author

@ardan-bkennedy ardan-bkennedy commented Nov 7, 2019

I find this interesting. This is my understanding.

A priority of the pacer is to maintain a smaller heap over time and to reduce mark assit (MA) so more M's can be used for application work during any GC cycle. A GC may start early (before the heap reaches the GC Percent threshold) if it means reducing MA time. In the end, the total GC time would stay at or below 25%.

This change is allowing the GC time to grow above 25% to help reduce the size of the heap in some heavy allocation scenarios. This will increase the amount of MA time and reduce the application throughput during a GC?

Your hope is the performance loss there is gained back in the allocator?

In the end, the heap size remains as small as possible?

@mknyszek

This comment has been minimized.

Copy link
Contributor

@mknyszek mknyszek commented Nov 7, 2019

I find this interesting. This is my understanding.

A priority of the pacer is to maintain a smaller heap over time and to reduce mark assit (MA) so more M's can be used for application work during any GC cycle. A GC may start early (before the heap reaches the GC Percent threshold) if it means reducing MA time. In the end, the total GC time would stay at or below 25%.

Pretty much, though I wouldn't characterize it as "may start early", but rather as just "starts earlier". It's the pacer's job to drive GC use to 25%, and its primary tool for doing so is deciding when to start a GC.

This change is allowing the GC time to grow above 25% to help reduce the size of the heap in some heavy allocation scenarios. This will increase the amount of MA time and reduce the application throughput during a GC?

Both latency and throughput, but yes that's correct.

Your hope is the performance loss there is gained back in the allocator?

Correct. A heavily allocating RPC benchmark was able to drive the pacer to start a GC at the half-way point (trigger ratio = 0.5) in Go 1.13. The same benchmark drove the trigger ratio to 0.01 with the new allocator. The most convincing evidence of this being that the allocator just got faster was that the only thing that brought the trigger ratio back up was adding a sleep on the critical path.

In the end, this RPC benchmark saw a significant improvement in tail latency (-20% or more) and throughput (+30% or more), even with the new threshold.

In the end, the heap size remains as small as possible?

Not quite. The threshold in that above CL was chosen to keep the heap size roughly the same across Go versions.

@agnivade agnivade added this to the Go1.14 milestone Nov 8, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants
You can’t perform that action at this time.