Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: program appears to spend 10% more time in GC on tip 3c47ead than on Go1.13.3 #35430

Closed
ardan-bkennedy opened this issue Nov 7, 2019 · 24 comments
Milestone

Comments

@ardan-bkennedy
Copy link

@ardan-bkennedy ardan-bkennedy commented Nov 7, 2019

What version of Go are you using (go version)?

$ gotip version
go version devel +3c47ead Thu Nov 7 19:20:57 2019 +0000 darwin/amd64

Does this issue reproduce with the latest release?

Current version of 1.13.3 runs faster. In fact, a version of gotip as of yesterday saw this program spending 50% of its time in GC. With this latest version of tip, it now is running at 33%.

What operating system and processor architecture are you using (go env)?

go env Output
$ gotip env
GO111MODULE=""
GOARCH="amd64"
GOBIN=""
GOCACHE="/Users/bill/Library/Caches/go-build"
GOENV="/Users/bill/Library/Application Support/go/env"
GOEXE=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="darwin"
GONOPROXY=""
GONOSUMDB=""
GOOS="darwin"
GOPATH="/Users/bill/code/go"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/Users/bill/sdk/gotip"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/Users/bill/sdk/gotip/pkg/tool/darwin_amd64"
GCCGO="gccgo"
AR="ar"
CC="clang"
CXX="clang++"
CGO_ENABLED="1"
GOMOD=""
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fdebug-prefix-map=/var/folders/f8/nl6gsnzs1m7530bkx9ct8rzc0000gn/T/go-build761411139=/tmp/go-build -gno-record-gcc-switches -fno-common"

What did you do?

https://github.com/ardanlabs/gotraining/tree/master/topics/go/profiling/trace

With the following code changes.

// Uncomment out these two lines.
44     trace.Start(os.Stdout)
45     defer trace.Stop()

Comment out line 53 and uncomment out line 56.

52     topic := "president"
53     // n := freq(topic, docs)
54     // n := freqConcurrent(topic, docs)
55     // n := freqConcurrentSem(topic, docs)
56     n := freqNumCPU(topic, docs)
57     // n := freqNumCPUTasks(topic, docs)
58     // n := freqActor(topic, docs)

Run the program

$ gotip build
$ ./trace > t.out
$ gotip tool trace t.out

What did you expect to see?

I expected to see the GC to be at or under 25% of the total runtime for the program. I didn't expect the program to run slower. Also the freqConcurrent version of the algorithm used to run at a comparable run time. Now on tip, this is faster as well by close to 300 milliseconds.

What did you see instead?

With the latest version of tip for today, I saw GC using 33% of the total run time.

On Tip

GC | 282,674,620 ns | 282,674,620 ns | 674,641 ns | 419
Selection start: 3,595,151 ns
Selection extent: 845,408,873 ns
Total Run time: 849.3ms

On 1.13.3

GC | 174,446,968 ns | 174,446,968 ns | 425,480 ns | 410
Selection start: 2,872,528 ns
Selection extent: 763,358,190 ns
Total Run time: 768.0ms
@odeke-em
Copy link
Member

@odeke-em odeke-em commented Nov 7, 2019

Thank you for reporting this issue @ardan-bkennedy!

Kindly paging @mknyszek @randall77 @aclements @RLH.

@odeke-em odeke-em changed the title runtime/GC: Program appears to spend 10% more time in GC on tip runtime: program appears to spend 10% more time in GC on tip 3c47ead than on Go1.13.3 Nov 7, 2019
@mknyszek
Copy link
Contributor

@mknyszek mknyszek commented Nov 7, 2019

This is likely related to golang.org/cl/200439 which allows the GC to assist more than 25% in cases where there's a high rate of allocation.

Although this seems like a regression, please stay tuned. I'm currently in the process of landing a set of patches related to #35112 and by the end, with this additional GC use, it's a net win for heavily allocating applications (AFAICT).

The reason we're allowing GC to exceed 25% in these cases is because #35112 makes the page allocator fast enough to out-run the GC and drive the trigger ratio to very low values (like 0.01), which means the next mark phase is starting almost immediately, meaning pretty much all new memory would be allocated black, leading to an unnecessary RSS increase. By bounding the trigger ratio like in golang.org/cl/200439, your application may end up assisting more, but the latency win from #35112 should still beat that latency hit by a significant margin in my experiments.

I'll poke this thread again when I've finished landing the full stack of changes, so please try again at that point.

In the meantime, if you could provide some information about your application? In particular:

  • What is the value of GOMAXPROCS when running this program?
  • How heavily does it allocate/do you expect it to allocate?
    • Does it perform these allocations concurrently?

This will help me get a better idea of whether this will be a win, or whether this is a loss in single-threaded performance or something else.

@ardan-bkennedy
Copy link
Author

@ardan-bkennedy ardan-bkennedy commented Nov 7, 2019

Hardware Overview:

  Model Name:	MacBook Pro
  Model Identifier:	MacBookPro15,1
  Processor Name:	6-Core Intel Core i9
  Processor Speed:	2.9 GHz
  Number of Processors:	1
  Total Number of Cores:	6
  L2 Cache (per Core):	256 KB
  L3 Cache:	12 MB
  Hyper-Threading Technology:	Enabled
  Memory:	32 GB

This runs as a 12 threaded Go program. So the code is using a pool of 12 goroutines and the GC is keeping the heap at 4 meg. In the version of code that creates a goroutine per file, I see the heap grow as high as 80 meg.

The program is opening, reading, decoding and searching 4000 files. It's memory intensive to an extent. Throwing 4000 groutines at this problem on tip is finishing the work faster than using a pool. That was never the case in 1.13.

@ardan-bkennedy
Copy link
Author

@ardan-bkennedy ardan-bkennedy commented Nov 7, 2019

I find this interesting. This is my understanding.

A priority of the pacer is to maintain a smaller heap over time and to reduce mark assit (MA) so more M's can be used for application work during any GC cycle. A GC may start early (before the heap reaches the GC Percent threshold) if it means reducing MA time. In the end, the total GC time would stay at or below 25%.

This change is allowing the GC time to grow above 25% to help reduce the size of the heap in some heavy allocation scenarios. This will increase the amount of MA time and reduce the application throughput during a GC?

Your hope is the performance loss there is gained back in the allocator?

In the end, the heap size remains as small as possible?

@mknyszek
Copy link
Contributor

@mknyszek mknyszek commented Nov 7, 2019

I find this interesting. This is my understanding.

A priority of the pacer is to maintain a smaller heap over time and to reduce mark assit (MA) so more M's can be used for application work during any GC cycle. A GC may start early (before the heap reaches the GC Percent threshold) if it means reducing MA time. In the end, the total GC time would stay at or below 25%.

Pretty much, though I wouldn't characterize it as "may start early", but rather as just "starts earlier". It's the pacer's job to drive GC use to 25%, and its primary tool for doing so is deciding when to start a GC.

This change is allowing the GC time to grow above 25% to help reduce the size of the heap in some heavy allocation scenarios. This will increase the amount of MA time and reduce the application throughput during a GC?

Both latency and throughput, but yes that's correct.

Your hope is the performance loss there is gained back in the allocator?

Correct. A heavily allocating RPC benchmark was able to drive the pacer to start a GC at the half-way point (trigger ratio = 0.5) in Go 1.13. The same benchmark drove the trigger ratio to 0.01 with the new allocator. The most convincing evidence of this being that the allocator just got faster was that the only thing that brought the trigger ratio back up was adding a sleep on the critical path.

In the end, this RPC benchmark saw a significant improvement in tail latency (-20% or more) and throughput (+30% or more), even with the new threshold.

In the end, the heap size remains as small as possible?

Not quite. The threshold in that above CL was chosen to keep the heap size roughly the same across Go versions.

@agnivade agnivade added this to the Go1.14 milestone Nov 8, 2019
@ianlancetaylor
Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Nov 16, 2019

@mknyszek Can you let @ardan-bkennedy know when to re-run tests for this issue? Thanks.

@mknyszek
Copy link
Contributor

@mknyszek mknyszek commented Nov 18, 2019

@ianlancetaylor Ah! Sorry. I completely forgot.

@ardan-bkennedy If it's on linux, windows, freebsd, feel free to try again from tip any time. :) Still working out some issues on the less popular platforms.

@ardan-bkennedy
Copy link
Author

@ardan-bkennedy ardan-bkennedy commented Nov 19, 2019

@mknyszek I am running on a Mac. I need time to test this on linux.

Side Note: I find it interesting that you consider Darwin a less popular platform when most developers I know are working on that platform?

@mknyszek
Copy link
Contributor

@mknyszek mknyszek commented Nov 19, 2019

@mknyszek I am running on a Mac. I need time to test this on linux.

Side Note: I find it interesting that you consider Darwin a less popular platform when most developers I know are working on that platform?

@ardan-bkennedy That's my mistake, I omitted it by accident. I do consider it a popular platform. Please give it a try.

The "less popular" platforms I had in mind were AIX and OpenBSD, so really anything that's not those two, though AIX should be OK now.

@ardan-bkennedy
Copy link
Author

@ardan-bkennedy ardan-bkennedy commented Nov 19, 2019

I gave Darwin a try today and the GC actually ran closer to 60% today. I just downloaded tip once more and ran it again. Looks like the program is running slower but GC looks like it is at 37%.

12 Days Ago

GC | 282,674,620 ns | 282,674,620 ns | 674,641 ns | 419
Selection start: 3,595,151 ns
Selection extent: 845,408,873 ns
Total Run time: 849.3ms

Tonight
go version devel +8b1e8a424a Tue Nov 19 19:59:21 2019 +0000 darwin/amd64

GC | 341,260,881 ns | 341,260,881 ns | 773,834 ns | 441
Selection start: 2,867,305 ns
Selection extent: 915,831,104 ns
Total Run time: 919.96ms
@mknyszek
Copy link
Contributor

@mknyszek mknyszek commented Nov 20, 2019

@ardan-bkennedy Interesting. I wonder to what extent we're seeing additional start-up costs here, considering that the application only runs for about a second (though 70 ms is kind of a lot, I wonder what the distributions look like).

@ardan-bkennedy
Copy link
Author

@ardan-bkennedy ardan-bkennedy commented Nov 20, 2019

@mknyszek the program is available for you to run. I added instructions earlier. On my machine I expect the program to run as fast as the fan out version I run with 4000 Goroutines. Which is ~750 ms

@ardan-bkennedy
Copy link
Author

@ardan-bkennedy ardan-bkennedy commented Dec 5, 2019

@mknyszek In class today someone familar with GC in the JVM suggested I run the program with a 40 meg heap. That allowed the program to run brilliantly.

GOGC=1000 ./trace > t.out

GC | 12,385,913 ns | 12,385,913 ns | 619,296 ns | 20
Selection start: 32,301,149 ns
Selection extent: 452,915,504 ns
Total Run time: 505.645 ms

It's the first time I understood the relationship of the GC Percent number to how the pacer runs.

BTW, using GOGC=100 on the latest tip ran worse today.

$ gotip version
go version devel +aab0280476 Wed Dec 4 14:53:20 2019 +0000 darwin/amd64
@mknyszek
Copy link
Contributor

@mknyszek mknyszek commented Dec 5, 2019

@ardan-bkennedy I tried this out on both Linux and macOS and cannot reproduce a difference. On Linux it was consistently faster (but only by a few ms out of 1 second-ish run so I'm hesitant to claim any wins here). On macOS it takes roughly the same amount of time either way.

What is GOMAXPROCS for you?

Also, looking at GODEBUG=gcpacertrace=1, the GC utilization is similar overall in each GC (u_a, the output is a bit hard to read), and in Go 1.13.5 already goes higher than 25% for me quite often. I'm not sure whether there's a problem here with that at all.

Also, how are you producing the output you're posting in this issue? I haven't used the trace tool extensively (except for the in-browser viewer) so this is a mode I'm not familiar with.

@ardan-bkennedy
Copy link
Author

@ardan-bkennedy ardan-bkennedy commented Dec 6, 2019

@mknyszek Do you think there is any chance we can jump on a hangout? Then I can show you what I am doing and why it appears this algorithm is not behaving the same as it did in 1.12 and 1.13 on Darwin/amd64.

bill@ardanlabs.com

GOMAXPROCS is 12
macOS Catalina (10.15.2 beta)
2.9 GHz 6-Core Intel Core i9
32 GB 2400 MHz DDR4

@ardan-bkennedy
Copy link
Author

@ardan-bkennedy ardan-bkennedy commented Dec 12, 2019

@mknyszek I am home for the next few weeks and I have time to jump on a hangout to show you what I am seeing if you have the time. IMHO, this is not resolved. However, this is an interesting example to teach how to slow the pacer down using GOGC to handle these types of situations.

@ardan-bkennedy
Copy link
Author

@ardan-bkennedy ardan-bkennedy commented Jan 16, 2020

@mknyszek things got very much worse today on Darwin. Now the 4000 goroutine program is running at 50% GC. I am seeing a TON of mark assist on the first GC which I never saw before. When I use GOGC=1000 the program is performing very well. The heap size is really causing the pacer to over GC, IMHO.

BTW, I am running a new mac 8 core.

Is there any chance we can get on a video chat so I can show you what is happening?

@mknyszek
Copy link
Contributor

@mknyszek mknyszek commented Jan 27, 2020

Hey @ardan-bkennedy, I think it would be best if we kept the chat here. My apologies for the delay in replying.

Could you share more information about how you're generating the output? Specifically this:

GC | 12,385,913 ns | 12,385,913 ns | 619,296 ns | 20
Selection start: 32,301,149 ns
Selection extent: 452,915,504 ns
Total Run time: 505.645 ms

You mentioned that you got it from a trace, but if you have more concrete details about how you extracted this out of a trace that would be very helpful.

Also, could you show a comparison of these numbers for you between Go 1.13 and tip?

I tried to produce a wall-clock time regression with both the freqConcurrent and freqNumCPU configurations (since you indicated there was an increase in total run time) on my 4-core macOS machine and have been unable to.

Now the 4000 goroutine program is running at 50% GC. I am seeing a TON of mark assist on the first GC which I never saw before.

As I mentioned earlier in this thread, more mark assists overall is expected with the new release. Note that when an application changes phases (e.g. it starts doing lots of work and moves toward a steady-state) the GC tends to over-do it and then converge to a steady-state. Given what I mentioned earlier, it makes sense that you'd see a lot more mark assists in these GCs. An increase in time spent in mark assists isn't necessarily a bad thing.

The heap size is really causing the pacer to over GC, IMHO.

I'm not quite sure what you mean by "over GC" here. In the default configuration, GOGC=100, the heap will grow to 2x the live heap and pace the mark phase to start and end before it gets there. At GOGC=1000, the heap will grow to 11x the live heap. Since the amount of mark work is proportional to only the size of the live heap, in the steady-state (the live heap isn't growing or shrinking) the application is going to spend the same amount of time GCing regardless of the heap goal. By dialing GOGC up, you're trading off a bigger heap size in exchange for fewer GCs.

@ardan-bkennedy
Copy link
Author

@ardan-bkennedy ardan-bkennedy commented Jan 30, 2020

These numbers are coming from the tracing tool by selecting all the [GC] blocks.

I appreciate you want to keep this here and I do too. It would be very helpful for me to show you this. If we can do that, we could document what we have found.

I believe that I shouldn't have to dial up the GOGC to get the results I'm expecting. The pacer seems to be over pacing IMO, something it didn't do in the past (version 1.12, 1.13).

I wish I could post images but that is not really an option.

I'm home today, tomorrow and all next week. Just give me 30 minutes to show you what I am seeing.

@aclements
Copy link
Member

@aclements aclements commented Feb 3, 2020

Thanks, we now understand where you're getting those numbers from, which is very helpful.

What you're measuring simply isn't GC time. And it's unrelated to the 25% goal. You're measuring the wall-clock time that GC is active. While it's true that spending a higher or lower fraction of wall-clock time with GC active can affect the program, these effects are secondary, and generally not large unless that fraction gets a fair bit closer to 100%.

The GC's 25% goal is not about the fraction of wall-clock time spent with GC active. It's that, while the GC is active, it consumes 25% of CPU time. For example, the GC could be active 100% of wall-clock time, but using one of four CPUs while leaving the other three to the application's goroutines, and that would satisfy the 25% goal.

This is why we strongly encourage people to look at end-to-end performance measurements. Increased wall-clock time spent with GC active only matters if it helps explain a change in an application measurement that does matter, such as a program's overall throughput/running time or latency. We haven't been able to find any measurable impact on your application.

@ardan-bkennedy
Copy link
Author

@ardan-bkennedy ardan-bkennedy commented Feb 3, 2020

Apologize for all my ignorace in all of this. Just to make sure of everything, I downloaded 1.12.16 and ran the NumCPU algorithm against 1.12.6, 1.13.6 and latst gotip as of now.

1.12.6 : perf -> 864.339ms : GC Wall clock -> 314ms
1.13.6 : perf ->759.603ms : GC Wall Clock -> 262ms
tip : perf -> 841.276ms : GC Wall Clock -> 432ms

This is a small dataset so maybe you consider this equal. I am running on an intel 8 core. Seeing the peformance drop back to 1.12 levels on tip is what triggered all of this in the first place for me.

How can I properly measure the GC time?

Since using GOGC=1000 considerably shaves 200ms off the time to run the program, is this reasonable to do in practice for something like this?

I would like guidance (if possible) on pacing and manually controlling the heap size.

@toothrot toothrot modified the milestones: Go1.14, Go1.15 Feb 25, 2020
@ianlancetaylor
Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Jun 15, 2020

@mknyszek What is the status of this issue? Are we doing to do anything about this for 1.15?

@ardan-bkennedy
Copy link
Author

@ardan-bkennedy ardan-bkennedy commented Jun 16, 2020

This has been resolved AFAIAC. You can close it.

@aclements
Copy link
Member

@aclements aclements commented Jun 16, 2020

Thanks for the update! Closing.

@aclements aclements closed this Jun 16, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
7 participants