Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: performance degradation in go 1.12 #36521

Closed
interviewQ opened this issue Jan 13, 2020 · 25 comments
Closed

runtime: performance degradation in go 1.12 #36521

interviewQ opened this issue Jan 13, 2020 · 25 comments
Assignees
Milestone

Comments

@interviewQ
Copy link

@interviewQ interviewQ commented Jan 13, 2020

What version of Go are you using (go version)?

go 1.11.12, planning to upgrade to go 1.12.15

Does this issue reproduce with the latest release?

Yes, with go 1.12.15

What operating system and processor architecture are you using (go env)?

Ubuntu 18.04

What did you do?

Since go 1.11 is about to go EOL, we are planning to upgrade to the next version. I attempted
to upgrade to go 1.12.15, but I am seeing 15-20% performance degradation. A similar result was seen with go 1.13.6.

What did you expect to see?

I was hoping to see performance improvement with a newer go compiler.

What did you see instead?

Instead of performance improvement, saw performance degradation instead.

Is there anything in newer go versions (1.12 and beyond) that could answer the above performance degradation? I read online that the scheduler was modified to allow preemption of long-running goroutines. While I do not know if this could cause the performance issue, is there any way to revert this scheduler behavior just for test purposes. Any other suggestions are welcome.

@ALTree
Copy link
Member

@ALTree ALTree commented Jan 13, 2020

Hi,

thanks for reporting this, but your description of the issue is far too vague to be actionable.

Please provide a runnable, auto-contained benchmark that demonstrates the problem on the latest 1.13 release, together with some benchmarking results on the platform you are running on.

@ALTree ALTree changed the title Performance degradation in go 1.12 runtime: performance degradation in go 1.12 Jan 13, 2020
@interviewQ
Copy link
Author

@interviewQ interviewQ commented Jan 13, 2020

Thanks for the response. I do see the performance issue with 1.13 also. Give me a few days to come up with a sample code that demonstrates the issue.

@interviewQ
Copy link
Author

@interviewQ interviewQ commented Feb 4, 2020

Sorry for the delay, it took me a while to narrow down the problem code path. Attached is a sample benchmark test. I am running this test on my Mac and here are the results -

1.11.13 -

goos: darwin
goarch: amd64
BenchmarkBlockSplit-12 1 2506669772 ns/op
PASS

1.13.7 -

goos: darwin
goarch: amd64
BenchmarkBlockSplit-12 1 2945086182 ns/op
PASS

So, 2.5 sec -> 2.9 sec equates to 16% drop in performance.
perf.txt

@randall77
Copy link
Contributor

@randall77 randall77 commented Feb 4, 2020

Here's a simple reproducer. I can reproduce on Linux.

package main

import (
	"fmt"
	"sync"
	"time"
)

var sink []byte

func main() {
	start := time.Now()
	var wg sync.WaitGroup
	for i := 0; i < 10; i++ {
		wg.Add(1)
		go func() {
			for i := 0; i < 16000; i++ {
				sink = make([]byte, 64*1024)
			}
			wg.Done()
		}()
	}
	wg.Wait()
	fmt.Printf("time: %f\n", time.Since(start).Seconds())
}

This test is really allocation heavy. It's going to depend on the precise details of how allocation is done. I get these times:

go1.11.13: 0.689 sec
go1.12.16: 0.973 sec
go1.13.7: 1.151 sec
tip: 1.099 sec

I think most of the difference has to do with how big the heap grows. Here's my eyeballing the maxiumum RSS (adding lots of iterations to the innermost loop so I can watch top while it runs):

go1.11.13: 310MB
go1.12.16: 202MB
go1.13.7: 158MB
tip: 98MB

This test is stressing the allocator so much that it goes way beyond 200% of live heap data, which should only be 640KB or so. You'll see that we've made steady improvements on how much over we go. Of course, that means that we'll run garbage collections more often, which will hurt the run time.

I'm not sure there's any bug here. It just seems like a different speed/memory tradeoff. To actually demonstrate a regression, I think we'd need to hold heap size constant somehow.

@mknyszek

@interviewQ
Copy link
Author

@interviewQ interviewQ commented Feb 4, 2020

Our app which is showing performance degradation is also allocation heavy, which is why we are seeing this performance degradation with newer go runtime. Having said that, over the last few weeks I have made lots of changes to improve memory management and did see improvements that make the performance drop lower. From what you are saying above, it seems like memory allocator heavy apps will have to take a hit when they upgrade to a newer go runtime.

In your test above it seems time taken increased from 0.689 sec -> 1.151 sec which is a 65% increase. Is this big drop in performance solely attributed to running GC more often?

@mknyszek
Copy link
Contributor

@mknyszek mknyszek commented Feb 4, 2020

@interviewQ Some background: there was a known regression in the slow path of the allocator in Go 1.12, which was intentional in order to support returning memory to the OS more eagerly. For the vast majority of programs, this wasn't noticed. The regression only showed up in microbenchmarks, not in any real production services or applications (as far as we could see/find). You need to be really allocation bound to notice this, as @randall77's benchmark shows. Note that @randall77's reproducer makes many 64 KiB allocations without ever touching that memory (aside from zeroing), which would already meaningfully reduce the load on the allocator.

However, it's also been known that allocating heavily and in parallel has had serious lock contention issues since at least Go 1.11. This was noticed when we worked on the regression in Go 1.12. In Go 1.14 we worked to fix these issues (#35112) and it went reasonably well. I'm fairly confident that the allocator is now faster, so this seems to me like it's the change in the default trade-off in these "way past the heap goal" cases as @randall77 says.

With that being said, @interviewQ could you:

  1. Check if using the tip of the master branch for Go results in less memory used for your application.
  2. If so, try cranking up GOGC to the point where you're using the same amount of memory as before, then measure CPU performance again.

Hopefully the net effect will be that we made performance better. :) If not, we can move on from there.

@interviewQ
Copy link
Author

@interviewQ interviewQ commented Feb 4, 2020

First of all thanks a lot for these quick replies and very useful info.

I had tested with 1.14beta1 in Dec and that still showed the performance issue.
Did the fix for 35112 go after 1.14 beta1 was released? I was planning on retesting with 1.14 when it gets released (was due on Feb 1). However, I could try the tip of the master branch.

In our app, we have 1000 goroutines, all allocating memory in parallel, so from what you are saying above "allocating heavily and in parallel" is what applies to our app.

Using sync.Pool is something I have thought about but it seems non-trivial since I will need to figure out when to put the buffer back to the pool. It is not impossible just hard to figure out using sync.Pool.

What if I have 1 goroutine whose only job is to alloc memory. These 1000 goroutines can talk to this "allocator" goroutine (via channels). That way these huge allocations happen from only 1 goroutine and contention is eliminated. Please let me know if this makes sense. Also, I come from C++ world and hence new to Go, so please pardon my ignorance.

@mknyszek

@mknyszek
Copy link
Contributor

@mknyszek mknyszek commented Feb 4, 2020

First of all thanks a lot for these quick replies and very useful info.

I had tested with 1.14beta1 in Dec and that still showed the performance issue.
Did the fix for 35112 go after 1.14 beta1 was released? I was planning on retesting with 1.14 when it gets released (was due on Feb 1). However, I could try the tip of the master branch.

There may be a regression in the default configuration, but have you looked at memory use like I mentioned? If it went down, you could increase GOGC from the default until your memory use is the same as before (in the steady-state) and see if it performs better? Since you mentioned you're relatively new to Go, check out the comment at the top of https://golang.org/pkg/runtime for an explanation of GOGC, and also see https://golang.org/pkg/runtime/debug/#SetGCPercent for more information.

In our app, we have 1000 goroutines, all allocating memory in parallel, so from what you are saying above "allocating heavily and in parallel" is what applies to our app.

The number of goroutines doesn't really matter since that's just concurrency. The GOMAXPROCS value is your actual level of parallelism (up to the number of independent CPU cores on your machine). What is GOMAXPROCS for your application?

Using sync.Pool is something I have thought about but it seems non-trivial since I will need to figure out when to put the buffer back to the pool. It is not impossible just hard to figure out using sync.Pool.

What if I have 1 goroutine whose only job is to alloc memory. These 1000 goroutines can talk to this "allocator" goroutine (via channels). That way these huge allocations happen from only 1 goroutine and contention is eliminated. Please let me know if this makes sense. Also, I come from C++ world and hence new to Go, so please pardon my ignorance.

Contention isn't really eliminated; you're just moving it from the allocator and to the channels in this case. Before trying to restructure your code for this please give my suggestion above a try. You might just have to increase GOGC to get the same level of CPU performance for the same memory usage.

@interviewQ
Copy link
Author

@interviewQ interviewQ commented Feb 4, 2020

GOMAXPROCS = number of CPU cores. Our app runs on EC2 instances, so it depends on which instance we are running on. In my test GOMAXPROCS = 72. I will try out your suggestions and will get back to you (most likely tomorrow).

@interviewQ
Copy link
Author

@interviewQ interviewQ commented Feb 4, 2020

I ran tests with various go versions and checked RSS usage.

1.11.13 - 3.3 GB
1.13.7 - 3.4GB
1.14 (tip of master) - 4.6 GB

Actually, RSS increased with 1.14, due to this I did not so the GOGC adjustment. On this note, I see mention of SetMaxHeap function in runtime (https://blog.golang.org/ismmkeynote). Is this something that would be included in future versions of Go?

Relative to 1.11.13, performance-wise I see 25% drop with 1.13.7 and 12% drop with 1.14.

By reducing heap allocation in the app I do see performance improvement, so it is clear allocator is the bottleneck.

@mknyszek

@mknyszek
Copy link
Contributor

@mknyszek mknyszek commented Feb 5, 2020

Just out of curiosity, how did you measure RSS?

I just want to be very precise on this because the virtual memory usage of Go increased significantly with 1.14 (around 600 MiB, which admittedly doesn't account for everything you're seeing), but most of that memory is not mapped in/committed.

Thanks for the quick turnaround on this and for your cooperation. 72 cores is not a level of parallelism I've personally ever tested, it's possible we have some scalability problems at that level that we haven't seen before.

As another experiment, can you try setting GOMAXPROCS to 48 or so? There are other things I'd be interested in looking at as well, but I'll wait for your reply first.

@mknyszek
Copy link
Contributor

@mknyszek mknyszek commented Feb 5, 2020

@interviewQ Also, if you'd be willing to share a GC trace, that would give us a lot more insight into what the runtime is doing. You can collect one by running your application with GODEBUG=gctrace=1 set and capturing STDERR.

@interviewQ
Copy link
Author

@interviewQ interviewQ commented Feb 5, 2020

@mknyszek
I measured RSS by looking into "top" output, did you mean something else when you mentioned memory usage of the application. I had collected gctrace but did not save it. Will run the test again and give you the output. Planning to run the following tests -

  • GOMAXPROCS=72, gctrace=1, go 1.11.13
  • GOMAXPROCS=72, gctrace=1, go 1.13.7
  • GOMAXPROCS=48, gctrace=1, go 1.11.13
  • GOMAXPROCS=48, gctrace=1, go 1.13.7

Do you also need results from 1.14 (top of master branch)?

@mknyszek
Copy link
Contributor

@mknyszek mknyszek commented Feb 5, 2020

@interviewQ I did not mean anything else, that's exactly it. Thank you for confirming.

Running those sounds great to me. If you have the time/resources to try tip of the master branch as well that would be very helpful too.

@interviewQ
Copy link
Author

@interviewQ interviewQ commented Feb 6, 2020

@mknyszek
Finished tests and am attaching results. I saw go1.14rc1 was released yesterday, so tested with that.

Keeping 1.11.13 as the baseline, with GOMAXPROCS=72, I see huge perf drop (25% or so) with 1.13.7. There is a perf drop with 1.14rc1 also but very less (around 5%).

Keeping 1.11.13 as the baseline, with GOMAXPROCS=48, I see huge perf drop (20% or so) with 1.13.7. With 1.14rc1 perf seems the same as 1.11.13.

Comparing GOMAXPROCS=72 with GOMAXPROCS=48, we are reducing CPU resources by 35%, but perf does not drop that much.
1.11.3 - perf drops around 7%
1.13.7 - perf remains the same
1.14rc1 - perf drop around 3%

The app in question is CPU bound, there is no real i/o to slow it down. It is also memory alloc heavy.

In the attached logs if you see logs for GOMAXPROCS=48, you will see it starts with 72P but after half a sec or so changes to 48P. I am changing this programmatically by calling runtime.GOMAXPROCS. The real test starts only after this setting has been made.

Archive.zip

@interviewQ
Copy link
Author

@interviewQ interviewQ commented Feb 6, 2020

If you have any code changes in go runtime, I can test it with our app and give it a whirl.

@interviewQ
Copy link
Author

@interviewQ interviewQ commented Feb 10, 2020

@mknyszek
I ran tests on an EC2 instance with 36 core CPU. With this, I see a similar performance with go 1.11.13 vs 1.13.7. Go 1.14rc1 is actually slightly faster than go1.11.13. So, the performance issue which I am seeing is only with CPUs with large number of cores. This points to the lock contention issue in the memory allocator that you mention above. I do see this has improved considerably in Go 1.14, but it is still slower than Go 1.11 (only with large number of cores). Is there any way to improve this any further in Go1.14.

@mknyszek
Copy link
Contributor

@mknyszek mknyszek commented Feb 10, 2020

@mknyszek
I ran tests on an EC2 instance with 36 core CPU. With this, I see a similar performance with go 1.11.13 vs 1.13.7. Go 1.14rc1 is actually slightly faster than go1.11.13. So, the performance issue which I am seeing is only with CPUs with large number of cores. This points to the lock contention issue in the memory allocator that you mention above. I do see this has improved considerably in Go 1.14, but it is still slower than Go 1.11 (only with large number of cores). Is there any way to improve this any further in Go1.14.

Thinking about this more, it doesn't make sense that if Go 1.11 collapses at 48 cores that it would do better than Go 1.14 at 72 (in terms of allocations). One case I can think of is that the Go 1.14 allocator just completely breaks down at 72 cores, doing worse than Go 1.11.

So I set up a 72-core VM and ran some allocator scalability microbenchmarks. These show that the 1.14 allocator scales much better than Go 1.11: around 1.5-2x throughput improvement at 72 cores for a range of allocation sizes (anywhere from 1 KiB to 64 KiB).

I think that perhaps the scalability of another part of the runtime got worse in Go 1.12, and that perhaps scalability improvements are carrying things a bit. I don't know what that is, but now that I have a 72-core VM I can experiment more deeply.

I poured over the GC traces for Go 1.11 and Go 1.14 for several hours, doing a bunch of aggregate analysis, and things have certainly changed a lot (e.g. much more time spent in assists in Go 1.14 vs. older versions; this is not totally unexpected), but nothing stands out as "this is obviously the cause" to me. Though, since I don't think this relates to #35112 anymore, I'm going to take a closer look at the Go 1.13 GC traces and see what I can glean from them.

@networkimprov
Copy link

@networkimprov networkimprov commented Feb 10, 2020

@gopherbot remove WaitingForInfo

@interviewQ
Copy link
Author

@interviewQ interviewQ commented Feb 10, 2020

So I set up a 72-core VM and ran some allocator scalability microbenchmarks. These show that the 1.14 allocator scales much better than Go 1.11: around 1.5-2x throughput improvement at 72 cores for a range of allocation sizes (anywhere from 1 KiB to 64 KiB).

@mknyszek
In the app I am testing, we have several large-sized block allocations. A lot of the allocations are fixed 64K blocks. Other allocations are for blocks sized 50K-120K. I just thought of providing this info. Maybe for larger block allocs and that too in parallel with large number of cores (72),
we have hit some scalability limits with go 1.14 vs go 1.11.

@mknyszek
Copy link
Contributor

@mknyszek mknyszek commented Feb 10, 2020

@interviewQ Right, I figured based on the original benchmark you provided, that's why I explicitly tried out 64 KiB allocations. I can try out larger ones but up to 128 KiB they all have the potential to be satisfied out the page cache, so the scalability of the allocator should be fine.

@aclements suggested to me that if there is some scalability problem in the runtime here, and your application is indeed CPU-bound, we should be able to figure out what it is (or get a strong hint) by just looking at a differential profile for the same version of Go but for different GOMAXPROCS. So for example, for Go 1.14, if you could collect a pprof CPU profile of your application running with GOMAXPROCS=36 and GOMAXPROCS=48 and GOMAXPROCS=72, you can look at the diff in pprof and see if anything is growing. If anything is, that will point heavily to whatever the bottleneck is.

You can collect a profile by using https://golang.org/pkg/runtime/pprof/ and then running go tool pprof --diff_base=low_count.cpu high_count.cpu then typing the command "top 20" in the interactive input.

@dmitshur dmitshur added this to the Go1.14 milestone Feb 12, 2020
@dmitshur
Copy link
Member

@dmitshur dmitshur commented Feb 13, 2020

It's not clear to me yet if this should be a release blocker for 1.14, but I'll tentatively add it for visibility so we can discuss it in a release meeting.

@dmitshur
Copy link
Member

@dmitshur dmitshur commented Feb 14, 2020

We've looked over this, and it should not be a release blocker for Go 1.14, because it's not a regression that is specific to 1.14 (it affects previous releases too). Removing release-blocker.

@dmitshur dmitshur modified the milestones: Go1.14, Go1.15 Feb 14, 2020
@mknyszek mknyszek self-assigned this Mar 12, 2020
@gopherbot
Copy link

@gopherbot gopherbot commented Apr 12, 2020

Timed out in state WaitingForInfo. Closing.

(I am just a bot, though. Please speak up if this is a mistake or you have the requested information.)

@gopherbot gopherbot closed this Apr 12, 2020
@networkimprov
Copy link

@networkimprov networkimprov commented Apr 12, 2020

@interviewQ I think this was waiting on pprof results from you, described in #36521 (comment).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
7 participants
You can’t perform that action at this time.