-
Notifications
You must be signed in to change notification settings - Fork 17.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
runtime: performance degradation in go 1.12 #36521
Comments
Hi, thanks for reporting this, but your description of the issue is far too vague to be actionable. Please provide a runnable, auto-contained benchmark that demonstrates the problem on the latest 1.13 release, together with some benchmarking results on the platform you are running on. |
Thanks for the response. I do see the performance issue with 1.13 also. Give me a few days to come up with a sample code that demonstrates the issue. |
Sorry for the delay, it took me a while to narrow down the problem code path. Attached is a sample benchmark test. I am running this test on my Mac and here are the results - 1.11.13 - goos: darwin 1.13.7 - goos: darwin So, 2.5 sec -> 2.9 sec equates to 16% drop in performance. |
Here's a simple reproducer. I can reproduce on Linux.
This test is really allocation heavy. It's going to depend on the precise details of how allocation is done. I get these times:
I think most of the difference has to do with how big the heap grows. Here's my eyeballing the maxiumum RSS (adding lots of iterations to the innermost loop so I can watch
This test is stressing the allocator so much that it goes way beyond 200% of live heap data, which should only be 640KB or so. You'll see that we've made steady improvements on how much over we go. Of course, that means that we'll run garbage collections more often, which will hurt the run time. I'm not sure there's any bug here. It just seems like a different speed/memory tradeoff. To actually demonstrate a regression, I think we'd need to hold heap size constant somehow. |
Our app which is showing performance degradation is also allocation heavy, which is why we are seeing this performance degradation with newer go runtime. Having said that, over the last few weeks I have made lots of changes to improve memory management and did see improvements that make the performance drop lower. From what you are saying above, it seems like memory allocator heavy apps will have to take a hit when they upgrade to a newer go runtime. In your test above it seems time taken increased from 0.689 sec -> 1.151 sec which is a 65% increase. Is this big drop in performance solely attributed to running GC more often? |
@interviewQ Some background: there was a known regression in the slow path of the allocator in Go 1.12, which was intentional in order to support returning memory to the OS more eagerly. For the vast majority of programs, this wasn't noticed. The regression only showed up in microbenchmarks, not in any real production services or applications (as far as we could see/find). You need to be really allocation bound to notice this, as @randall77's benchmark shows. Note that @randall77's reproducer makes many 64 KiB allocations without ever touching that memory (aside from zeroing), which would already meaningfully reduce the load on the allocator. However, it's also been known that allocating heavily and in parallel has had serious lock contention issues since at least Go 1.11. This was noticed when we worked on the regression in Go 1.12. In Go 1.14 we worked to fix these issues (#35112) and it went reasonably well. I'm fairly confident that the allocator is now faster, so this seems to me like it's the change in the default trade-off in these "way past the heap goal" cases as @randall77 says. With that being said, @interviewQ could you:
Hopefully the net effect will be that we made performance better. :) If not, we can move on from there. |
First of all thanks a lot for these quick replies and very useful info. I had tested with 1.14beta1 in Dec and that still showed the performance issue. In our app, we have 1000 goroutines, all allocating memory in parallel, so from what you are saying above "allocating heavily and in parallel" is what applies to our app. Using sync.Pool is something I have thought about but it seems non-trivial since I will need to figure out when to put the buffer back to the pool. It is not impossible just hard to figure out using sync.Pool. What if I have 1 goroutine whose only job is to alloc memory. These 1000 goroutines can talk to this "allocator" goroutine (via channels). That way these huge allocations happen from only 1 goroutine and contention is eliminated. Please let me know if this makes sense. Also, I come from C++ world and hence new to Go, so please pardon my ignorance. |
There may be a regression in the default configuration, but have you looked at memory use like I mentioned? If it went down, you could increase GOGC from the default until your memory use is the same as before (in the steady-state) and see if it performs better? Since you mentioned you're relatively new to Go, check out the comment at the top of https://golang.org/pkg/runtime for an explanation of GOGC, and also see https://golang.org/pkg/runtime/debug/#SetGCPercent for more information.
The number of goroutines doesn't really matter since that's just concurrency. The GOMAXPROCS value is your actual level of parallelism (up to the number of independent CPU cores on your machine). What is GOMAXPROCS for your application?
Contention isn't really eliminated; you're just moving it from the allocator and to the channels in this case. Before trying to restructure your code for this please give my suggestion above a try. You might just have to increase GOGC to get the same level of CPU performance for the same memory usage. |
GOMAXPROCS = number of CPU cores. Our app runs on EC2 instances, so it depends on which instance we are running on. In my test GOMAXPROCS = 72. I will try out your suggestions and will get back to you (most likely tomorrow). |
I ran tests with various go versions and checked RSS usage. 1.11.13 - 3.3 GB Actually, RSS increased with 1.14, due to this I did not so the GOGC adjustment. On this note, I see mention of SetMaxHeap function in runtime (https://blog.golang.org/ismmkeynote). Is this something that would be included in future versions of Go? Relative to 1.11.13, performance-wise I see 25% drop with 1.13.7 and 12% drop with 1.14. By reducing heap allocation in the app I do see performance improvement, so it is clear allocator is the bottleneck. |
Just out of curiosity, how did you measure RSS? I just want to be very precise on this because the virtual memory usage of Go increased significantly with 1.14 (around 600 MiB, which admittedly doesn't account for everything you're seeing), but most of that memory is not mapped in/committed. Thanks for the quick turnaround on this and for your cooperation. 72 cores is not a level of parallelism I've personally ever tested, it's possible we have some scalability problems at that level that we haven't seen before. As another experiment, can you try setting GOMAXPROCS to 48 or so? There are other things I'd be interested in looking at as well, but I'll wait for your reply first. |
@interviewQ Also, if you'd be willing to share a GC trace, that would give us a lot more insight into what the runtime is doing. You can collect one by running your application with |
@mknyszek
Do you also need results from 1.14 (top of master branch)? |
@interviewQ I did not mean anything else, that's exactly it. Thank you for confirming. Running those sounds great to me. If you have the time/resources to try tip of the master branch as well that would be very helpful too. |
@mknyszek Keeping 1.11.13 as the baseline, with GOMAXPROCS=72, I see huge perf drop (25% or so) with 1.13.7. There is a perf drop with 1.14rc1 also but very less (around 5%). Keeping 1.11.13 as the baseline, with GOMAXPROCS=48, I see huge perf drop (20% or so) with 1.13.7. With 1.14rc1 perf seems the same as 1.11.13. Comparing GOMAXPROCS=72 with GOMAXPROCS=48, we are reducing CPU resources by 35%, but perf does not drop that much. The app in question is CPU bound, there is no real i/o to slow it down. It is also memory alloc heavy. In the attached logs if you see logs for GOMAXPROCS=48, you will see it starts with 72P but after half a sec or so changes to 48P. I am changing this programmatically by calling runtime.GOMAXPROCS. The real test starts only after this setting has been made. |
If you have any code changes in go runtime, I can test it with our app and give it a whirl. |
@mknyszek |
Thinking about this more, it doesn't make sense that if Go 1.11 collapses at 48 cores that it would do better than Go 1.14 at 72 (in terms of allocations). One case I can think of is that the Go 1.14 allocator just completely breaks down at 72 cores, doing worse than Go 1.11. So I set up a 72-core VM and ran some allocator scalability microbenchmarks. These show that the 1.14 allocator scales much better than Go 1.11: around 1.5-2x throughput improvement at 72 cores for a range of allocation sizes (anywhere from 1 KiB to 64 KiB). I think that perhaps the scalability of another part of the runtime got worse in Go 1.12, and that perhaps scalability improvements are carrying things a bit. I don't know what that is, but now that I have a 72-core VM I can experiment more deeply. I poured over the GC traces for Go 1.11 and Go 1.14 for several hours, doing a bunch of aggregate analysis, and things have certainly changed a lot (e.g. much more time spent in assists in Go 1.14 vs. older versions; this is not totally unexpected), but nothing stands out as "this is obviously the cause" to me. Though, since I don't think this relates to #35112 anymore, I'm going to take a closer look at the Go 1.13 GC traces and see what I can glean from them. |
@gopherbot remove WaitingForInfo |
@mknyszek |
@interviewQ Right, I figured based on the original benchmark you provided, that's why I explicitly tried out 64 KiB allocations. I can try out larger ones but up to 128 KiB they all have the potential to be satisfied out the page cache, so the scalability of the allocator should be fine. @aclements suggested to me that if there is some scalability problem in the runtime here, and your application is indeed CPU-bound, we should be able to figure out what it is (or get a strong hint) by just looking at a differential profile for the same version of Go but for different GOMAXPROCS. So for example, for Go 1.14, if you could collect a pprof CPU profile of your application running with GOMAXPROCS=36 and GOMAXPROCS=48 and GOMAXPROCS=72, you can look at the diff in pprof and see if anything is growing. If anything is, that will point heavily to whatever the bottleneck is. You can collect a profile by using https://golang.org/pkg/runtime/pprof/ and then running |
It's not clear to me yet if this should be a release blocker for 1.14, but I'll tentatively add it for visibility so we can discuss it in a release meeting. |
We've looked over this, and it should not be a release blocker for Go 1.14, because it's not a regression that is specific to 1.14 (it affects previous releases too). Removing release-blocker. |
Timed out in state WaitingForInfo. Closing. (I am just a bot, though. Please speak up if this is a mistake or you have the requested information.) |
@interviewQ I think this was waiting on pprof results from you, described in #36521 (comment). |
What version of Go are you using (
go version
)?go 1.11.12, planning to upgrade to go 1.12.15
Does this issue reproduce with the latest release?
Yes, with go 1.12.15
What operating system and processor architecture are you using (
go env
)?Ubuntu 18.04
What did you do?
Since go 1.11 is about to go EOL, we are planning to upgrade to the next version. I attempted
to upgrade to go 1.12.15, but I am seeing 15-20% performance degradation. A similar result was seen with go 1.13.6.
What did you expect to see?
I was hoping to see performance improvement with a newer go compiler.
What did you see instead?
Instead of performance improvement, saw performance degradation instead.
Is there anything in newer go versions (1.12 and beyond) that could answer the above performance degradation? I read online that the scheduler was modified to allow preemption of long-running goroutines. While I do not know if this could cause the performance issue, is there any way to revert this scheduler behavior just for test purposes. Any other suggestions are welcome.
The text was updated successfully, but these errors were encountered: