-
Notifications
You must be signed in to change notification settings - Fork 18k
runtime: performance degradation on tip on high core count machines #67858
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
CC @golang/runtime |
These logging benchmarks appear to involve incredibly tight loops, which leads me to believe that it's possibly due to unfortunate microarchitectural effects -- see my comment on #67857. And my next question would be: does this in any way correspond to production-level regressions? Unless the world is getting stopped a lot I would not expect 15* One thing I might suggest trying is setting For example, a benchmark that allocates in a loop, immediately drops the memory, and has a teeny tiny live heap (a lot like most microbenchmarks!), is going to see a very, very high number of GC cycles because it is going to sit at the minimum total heap size of 4 MiB. This means tons of STWs and CPU time spent on mark work that does not translate at all to real production systems with sizable live heaps. The tricky part with the GC is that its frequency (and thus its cost) is proportional to your live heap via I'm curious to know if setting I'll also add, are you able to compare profiles before and after ( |
Hey @mknyszek - thanks for the quick response & ideas! Your explanation w.r.t microbenchmarks being a GC torture test makes sense - these tests are allocating memory that quickly becomes dead - which leads to a situation with consistent allocations but low target memory, causing GC to get triggered often. I suppose this would exacerbate any GC performance degradations that are otherwise minuscule in larger benchmarks/applications. It's unfortunately not realistic for us to actually test Go tip in production - but we may be able to do this once an rc version is tagged. I did turn off GC and re-ran these benchmarks, and as you predicted, this does seem to result in no degradation between the versions. Like I mentioned before, we had a hard time discerning any differences in profiles between the two versions, but I tried to make the issue worse by forcing more GC with GOGC=50 in hopes that differences between the profiles would surface better, and did find that |
No problem. These reports are important. :) I hope my response did not discourage future reports. Thanks for the thorough issue you filed -- it's genuinely helpful that y'all do performance testing against tip.
Yeah, it's a bit unfortunate how easy it is to generate benchmarks that do this.
Understandable. Hopefully the RC gives us some more feedback.
Got it, that's good to know! Thanks for checking.
That's unfortunate; just to clarify, that's even with the automated diffing? The only other thing I might suggest is doing the same kind of differential profile with Linux I'm going to leave this issue open for now while I investigate #67822, in case it becomes relevant. |
Similar Issues
(Emoji vote if this was helpful or unhelpful; more detailed feedback welcome in this discussion.) |
I ran these benchmarks on my high-core 2-socket machine, with and without link-time randomization (as described in this note: #67822 (comment)). Here's what I see with no randomization; rock steady:
Now here's what happens when I add in text layout randomization (10 instances of -count=3 runs, each with a random seed):
Note the |
Hmm... FWIW I wasn't able to reproduce this using the script from #67822 (comment), including the same 10 seeds. Perhaps I'm doing something wrong. No linker randomization:
Linker randomization:
I did sanity check go was being invoked correctly.
very strange. |
Interesting -- looks like machine type (microarchitecture) matters a lot here, which I suppose I should have expected all along. I don't have access to a 2-socket (96 core) AMD EPYC 7B13, but I do have a single-socket AMD EPYC 7B13 and I can see the degradation hold up just as you mention. I get roughly this
both with and without layout randomization. Thanks. |
Go version
go version go1.22.4 linux/amd64
Output of
go env
in your module/workspace:What did you do?
We have been doing some performance testing of Go tip at Uber in preparation for Go 1.23.
What did you see happen?
We have noticed degradation in linux machines with a lot of cores (96) in all of Zap’s Field logging benchmark tests of around 8%. These benchmarks look something like this:
We don’t have an isolated linux environment available to us, so these results are susceptible to a slight noisy neighbor problem, but we have consistently seen some amount of degradation on these benchmarks:
We fiddled with GOMAXPROCS a bit and noticed the degradation is definitely related to parallelism.

We didn’t see a whole lot in CPU profiles other than a general increase of about 2-4% of samples taken in the runtime package.
We were able to use git bisect to identify e995aa95cb5f379c1df5d5511ee09970261d877f as one cause. Specifically, the added calls to nanotime() seem to cause degradation in these highly parallelized benchmarks. However, this commit alone does not seem to account for the entire degradation:
We weren’t able to reliably identify any additional commits beyond this one that accounted for more of the degradation.
Note: this is not a duplicate of #67857, but rather an investigation of different Zap benchmark degradations.
What did you expect to see?
No practical degradation.
The text was updated successfully, but these errors were encountered: