-
Notifications
You must be signed in to change notification settings - Fork 18k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
runtime: performance degradation on tip that disappears when traced/profiled #67857
Comments
We've seen this particular commit as a culprit as well, but as you point out, this code never actually runs during the benchmarks. You need tracing enabled for it to do anything, and it does help tracing overheads, so I'm reluctant to roll it back. In these sorts of situations, I'd usually blame the benchmark for being too small. For example, the CL causes some code to move across cache line boundaries (or something like that), perturbing some super tight loop that is hit very often. Given that you're seeing this on arm64 I'm inclined to say it's not quite that exactly, but some kind of unfortunate microarchitectural effect. I was meaning to look into this (and still will), but given that you're already doing some work on it, are you able to dig deeper and see what differences are showing up at the microarchitectural level (say, with Linux perf or something)? A 20% regression should stick out like a sore thumb when comparing Linux perf results (assuming that that doesn't also eliminate the regression). |
Similar Issues
(Emoji vote if this was helpful or unhelpful; more detailed feedback welcome in this discussion.) |
Hey @mknyszek - thanks for the response. We were able to grab an isolated linux machine w/ 24 cores and do some more investigation. To reduce any possible noise, we moved the code out of a benchmark environment and into a standalone program that ran the same call to Marshal with the same parallelism factor. The degradation on this particular machine was around 5-10%. Firstly, we looked at runtime/metrics.
Secondly, we looked at linux perf results. When running multiple times, the values that stayed consistently higher in the "bad" commit were cpu-migrations and context switches. For example: Running on 73981695a2518b6eae7f8ffe74d224691c60d433 (good):
on 5bba5b256ce287736da60372c5cf634395d7b1a3 (bad):
Collecting a CPU profile did seem to make the migrations go back to "good" levels, but we have yet to fully investigate profiling vs. not profiling. We noticed other commits were seeing better performance when profiling too, so we are partially baffled by this atm. Finally, running with link-time randomization (as @thanm did in #67858) did not seem to have any effect. At this point, one theory we have is that for some reason threads are yielding more frequently (including GC threads), which is causing more frequent GC cycles. This could be due to some increased lock contention somewhere (since we see such a large increase in mutex wait time)? Or maybe the OS is forcing yield more frequently for some reason? Still unsure how the bisected commit would cause either though. |
Thanks for doing a more in-depth analysis. I think I know what the problem is, and the change in The old trace map implementation kept a substantial amount of memory in an array. These maps are only defined in globals, and there's about 4 of them if I recall correctly (6 on tip, I think). Even though the pointers in the map are to not-in-heap memory, the pointers are just regular unsafe pointers, so the big array appears as scannable memory in globals to the GC. Because scannable globals contribute to the heap goal since Go 1.18, I think their disappearance is the likely the source of the effect. My first guess is that these extra globals in the "good" commit are behaving like a ballast. This theory is consistent with the change in # of GC cycles. What's odd is that the heap goal remains the same, but it may just be a rounding error in the Would you be able to share the STDERR of running your benchmark with If this theory is right, then unfortunately, this is exactly one of those artifacts that only really affects programs with really small live heaps that allocate a lot. If I am right, I do not think we should do anything about it. (That is, I do not think it is worthwhile to restore the ballast.) What may prove my theory wrong is that 300 KiB is still very little. But I do think this is still in the right general direction. (This may also explain the |
Ah - interesting. I definitely was misinterpreting what I was able to trace GC and found evidence that supports your theory, though I'm not too sure if the absolute amount of discrepancy is what is expected, and the consistent 0MB globals throws me off. Seems like during GC for the "bad" commit, the mean (taken of the rounded values in the output) heap size and goal are 3.3 MB and 7.3 MB, respectively, while for the "good" commit it's around 4.5 and 10 MB, respectively. Here's the full output for bad vs. good, if you'd like to take a look (just fyi - this was taken on the machine I used to post the original issue, not the linux machine I used for my most recent comment) I'm supposing that when I collected runtime/metrics before, the heap goal at that point in time (after the benchmark had completed) for both good & bad happened to be the minimum heap size of 4 MB.
Yeah absolutely - I think the issue - while worth investigation - is definitely a very specific scenario that does not represent real applications with sizeable heaps. |
Yeah, thats likely an artifact of rounding down.
Yep, that sounds like it's the problem. It makes sense that the live heap size also might be slightly larger as a side-effect, since everything allocated during the mark phase is counted as part of the next cycle's live heap. If there's more for the GC to do, it's a longer mark phase, and thus a bigger live heap. Thanks! I'll poke at it. |
Go version
go version go1.22.4 darwin/arm64
Output of
go env
in your module/workspace:What did you do?
While doing some initial testing in preparation for Go 1.23 in the coming months, we ran some of Uber’s open-source benchmarks using the tip of Go.
What did you see happen?
We saw surprising degradations in a couple Zap benchmarks of around 20%. The most concerning of which was a simple test that measured concurrent calls to json.Marshal (nothing Zap-related). Here’s a simplified version of that benchmark:
To reproduce:
Some notes about reproducing:
Our Investigation
Using git bisect revealed that this commit is what is causing the degradation in the benchmarks.
This is a surprising result as the offending commit seems entirely unrelated to the benchmarks. We tried to see how this new code could be getting hit at all, and were unable to find a link via dlv breakpoints nor print statements in the runtime package.
We did try to collect CPU profiles while running the benchmarks to investigate further but interestingly, the degradation disappears when collecting a CPU profile.
Since this issue seems to be unrelated to Zap specifically, and reproduces with calls to encoding/json, we wanted to bring it up in case it points to a more general issue.
What did you expect to see?
No practical degradation.
The text was updated successfully, but these errors were encountered: