New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cmd/compile: revisit statement boundaries CL peformance and binary size impact #25426
Comments
We got a nice improvement in compilation time from improvements to imports; we get to spend some of that on other things. There is in fact a dependence on value order, and not depending on value order would induce a much larger compile-time cost. The value to the end user is that debugger breakpoints more often correspond to the correct position in the source code, and that the stepping/nexting experience within a debugger is much less erratic and confusing. This also means that printing values in the debugger more often yields the "correct" result (i.e., conforming to the user's expectations). As a side-effect, it becomes less necessary to get this same effect by changing the line number of "unimportant" instructions (to avoid debugger-visible line-number churn) which should make profiling slightly more accurate. Late in the game I thought of what could be a better approach to this that should be both faster and less sensitive to value order, but it was very late, and the alternate method is completely untested. |
So, re late-in-the-game method, how about we make investigating that a high priority for 1.12? It won't reduce allocations much beyond what we do now, because that's already been optimized down to O(1) allocation per function (I think) and it won't help binary size in the least because that information takes space no matter what. We are, however, talking about compressing debugging information in the near-ish future, so there's that hope. |
Thanks, @dr2chase. I want to read a bit more before I reply to your comments. One quick question, though. Was the intent that this CL only change DWARF info? I ask because I notice that the hello world text bytes are up 1.22%, and I would have expected that DWARF changes would only impact overall binary size (which is up 4%). And a quick late night test with |
I am not sure where that extra text came from, and I don't think I touched any of the optimizations. It is possible that some of the scheduler heuristics changed slightly because I rolled back some of the line-number-removals previously done to clean up debugger experience, and line number is one of the tie-breakers. |
I see differences, they go in both directions, I think there is some other order-dependent bug in rewrite triggered by my change to the value ordering in fuse. Here, "+" is the post-CL version, i.e., shorter.
|
I did a quick experiment with the order of visiting blocks and values in rewrite.go, and it is definitely order-sensitive. Sadly, I proved this only by making the text larger. I suspect if we visited blocks in reverse post order and also scheduled their contents, that we would get the best outcome, though any new values allocated will probably invalidate the schedule. |
This is amusing. The cause of the larger text is
I probably should have done this in two CLs, I'll let that be my lesson. Your grumbling about a need for tools to understand rewrite application is also looking prescient.
|
Sorry for all the delays from me. Here's my set of reactions having skimmed the CL and your investigatory results (thanks!).
And with that list, I'm going to probably most disappear again for another couple of days. My apologies. :( |
I could start on the alternative approach, since that is hoped-for for 1.12, but I think the order dependence is more annoying than fraught. I'm pretty sure the quadratic risk has been reduced after dealing with Austin's remarks; not 100% certain, but its result is now used for skip-ahead which I think fixes it. pclntab is "s", whatever that is, and it might be the fault of the size command that it is considered text. I don't want to undo the line numbering change just for the sake of shrinking that table; that change is believed to make the numbering more accurate (for profiling purposes) relative to earlier compromise between debugging experience and profiling accuracy. I'm unsure about the goodness of detecting statements earlier; this worked better than I expected, enough that I would want examples (bug reports) showing failure. As far as I know, most of the existing failures are caused by peculiar combinations of rewriting and ordering, and not by screwups in initial ordering. It's my belief thats changing to the block-oriented approach would avoid all these bad-luck bugs. I'll look into the proposed cleanups, with an eye towards tiny, tractable CLs, since this is late in the game. |
I recommended an alternative and simpler statement tracking scheme in CL 93663: https://go-review.googlesource.com/c/go/+/93663#message-5b114391669f7db29632eb5b66dbec412c169e6a |
Change https://golang.org/cl/113717 mentions this issue: |
I checked for quadratic behavior of In the third it is not:
Quadratic behavior is limited to the number of statements of in a row that are marked |
I'm not following this:
|
There's a glitch in how attributes from procs that do not generate code are combined, and the workaround for this glitch appeared in two places. "One big pile is better than two little ones." Updates #25426. Change-Id: I252f9adc5b77591720a61fa22e6f9dda33d95350 Reviewed-on: https://go-review.googlesource.com/113717 Run-TryBot: David Chase <drchase@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Josh Bleecher Snyder <josharian@gmail.com>
Change https://golang.org/cl/113719 mentions this issue: |
And with good reason. :) Please ignore. Thanks for checking on the quadratic concern. |
There's semantically-but-not-literally equivalent code in two cases for joining blocks' value lists in ssa/fuse.go. It can be made literally equivalent, then commoned up. Updates #25426. Change-Id: Id1819366c9d22e5126f9203dcd4c622423994110 Reviewed-on: https://go-review.googlesource.com/113719 Run-TryBot: David Chase <drchase@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Josh Bleecher Snyder <josharian@gmail.com>
What is the status of this issue? Is there more to do for 1.11? For 1.12? |
Not more to do in 1.11, for 1.12 we might look into one or another different and perhaps more efficient ways of assigning statement boundaries. |
There is a blog post about big |
As I read the blog post, it was compressed, but for performance reasons we quit compressing it. |
There has been a push recently to further reduce start-up times; I don't think unilaterally deciding to (naively) compress A better alternative is to do some serious engineering work and look for alternative encodings of that data. I experimented a few years ago with using a smaller encoding for pcvalue pairs. From my notes to myself at that time:
There are also other streaming integer compression techniques, like varint-GB, Stream V-Byte, Roaring Bitmaps. They are all optimized for different purposes, so it is reasonably likely that anything would need to be adapted a bit for Go's particular data structure and needed. The point is merely that there is room for experimentation before reaching for the compressions sledgehammer. Another thing that I've pondered is supporting some kind of lookback in the encoding. Inlined code is very expensive in the pclntab, because the encoding required for a cross-file, cross-line jump is long. But for inlined code, we are often hopping back and forth between a few locations. Having a way to refer not to the previous value but to a value up to values back could help. (This is similar to keyframes and lookback in video encoding.) This has a runtime cost; again, investigation needed. And there's more to the pclntab than just pcvalue entries. The next step here is probably for someone to spend some quality time investigating all the design decisions in pclntab and coming up with an improved design and quantifying the impact it would have (as @dr2chase says). |
Yet another option is to compress subsections of pclntab using a compression technique that is designed for speed and low overhead, such that it can get decompressed on the fly, every single time it is needed, without appreciable performance loss. I am far from an expert in this area, but heatshrink and blosc come to mind. |
It has always bothered me that we have essentially a separate pcln table for every function. I suspect lots of functions are small, in which case maybe we should batch the tables across a larger region, say a 4K page. |
One of the things I noticed while examining is that there is a lot of repetition in symbol names. As a snippet from
One idea here would be to encode all names as a lookup to 3 different names. e.g. There's also inefficiency in encoding anonymous structs:
Here one idea would be to use the variable name for the typename symbol rather than the type description itself. Of course, the anonymous struct types are easy to fix manually as well. |
Volunteer (and often reluctant) toolspeed cop here. (Sorry, @dr2chase.)
CL 102435 has a non-trivial impact on compilation speed, memory usage, and binary size:
Reading the CL description, I can't quite tell what the value is to the user that offsets that impact.
I'd like to revisit whether the speed, memory impact, and binary size impacts can be mitigated at all before Go 1.11 is released, and I personally would like to understand better what benefits the CL brings.
Also, as an aside, I am slightly concerned about this bit from the CL description:
Value order should not matter, and cannot be relied upon. (And @randall77 has an old, outstanding CL to randomize Value order to enforce that.) And it is unclear whether the fuse.go changes need to be permanent or not.
I have very little dedicated laptop time in the immediate future due to personal life stuff, but I wanted to flag this right away. I'll look carefully at the CL and the impact as soon as I can.
cc @dr2chase @aclements @bradfitz
The text was updated successfully, but these errors were encountered: