New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/compile: revisit statement boundaries CL peformance and binary size impact #25426

Open
josharian opened this Issue May 16, 2018 · 17 comments

Comments

Projects
None yet
5 participants
@josharian
Contributor

josharian commented May 16, 2018

Volunteer (and often reluctant) toolspeed cop here. (Sorry, @dr2chase.)

CL 102435 has a non-trivial impact on compilation speed, memory usage, and binary size:

name        old time/op       new time/op       delta
Template          176ms ± 2%        181ms ± 3%  +2.61%  (p=0.000 n=45+50)
Unicode          87.5ms ± 5%       87.9ms ± 4%    ~     (p=0.147 n=48+49)
GoTypes           557ms ± 4%        569ms ± 2%  +2.18%  (p=0.000 n=42+44)
Compiler          2.65s ± 3%        2.70s ± 3%  +1.82%  (p=0.000 n=49+49)
SSA               7.16s ± 2%        7.37s ± 2%  +3.00%  (p=0.000 n=48+47)
Flate             118ms ± 2%        123ms ± 3%  +4.05%  (p=0.000 n=48+49)
GoParser          138ms ± 3%        143ms ± 2%  +3.28%  (p=0.000 n=49+47)
Reflect           360ms ± 3%        367ms ± 3%  +1.76%  (p=0.000 n=48+48)
Tar               157ms ± 4%        160ms ± 3%  +2.31%  (p=0.000 n=50+49)
XML               201ms ± 4%        207ms ± 3%  +2.79%  (p=0.000 n=48+49)
[Geo mean]        353ms             362ms       +2.42%

name        old user-time/op  new user-time/op  delta
Template          215ms ± 3%        219ms ± 3%  +1.67%  (p=0.000 n=48+49)
Unicode           110ms ± 5%        110ms ± 3%    ~     (p=0.051 n=48+46)
GoTypes           741ms ± 4%        749ms ± 3%  +1.05%  (p=0.000 n=47+46)
Compiler          3.60s ± 4%        3.63s ± 2%  +0.84%  (p=0.002 n=44+49)
SSA               10.3s ± 4%        10.5s ± 2%  +2.13%  (p=0.000 n=44+46)
Flate             138ms ± 3%        143ms ± 3%  +3.28%  (p=0.000 n=48+46)
GoParser          159ms ± 3%        175ms ± 4%  +9.82%  (p=0.000 n=50+47)
Reflect           464ms ± 2%        466ms ± 3%  +0.47%  (p=0.020 n=47+49)
Tar               195ms ± 4%        198ms ± 3%  +1.40%  (p=0.000 n=50+46)
XML               241ms ± 9%        258ms ± 3%  +7.04%  (p=0.000 n=50+48)
[Geo mean]        446ms             458ms       +2.79%

name        old alloc/op      new alloc/op      delta
Template         35.1MB ± 0%       36.8MB ± 0%   +4.91%  (p=0.008 n=5+5)
Unicode          29.3MB ± 0%       29.8MB ± 0%   +1.59%  (p=0.008 n=5+5)
GoTypes           115MB ± 0%        121MB ± 0%   +5.15%  (p=0.008 n=5+5)
Compiler          521MB ± 0%        560MB ± 0%   +7.48%  (p=0.008 n=5+5)
SSA              1.71GB ± 0%       1.91GB ± 0%  +11.69%  (p=0.008 n=5+5)
Flate            24.2MB ± 0%       25.4MB ± 0%   +4.91%  (p=0.008 n=5+5)
GoParser         28.1MB ± 0%       29.5MB ± 0%   +4.87%  (p=0.008 n=5+5)
Reflect          78.7MB ± 0%       82.4MB ± 0%   +4.65%  (p=0.008 n=5+5)
Tar              34.5MB ± 0%       36.1MB ± 0%   +4.62%  (p=0.008 n=5+5)
XML              43.3MB ± 0%       45.5MB ± 0%   +5.27%  (p=0.008 n=5+5)
[Geo mean]       78.1MB            82.4MB        +5.48%

name        old allocs/op     new allocs/op     delta
Template           328k ± 0%         336k ± 0%   +2.59%  (p=0.008 n=5+5)
Unicode            336k ± 0%         338k ± 0%   +0.37%  (p=0.008 n=5+5)
GoTypes           1.14M ± 0%        1.17M ± 0%   +2.29%  (p=0.008 n=5+5)
Compiler          4.77M ± 0%        4.88M ± 0%   +2.23%  (p=0.008 n=5+5)
SSA               13.7M ± 0%        14.0M ± 0%   +2.49%  (p=0.008 n=5+5)
Flate              220k ± 0%         226k ± 0%   +2.71%  (p=0.008 n=5+5)
GoParser           273k ± 0%         280k ± 0%   +2.29%  (p=0.008 n=5+5)
Reflect            940k ± 0%         971k ± 0%   +3.32%  (p=0.008 n=5+5)
Tar                321k ± 0%         330k ± 0%   +2.65%  (p=0.008 n=5+5)
XML                383k ± 0%         390k ± 0%   +1.94%  (p=0.008 n=5+5)
[Geo mean]         751k              768k        +2.28%

name        old text-bytes    new text-bytes    delta
HelloSize         672kB ± 0%        680kB ± 0%  +1.22%  (p=0.000 n=50+50)

name        old data-bytes    new data-bytes    delta
HelloSize         134kB ± 0%        134kB ± 0%    ~     (all equal)

name        old exe-bytes     new exe-bytes     delta
HelloSize        1.43MB ± 0%       1.49MB ± 0%  +4.00%  (p=0.000 n=50+50)

Reading the CL description, I can't quite tell what the value is to the user that offsets that impact.

I'd like to revisit whether the speed, memory impact, and binary size impacts can be mitigated at all before Go 1.11 is released, and I personally would like to understand better what benefits the CL brings.

Also, as an aside, I am slightly concerned about this bit from the CL description:

The code in fuse.go that glued together the value slices
of two blocks produced a result that depended on the
former capacities (not lengths) of the two slices.  This
causes differences in the 386 bootstrap, and also can
sometimes put values into an order that does a worse job
of preserving statement boundaries when values are removed.

Value order should not matter, and cannot be relied upon. (And @randall77 has an old, outstanding CL to randomize Value order to enforce that.) And it is unclear whether the fuse.go changes need to be permanent or not.

I have very little dedicated laptop time in the immediate future due to personal life stuff, but I wanted to flag this right away. I'll look carefully at the CL and the impact as soon as I can.

cc @dr2chase @aclements @bradfitz

@josharian josharian added this to the Go1.11 milestone May 16, 2018

@dr2chase

This comment has been minimized.

Contributor

dr2chase commented May 16, 2018

We got a nice improvement in compilation time from improvements to imports; we get to spend some of that on other things.

There is in fact a dependence on value order, and not depending on value order would induce a much larger compile-time cost.

The value to the end user is that debugger breakpoints more often correspond to the correct position in the source code, and that the stepping/nexting experience within a debugger is much less erratic and confusing. This also means that printing values in the debugger more often yields the "correct" result (i.e., conforming to the user's expectations).

As a side-effect, it becomes less necessary to get this same effect by changing the line number of "unimportant" instructions (to avoid debugger-visible line-number churn) which should make profiling slightly more accurate.

Late in the game I thought of what could be a better approach to this that should be both faster and less sensitive to value order, but it was very late, and the alternate method is completely untested.

@dr2chase

This comment has been minimized.

Contributor

dr2chase commented May 16, 2018

So, re late-in-the-game method, how about we make investigating that a high priority for 1.12?
The method is to quit tracking at the value level until late scheduling, and record instead (for each block) the sequence of statements encountered within it, and then pinning statement boundaries to values during/after scheduling. It should cost less time because it's only done once, and it removes the value-order dependence, which is a good thing.

It won't reduce allocations much beyond what we do now, because that's already been optimized down to O(1) allocation per function (I think) and it won't help binary size in the least because that information takes space no matter what. We are, however, talking about compressing debugging information in the near-ish future, so there's that hope.

@josharian

This comment has been minimized.

Contributor

josharian commented May 17, 2018

Thanks, @dr2chase. I want to read a bit more before I reply to your comments.

One quick question, though. Was the intent that this CL only change DWARF info? I ask because I notice that the hello world text bytes are up 1.22%, and I would have expected that DWARF changes would only impact overall binary size (which is up 4%). And a quick late night test with -ldflags=-w shows a binary size increase. The obvious follow-on questions (if I understand correctly) are: Where does the extra 1.22% comes from? And have the effectiveness of any (important) optimizations been impaired?

@dr2chase

This comment has been minimized.

Contributor

dr2chase commented May 17, 2018

I am not sure where that extra text came from, and I don't think I touched any of the optimizations.
I'll look.

It is possible that some of the scheduler heuristics changed slightly because I rolled back some of the line-number-removals previously done to clean up debugger experience, and line number is one of the tie-breakers.

@dr2chase

This comment has been minimized.

Contributor

dr2chase commented May 17, 2018

I see differences, they go in both directions, I think there is some other order-dependent bug in rewrite triggered by my change to the value ordering in fuse. Here, "+" is the post-CL version, i.e., shorter.
This is from strconv.roundShortest:

-    v148 = LEAQ <*decimal> {.autotmp_16} v2 : R9
-    v173 = MOVBloadidx1 <byte> v119 v148 v115 : R10 (u[byte])
+    v173 = MOVBloadidx1 <byte> {.autotmp_16} v119 v2 v115 : R9 (u[byte])
@dr2chase

This comment has been minimized.

Contributor

dr2chase commented May 17, 2018

I did a quick experiment with the order of visiting blocks and values in rewrite.go, and it is definitely order-sensitive. Sadly, I proved this only by making the text larger. I suspect if we visited blocks in reverse post order and also scheduled their contents, that we would get the best outcome, though any new values allocated will probably invalidate the schedule.

@dr2chase

This comment has been minimized.

Contributor

dr2chase commented May 17, 2018

This is amusing. The cause of the larger text is runtime.pclntab.
But rewrite IS order dependent.
However, changing fuse to preserve value order no matter what the respective block.Value capacities are actually tickles this bug in the proper direction to produce smaller code, not larger (and hence probably slightly more efficient).

runtime.pclntab is larger because I rolled back some changes intended to improve debugging the old way, where "unimportant" instructions are assigned the line number of their neighbors to avoid debugger-step-churn.

I probably should have done this in two CLs, I'll let that be my lesson. Your grumbling about a need for tools to understand rewrite application is also looking prescient.

-238    runtime.epclntab        2928,2690
-80     encoding/asn1.parseField        14784,14704
-48     encoding/json.(*decodeState).object     8320,8272
-48     vendor/golang_org/x/crypto/cryptobyte.(*String).ReadASN1Integer 1968,1920
-32     encoding/json.(*decodeState).literalStore       10176,10144
-32     fmt.(*pp).printArg      2688,2656
-32     reflect.DeepEqual       928,896
-32     regexp/syntax.(*parser).repeat  1552,1520
-32     runtime.mapaccess1      576,544
-32     runtime.mapaccess1_fast32       448,416
-32     runtime.updatememstats  672,640
-32     text/template.(*state).idealConstant    1104,1072
-32     text/template.(*state).walkRange        2128,2096
-32     vendor/golang_org/x/crypto/cryptobyte.(*String).ReadOptionalASN1Integer 2016,1984
-16     crypto/tls.(*serverHelloMsg).marshal    3280,3264
-16     crypto/x509.(*Certificate).checkNameConstraints 2512,2496
-16     encoding/asn1.MarshalWithParams 720,704
-16     encoding/asn1.UnmarshalWithParams       720,704
-16     encoding/gob.(*Decoder).Decode  576,560
-16     encoding/gob.(*Decoder).recvType        544,528
-16     encoding/gob.(*Encoder).Encode  288,272
-16     encoding/gob.(*Encoder).sendActualType  1696,1680
-16     encoding/gob.compileEnc 1808,1792
-16     fmt.(*pp).catchPanic    1264,1248
-16     fmt.(*pp).fmtBytes      2416,2400
-16     fmt.(*pp).printValue    10256,10240
-16     fmt.(*ss).scanOne       5024,5008
-16     fmt.intFromArg  512,496
-16     internal/poll.(*FD).Accept      704,688
-16     math/big.lehmerSimulate 384,368
-16     net/http.(*Request).write       3408,3392
-16     net/http.(*http2ClientConn).encodeHeaders       2736,2720
-16     reflect.Value.Elem      480,464
-16     reflect.Value.Field     352,336
-16     regexp.mergeRuneSets    2608,2592
-16     runtime.(*mcache).releaseAll    176,160
-16     runtime.(*mspan).sweep  2400,2384
-16     runtime.adjustpointers  576,560
-16     runtime.advanceEvacuationMark   240,224
-16     runtime.clearpools      368,352
-16     runtime.freeStackSpans  432,416
-16     runtime.freedefer.func1 368,352
-16     runtime.mallocinit      608,592
-16     runtime.mapaccess1_fast64       448,432
-16     runtime.mapaccess1_faststr      1088,1072
-16     runtime.mapaccess2      576,560
-16     runtime.mapaccess2_fast32       464,448
-16     runtime.mapaccess2_fast64       464,448
-16     runtime.mapaccess2_faststr      1152,1136
-16     runtime.moduledataverify1       1696,1680
-16     runtime.newdefer.func1  480,464
-16     runtime.purgecachedstats        192,176
-16     runtime.typelinksinit   1760,1744
-16     runtime.wbBufFlush1     640,624
-16     runtime/pprof.parseProcSelfMaps 1248,1232
-16     strconv.roundShortest   960,944
-16     text/template.(*state).evalArg  3872,3856
-16     text/template.(*state).evalCommand      2400,2384
-16     text/template.(*state).evalEmptyInterface       2048,2032
-16     text/template.(*state).evalField        3792,3776
-16     text/template.addValueFuncs     1360,1344
-16     text/template.evalArgs  544,528
-16     text/template.length    704,688
-16     vendor/golang_org/x/crypto/chacha20poly1305/internal/chacha20.XORKeyStream      656,640
-16     vendor/golang_org/x/net/http2/hpack.newStaticTable      864,848
-16     vendor/golang_org/x/net/route.ParseRIB  1232,1216
-4      runtime.findfunctab     17053,17049
4       runtime.erodata 387,391
4       runtime.etypes  387,391
16      crypto/elliptic.p224DoubleJacobian      1104,1120
16      crypto/tls.(*Conn).readRecord   5808,5824
16      crypto/tls.(*clientHelloMsg).unmarshal  4432,4448
16      encoding/gob.(*Encoder).encodeInterface 2624,2640
16      net/http.(*http2Framer).WriteGoAway     864,880
16      regexp.(*machine).onepass       1936,1952
16      syscall.ParseDirent     1168,1184
16      text/template.(*state).evalCall 3824,3840
16      text/template.(*state).walkTemplate     992,1008
64      internal/poll.(*FD).ReadDirent  368,432
288     go.func.*       182310,182598
58414   runtime.pclntab 1671824,1730238
@josharian

This comment has been minimized.

Contributor

josharian commented May 18, 2018

Sorry for all the delays from me. Here's my set of reactions having skimmed the CL and your investigatory results (thanks!).

  • I think we should investigate the alternative approach, possibly even for 1.11 depending on how it works out. The dependence on Value order seems fraught with peril, and the alternative approach seems a lot cleaner.

  • We might even (for 1.12?) want to identify statements much earlier, instead of during AST to SSA translation. Many statements present in AST to SSA translation are part of lowering or ordering code, and aren't actually present in the source. It seems to me we should mark statements right after parsing, since that's the moment at which we have the closest correspondence to the input source. It'd also sidestep the AVARLIVE etc handling.

  • Nice find on the fuse change and rewrite rules! I'd like to find some way to programmatically detect order-dependence in the rewrite rules, so that we can start systematically fixing them. I don't have any simple ideas, though. (One complicated idea is to have a debug mode in which we do the following. Before rewriting, make a deep copy of a function, its blocks, and its values. Randomize block and value order. Rewrite. Make another deep copy, randomize again, and rewrite. Then use topological ordering to correlate blocks, and some kind of isomorphism to correlate values. Check whether the results are identical. If not, complain. Repeat.) Or alternatively switch to a more sophisticated rewrite rule generation engine.

  • I think we should send a CL soon to put back the line-number-as-last-resort scheduling heuristic, to shrink pclntab back, if feasible.

  • Why is pclntab part of TEXT? I'd expect it to be RODATA.

  • I'm a bit concerned that the rewrite rules might now be (mildly) quadratic. Note that for each Value changed we potentially do O(n) work as part of nextGoodStatementIndex. This could imaginably blow up for code that has many, many Values in a single Block. Unlikely, but still concerning.

  • I am mildly depressed by the number of sparse data structures we have. I wonder whether some judicious embedding/composition might help. (Or generics. :P) For later.

  • It'd be nice to refactor the handling of APCDATA, AFUNCDATA, and ANOP into a function, which could then have nice docs attached to it explaining what is happening and why.

  • Can you avoid generating the liveOrderStmts slice if you're not going to use it, in liveValues? That might help with allocs.

  • In fuse, should clear the tail of c.Values in the in-place case, to allow them to be GC'd. Also, if you make with the correct length instead of correct cap in the reallocate case, then you could copy in instead of appending in, in which case the copying in can be shared.

  • Can you move numberlines after "early phielim", and maybe "early copyelim"? That'd give it much less work to do, since those passes tend to significantly reduce the number of Values present.

  • Are we sure that numberlines isn't quadratic, for similar reasons as the rewrite rules above?

And with that list, I'm going to probably most disappear again for another couple of days. My apologies. :(

@dr2chase

This comment has been minimized.

Contributor

dr2chase commented May 18, 2018

I could start on the alternative approach, since that is hoped-for for 1.12, but I think the order dependence is more annoying than fraught. I'm pretty sure the quadratic risk has been reduced after dealing with Austin's remarks; not 100% certain, but its result is now used for skip-ahead which I think fixes it.

pclntab is "s", whatever that is, and it might be the fault of the size command that it is considered text. I don't want to undo the line numbering change just for the sake of shrinking that table; that change is believed to make the numbering more accurate (for profiling purposes) relative to earlier compromise between debugging experience and profiling accuracy.

I'm unsure about the goodness of detecting statements earlier; this worked better than I expected, enough that I would want examples (bug reports) showing failure. As far as I know, most of the existing failures are caused by peculiar combinations of rewriting and ordering, and not by screwups in initial ordering. It's my belief thats changing to the block-oriented approach would avoid all these bad-luck bugs.

I'll look into the proposed cleanups, with an eye towards tiny, tractable CLs, since this is late in the game.

@mdempsky

This comment has been minimized.

Member

mdempsky commented May 18, 2018

I recommended an alternative and simpler statement tracking scheme in CL 93663: https://go-review.googlesource.com/c/go/+/93663#message-5b114391669f7db29632eb5b66dbec412c169e6a

@gopherbot

This comment has been minimized.

gopherbot commented May 18, 2018

Change https://golang.org/cl/113717 mentions this issue: cmd/compile: refactor and cleanup of common code/workaround

@dr2chase

This comment has been minimized.

Contributor

dr2chase commented May 18, 2018

I checked for quadratic behavior of nextGoodStatementIndex, I think it is mostly okay.
There are three uses, and in two of them the result is used to skip ahead (and the
cost of skipping ahead is proportional to the distance skipped).

In the third it is not:

                                          if v.Pos.IsStmt() == src.PosIsStmt {
						if k := nextGoodStatementIndex(v, j, b); k != j {
							v.Pos = v.Pos.WithNotStmt()
							b.Values[k].Pos = b.Values[k].Pos.WithIsStmt()
						}
					}

Quadratic behavior is limited to the number of statements of in a row that are marked PosIsStmt for the same line with a "not Good" opcode. Such runs are not supposed to occur, though rewriting might create them -- but their size would be limited to the expansion of an instruction that was marked as a statement. I'd like to do better, but I think this risk is well-fenced.

@dr2chase

This comment has been minimized.

Contributor

dr2chase commented May 18, 2018

I'm not following this:

In fuse, should clear the tail of c.Values in the in-place case, to allow them to be GC'd.

c.Values gets b.Values followed by c.Values. For the in-place case, c.Values is expanded, then the old values moved to higher indices copy(c.Values[bl:], c.Values). This creates no new non-zero entries, since the "old" c.Values is overwritten by some combination of values from b and c, and the newly addressed area (between indices cl and bl+cl) is filled with the last bl elements of the old c.Values.

gopherbot pushed a commit that referenced this issue May 18, 2018

cmd/compile: refactor and cleanup of common code/workaround
There's a glitch in how attributes from procs that do not
generate code are combined, and the workaround for this
glitch appeared in two places.

"One big pile is better than two little ones."

Updates #25426.

Change-Id: I252f9adc5b77591720a61fa22e6f9dda33d95350
Reviewed-on: https://go-review.googlesource.com/113717
Run-TryBot: David Chase <drchase@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Josh Bleecher Snyder <josharian@gmail.com>
@gopherbot

This comment has been minimized.

gopherbot commented May 18, 2018

Change https://golang.org/cl/113719 mentions this issue: cmd/compile: common up code in fuse for joining blocks

@josharian

This comment has been minimized.

Contributor

josharian commented May 21, 2018

I'm not following this

And with good reason. :) Please ignore.

Thanks for checking on the quadratic concern.

gopherbot pushed a commit that referenced this issue May 22, 2018

cmd/compile: common up code in fuse for joining blocks
There's semantically-but-not-literally equivalent code in
two cases for joining blocks' value lists in ssa/fuse.go.
It can be made literally equivalent, then commoned up.

Updates #25426.

Change-Id: Id1819366c9d22e5126f9203dcd4c622423994110
Reviewed-on: https://go-review.googlesource.com/113719
Run-TryBot: David Chase <drchase@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Josh Bleecher Snyder <josharian@gmail.com>
@ianlancetaylor

This comment has been minimized.

Contributor

ianlancetaylor commented Jul 10, 2018

What is the status of this issue? Is there more to do for 1.11? For 1.12?

@dr2chase

This comment has been minimized.

Contributor

dr2chase commented Jul 10, 2018

Not more to do in 1.11, for 1.12 we might look into one or another different and perhaps more efficient ways of assigning statement boundaries.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment