cmd/compile: performance problems with many long, exported identifiers #18602

Open
cretz opened this Issue Jan 10, 2017 · 22 comments

Projects

None yet

7 participants

@cretz
cretz commented Jan 10, 2017 edited

What version of Go are you using (go version)?

1.7, 1.8+

What operating system and processor architecture are you using (go env)?

win64 and nix64

What did you do?

Tried to compile 150MB of code in a single package. Really it was a simple main.go that referenced this huge package and I executed "go build" but it is the compile executable that consumed ridiculous resources. But this even happens in my case when I get the code down to half that size (~75 MB). Not removing inlining and/or optimizations only delays the extreme resource usage, it does not remove it.

What did you expect to see?

A successful result and timid RAM usage that does not climb to extreme proportions and "streams" the build limited only by disk space of the result.

What did you see instead?

Consumed proportional amount of memory to code size, running out of memory (Win) or swapping forever until receiving a kill signal (Nix).

I am afraid I cannot provide the large set of code at the immediate moment. But if necessary, I can build a go generator that generates a ton of interfaces, structs, and functions. Orig thread: https://groups.google.com/forum/#!topic/golang-nuts/sBBkQ1_xf2Q.

@bradfitz
Member

Without a specific repro, I don't foresee many people jumping on this bug.

But once you have a repro, it becomes much more exciting to debug and fix.

@cretz
cretz commented Jan 10, 2017

@bradfitz I was hoping the bug was obvious ("large code can't be compiled") but I understand the need to reproduce. I will take some time to write an extreme Go code generator to replicate my issue (or maybe post a link to a large set of code causing this issue).

@bradfitz
Member

@cretz, no, it's not obvious at all. It's not even obvious whether you're hitting problems in the compiler (e.g. the SSA optimization phases, and there are many), or in the linker. Every time a bug like this has come up, it's had a very unique cause and fix.

@bradfitz bradfitz changed the title from Go compile consuming lots of RAM on really large packages to cmd/compile: (or cmd/link?) consumes lots of RAM on really large packages Jan 11, 2017
@cretz
cretz commented Jan 11, 2017 edited

@bradfitz - Code to reproduce: repro-18602.zip

I was able to get ~130MB of the code to compress well down to 10MB so I could paste in this issue. See the link above. It's a transpiler I am working on here. Simply extract the ZIP, navigate into the folder extracted to, and run go build to see the issue. For me on Windows:

    # _/C_/work/temp/repro/rt
    runtime: VirtualAlloc of 1048576 bytes failed with errno=1455
    fatal error: runtime: cannot map pages in arena address space

    runtime stack:
    runtime.throw(0x921776, 0x30)
            c:/go/src/runtime/panic.go:566 +0x9c
    runtime.sysMap(0xc159810000, 0x100000, 0x7fc01, 0xb7eb38)
            c:/go/src/runtime/mem_windows.go:116 +0x12c
    runtime.(*mheap).sysAlloc(0xb5cbc0, 0x100000, 0xb5cbd8)
            c:/go/src/runtime/malloc.go:407 +0x381
    runtime.(*mheap).grow(0xb5cbc0, 0x1, 0x0)
            c:/go/src/runtime/mheap.go:726 +0x69
    runtime.(*mheap).allocSpanLocked(0xb5cbc0, 0x1, 0xb4afc0)
            c:/go/src/runtime/mheap.go:630 +0x4f9
    runtime.(*mheap).alloc_m(0xb5cbc0, 0x1, 0xa, 0xa65c908)
            c:/go/src/runtime/mheap.go:515 +0xee
    runtime.(*mheap).alloc.func1()
            c:/go/src/runtime/mheap.go:579 +0x52
    runtime.systemstack(0x7fdb0)
            c:/go/src/runtime/asm_amd64.s:314 +0xb5
    runtime.(*mheap).alloc(0xb5cbc0, 0x1, 0x1000000000a, 0x0)
            c:/go/src/runtime/mheap.go:580 +0x7a
    runtime.(*mcentral).grow(0xb5e2f0, 0x0)
            c:/go/src/runtime/mcentral.go:210 +0x9b
    runtime.(*mcentral).cacheSpan(0xb5e2f0, 0xa65c908)
            c:/go/src/runtime/mcentral.go:91 +0x101
    runtime.(*mcache).refill(0x1a0000, 0xc10000000a, 0xa65c908)
            c:/go/src/runtime/mcache.go:121 +0xbc
    runtime.(*mcache).nextFree.func1()
            c:/go/src/runtime/malloc.go:505 +0x3a
    runtime.systemstack(0xb4b000)
            c:/go/src/runtime/asm_amd64.s:298 +0x7e
    runtime.mstart()
            c:/go/src/runtime/proc.go:1079

    goroutine 1 [running]:
    runtime.systemstack_switch()
            c:/go/src/runtime/asm_amd64.s:252 fp=0xc105a60690 sp=0xc105a60688
    runtime.(*mcache).nextFree(0x1a0000, 0xa, 0x0, 0xd, 0xd)
            c:/go/src/runtime/malloc.go:506 +0xb9 fp=0xc105a606e8 sp=0xc105a60690
    runtime.mallocgc(0x90, 0x900300, 0x1, 0xc1597f8bd0)
            c:/go/src/runtime/malloc.go:658 +0x848 fp=0xc105a60788 sp=0xc105a606e8
    runtime.newobject(0x900300, 0xd)
            c:/go/src/runtime/malloc.go:785 +0x3f fp=0xc105a607b8 sp=0xc105a60788
    cmd/compile/internal/gc.Nod(0xc105a60701, 0x0, 0x0, 0x6c6e692e00000001)
            c:/go/src/cmd/compile/internal/gc/subr.go:335 +0x38 fp=0xc105a607e0 sp=0xc105a607b8
    cmd/compile/internal/gc.newname(0xc1597f8bd0, 0x7)
            c:/go/src/cmd/compile/internal/gc/dcl.go:334 +0x51 fp=0xc105a60820 sp=0xc105a607e0
    cmd/compile/internal/gc.newlabel_inl(0xc105a609d8)
            c:/go/src/cmd/compile/internal/gc/inl.go:883 +0x68 fp=0xc105a60850 sp=0xc105a60820
    cmd/compile/internal/gc.mkinlcall1(0xc07db27320, 0xc07da78630, 0x0, 0x0)
            c:/go/src/cmd/compile/internal/gc/inl.go:769 +0xddc fp=0xc105a60d98 sp=0xc105a60850
    cmd/compile/internal/gc.mkinlcall(0xc07db27320, 0xc07da78630, 0x0, 0x0)
            c:/go/src/cmd/compile/internal/gc/inl.go:500 +0x7f fp=0xc105a60dd0 sp=0xc105a60d98
    cmd/compile/internal/gc.inlnode(0xc07db27320, 0x128fdf)
            c:/go/src/cmd/compile/internal/gc/inl.go:481 +0x948 fp=0xc105a60fe8 sp=0xc105a60dd0
    cmd/compile/internal/gc.inlnode(0xc07db273b0, 0x128fdf)
            c:/go/src/cmd/compile/internal/gc/inl.go:365 +0x11e fp=0xc105a61200 sp=0xc105a60fe8
    cmd/compile/internal/gc.inlnode(0xc07db274d0, 0xc10d8ab440)
            c:/go/src/cmd/compile/internal/gc/inl.go:365 +0x11e fp=0xc105a61418 sp=0xc105a61200
    cmd/compile/internal/gc.inlnodelist(0xc07db28180)
            c:/go/src/cmd/compile/internal/gc/inl.go:318 +0x69 fp=0xc105a61458 sp=0xc105a61418
    cmd/compile/internal/gc.inlnode(0xc07db27560, 0xc07db26cf0)
            c:/go/src/cmd/compile/internal/gc/inl.go:379 +0x1c6 fp=0xc105a61670 sp=0xc105a61458
    cmd/compile/internal/gc.inlnodelist(0xc07db29a80)
            c:/go/src/cmd/compile/internal/gc/inl.go:318 +0x69 fp=0xc105a616b0 sp=0xc105a61670
    cmd/compile/internal/gc.inlnode(0xc07daf4ab0, 0x0)
            c:/go/src/cmd/compile/internal/gc/inl.go:436 +0x353 fp=0xc105a618c8 sp=0xc105a616b0
    cmd/compile/internal/gc.inlcalls(0xc07daf4ab0)
            c:/go/src/cmd/compile/internal/gc/inl.go:274 +0x57 fp=0xc105a61908 sp=0xc105a618c8
    cmd/compile/internal/gc.Main.func1(0xc14134f008, 0x6, 0xff, 0xc105a61900)
            c:/go/src/cmd/compile/internal/gc/main.go:435 +0x71 fp=0xc105a61938 sp=0xc105a61908
    cmd/compile/internal/gc.(*bottomUpVisitor).visit(0xc105a61b88, 0xc07daf4ab0, 0x393fe)
            c:/go/src/cmd/compile/internal/gc/esc.go:106 +0x2b5 fp=0xc105a619d8 sp=0xc105a61938
    cmd/compile/internal/gc.(*bottomUpVisitor).visitcode(0xc105a61b88, 0xc07daf25a0, 0x393fe, 0x393fe)
            c:/go/src/cmd/compile/internal/gc/esc.go:137 +0x175 fp=0xc105a61a08 sp=0xc105a619d8
    cmd/compile/internal/gc.(*bottomUpVisitor).visitcode(0xc105a61b88, 0xc07daf2630, 0xc1000393fe, 0x393fe)
            c:/go/src/cmd/compile/internal/gc/esc.go:126 +0x9c fp=0xc105a61a38 sp=0xc105a61a08
    cmd/compile/internal/gc.(*bottomUpVisitor).visitcode(0xc105a61b88, 0xc07daf26c0, 0x393fe, 0x393fe)
            c:/go/src/cmd/compile/internal/gc/esc.go:126 +0x9c fp=0xc105a61a68 sp=0xc105a61a38
    cmd/compile/internal/gc.(*bottomUpVisitor).visitcodelist(0xc105a61b88, 0xc07dacde80, 0xc1000393fe, 0xc105a61b0c)
            c:/go/src/cmd/compile/internal/gc/esc.go:114 +0x6d fp=0xc105a61ab0 sp=0xc105a61a68
    cmd/compile/internal/gc.(*bottomUpVisitor).visit(0xc105a61b88, 0xc07daec480, 0xc1000393f0)
            c:/go/src/cmd/compile/internal/gc/esc.go:80 +0x143 fp=0xc105a61b50 sp=0xc105a61ab0
    cmd/compile/internal/gc.visitBottomUp(0xc12ed46000, 0x432b4, 0x4c000, 0x9625c8)
            c:/go/src/cmd/compile/internal/gc/esc.go:62 +0xdd fp=0xc105a61bc8 sp=0xc105a61b50
    cmd/compile/internal/gc.Main()
            c:/go/src/cmd/compile/internal/gc/main.go:438 +0x1c50 fp=0xc105a61e10 sp=0xc105a61bc8
    cmd/compile/internal/amd64.Main()
            c:/go/src/cmd/compile/internal/amd64/galign.go:93 +0x301 fp=0xc105a61e48 sp=0xc105a61e10
    main.main()
            c:/go/src/cmd/compile/main.go:33 +0x2aa fp=0xc105a61ed8 sp=0xc105a61e48
    runtime.main()
            c:/go/src/runtime/proc.go:183 +0x1f0 fp=0xc105a61f30 sp=0xc105a61ed8
    runtime.goexit()
            c:/go/src/runtime/asm_amd64.s:2086 +0x1 fp=0xc105a61f38 sp=0xc105a61f30

I was afraid it was a generic, obvious issue due to the extreme size of the code and that it's all in one package. I don't expect the code to run (I figure it'd panic, I don't know, never got it to compile). But if it's a nuanced issue that would be fantastic.

@ianlancetaylor ianlancetaylor changed the title from cmd/compile: (or cmd/link?) consumes lots of RAM on really large packages to cmd/compile: consumes lots of RAM on really large package Jan 11, 2017
@ianlancetaylor ianlancetaylor added this to the Go1.9Maybe milestone Jan 11, 2017
@josharian
Contributor
josharian commented Jan 11, 2017 edited

I'll leave a compile running overnight to see whether I can get some useful pprof output.

OnUnmappableCharacter__desc____obj__Java__nio__charset__CodingErrorAction__ret____obj__Java__nio__charset__CharsetEncoder -- wow.

@laboger
Contributor
laboger commented Jan 11, 2017

Building Kubernetes produces several very large programs. There could be a reproducer there.
https://github.com/kubernetes/kubernetes.git
Set up your GOPATH so it gets this:
go get -u github.com/jteeuwen/go-bindata/go-bindata
Then adds its bin directory to your PATH.
Depending on your platform, cd kubernetes, then make.

Just a suggestion....

@josharian
Contributor

Compile with memory profiling complete. Took almost 7 hrs, 13gb ram:

    24205.02 real     28947.74 user      1399.21 sys
12711182336  maximum resident set size
         0  average shared memory size
         0  average unshared data size
         0  average unshared stack size
  95223080  page reclaims
        89  page faults
         0  swaps
       111  block input operations
        44  block output operations
         0  messages sent
         0  messages received
         3  signals received
   2251562  voluntary context switches
  53741995  involuntary context switches

Longest phases were:

  • parse: 11m
  • compilefuncs: 15m
  • dumpobj: 6h30m!

The resulting object file was 1.14gb!

Memory profile, using -alloc_space:

o.pdf

The only unusual things I see in the profile are:

  • ssa.fuseBlockPlain: Never seen this be significant in a memory profile before. Probably a result of unusual machine-generated code patterns. Once my second compile (cpu profile) is complete, I'll check for obvious optimizations here. In any case, the panics in the original report indicate that compilation never even made it to the backend, where this function is.

  • Lots of allocation rooted at methodsym and methodfunc and typefmt for TFUNC. I think this is a direct result of the extraordinarily long names in the code, particularly in interfaces. Note that a long name used in an interface gets used in many places: In the interface representation, as part of the method set of any type satisfying that interface, as part of the name of many functions, etc.

As for the extremely long dumpobj time and giant object file (which will lead to massive link times and/or giant binaries), I suspect that that is again the result of very many very long exported identifiers. Hopefully when my cpu profile run completes in five hours, it'll be much clearer whether there are obvious non-linearities in dumpobj.

@cretz though there may be some improvements available in the compiler, I don't anticipate anything breathtaking; you will almost certainly also need to change your generated code as well. As a start, I suggest altering your code generator to (a) avoid exporting any identifiers you reasonably can and (b) give your symbols shorter names, perhaps using truncated hashes or a global numbering scheme. (You'll probably also then need to generate a lookup table or some such to help with demangling, for debugging. Sorry about that.) If you do those things, please report back with the results, and maybe with an updated code dump. Thanks!

@josharian josharian changed the title from cmd/compile: consumes lots of RAM on really large package to cmd/compile: performance problems with many long, exported identifiers Jan 11, 2017
@josharian josharian added the ToolSpeed label Jan 11, 2017
@josharian
Contributor

@laboger thanks, but to mangle Tolstoy, every interminable compile is interminable in its own way.

@cretz
cretz commented Jan 11, 2017

@josharian - Thanks for the update. I figured my use case was just really extreme. I can unexport some things. I can also reduce the identifier size (I wanted my identifiers to help w/ runtime reflection, but no big deal, I can maintain a separate blob w/ that info). I was hoping that even though the obj file is big, that DCE would remove a lot of bloat from the final binary due to having most functions as methods, but I am unfamiliar w/ the internals of the linker.

"whether there are obvious non-linearities in dumpobj"

I think this might be a key point in general. Ideally some of the work can be streamed and not held in memory for the life of the package compile.

At least now y'all have a good stress test case.

@griesemer
Contributor

If it's really very long identifiers that cause problems with the compile time, we should try to get to the bottom of this, rather than find work-arounds. Exporting long identifiers shouldn't cause undue compilation times - there's a bug somewhere.

As one data point: The export data writer uses a map to canonicalize strings - that is, a string that's been seen before (this includes exported identifiers) will only appear once in the export data. But the same identifiers may appear elsewhere in the obj file.

@josharian
Contributor

Here's a CPU profile:

cf.pdf

Aha!

Hello, gc.testdclstack. This is not the first time we've had problems with testdclstack. See #14781. Robert suggested only enabling it in a special debug mode in the compiler. It is probably time to do that, perhaps even for Go 1.8. I'll see about sending a CL soon.

With gc.testdclstack eliminated, the parse phase drops from 11m to 13s. Still waiting to see how much it helps elsewhere.

Eliminating gc.testdclstack won't help with memory usage, though. My compile is still at 7gb and growing. ssa.fuseBlockPlain changes may help, experimentation still to come.

If it's really very long identifiers that cause problems with the compile time, we should try to get to the bottom of this, rather than find work-arounds.

I don't think it's just very long exported identifiers. It is also the sheer number of them, and probably also some structural things (per the other comments I've made). Squinting at the profiles, the long identifiers is maybe 10% of the memory issue; I just suggested it to @cretz as a good, straightforward first step (and experiment to confirm what I'm seeing).

@griesemer
Contributor

@josharian Excellent, thanks for tracking this down. We should perhaps disable testdclstack (in non-debug mode) even for 1.8. It's just a verification step.

@bradfitz
Member

@griesemer, @josharian, send a CL soon if so.

@gopherbot

CL https://golang.org/cl/35113 mentions this issue.

@gopherbot

CL https://golang.org/cl/35114 mentions this issue.

@cretz
cretz commented Jan 11, 2017

For anyone wanting it, here is a test case w/ smaller identifiers: repro-18602-smaller-idents.zip. Granted I still think repro-18602.zip is quality stress test case too.

@josharian
Contributor

After the CLs above, time is reduced to a half hour, and max rss is down a bit:

     1461.67 real      1588.13 user       374.84 sys
11645718528  maximum resident set size
         0  average shared memory size
         0  average unshared data size
         0  average unshared stack size
  38278408  page reclaims
        75  page faults
         0  swaps
       312  block input operations
       193  block output operations
         0  messages sent
         0  messages received
         3  signals received
    167410  voluntary context switches
  11764445  involuntary context switches
  • parse now takes 11s.
  • dumpobj now takes 7m.

For reference, here's an alloc_space profile output:

alloc_space_after_fuse.pdf

Aside from the things I've already mentioned, disabling inlining would probably help noticeably now with memory usage.

There might be further optimizations available to further speed up dumpobj or shrink the object file size by reusing more strings somewhere, but I'll leave that to @griesemer (export info) and @crawshaw (reflect info).

Thanks for the new test case; I'll take a look at that later or tomorrow.

@gopherbot gopherbot pushed a commit that referenced this issue Jan 11, 2017
@josharian josharian cmd/compile: hide testdclstack behind debug flag
This reduces compilation time for the program
in #18602 from 7 hours to 30 min.

Updates #14781
Updates #18602

Change-Id: I3c4af878a08920e6373d3b3b0c4453ee002e32eb
Reviewed-on: https://go-review.googlesource.com/35113
Run-TryBot: Josh Bleecher Snyder <josharian@gmail.com>
Reviewed-by: Robert Griesemer <gri@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
f65abf6
@griesemer
Contributor

Was testdclstack involved in dumpobj? Or what's the reason for the 55x reduction there?

@josharian
Contributor

Was testdclstack involved in dumpobj?

Yes.

@josharian
Contributor

At least on my machine, the new code you posted compiles by ~10-15% faster, but memory usage doesn't shrink significantly; I guess I was wrong. The object file is still 856mb, though, so you're still probably going to have slow linking and a very large (and probably slow) binary.

I don't plan to investigate this further at the moment.

@cretz
cretz commented Jan 12, 2017

No prob. I appreciate that y'all are leaving this open so it can be revisited in the future if anyone wants a crazy test case for compiler performance.

@josharian josharian referenced this issue in rsc/compilebench Jan 15, 2017
Open

Add torture tests #7

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment