Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/compile: random performance fluctuations after unrelated changes #8717

dvyukov opened this issue Sep 12, 2014 · 8 comments

cmd/compile: random performance fluctuations after unrelated changes #8717

dvyukov opened this issue Sep 12, 2014 · 8 comments


Copy link

@dvyukov dvyukov commented Sep 12, 2014

We see constant performance fluctuations after unrelated changes on the perf dashboard.
For example:;commit0=fee5fcd5f87e75235d93fb297123feb15a59ae38&builder=linux-amd64-perf&benchmark=json;commit0=0a5fafdd2343b083457d0baf6487dfce0f01e25f&builder=windows-amd64-perf&benchmark=json;commit0=fc588981a45afa430a2d2cd29d234403cb86e1bd&builder=windows-amd64-perf&benchmark=json

I can reproduce it locally as well. I've took 2 consecutive commits:
changeset:   21142:91110f70916a
summary:     runtime: allow crash from gsignal stack
changeset:   21141:0768bb1fd027
summary:     net: fix inconsistent behavior across platforms in SetKeepAlivePeriod

and run go.benchmarks binary as:
./bench -bench=json -benchtime=3s -benchnum=100
alternating old and new binaries.

The results are:


The difference is now that big (probably because my processors panilizes less what is
being penalized), but clearly observable.

It looks like code alignment issue. Probably loops are not 16-byte aligned or something
like this.

I guess this also penalizes user binaries in the same random and unpredictable way.
Copy link

@rsc rsc commented Oct 20, 2014

Comment 1:

If you want to try to figure out how this could be loop alignment, please go ahead. I
spent days on this a few years ago and got absolutely nowhere. I can't find any evidence
that loop alignment matters. It may be something else entirely, but in general modern
CPUs are black boxes that can misbehave on a whim and - at least for people not privy to
the inner workings - simply cannot be fully understood. They are subtle and quick to
If you want to try the loop alignment hypothesis, you could edit src/liblink/asm6.c.
Look for LoopAlign (MaxLoopPad = 0 means alignment is turned off right now).
I am removing the Release-Go1.5 tag. If someone wants to work on this, great, but I am
not going to promise to waste any more time on this. Long ago I made peace with the fact
that this kind of variation is something we just have to live with sometimes.

Labels changed: added release-none, removed release-go1.5.

Status changed to Accepted.

Copy link
Member Author

@dvyukov dvyukov commented Oct 21, 2014

Comment 2:

I can confirm your conclusion. We need to wait until Go becomes important enough so that
processor manufacturers allocate engineers for optimization.
I've tried to align back-branch targets and all branch targets at 16 bytes
( with no success. Alignment of back-branch
targets increased binary size by 5.1%, all branches - 8.3%. So if we do it, we need
something smarter, e.g. align only within real loops.
I've checked that in both binaries stack segment address and fs register has equal
values, so we can strike it out.
Since code has moved, data segment also has a different address. So maybe it's related
to data. But I don't see any functions in the profile that heavily access global data...
@rsc rsc added this to the Unplanned milestone Apr 10, 2015
@rsc rsc changed the title cmd/gc: random performance fluctuations after unrelated changes cmd/compile: random performance fluctuations after unrelated changes Jun 8, 2015
Copy link
Member Author

@dvyukov dvyukov commented May 11, 2016

Just debugged another case, which turned out to be this issue.
After inserting a dead function call program becomes 2.5% slower.

go version devel +b4538d7 Wed May 11 06:00:33 2016 +0000 linux/amd64
with this patch:

--- a/src/runtime/proc.go
+++ b/src/runtime/proc.go
@@ -1028,12 +1028,41 @@ func startTheWorldWithSema() {
        add := needaddgcproc()

+       var p1 *p
        procs := gomaxprocs
-       if newprocs != 0 {
-               procs = newprocs
-               newprocs = 0
+       if true {
+               if newprocs != 0 {
+                       procs = newprocs
+                       newprocs = 0
+               }
+               p1 = procresize(procs)
+       } else {
+               if newprocs != 0 {
+                       procs = newprocs
+                       newprocs = 0
+                       p1 = procresize(procs)
+               } else {
+                       _g_.m.p.ptr().status = _Prunning
+                       var runnablePs *p
+                       for i := gomaxprocs - 1; i >= 0; i-- {
+                               p := allp[i]
+                               if _g_.m.p.ptr() == p {
+                                       continue
+                               }
+                               p.status = _Pidle
+                               if runqempty(p) {
+                                       pidleput(p)
+                               } else {
+                                       p.m.set(mget())
+                                       runnablePs = p
+                               }
+                       }
+                       p1 = runnablePs
+               }
-       p1 := procresize(procs)
        sched.gcwaiting = 0
        if sched.sysmonwait != 0 {
                sched.sysmonwait = 0

Then depending on presence of the following patch:

--- a/src/runtime/mgc.go
+++ b/src/runtime/mgc.go
@@ -1608,7 +1608,7 @@ func gcMark(start_time int64) {

-       cachestats()
+       //cachestats()

        // Update the reachable heap stat.
        memstats.heap_reachable = work.bytesMarked

The test program:
performance changes +/-2.5%.

With the call commented out I consistently see:

$ nice -20 taskset -c 0-11 time /tmp/gif100
7.90user 0.05system 0:07.95elapsed 100%CPU

Without the call commented out:

$ nice -20 taskset -c 0-11 time /tmp/gif101
8.11user 0.05system 0:08.15elapsed 100%CPU

All time is spent in computations:

  95.11%  gif101   gif101             [.] image/draw.drawPaletted
   0.92%  gif101   gif101             [.] compress/lzw.(*encoder).Write
   0.79%  gif101   gif101             [.] image.(*Paletted).At
   0.58%  gif101   gif101             [.] compress/lzw.(*decoder).decode
   0.37%  gif101   gif101             [.] image/color.(*RGBA).RGBA
   0.32%  gif101   gif101             [.] compress/lzw.(*decoder).readLSB
   0.23%  gif101   gif101             [.] compress/lzw.(*encoder).writeLSB        

drawPaletted magically becomes faster after the change.

Diff in CPU profiles still does not make any sense to me, it looks like percents are just randomly shuffled.
Faster version:
Slower version:

@aclements @RLH @randall77 @dr2chase

Copy link
Member Author

@dvyukov dvyukov commented May 11, 2016

Loop alignment still makes the program slower.

Copy link
Member Author

@dvyukov dvyukov commented May 11, 2016

The the fast version the function is aligned on 0x10:

/tmp/gif103:0000000000494a10 T image/draw.drawPaletted

and in the slow version to 0x20:

/tmp/gif102:00000000004949a0 T image/draw.drawPaletted

If I set function alignment to 0x20 (which is broken due to asm functions, so it actually gives me 0x10 alignment for the function), it mostly fixes the problem:

7.96user 0.05system 0:08.01elapsed 100%CPU
@bradfitz bradfitz modified the milestones: Go1.8, Unplanned May 11, 2016
Copy link

@tombergan tombergan commented Aug 2, 2016

I'm reminded of:

From the second paper:

Mytkowicz et al. [the first paper] show that just changing the size of environment variables can trigger performance degradation as high as 300%; we find that simply changing the link order of object files can cause performance to decrease by as much as 57%.

Microsoft Research had a tool that would link a program multiple times, where each binary used a different (randomized) function order, then they'd run tests and pick the best function order. Unfortunately, my Google-fu is failing me and I cannot find a reference. (The closest I can find is VC's /ORDER linker option, which looks like it could be used to implement this feature.)

Copy link

@navytux navytux commented Apr 3, 2018

For the reference: the effect is fully explained here:

#18977 (comment)
#18977 (comment)
#18977 (comment)

Copy link

@navytux navytux commented Apr 3, 2018

(including way on how to denoise the timings)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
7 participants