New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/compile: poor register allocator behavior in compression code #16122

Open
flanglet opened this Issue Jun 20, 2016 · 17 comments

Comments

Projects
None yet
9 participants
@flanglet

flanglet commented Jun 20, 2016

Please answer these questions before submitting your issue. Thanks!

  1. What version of Go are you using (go version)?
    go1.6 windows/amd64 and go1.7beta1 windows/amd64
  2. What operating system and processor architecture are you using (go env)?
    set GOARCH=amd64
    set GOBIN=
    set GOEXE=.exe
    set GOHOSTARCH=amd64
    set GOHOSTOS=windows
    set GOOS=windows
    set GOPATH=E:\Users\fred\Documents\Prog\kanzi\go
    set GORACE=
    set GOROOT=E:\Program Files\go
    set GOTOOLDIR=E:\Program Files\go\pkg\tool\windows_amd64
    set GO15VENDOREXPERIMENT=1
    set CC=gcc
    set GOGCCFLAGS=-m64 -mthreads -fmessage-length=0
    set CXX=g++
    set CGO_ENABLED=1
  3. What did you do?
    I ran "go build TestZRLT.go" then "TestZRLT.exe" both for Go 1.6 and 1.7 beta1
    The source code is very simple: https://github.com/flanglet/kanzi/blob/master/go/src/kanzi/test/TestZRLT.go.
    It runs a correctness and a performance test tor the Zero Run Length Transform:
    https://github.com/flanglet/kanzi/blob/master/go/src/kanzi/function/ZRLT.go.
  4. What did you expect to see?
    I expected to see no performance regression from 1.6 to 1.7beta1
  5. What did you see instead?
    ZRLT encoding is much faster with 1.7beta1 but decoding is much slower.

Output for 1.6:
Speed test
Iterations: 50000

ZRLT encoding [ms]: 10694
Throughput [MB/s]: 222
ZRLT decoding [ms]: 7419
Throughput [MB/s]: 321

ZRLT encoding [ms]: 10753
Throughput [MB/s]: 221
ZRLT decoding [ms]: 7472
Throughput [MB/s]: 319

ZRLT encoding [ms]: 10724
Throughput [MB/s]: 222
ZRLT decoding [ms]: 7393
Throughput [MB/s]: 322

Output for 1.7beta1:
Speed test
Iterations: 50000

ZRLT encoding [ms]: 6834
Throughput [MB/s]: 348
ZRLT decoding [ms]: 11560
Throughput [MB/s]: 206

ZRLT encoding [ms]: 6828
Throughput [MB/s]: 349
ZRLT decoding [ms]: 11589
Throughput [MB/s]: 205

ZRLT encoding [ms]: 6790
Throughput [MB/s]: 351
ZRLT decoding [ms]: 11558
Throughput [MB/s]: 206

I narrowed down the issue to the run length decoding loop:

for val <= 1 {
    runLength = (runLength << 1) | int(val)
    srcIdx++

    if srcIdx >= srcEnd {
            break
    }

    val = src[srcIdx]
}

If I replace 'for val <= 1 {' with 'for val&1 == val {', the decoding becomes much faster (although not as fast as with Go 1.6)

Output for 1.7beta1 with code change:

Speed test
Iterations: 50000

ZRLT encoding [ms]: 6800
Throughput [MB/s]: 350
ZRLT decoding [ms]: 7669
Throughput [MB/s]: 310

ZRLT encoding [ms]: 6813
Throughput [MB/s]: 349
ZRLT decoding [ms]: 7689
Throughput [MB/s]: 310

ZRLT encoding [ms]: 6775
Throughput [MB/s]: 351
ZRLT decoding [ms]: 7662
Throughput [MB/s]: 311

@adg adg changed the title from Go 1.7beta1 performance drop vs. Go 1.6 in compression code to cmd/compile: Go 1.7beta1 performance drop vs. Go 1.6 in compression code Jun 20, 2016

@adg

This comment has been minimized.

Contributor

adg commented Jun 20, 2016

@adg adg added this to the Go1.7Maybe milestone Jun 20, 2016

@randall77

This comment has been minimized.

Contributor

randall77 commented Jun 21, 2016

I'm not seeing what you are seeing.
The performance of 1.6.2 and tip are nearly identical for decode (tip is quite a bit better for encode).
I also don't see any variance between val <= 1 and val&1 == val.

That said, the code for the inner loop in decode isn't very good. The register allocator uses more copies than are necessary.

The indexing produces code like:

  y = x + 1
  load from p + x + 1

The latter is a single complex-addressing mode load, and the x+1 was folded into it.
In between those two instructions, both y and x are live. So the y=x+1 can't be done in place and a bunch of moves need to fix things up afterward.

I'll try to fix this up during the 1.8 cycle.

@randall77 randall77 modified the milestones: Go1.8, Go1.7Maybe Jun 21, 2016

@randall77 randall77 self-assigned this Jun 21, 2016

@flanglet

This comment has been minimized.

flanglet commented Jun 21, 2016

OK. I suppose the issue has been addressed between 1.7.beta1 and tip then. What remains is an optimization.

@quentinmit quentinmit changed the title from cmd/compile: Go 1.7beta1 performance drop vs. Go 1.6 in compression code to cmd/compile: poor register allocator behavior in compression code Jun 30, 2016

@quentinmit quentinmit added the NeedsFix label Jun 30, 2016

@rsc

This comment has been minimized.

Contributor

rsc commented Oct 21, 2016

@randall77, anything left here for Go 1.8?

@randall77

This comment has been minimized.

Contributor

randall77 commented Oct 21, 2016

No, I did not get to fixing anything else here. Punting to 1.9.

@randall77 randall77 modified the milestones: Go1.9, Go1.8 Oct 21, 2016

@josharian

This comment has been minimized.

Contributor

josharian commented May 16, 2017

I grabbed commit 3872c76b9b410c44428e74b2065f3e2291cb8095 of github.com/flanglet/kanzi based on the date of the commit and of this issue, extracted all relevant code into a single file, and updated it to use regular test/benchmark form: https://gist.github.com/josharian/e0bc6e238d4914a44289b44bc4ae3640.

This shows a steady regression over time (tip at b53acd8):

name \ time/op  17          18          tip
Speed/encode-8  314µs ± 1%  406µs ± 1%  438µs ± 2%
Speed/decode-8  314µs ± 1%  333µs ± 1%  356µs ± 1%
@josharian

This comment has been minimized.

Contributor

josharian commented May 16, 2017

CL 43491 applied to tip helps a fair amount:

name            old time/op  new time/op  delta
Speed/encode-8   438µs ± 2%   374µs ± 0%  -14.71%  (p=0.000 n=10+9)
Speed/decode-8   356µs ± 1%   320µs ± 1%   -9.93%  (p=0.000 n=10+10)

That brings it back below 1.8 levels, although still not as good as 1.7.

@josharian

This comment has been minimized.

Contributor

josharian commented May 18, 2017

CL 43491 is in, so we're back below the 1.8 level. Changing to milestone 1.10 for the original goal of getting back to 1.7 levels.

@josharian josharian modified the milestones: Go1.10, Go1.9 May 18, 2017

@dr2chase

This comment has been minimized.

Contributor

dr2chase commented May 22, 2017

I spent some time (intermission at a concert) noodling with pictures of loops, and it seems to me that there's families of loops where we can arrange to have "optimal" answers, and we should probably arrange to do that in the block ordering code and then see what's left for heuristics.

For example, if there is a block in the loop that dominates all the exits and itself is conditional with an exit successor, rotate that block to the bottom, and its in-loop successor to the beginning of the run of blocks for the loop.

If we have a diamond with an exit on one of the two arms, A -> (B,C), B -> (X,D), C -> D, D -> A, order it C D A B.

In "layout", we'd do this by detecting transition into a loop, decoding the loop type, and inferring the best start block if it fits one of our models, and I think that the rest just falls out -- i.e., if you start with C, then the only logical successor is D then A, then B. This might subsume the loop rotation phase.

@bradfitz bradfitz modified the milestones: Go1.10, Go1.11 Nov 28, 2017

@flanglet

This comment has been minimized.

flanglet commented Dec 11, 2017

go1.10beta1 shows a 15% performance regression on this code vs.
go 1.9.2 (see bench at [https://github.com/flanglet/kanzi/blob/master/go/src/kanzi/benchmark/Functions_test.go]).

go1.9.2 windows/amd64

go test Functions_test.go -bench ZRLT -benchtime 5s -cpuprofile=r:\cpu.prof

BenchmarkZRLT-8            10000           1032659 ns/op
PASS
ok      command-line-arguments  10.680s

Duration: 10.60s, Total samples = 10.42s (98.28%)
Showing nodes accounting for 10.41s, 99.90% of 10.42s total
Dropped 2 nodes (cum <= 0.05s)
      flat  flat%   sum%        cum   cum%
     6.39s 61.32% 61.32%      6.39s 61.32%  kanzi/function.(*ZRLT).Forward 
     4.02s 38.58% 99.90%      4.02s 38.58%  kanzi/function.(*ZRLT).Inverse  <===
         0     0% 99.90%     10.41s 99.90%  command-line-arguments.BenchmarkZRLT
         0     0% 99.90%     10.41s 99.90%  testing.(*B).launch 
         0     0% 99.90%     10.41s 99.90%  testing.(*B).runN

go1.10beta1 windows/amd64

go test Functions_test.go -bench ZRLT -benchtime 5s -cpuprofile=r:\cpu.prof

BenchmarkZRLT-8            10000           1185367 ns/op
PASS
ok      command-line-arguments  12.179s

Duration: 12.10s, Total samples = 12.61s (104.20%)

Showing nodes accounting for 12.27s, 97.30% of 12.61s total
Dropped 46 nodes (cum <= 0.06s)
Showing top 10 nodes out of 24 
      flat  flat%   sum%        cum   cum%
     6.36s 50.44% 50.44%      6.36s 50.44%  kanzi/function.(*ZRLT).Forward
     5.58s 44.25% 94.69%      5.58s 44.25%  kanzi/function.(*ZRLT).Inverse   <===
     0.08s  0.63% 95.32%      0.08s  0.63%  runtime.stdcall1
     0.06s  0.48% 95.80%      0.13s  1.03%  runtime/pprof.(*profMap).lookup
     0.04s  0.32% 96.11%      0.07s  0.56%  runtime.mapaccess1_fast64
     0.04s  0.32% 96.43%      0.13s  1.03%  runtime.ready
     0.04s  0.32% 96.75%      0.17s  1.35%  runtime/pprof.(*profileBuilder).addC
PUData
     0.03s  0.24% 96.99%      0.07s  0.56%  runtime/pprof.readProfile
     0.02s  0.16% 97.15%      0.08s  0.63%  runtime.findrunnable
     0.02s  0.16% 97.30%      0.12s  0.95%  runtime.park_m

Problematic lines: 166 and 194


ROUTINE ======================== kanzi/function.(*ZRLT).Inverse 
     5.58s      5.58s (flat, cum) 44.25% of Total
         .          .    147:   runLength := 1
         .          .    148:   srcIdx, dstIdx := 0, 0
         .          .    149:   var err error
         .          .    150:
         .          .    151:   if srcIdx < srcEnd {
     410ms      410ms    152:           for dstIdx < dstEnd {
      90ms       90ms    153:                   if runLength > 1 {
     260ms      260ms    154:                           runLength--
      60ms       60ms    155:                           dst[dstIdx] = 0
         .          .    156:                           dstIdx++
         .          .    157:                           continue
         .          .    158:                   }
         .          .    159:
     340ms      340ms    160:                   if srcIdx >= srcEnd {
         .          .    161:                           break
         .          .    162:                   }
         .          .    163:
     770ms      770ms    164:                   val := src[srcIdx]
         .          .    165:
     1.14s      1.14s    166:                   if val <= 1 {
         .          .    167:                           // Generate the run leng
th bit by bit (but force MSB)
         .          .    168:                           runLength = 1
         .          .    169:
     100ms      100ms    170:                           for val <= 1 {
      50ms       50ms    171:                                   runLength = (run
Length << 1) | int(val)
      40ms       40ms    172:                                   srcIdx++
         .          .    173:
         .          .    174:                                   if srcIdx >= src
End {
         .          .    175:                                           break
         .          .    176:                                   }
         .          .    177:
      30ms       30ms    178:                                   val = src[srcIdx
]
         .          .    179:                           }
         .          .    180:
      50ms       50ms    181:                           continue
         .          .    182:                   }
         .          .    183:
         .          .    184:                   // Regular data processing
     340ms      340ms    185:                   if val == 0xFF {
         .          .    186:                           srcIdx++
         .          .    187:
         .          .    188:                           if srcIdx >= srcEnd {
         .          .    189:                                   break
         .          .    190:                           }
         .          .    191:
         .          .    192:                           dst[dstIdx] = 0xFE + src
[srcIdx]
         .          .    193:                   } else {
     1.62s      1.62s    194:                           dst[dstIdx] = val - 1
         .          .    195:                   }
         .          .    196:
         .          .    197:                   srcIdx++
     280ms      280ms    198:                   dstIdx++
         .          .    199:           }
         .          .    200:   }
         .          .    201:
         .          .    202:   // If runLength is not 1, add trailing 0s
         .          .    203:   end := dstIdx + runLength - 1

@randall77

This comment has been minimized.

Contributor

randall77 commented Dec 11, 2017

@flanglet I don't see the same performance issues you do.
(I'm on linux/amd64, in case that makes a difference.)

$ ~/go1.9.2/bin/go test -test.bench=ZRLT -test.count=10 > log192
$ ~/goTip/bin/go test -test.bench=ZRLT -test.count=10 > logTip
$ benchstat log192 logTip 
name  old time/op  new time/op  delta
ZRLT   541µs ± 9%   547µs ± 9%   ~     (p=0.393 n=10+10)

There's a lot of noise in this benchmark; it varies a lot from run to run. That makes it hard to make any definitive statements.

@flanglet

This comment has been minimized.

flanglet commented Dec 12, 2017

I don't see the same performance issues you do.
(I'm on linux/amd64, in case that makes a difference.)

I also do not see much difference on a Linux machine. But the performance drop is reproducible every time the benchmark runs on the Windows 7 machine.

There's a lot of noise in this benchmark; it varies a lot from run to run. That makes it hard to make any definitive statements.

Increase the test count. I will work on reducing the noise.

@randall77

This comment has been minimized.

Contributor

randall77 commented Dec 12, 2017

@flanglet : Then I don't understand what is going on. The register allocator is identical on linux and windows, so this may be an entirely separate problem from the one that started this issue.
Are your linux and windows boxes two separate machines? Might it be the CPU, not the OS?

@flanglet

This comment has been minimized.

flanglet commented Dec 12, 2017

The Windows 7 machine has an Intel Core7 2600 (Sandy bridge) CPU.
The Linux machine has an Intel Core7 7700K (Kaby Lake) CPU.
So. yeah, it might be CPU related.

@randall77

This comment has been minimized.

Contributor

randall77 commented Dec 12, 2017

@TocarIP , might be up your alley.

@TocarIP

This comment has been minimized.

Contributor

TocarIP commented Dec 12, 2017

Couldn't reproduce on Sandy Bridge
model name : Intel(R) Xeon(R) CPU E5-2640 0 @ 2.50GHz
I see:
ZRLT-24 935µs ± 0% 931µs ± 1% -0.40% (p=0.009 n=10+10)

@flanglet

This comment has been minimized.

flanglet commented Dec 12, 2017

Well, that is weird. It happens every time on my machine.
FWIW, I downloaded the go package from https://golang.org/dl/#go1.10beta1. Maybe you tested a slightly different version ?
Anyway, I will wait for newer releases and keep on testing this case.

@bradfitz bradfitz modified the milestones: Go1.11, Unplanned Jun 19, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment