New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: stack growth during critical section harms performance #18020

Open
gyuho opened this Issue Nov 22, 2016 · 15 comments

Comments

Projects
None yet
@gyuho
Contributor

gyuho commented Nov 22, 2016

Hello,

etcd observed significant performance regression with Go 1.7.3.

Details follow or please visit etcd-io/etcd#6876.


What version of Go are you using (go version)?

Go 1.6.3 and Go 1.7.3


What operating system and processor architecture are you using (go env)?

$ go env
GOARCH="amd64"
GOBIN=""
GOEXE=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOOS="linux"
GOPATH="/home/gyuho/go"
GORACE=""
GOROOT="/usr/local/go"
GOTOOLDIR="/usr/local/go/pkg/tool/linux_amd64"
CC="gcc"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build467209749=/tmp/go-build -gno-record-gcc-switches"
CXX="g++"
CGO_ENABLED="1"

Google Cloud VM with 8 vCPUs, 16 GB memory, ubuntu-1610

gcloud compute instances create test-machine --custom-cpu=8 --custom-memory=16 --image-family=ubuntu-1610 --image-project=ubuntu-os-cloud --boot-disk-size=150 --boot-disk-type="pd-ssd" --zone us-west1-a


What did you do?

Build etcd binaries, others

Build etcd binaries in different Go versions with etcd commit etcd-io/etcd@faeeb2f.

Go version is the only difference!

mkdir -p ${GOPATH}/src/github.com/coreos
cd ${GOPATH}/src/github.com/coreos
git clone https://github.com/coreos/etcd.git
cd ${GOPATH}/src/github.com/coreos/etcd
git reset --hard faeeb2fc7514c5caf7a9a0cc03ac9ee2ff94438b

./build
./bin/etcd --version

go build -v ./cmd/tools/benchmark
./benchmark help

Or

wget https://storage.googleapis.com/etcd/tip/etcd-v3.0.2-go1.6.3
wget https://storage.googleapis.com/etcd/tip/etcd-v3.0.2-go1.7.3
wget https://storage.googleapis.com/etcd/tip/etcdctl-master-go1.7.3
wget https://storage.googleapis.com/etcd/tip/benchmark-master-go1.7.3

sudo chmod +x ./etcd-v3.0.2-go1.6.3      && ./etcd-v3.0.2-go1.6.3 --version
sudo chmod +x ./etcd-v3.0.2-go1.7.3      && ./etcd-v3.0.2-go1.7.3 --version
sudo chmod +x ./etcdctl-master-go1.7.3   && ./etcdctl-master-go1.7.3 --version
sudo chmod +x ./benchmark-master-go1.7.3 && ./benchmark-master-go1.7.3 help
Run single-node cluster, benchmark

Go 1.6.3 + master branch Go 1.7.3 client

rm -rf test.etcd
./etcd-v3.0.2-go1.6.3 --name test \
    --listen-client-urls http://localhost:2379 \
    --advertise-client-urls http://localhost:2379 \
    --listen-peer-urls http://localhost:2380 \
    --initial-advertise-peer-urls http://localhost:2380 \
    --initial-cluster test=http://localhost:2380 \
    --initial-cluster-token test-token \
    --initial-cluster-state new \
    --enable-pprof

ETCDCTL_API=3 ./etcdctl-master-go1.7.3 --endpoints=localhost:2379 put foo bar
ETCDCTL_API=3 ./etcdctl-master-go1.7.3 --endpoints=localhost:2379 get foo --consistency=s
./benchmark-master-go1.7.3 --endpoints=localhost:2379 --conns=100 --clients=1000 range foo --consistency=s --total=200000

Go 1.7.3 + master branch Go 1.7.3 client

rm -rf test.etcd
./etcd-v3.0.2-go1.7.3 --name test \
    --listen-client-urls http://localhost:2379 \
    --advertise-client-urls http://localhost:2379 \
    --listen-peer-urls http://localhost:2380 \
    --initial-advertise-peer-urls http://localhost:2380 \
    --initial-cluster test=http://localhost:2380 \
    --initial-cluster-token test-token \
    --initial-cluster-state new \
    --enable-pprof

ETCDCTL_API=3 ./etcdctl-master-go1.7.3 --endpoints=localhost:2379 put foo bar
ETCDCTL_API=3 ./etcdctl-master-go1.7.3 --endpoints=localhost:2379 get foo --consistency=s
./benchmark-master-go1.7.3 --endpoints=localhost:2379 --conns=100 --clients=1000 range foo --consistency=s --total=200000

What did you expect to see?

etcd with Go 1.6.3 shows:

Summary:
  Total:	55.4612 secs.
  Requests/sec:	36061.2548

What did you see instead?

etcd with Go 1.7.3 is 2x slower:

Summary:
  Total:	105.3251 secs.
  Requests/sec:	18988.8279

We expected Go 1.7.3 to be comparable or better.


Here's our profiling data

Note: etcd uses old golang.org/x/net/http2 but same behavior happens with latest http2; see etcd-io/etcd#6876 (comment).

Could anybody help us debug this around Go runtime?
Please let me know if more information is needed.

Thanks a lot in advance!

@mvdan

This comment has been minimized.

Member

mvdan commented Nov 22, 2016

Did you try tip?

@gyuho

This comment has been minimized.

Contributor

gyuho commented Nov 22, 2016

I did; tip has slightly better performance than Go 1.7.3, but still much slower than Go 1.6.3.

@rsc rsc added this to the Go1.8Maybe milestone Nov 22, 2016

@rsc

This comment has been minimized.

Contributor

rsc commented Nov 22, 2016

Looked at the pb.gz profiles in etcd-io/etcd#6876. Slightly confusing because the Go 1.7 profile ran 2X as long as the Go 1.6 profile so the total times are not directly comparable. However, in Go 1.6 the scheduler ran for about 2s, while in Go 1.7 the scheduler ran for about 20s, so about 5X more. That probably has something to do with it.

http2 probably does not have anything to do with it. From the profile it seems clear that you are running a vendored copy of http2, not the one from the Go distribution itself.

Go 1.6:

screen shot 2016-11-22 at 3 45 29 pm

Go 1.7:

screen shot 2016-11-22 at 3 44 52 pm

We might be able to fix this in Go 1.8 if the fix is simple.

/cc @aclements @RLH

@rsc rsc changed the title from runtime: performance regression in Go 1.7.3 and http2 to runtime: performance regression in Go 1.7.3 scheduler Nov 22, 2016

@gyuho

This comment has been minimized.

Contributor

gyuho commented Nov 22, 2016

I gave different duration to test the same workloads, but they are confusing for profiling.

So I collected pprof data with same duration 30-second:

./benchmark-master-go1.7.3 --endpoints=localhost:2379 --conns=100 --clients=1000 range foo --consistency=s --total=2000000

# Go 1.6.3
  Total:	55.5685 secs.
  Requests/sec:	35991.5985

# Go 1.7.3
Summary:
  Total:	105.6422 secs.
  Requests/sec:	18931.8332

And here are pprof binaries if that helps.

Go 1.6.3

wget https://storage.googleapis.com/etcd/tip/etcd-v3.0.2-go1.6.3
etcd-v3.0.2-go1.6.3-serializable-reads-00.pdf
pprof.localhost:2379.samples.cpu.001.pb.gz

Go 1.7.3

wget https://storage.googleapis.com/etcd/tip/etcd-v3.0.2-go1.7.3
etcd-v3.0.2-go1.7.3-serializable-reads-00.pdf
pprof.etcd-v3.0.2-go1.7.3.localhost:2379.samples.cpu.001.pb.gz

Thanks a lot!

@timothysc

This comment has been minimized.

timothysc commented Nov 28, 2016

@bradfitz

This comment has been minimized.

Member

bradfitz commented Dec 15, 2016

I haven't seen any update on this bug about anything Go 1.8-related, so I'm going to kick this down the road further.

Please try Go 1.8 and file bugs about any performance problems, thanks!

In the future, please start your trials of new Go versions earlier. Testing the move from Go 1.6 to Go 1.7 doesn't help us much when we're just about done with Go 1.8.

@bradfitz bradfitz modified the milestones: Go1.9, Go1.8Maybe Dec 15, 2016

@gyuho

This comment has been minimized.

Contributor

gyuho commented Dec 15, 2016

@bradfitz Thanks!

I ran the same tests with go1.6.4 vs go1.8beta1, and got same results

# all single-node cluster
go1.6.4     took   49.8130 secs.
go1.8beta1  took   100.3238 secs.

Here are benchmark binaries and profiled data:

go1.6.4
etcd-v3.0.2-go1.6.4-2M-serializable-reads-2016121501.pdf
pprof.localhost-2379.samples.cpu.001.pb.gz

go1.8beta1
etcd-v3.0.2-go1.8beta1-2M-serializable-reads-2016121501.pdf
pprof.etcd-v3.0.2-go1.8beta1.localhost-2379.samples.cpu.001.pb.gz

wget https://storage.googleapis.com/etcd/tip/etcd-v3.0.2-go1.6.4
wget https://storage.googleapis.com/etcd/tip/etcd-v3.0.2-go1.8beta1
wget https://storage.googleapis.com/etcd/tip/etcdctl-master-go1.8beta1
wget https://storage.googleapis.com/etcd/tip/benchmark-master-go1.8beta1

sudo chmod +x ./etcd-v3.0.2-go1.6.4         && ./etcd-v3.0.2-go1.6.4 --version
sudo chmod +x ./etcd-v3.0.2-go1.8beta1      && ./etcd-v3.0.2-go1.8beta1 --version
sudo chmod +x ./etcdctl-master-go1.8beta1   && ./etcdctl-master-go1.8beta1 --version
sudo chmod +x ./benchmark-master-go1.8beta1 && ./benchmark-master-go1.8beta1 help
@bradfitz

This comment has been minimized.

Member

bradfitz commented Dec 15, 2016

Leaving for @aclements to decide what to do here, and when.

@TocarIP

This comment has been minimized.

Contributor

TocarIP commented Jan 21, 2017

I've figured out, whats going on.
I was able to reproduce this issue only with GOMAXPROCS >= 3
This started happening on a6fb2ae
Which enables new ssa compiler. Because of that, stack frames are now shorter, so stack growth happens while locks are taken. Previously due lo larger stackframes stack growth was happening outside of critical section. I've verified this by enabling blocking profiler, and indeed in bad case much more time is spent waiting on locks:
214.42mins of 215.35mins total (99.57%) in bad case
2376s of 2376.23s total ( 100%) in good case
From looking at profiles, it seems like the guilty lock is the one in coreos/etcd/etcdserver.(*authApplierV3).Apply, but I didn't verify this.

I've verified that following approaches make this regression disappear:

  1. Bumping default stack size to 8kb or changing growth factor to 4
  2. Putting large array on stack to force stack growth earlier.
    E. g:
+var q [160]int
+
 func metricsUnaryInterceptor(ctx context.Context, req interface{}, info *grpc.UnaryServerInfo, handler grpc.UnaryHandler) (resp interface{}, err error) {
        service, method := splitMethodName(info.FullMethod)
        receivedCounter.WithLabelValues(service, method).Inc()
 -
+       var ccc [160]int
        start := time.Now()
        resp, err = handler(ctx, req)
        if err != nil {
                failedCounter.WithLabelValues(service, method, grpc.Code(err).String()).Inc()
        }
        handlingDuration.WithLabelValues(service, method).Observe(time.Since(start).Seconds())
 -
+       q = ccc
        return resp, err
}

However both of them look like hacks to me.
As for real solution, I think that something like @aclements suggestion from #18138 about allocating larger stack for functions that are known to require it, is the right way.

@xiang90

This comment has been minimized.

xiang90 commented Jan 21, 2017

@TocarIP

From looking at profiles, it seems like the guilty lock is the one in coreos/etcd/etcdserver.(*authApplierV3).Apply, but I didn't verify this.

I can confirm this. We also suspect that there is a more serious lock contention triggered by some runtime changes. So we change the locking mechanism, the problem disappeared.

But as you mentioned, this issue is probably still worth fixing.

@gopherbot

This comment has been minimized.

gopherbot commented Jun 9, 2017

CL https://golang.org/cl/45142 mentions this issue.

@aclements aclements changed the title from runtime: performance regression in Go 1.7.3 scheduler to runtime: stack growth during critical section harms performance Jun 9, 2017

@aclements

This comment has been minimized.

Member

aclements commented Jun 9, 2017

@TocarIP thanks for the excellent analysis.

If you can still reproduce this, I'd be curious what the effect of CL 45142 is. I'm really not sure what the right answer here is, but that hack might mitigate things from both this issue and #18138.

I've been thinking for a while that the latency of stack growth can be problematic, but this is an interesting case where it's not just a problem with large stacks. The old segmented stacks approach had less of this problem, but it had other anomalies. I wonder if it's possible to efficiently but incrementally move a stack.

@aclements aclements modified the milestones: Go1.10, Go1.9 Jun 9, 2017

@TocarIP

This comment has been minimized.

Contributor

TocarIP commented Jun 9, 2017

I've tested CL 45142, and it doesn't help

@josharian

This comment has been minimized.

Contributor

josharian commented Jun 13, 2017

At my request, @TocarIP also checked CL 43150 (thanks!). It's not a deep fix, but it does at least help a little--from 13.4s to 12.4s.

@rsc rsc modified the milestones: Go1.10, Go1.11 Nov 22, 2017

@josharian

This comment has been minimized.

Contributor

josharian commented Apr 30, 2018

I’ve done a bit of work on stack copying this cycle. Much of it is in. https://golang.org/cl/109001 is under review. I have one more I hope to mail soon, but no promises. If anyone can still reproduce readily, I’d be curious to hear whether / how much tip has improved, and whether the linked and future CLs help.

@ianlancetaylor ianlancetaylor modified the milestones: Go1.11, Go1.12 Jul 10, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment