New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: significant performance improvement on single core machines #15395

Open
icecrime opened this Issue Apr 21, 2016 · 7 comments

Comments

Projects
None yet
5 participants
@icecrime

icecrime commented Apr 21, 2016

1. What version of Go are you using (go version)?

go version go1.5.4 linux/amd64

2. What operating system and processor architecture are you using (go env)?

GOARCH="amd64"
GOBIN=""
GOEXE=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOOS="linux"
GOPATH=""
GORACE=""
GOROOT="/root/go"
GOTOOLDIR="/root/go/pkg/tool/linux_amd64"
GO15VENDOREXPERIMENT=""
CC="gcc"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0"
CXX="g++"
CGO_ENABLED="1"

Hosts are running stock Ubuntu 15.10:

# uname -r
4.2.0-27-generic

3. What did you do?
4. What did you expect to see?
5. What did you see instead?

Looking into performance of running containers on Docker, we realized there's a significant difference when running on a single core machine versus a multi core machine. The result seem independent of the GOMAXPROCS value (using GOMAXPROCS=1 on the multi core machine remains significantly slower).

Single core:

# time ./docker run --rm busybox true
real    0m0.255s

Multi core:

# time ./docker run --rm busybox true
real    0m0.449s

We cannot attribute that difference to Go with a 100% certainty, but we could use your help explaining some of the profiling results we've obtained so far.

Profiling / trace

We instrumented the code to find out where that difference materialized, and what comes out is that it is quite evenly distributed. However, syscalls are consistently taking much more time on the multi core machine (as made obvious with slower syscalls such as syscall.Unmount).

Using go tool trace to dig further, it appears that we're seeing discontinuities in goroutine execution on the multi core machine that the single core one doesn't expose, even with GOMAXPROCS=1.

Single core (link to the trace file)

image

Multi core with GOMAXPROCS=1 (link to the trace file)

image

Link to the binary which produced the trace files.

Host information

The two hosts are virtual machines running on the same Digital Ocean zone, with the exact same CPU (Intel(R) Xeon(R) CPU E5-2630L v2 @ 2.40GHz). The issue was also reproduced locally with Virtual Box VM with different core numbers.

Please let us know if there's any more information we can provide, or if you need us to test with different builds of Go. Thanks for any help you can provide!

Cc @crosbymichael @tonistiigi.

@bradfitz bradfitz added this to the Unplanned milestone Apr 22, 2016

@bradfitz bradfitz changed the title from Significant performance improvement on single core machines to runtime: significant performance improvement on single core machines Apr 22, 2016

@unclejack

This comment has been minimized.

Contributor

unclejack commented Aug 1, 2016

@icecrime Have you tested with Go 1.7rcX?

@icecrime

This comment has been minimized.

icecrime commented Aug 1, 2016

@unclejack I haven't, but it might be worth it indeed.

@vincentwoo

This comment has been minimized.

vincentwoo commented Aug 1, 2016

@aclements have you had a chance to look at this issue? This affects a ton of downstream users of golang on multicore machines, and I'd love to hear about it!

@vincentwoo

This comment has been minimized.

vincentwoo commented Aug 22, 2016

@icecrime any changes in perf with the official release of 1.7?

@aclements

This comment has been minimized.

Member

aclements commented Aug 23, 2016

@icecrime, I suspect this is not a problem with Go. According to the syscall blocking profile in the multicore trace you attached to your original message, you're spending 50ms blocked in the unmount syscall (from github.com/docker/docker/daemon/graphdriver/aufs.Unmount). There are plenty of other syscalls in that trace that you don't spend appreciable time in, suggesting that the performance problem is in unmount (possibly AUFS unmount) itself.

It would be worth trying with Go 1.7, though if it is an issue in the kernel, that won't make a difference. As an experiment, you could also try invoking just that unmount syscall from, say, C, and time that. If it also takes a long time in C, then we know it's not a Go issue. Running this under "perf record" may also prove enlightening, though if the delay comes from blocking not much will show up.

@icecrime

This comment has been minimized.

icecrime commented Aug 23, 2016

@aclements Thanks for your reply. Unmount here was just one example, as it is indeed part of the slower syscalls. However, we measured several times, and Unmount alone clearly isn't responsible for the difference.

I can remember we instrumented at multiple places in that code path, and no particular segment of that code path could explain the timings difference between single core and multi core machines. This is why we ended up thinking it might be a runtime issue.

@vincentwoo

This comment has been minimized.

vincentwoo commented Jan 25, 2017

Hi @aclements and @icecrime. Has there been any additional work on seeing whether this is indeed a runtime vs kernel issue? This has pretty serious impacts for the whole Go ecosystem if it actually is a runtime issue!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment