Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/go: performance regression after F_FULLFSYNC fcntl change on Darwin #27415

Closed
mark-rushakoff opened this issue Aug 31, 2018 · 11 comments

Comments

Projects
None yet
5 participants
@mark-rushakoff
Copy link
Contributor

commented Aug 31, 2018

I'm seeing a reliable ~3x slowdown with go test -race on a particular set of tests doing a lot of concurrent operations in several goroutines. I haven't yet spent the time to reduce the test case or to identify the go commit introducing the regression.

Tried with go 1.11:

$ go version; go env
go version go1.11 darwin/amd64
GOARCH="amd64"
GOBIN=""
GOCACHE="/Users/mr/Library/Caches/go-build"
GOEXE=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="darwin"
GOOS="darwin"
GOPATH="/Users/mr/go"
GOPROXY=""
GORACE=""
GOROOT="/usr/local/Cellar/go/1.11/libexec"
GOTMPDIR=""
GOTOOLDIR="/usr/local/Cellar/go/1.11/libexec/pkg/tool/darwin_amd64"
GCCGO="gccgo"
CC="clang"
CXX="clang++"
CGO_ENABLED="1"
GOMOD=""
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fdebug-prefix-map=/var/folders/ct/bl4_z3g51ks8239_r2k07v_40000gn/T/go-build102870134=/tmp/go-build -gno-record-gcc-switches -fno-common"

and go tip:

$ gotip version; gotip env
go version devel +7c96f9b527 Fri Aug 31 02:14:24 2018 +0000 darwin/amd64
GOARCH="amd64"
GOBIN=""
GOCACHE="/Users/mr/Library/Caches/go-build"
GOEXE=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="darwin"
GOOS="darwin"
GOPATH="/Users/mr/go"
GOPROXY=""
GORACE=""
GOROOT="/Users/mr/gotip/src/github.com/golang/go"
GOTMPDIR=""
GOTOOLDIR="/Users/mr/gotip/src/github.com/golang/go/pkg/tool/darwin_amd64"
GCCGO="gccgo"
CC="clang"
CXX="clang++"
CGO_ENABLED="1"
GOMOD=""
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fdebug-prefix-map=/var/folders/ct/bl4_z3g51ks8239_r2k07v_40000gn/T/go-build924877100=/tmp/go-build -gno-record-gcc-switches -fno-common"

To reproduce:

go get github.com/influxdata/platform
cd $GOPATH/src/github.com/influxdata/platform
git checkout origin/test/taskservice
dep ensure -v -vendor-only
go test -race -count=5 ./task

(That branch will likely get rebased onto master in the near future, so I've got a fork at mark-rushakoff/platform@a8269bc with the commit as it is today.)

I'm consistently seeing times like:

$ go test -race -count=5 ./task
ok  	github.com/influxdata/platform/task	8.513s
$ gotip test -race -count=5 ./task
ok  	github.com/influxdata/platform/task	25.863s

For a small test suite that doesn't exercise concurrency, the slowdown is negligible:

$ go test -race -count=5 ./task/options/
ok  	github.com/influxdata/platform/task/options	1.602s
$ gotip test -race -count=5 ./task/options/
ok  	github.com/influxdata/platform/task/options	1.655s

I might spend some time bisecting go to identify where the slowdown was introduced, or I might take a look at trace output, but I wanted to file this issue in case anyone else had time and interest to look at it before I do.

@ALTree

This comment has been minimized.

Copy link
Member

commented Aug 31, 2018

If I'm reading this correctly the regression is in tip but not in 1.11? You have to bisect between 227 commits, which will take 8 steps... at ~2 minutes at step, it's less than 20 minutes. It seems doable to me.

@mark-rushakoff

This comment has been minimized.

Copy link
Contributor Author

commented Aug 31, 2018

I'm seeing an even more drastic slowdown without using -race:

bash-3.2$ go test  -count=5 ./task
ok  	github.com/influxdata/platform/task	1.845s
bash-3.2$ gotip test  -count=5 ./task
ok  	github.com/influxdata/platform/task	19.957s
bash-3.2$ go test  -count=5 ./task
ok  	github.com/influxdata/platform/task	1.841s
bash-3.2$ gotip test  -count=5 ./task
ok  	github.com/influxdata/platform/task	20.574s
@mark-rushakoff

This comment has been minimized.

Copy link
Contributor Author

commented Aug 31, 2018

Okay, I bisected to find be10ad7 (internal/poll: use F_FULLFSYNC fcntl for FD.Fsync on OS X) was the first "bad" commit.

The test suite in question is exercising some concurrent operations backed by a coreos/bbolt database. The commit message details look like the commit was a necessary change:

As reported in #26650 and also cautioned on the man page
for fsync on OS X, fsync doesn't properly flush content
to permanent storage, and might cause corruption of data if
the OS crashes or if the drive loses power. Thus it is recommended
to use the F_FULLFSYNC fcntl, which flushes all buffered data to
permanent storage and is important for applications such as
databases that require a strict ordering of writes.

Maybe this is just a necessary regression for file correctness on Darwin?

Here's the tail of my bisect run, if that's useful information:

bash-3.2$ gotip version; gotip test -count=5 ./task; gotip test -count=5 -race ./task
go version devel +37ea182660 Wed Aug 22 21:38:27 2018 +0000 darwin/amd64
ok  	github.com/influxdata/platform/task	1.693s
ok  	github.com/influxdata/platform/task	8.448s
bash-3.2$ gotip version; gotip test -count=5 ./task; gotip test -count=5 -race ./task
go version devel +37ea182660 Wed Aug 22 21:38:27 2018 +0000 darwin/amd64
ok  	github.com/influxdata/platform/task	1.976s
ok  	github.com/influxdata/platform/task	8.973s
bash-3.2$ gotip version; gotip test -count=5 ./task; gotip test -count=5 -race ./task
go version devel +ca8ba0675a Wed Aug 22 22:47:21 2018 +0000 darwin/amd64
ok  	github.com/influxdata/platform/task	20.477s
ok  	github.com/influxdata/platform/task	26.229s
bash-3.2$ gotip version; gotip test -count=5 ./task; gotip test -count=5 -race ./task
go version devel +ca8ba0675a Wed Aug 22 22:47:21 2018 +0000 darwin/amd64
ok  	github.com/influxdata/platform/task	21.889s
ok  	github.com/influxdata/platform/task	27.399s
bash-3.2$ gotip version; gotip test -count=5 ./task; gotip test -count=5 -race ./task
go version devel +be10ad7622 Wed Aug 22 22:26:40 2018 +0000 darwin/amd64
ok  	github.com/influxdata/platform/task	21.406s
ok  	github.com/influxdata/platform/task	26.720s
bash-3.2$ gotip version; gotip test -count=5 ./task; gotip test -count=5 -race ./task
go version devel +be10ad7622 Wed Aug 22 22:26:40 2018 +0000 darwin/amd64
ok  	github.com/influxdata/platform/task	21.632s
ok  	github.com/influxdata/platform/task	26.340s
bash-3.2$ gotip version; gotip test -count=5 ./task; gotip test -count=5 -race ./task
go version devel +ccb70bd19c Wed Aug 22 22:11:24 2018 +0000 darwin/amd64
ok  	github.com/influxdata/platform/task	1.786s
ok  	github.com/influxdata/platform/task	8.437s
bash-3.2$ gotip version; gotip test -count=5 ./task; gotip test -count=5 -race ./task
go version devel +ccb70bd19c Wed Aug 22 22:11:24 2018 +0000 darwin/amd64
ok  	github.com/influxdata/platform/task	1.821s
ok  	github.com/influxdata/platform/task	9.150s

/cc @odeke-em @ianlancetaylor

@mark-rushakoff mark-rushakoff changed the title cmd/go: performance regression on `go test -race` at tip cmd/go: performance regression after F_FULLFSYNC fcntl change on Darwin Aug 31, 2018

@ALTree ALTree added this to the Go1.12 milestone Aug 31, 2018

@odeke-em

This comment has been minimized.

Copy link
Member

commented Sep 2, 2018

Thank you for the investigation and report @mark-rushakoff!

In deed, the regression is present but for consistency and correctness to avoid data corruption, the change seems inevitable.

Could we please compare the speeds and throughput on Linux vs on OS X vs Windows?

@mark-rushakoff

This comment has been minimized.

Copy link
Contributor Author

commented Sep 3, 2018

Unfortunately, I don't have access to a Linux or Windows machine to compare results. Fortunately, I've created a repository with a smaller reproduction case so that someone with Linux or Windows can easily help: https://github.com/mark-rushakoff/goissue27415

The case there still depends on coreos/bbolt, which is included in vendor. I tried calling (*os.File).Sync() but wasn't able to reproduce the issue (consistent with the comment in the commit); and I tried using syscall.Fsync but also wasn't able to reproduce it, although it's very possible that was user error on my part.

The benchmark in the repository is this:

package goissue27415_test

import (
	"encoding/binary"
	"io/ioutil"
	"os"
	"path/filepath"
	"testing"

	bolt "github.com/coreos/bbolt"
)

func BenchmarkBboltWrites(b *testing.B) {
	dir, err := ioutil.TempDir("", "")
	if err != nil {
		b.Fatal(err)
	}

	path := filepath.Join(dir, "bolt.db")
	defer os.Remove(path)

	db, err := bolt.Open(path, 0644, nil)
	if err != nil {
		b.Fatal(err)
	}
	defer db.Close()

	var buf [8]byte
	b.ResetTimer()
	for i := 0; i < b.N; i++ {
		binary.BigEndian.PutUint64(buf[:], uint64(i))

		if err := db.Update(func(tx *bolt.Tx) error {
			// Create a bucket.
			bucket, err := tx.CreateBucketIfNotExists([]byte("numbers"))
			if err != nil {
				return err
			}

			if err := bucket.Put(buf[:], []byte("ok")); err != nil {
				return err
			}
			return nil
		}); err != nil {
			b.Error(err)
		}
	}
}

And by comparing go version go1.11 darwin/amd64 with go version devel +be10ad7622 Wed Aug 22 22:26:40 2018 +0000 darwin/amd64, I got the following benchstat output:

name                old time/op  new time/op  delta
BboltWrites-4   281µs ± 1%  8283µs ±11%  +2843.16%  (p=0.000 n=8+10)

(Edit: updated code and benchstat for simpler test case)

@odeke-em

This comment has been minimized.

Copy link
Member

commented Sep 3, 2018

At @mark-rushakoff I was able to toggle remote tests/benchmarks on both my Linux PowerEdge server and Macbook Pro. However, as you'll see the numbers of cores vary with results. The results are just to compare the range of tolerance/overhead across the various OSes

Linux

BenchmarkBboltWrites/10-4         	      10	 102957964 ns/op
BenchmarkBboltWrites/10-4         	      20	  71143284 ns/op
BenchmarkBboltWrites/10-4         	      20	 107467903 ns/op
BenchmarkBboltWrites/10-4         	      20	  56733746 ns/op
BenchmarkBboltWrites/10-4         	      20	  56198276 ns/op
BenchmarkBboltWrites/10-4         	      20	  97683423 ns/op
BenchmarkBboltWrites/10-4         	      50	  92507200 ns/op
BenchmarkBboltWrites/10-4         	      20	  73986142 ns/op
BenchmarkBboltWrites/10-4         	      20	  85993264 ns/op
BenchmarkBboltWrites/10-4         	      10	 142830820 ns/op
BenchmarkBboltWrites/100-4        	      20	  89995302 ns/op
BenchmarkBboltWrites/100-4        	      10	 104773303 ns/op
BenchmarkBboltWrites/100-4        	      20	 110999693 ns/op
BenchmarkBboltWrites/100-4        	      20	  70266489 ns/op
BenchmarkBboltWrites/100-4        	      20	  82401870 ns/op
BenchmarkBboltWrites/100-4        	      20	  73706927 ns/op
BenchmarkBboltWrites/100-4        	      20	  78203692 ns/op
BenchmarkBboltWrites/100-4        	      20	  73272481 ns/op
BenchmarkBboltWrites/100-4        	      20	  83584688 ns/op
BenchmarkBboltWrites/100-4        	      20	  60500892 ns/op
BenchmarkBboltWrites/1000-4       	      20	  80169125 ns/op
BenchmarkBboltWrites/1000-4       	      20	  66684793 ns/op
BenchmarkBboltWrites/1000-4       	      20	  87746197 ns/op
BenchmarkBboltWrites/1000-4       	       5	 203727481 ns/op
BenchmarkBboltWrites/1000-4       	      20	  87576367 ns/op
BenchmarkBboltWrites/1000-4       	      10	 103791336 ns/op
BenchmarkBboltWrites/1000-4       	      10	 104779844 ns/op
BenchmarkBboltWrites/1000-4       	      20	 100336926 ns/op
BenchmarkBboltWrites/1000-4       	      20	  88736597 ns/op
BenchmarkBboltWrites/1000-4       	      20	  71800534 ns/op

OS X Macbook Pro

Forcibly setting GOMAXPROCS to 4 since it is an 8 core machine

GOMAXPROCS=4 go test -bench=. -count=10
BenchmarkBboltWrites/10-4         	     100	  12040480 ns/op
BenchmarkBboltWrites/10-4         	     100	  12081625 ns/op
BenchmarkBboltWrites/10-4         	     100	  12340859 ns/op
BenchmarkBboltWrites/10-4         	     100	  12650864 ns/op
BenchmarkBboltWrites/10-4         	     100	  12846532 ns/op
BenchmarkBboltWrites/10-4         	     100	  12731753 ns/op
BenchmarkBboltWrites/10-4         	     100	  12683881 ns/op
BenchmarkBboltWrites/10-4         	     100	  12340144 ns/op
BenchmarkBboltWrites/10-4         	     100	  11625433 ns/op
BenchmarkBboltWrites/10-4         	     100	  12977623 ns/op
BenchmarkBboltWrites/100-4        	     100	  12117082 ns/op
BenchmarkBboltWrites/100-4        	     100	  12740611 ns/op
BenchmarkBboltWrites/100-4        	     100	  12934473 ns/op
BenchmarkBboltWrites/100-4        	     100	  12111835 ns/op
BenchmarkBboltWrites/100-4        	     100	  12959604 ns/op
BenchmarkBboltWrites/100-4        	     100	  12464744 ns/op
BenchmarkBboltWrites/100-4        	     100	  12544378 ns/op
BenchmarkBboltWrites/100-4        	     100	  12591169 ns/op
BenchmarkBboltWrites/100-4        	     100	  12499221 ns/op
BenchmarkBboltWrites/100-4        	     100	  12667389 ns/op
BenchmarkBboltWrites/1000-4       	     100	  12749705 ns/op
BenchmarkBboltWrites/1000-4       	     100	  12642107 ns/op
BenchmarkBboltWrites/1000-4       	     100	  12252822 ns/op
BenchmarkBboltWrites/1000-4       	     100	  11995750 ns/op
BenchmarkBboltWrites/1000-4       	     100	  12517070 ns/op
BenchmarkBboltWrites/1000-4       	     100	  12207655 ns/op
BenchmarkBboltWrites/1000-4       	     100	  12940175 ns/op
BenchmarkBboltWrites/1000-4       	     100	  12508852 ns/op
BenchmarkBboltWrites/1000-4       	     100	  12423794 ns/op
BenchmarkBboltWrites/1000-4       	     100	  13028134 ns/op

Results

The results are quite interesting, on OS X even with that change, those writes are apparently ~85% faster than on Linux

benchstat linux.txt osx.txt 
name                old time/op  new time/op  delta
BboltWrites/10-4    88.8ms ±61%  12.4ms ± 6%  -85.99%  (p=0.000 n=10+10)
BboltWrites/100-4   82.8ms ±34%  12.6ms ± 4%  -84.82%  (p=0.000 n=10+10)
BboltWrites/1000-4  88.0ms ±24%  12.5ms ± 4%  -85.76%  (p=0.000 n=9+10)

Unfortunately I don't have a WIndows machine but I'll kindly page @alexbrainman to help with those benchmarks if available

However, I've also got a state of the art Macbook Pro with Flash storage while my Linux machine has a hard drive -- this might be an explanation for the discrepancy: the comparison of apples to oranges, and since that change doesn't even affect any other OS I think we just need to make a judgement call on whether we can tolerate data corruption for speed.

@ianlancetaylor

This comment has been minimized.

Copy link
Contributor

commented Sep 3, 2018

On Darwin it seems clearly correct to use F_FULLFSYNC, and a reduction in speed of calls to the Sync method is an inevitable result. This sounds like working as intended. If you want to trade off data corruption for speed, don't call the Sync method.

@mark-rushakoff

This comment has been minimized.

Copy link
Contributor Author

commented Sep 3, 2018

That's fair @ianlancetaylor. I assume this performance change will surprise many other users; maybe this issue should be repurposed to track that the F_FULLSYNC change and its consequences are clearly documented in the Go1.12 release notes?

@alexbrainman

This comment has been minimized.

Copy link
Member

commented Sep 4, 2018

Unfortunately I don't have a WIndows machine but I'll kindly page @alexbrainman to help with those benchmarks if available

I did not bother with the benchmarks, because

https://go-review.googlesource.com/c/go/+/130676

did not change any windows code. Please, let me know, if I am wrong.

Alex

@ianlancetaylor

This comment has been minimized.

Copy link
Contributor

commented Sep 7, 2018

I added a RELNOTES tag to the CL, so we will add something to the release notes. Closing this issue.

@odeke-em

This comment has been minimized.

Copy link
Member

commented Sep 7, 2018

did not change any windows code. Please, let me know, if I am wrong

@alexbrainman no, we meant to say just raw performance numbers regardless of the operating system, to compare with what OS X was seeing, for ballparking how OS X would be suffering. Anyways, no need for it anymore :)

I added a RELNOTES tag to the CL, so we will add something to the release notes. Closing this issue.

Thank you Ian!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.