-
Notifications
You must be signed in to change notification settings - Fork 18.8k
Description
What version of Go are you using (go version)?
$ go version go version go1.15.5 darwin/amd64
Does this issue reproduce with the latest release?
Yes
What operating system and processor architecture are you using (go env)?
go env Output
$ go env GO111MODULE="" GOARCH="amd64" GOBIN="" GOCACHE="/Users/xx/Library/Caches/go-build" GOENV="/Users/xx/Library/Application Support/go/env" GOEXE="" GOFLAGS="" GOHOSTARCH="amd64" GOHOSTOS="darwin" GOINSECURE="" GOMODCACHE="/Users/xx/gocode/pkg/mod" GONOPROXY="" GONOSUMDB="" GOOS="darwin" GOPATH="/Users/xx/gocode" GOPRIVATE="" GOPROXY="https://proxy.golang.org,direct" GOROOT="/usr/local/Cellar/go/1.15.5/libexec" GOSUMDB="sum.golang.org" GOTMPDIR="" GOTOOLDIR="/usr/local/Cellar/go/1.15.5/libexec/pkg/tool/darwin_amd64" GCCGO="gccgo" AR="ar" CC="clang" CXX="clang++" CGO_ENABLED="1" GOMOD="" CGO_CFLAGS="-g -O2" CGO_CPPFLAGS="" CGO_CXXFLAGS="-g -O2" CGO_FFLAGS="-g -O2" CGO_LDFLAGS="-g -O2" PKG_CONFIG="pkg-config" GOGCCFLAGS="-fPIC -m64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fdebug-prefix-map=/var/folders/xr/hsc7mk1n5glg6vfqb4z5nlkh0000gn/T/go-build663468880=/tmp/go-build -gno-record-gcc-switches -fno-common"
What did you do?
testing.PB.Next() only returns a boolean whether there exits another iteration when running parallel benchmarks.
This is insufficient in many situations where one needs to know the current iteration index.
See the Go code in the attached stencil.go.zip file.
Consider the serial elision of the 3-point stencil example, which reads from an input slice in and writes to output slice out while computing in[i-1] + in[i] + in[i+1] for iteration index i and writing out to out[i].
func stencilSerial() {
r := testing.Benchmark(func(b *testing.B) {
in := make([]int, b.N)
out := make([]int, b.N)
for i := 0; i < b.N; i++ {
in[i] = i
}
for i := 0; i < b.N; i++ {
left := 0
if i > 0 {
left = in[i-1]
}
right := 0
if i < int(b.N)-1 {
right = in[i+1]
}
out[i] = left + in[i] + right
}
})
fmt.Printf("\n stencilSerial %v", r)
}
Try to convert it to a parallel benchmark; here is a first racy version, which is definitely incorrect.
The for i := 0; pb.Next(); i++ loop is local to each go routine and multiple go routines will overwrite the same index out[i] and not all indices will be read/updated and hence it is functionally incorrect.
Furthermore, the cacheline pingpoining will deteriorate the performance of this parallel code (shown at the end of this issue).
func stencilParallelRacy() {
r := testing.Benchmark(func(b *testing.B) {
in := make([]int, b.N)
out := make([]int, b.N)
for i := 0; i < b.N; i++ {
in[i] = i
}
b.RunParallel(func(pb *testing.PB) {
for i := 0; pb.Next(); i++ {
left := 0
if i > 0 {
left = in[i-1]
}
right := 0
if i < int(b.N)-1 {
right = in[i+1]
}
// racy update and not all in[i] are updated!
out[i] = left + in[i] + right
}
})
})
fmt.Printf("\n stencilParallelRacy %v", r)
}
A somewhat sane parallel version is to create in and out slices inside each Go routine as shown below.
However, this benchmark has several problems.
- it will bloat the memory needs by O(GOMAXPROCS). One cannot know the size of the slice to create for each goroutine due to dynamic load balancing.
- it is not functionally equivalent to the original one because not all indices of the input slice will be read and output slices be written.
func stencilParallel() {
r := testing.Benchmark(func(b *testing.B) {
b.RunParallel(func(pb *testing.PB) {
in := make([]int, b.N)
out := make([]int, b.N)
for i := 0; i < b.N; i++ {
in[i] = i
}
for i := 0; pb.Next(); i++ {
left := 0
if i > 0 {
left = in[i-1]
}
right := 0
if i < int(b.N)-1 {
right = in[i+1]
}
out[i] = left + in[i] + right
}
})
})
fmt.Printf("\n stencilParallel %v", r)
}
We need a func (pb *PB) NextIndex() (int, bool) API that can return the iteration index so that each iteration inside parallel region can take the appropriate action. See the code below, with the proposed NextIndex API in action.
func stencilParallelNextIndex() {
r := testing.Benchmark(func(b *testing.B) {
in := make([]int, b.N)
out := make([]int, b.N)
for i := 0; i < b.N; i++ {
in[i] = i
}
b.RunParallel(func(pb *testing.PB) {
for {
i, ok := pb.NextIndex()
if !ok {
break
}
left := 0
if i > 0 {
left = in[i-1]
}
right := 0
if int(i) < b.N-1 {
right = in[i+1]
}
out[i] = left + in[i] + right
}
})
})
fmt.Printf("\n stencilParallelNextIndex %v", r)
}
It is worth observing the running time of each of these examples (I have implemented NextIndex locally) when scaling GOMAXPROCS from 1-8.
GOMAXPROCS=1 go run stencil.go
stencilSerial 233812798 6.19 ns/op
stencilParallelRacy 323430168 5.72 ns/op
stencilParallel 314360209 3.81 ns/op
stencilParallelNextIndex 309754171 3.89 ns/op
GOMAXPROCS=2 go run stencil.go
stencilSerial 234993747 6.34 ns/op
stencilParallelRacy 437399902 5.94 ns/op
stencilParallel 432057882 6.89 ns/op
stencilParallelNextIndex 443609341 2.66 ns/op
GOMAXPROCS=4 go run stencil.go
stencilSerial 243237640 6.39 ns/op
stencilParallelRacy 542654800 3.41 ns/op
stencilParallel 194487830 7.74 ns/op
stencilParallelNextIndex 572145619 2.10 ns/op
GOMAXPROCS=8 go run stencil.go
stencilSerial 240228440 6.31 ns/op
stencilParallelRacy 697957074 4.28 ns/op
stencilParallel 100000000 15.0 ns/op
stencilParallelNextIndex 651700406 1.87 ns/op
stencilSerial as expected stays put.
stencilParallelRacy has a poor scalability due to a heavy cacheline conflict (and of course incorrect result).
stencilParallel has pathetic parallel scaling due to bloated memory needs and its associated initialization in each Goroutine.
stencilParallelNextIndex shows higher (although not perfectly linear) throughput with more GOMAXPROCS and it is much desirable when writing parallel benchmarks.
This is a small enough change, which introduces a new API.
What did you expect to see?
NA
What did you see instead?
NA