Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

testing: benchmark iteration reports incorrectly #41637

Closed
changkun opened this issue Sep 26, 2020 · 20 comments
Closed

testing: benchmark iteration reports incorrectly #41637

changkun opened this issue Sep 26, 2020 · 20 comments

Comments

@changkun
Copy link
Contributor

@changkun changkun commented Sep 26, 2020

What version of Go are you using (go version)?

$ go version
go version devel +8266570ba7 Fri Sep 25 19:10:05 2020 +0000 darwin/amd64

Does this issue reproduce with the latest release?

Yes. 1.15.2.

What operating system and processor architecture are you using (go env)?

go env Output
$ go env
GO111MODULE=""
GOARCH="amd64"
GOBIN=""
GOCACHE="/Users/changkun/Library/Caches/go-build"
GOENV="/Users/changkun/Library/Application Support/go/env"
GOEXE=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="darwin"
GOINSECURE=""
GOMODCACHE="/Users/changkun/go/pkg/mod"
GONOPROXY=""
GONOSUMDB=""
GOOS="darwin"
GOPATH="/Users/changkun/go"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/Users/changkun/dev/godev/go-gerrit"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/Users/changkun/dev/godev/go-gerrit/pkg/tool/darwin_amd64"
GCCGO="gccgo"
AR="ar"
CC="clang"
CXX="clang++"
CGO_ENABLED="1"
GOMOD="/Users/changkun/dev/godev/go-gerrit/src/go.mod"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fdebug-prefix-map=/var/folders/jd/vym7tt9s2_379d4ccjcj1xrw0000gn/T/go-build637901099=/tmp/go-build -gno-record-gcc-switches -fno-common"

What did you do?

According to the doc:

The benchmark function must run the target code b.N times. During benchmark execution, b.N is adjusted until the benchmark function lasts long enough to be timed reliably. The output

BenchmarkRandInt-8   	68453040	        17.8 ns/op

means that the loop ran 68453040 times at a speed of 17.8 ns per loop.

But the following test fails and the benchmark runs the target code more than b.N times.

// main_test.go
func TestResultIteration(t *testing.T) {
	var v int32
	r := testing.Benchmark(func(b *testing.B) {
		for i := 0; i < b.N; i++ {
			atomic.AddInt32(&v, 1)
		}
	})

	if r.N != int(v) {
		t.Fatalf("incorrect iteration count, want %d got %d", v, r.N)
	}
}

What did you expect to see?

PASS

What did you see instead?

=== RUN   TestResultIteration
    main_test.go:27: incorrect iteration count, want 359815341 got 258805240
--- FAIL: TestResultIteration (1.67s)
FAIL
@davecheney
Copy link
Contributor

@davecheney davecheney commented Sep 26, 2020

Because the benchmark function is run more than once.

(~/src) % cat main_test.go
package main

import (
        "sync/atomic"
        "testing"
)

// main_test.go
func TestResultIteration(t *testing.T) {
        var v int32
        var count int
        testing.Benchmark(func(b *testing.B) {
                count++
                for i := 0; i < b.N; i++ {
                        atomic.AddInt32(&v, 1)
                }
        })
        t.Logf("benchmark function ran %v times", count)

}
(~/src) % go test -bench=. -v main_test.go
=== RUN   TestResultIteration
    main_test.go:18: benchmark function ran 6 times
--- PASS: TestResultIteration (1.57s)
PASS
ok      command-line-arguments  1.582s
@changkun
Copy link
Contributor Author

@changkun changkun commented Sep 26, 2020

@davecheney Yes. Should a fix go for the doc or go for the code?

@davecheney
Copy link
Contributor

@davecheney davecheney commented Sep 26, 2020

@davecheney Yes. Should a fix go for the doc or go for the code?

First, what's the actual problem?

@changkun
Copy link
Contributor Author

@changkun changkun commented Sep 26, 2020

First, what's the actual problem?

The test states the actual problem.

@davecheney
Copy link
Contributor

@davecheney davecheney commented Sep 26, 2020

I don't think this is a bug, testing.Benchmark returns a BenchmarkResult which is the result of a benchmark run.

https://golang.org/pkg/testing/#BenchmarkResult

@changkun
Copy link
Contributor Author

@changkun changkun commented Sep 26, 2020

I don't think this is a bug, testing.Benchmark returns a BenchmarkResult, which is the result of a benchmark run.

Maybe. But the doc left the impression that the target code runs b.N times. If I have a shared resource, running the benchmark multiple times leads to more manipulation on the shared resource, which isn't what was expected.

The implementation could run the benchmark incrementally other than trying multiple times to predict the iteration. In this way, the testing facility could offer a more consistent result.

The other option could be at least document this behavior.

@gopherbot
Copy link

@gopherbot gopherbot commented Sep 26, 2020

Change https://golang.org/cl/257647 mentions this issue: testing: grow benchmark iteration incrementally

@andybons andybons added this to the Unplanned milestone Sep 29, 2020
@andybons
Copy link
Member

@andybons andybons commented Sep 29, 2020

@jayconrod
Copy link
Contributor

@jayconrod jayconrod commented Sep 29, 2020

Doesn't seem like a bug.

The benchmark function runs more than once with varying numbers of iterations. It starts with a small number of iterations so that if the benchmark function is slow, it doesn't take too long. It ramps up to a larger number of iterations for functions, since shorter runs will be noisy. BenchmarkResult only includes the results from the last run. It would be a bug for it to include results from earlier runs, since we expect them to be noisy.

What would change in the documentation? It seems concise but complete. The sentence "During benchmark execution, b.N is adjusted until the benchmark function lasts long enough to be timed reliably." describes this behavior.

@changkun
Copy link
Contributor Author

@changkun changkun commented Sep 29, 2020

@jayconrod Thanks for your response.

It ramps up to a larger number of iterations for functions, since shorter runs will be noisy

Is this the core reason that the current implementation iterates multiple times to predict the actual run batch size? As mentioned above, the implementation could run the benchmark incrementally other than trying multiple times. Statistically speaking, the final average using benchstat can avoid noise sources and provide comparable results.

"During benchmark execution, b.N is adjusted until the benchmark function lasts long enough to be timed reliably." describes this behavior.

Not really. The first impression could be: "b.N increases incrementally until it meets the time constraints." The confusing is how "be timed reliability" was defined in this case. There is no further prior knowledge on the benchmark function, it is totally possible with the default time window (1s) is too short, and a developer is responsible to decide the bench time. Thus the feasible way to produce comparable results is still to further verify with benchstat.

Will you willing to review CL 257647? It suggests code change other than doc change and also addresses #27217.

@jayconrod
Copy link
Contributor

@jayconrod jayconrod commented Sep 29, 2020

Is this the core reason that the current implementation iterates multiple times to predict the actual run batch size?

I think so. I'm not super-familiar with this code, but that's probably how I would write it.

As mentioned above, the implementation could run the benchmark incrementally other than trying multiple times.

I think by "incrementally", you mean that go test would accumulate the elapsed time and iteration count over several calls to the benchmark function instead of just returning the time and count from the last run (please correct me if I have that wrong).

That could add significant noise in fast benchmarks where the loop body only takes a few cycles to execute, i.e., fewer cycles than it takes to measure elapsed time and do the other benchmarking calculations. It's been a while since I did low-level benchmarking, but on older architectures, getting the time took a couple hundred cycles since it flushed CPU pipelines. That stuff adds a lot of noise.

@changkun
Copy link
Contributor Author

@changkun changkun commented Sep 29, 2020

That could add significant noise in fast benchmarks where the loop body only takes a few cycles to execute, i.e., fewer cycles than it takes to measure elapsed time and do the other benchmarking calculations. It's been a while since I did low-level benchmarking, but on older architectures, getting the time took a couple hundred cycles since it flushed CPU pipelines. That stuff adds a lot of noise.

Indeed, that's the theoretical impression. But I benchmarked a few benchmarks, for instance, the go1 bench.

Two executions of the go1 bench before CL 257647 (tip v.s. tip):

name                     old time/op    new time/op    delta
BinaryTree17-6              2.08s ± 2%     2.02s ± 2%  -2.81%  (p=0.000 n=9+9)
Fannkuch11-6                1.91s ± 1%     1.89s ± 0%  -0.70%  (p=0.000 n=10+10)
FmtFprintfEmpty-6          29.8ns ± 2%    29.1ns ± 1%  -2.17%  (p=0.001 n=10+8)
FmtFprintfString-6         51.4ns ± 4%    50.7ns ± 3%    ~     (p=0.071 n=10+10)
FmtFprintfInt-6            55.2ns ± 3%    54.9ns ± 3%    ~     (p=0.183 n=10+10)
FmtFprintfIntInt-6         84.0ns ± 1%    83.5ns ± 3%    ~     (p=0.093 n=10+10)
FmtFprintfPrefixedInt-6    94.2ns ± 1%    93.4ns ± 3%    ~     (p=0.136 n=10+10)
FmtFprintfFloat-6           151ns ± 0%     149ns ± 0%  -1.06%  (p=0.000 n=6+10)
FmtManyArgs-6               361ns ± 0%     358ns ± 1%  -0.94%  (p=0.000 n=10+10)
GobDecode-6                3.82ms ± 1%    3.75ms ± 1%  -1.83%  (p=0.000 n=10+10)
GobEncode-6                3.01ms ± 1%    2.94ms ± 1%  -2.20%  (p=0.000 n=9+9)
Gzip-6                      165ms ± 1%     161ms ± 0%  -2.41%  (p=0.000 n=9+10)
Gunzip-6                   24.7ms ± 1%    24.4ms ± 1%  -1.21%  (p=0.000 n=10+9)
HTTPClientServer-6         56.3µs ± 2%    54.7µs ± 0%  -2.92%  (p=0.000 n=9+8)
JSONEncode-6               6.35ms ± 2%    6.19ms ± 1%  -2.48%  (p=0.000 n=10+9)
JSONDecode-6               28.0ms ± 0%    27.9ms ± 1%  -0.48%  (p=0.008 n=9+9)
Mandelbrot200-6            3.31ms ± 0%    3.30ms ± 0%  -0.36%  (p=0.004 n=8+9)
GoParse-6                  2.27ms ± 0%    2.27ms ± 1%  -0.32%  (p=0.046 n=8+9)
RegexpMatchEasy0_32-6      45.7ns ± 0%    46.1ns ± 2%  +0.78%  (p=0.018 n=10+9)
RegexpMatchEasy0_1K-6       126ns ± 0%     126ns ± 1%    ~     (p=0.405 n=8+10)
RegexpMatchEasy1_32-6      41.9ns ± 3%    42.4ns ± 9%    ~     (p=0.373 n=10+10)
RegexpMatchEasy1_1K-6       213ns ± 1%     210ns ± 1%  -1.11%  (p=0.000 n=9+10)
RegexpMatchMedium_32-6     3.90ns ± 1%    3.84ns ± 0%  -1.31%  (p=0.000 n=9+9)
RegexpMatchMedium_1K-6     21.7µs ± 0%    21.9µs ± 6%    ~     (p=0.315 n=8+10)
RegexpMatchHard_32-6       1.04µs ± 1%    1.04µs ± 1%    ~     (p=0.914 n=8+10)
RegexpMatchHard_1K-6       31.4µs ± 1%    31.0µs ± 1%  -1.34%  (p=0.000 n=10+10)
Revcomp-6                   264ms ± 0%     260ms ± 2%  -1.41%  (p=0.004 n=9+9)
Template-6                 35.1ms ± 0%    34.9ms ± 1%  -0.38%  (p=0.021 n=8+9)
TimeParse-6                 224ns ± 1%     222ns ± 0%  -0.76%  (p=0.000 n=10+8)
TimeFormat-6                219ns ± 0%     218ns ± 0%  -0.46%  (p=0.000 n=9+9)

name                     old speed      new speed      delta
GobDecode-6               201MB/s ± 1%   205MB/s ± 1%  +1.87%  (p=0.000 n=10+10)
GobEncode-6               255MB/s ± 1%   261MB/s ± 1%  +2.25%  (p=0.000 n=9+9)
Gzip-6                    118MB/s ± 1%   121MB/s ± 0%  +2.47%  (p=0.000 n=9+10)
Gunzip-6                  786MB/s ± 1%   796MB/s ± 1%  +1.22%  (p=0.000 n=10+9)
JSONEncode-6              306MB/s ± 2%   313MB/s ± 1%  +2.54%  (p=0.000 n=10+9)
JSONDecode-6             69.3MB/s ± 0%  69.6MB/s ± 1%  +0.49%  (p=0.007 n=9+9)
GoParse-6                25.5MB/s ± 0%  25.6MB/s ± 1%  +0.33%  (p=0.038 n=8+9)
RegexpMatchEasy0_32-6     700MB/s ± 1%   695MB/s ± 2%  -0.72%  (p=0.017 n=10+9)
RegexpMatchEasy0_1K-6    8.07GB/s ± 1%  8.12GB/s ± 1%    ~     (p=0.105 n=10+10)
RegexpMatchEasy1_32-6     764MB/s ± 3%   756MB/s ± 9%    ~     (p=0.315 n=10+10)
RegexpMatchEasy1_1K-6    4.81GB/s ± 1%  4.87GB/s ± 0%  +1.31%  (p=0.000 n=10+9)
RegexpMatchMedium_32-6    257MB/s ± 1%   260MB/s ± 0%  +1.31%  (p=0.000 n=9+9)
RegexpMatchMedium_1K-6   47.3MB/s ± 0%  46.8MB/s ± 6%    ~     (p=0.304 n=8+10)
RegexpMatchHard_32-6     30.6MB/s ± 1%  30.7MB/s ± 1%    ~     (p=0.897 n=8+10)
RegexpMatchHard_1K-6     32.6MB/s ± 1%  33.1MB/s ± 1%  +1.36%  (p=0.000 n=10+10)
Revcomp-6                 962MB/s ± 0%   976MB/s ± 2%  +1.44%  (p=0.004 n=9+9)
Template-6               55.3MB/s ± 0%  55.5MB/s ± 1%  +0.38%  (p=0.021 n=8+9)

Go 1 bench before and after the CL (tip v.s. tip+CL):

name                     old time/op    new time/op    delta
BinaryTree17-6              2.02s ± 2%     1.72s ± 1%  -14.78%  (p=0.000 n=9+9)
Fannkuch11-6                1.89s ± 0%     1.89s ± 0%   -0.19%  (p=0.004 n=10+8)
FmtFprintfEmpty-6          29.1ns ± 1%    29.0ns ± 1%     ~     (p=0.292 n=8+10)
FmtFprintfString-6         50.7ns ± 3%    50.4ns ± 1%     ~     (p=0.780 n=10+10)
FmtFprintfInt-6            54.9ns ± 3%    54.6ns ± 1%     ~     (p=0.743 n=10+8)
FmtFprintfIntInt-6         83.5ns ± 3%    84.8ns ± 1%   +1.64%  (p=0.022 n=10+10)
FmtFprintfPrefixedInt-6    93.4ns ± 3%    94.5ns ± 2%     ~     (p=0.085 n=10+10)
FmtFprintfFloat-6           149ns ± 0%     150ns ± 1%   +0.54%  (p=0.019 n=10+10)
FmtManyArgs-6               358ns ± 1%     357ns ± 1%     ~     (p=0.301 n=10+10)
GobDecode-6                3.75ms ± 1%    3.75ms ± 1%     ~     (p=0.633 n=10+8)
GobEncode-6                2.94ms ± 1%    2.97ms ± 2%   +0.76%  (p=0.031 n=9+9)
Gzip-6                      161ms ± 0%     160ms ± 0%   -0.46%  (p=0.003 n=10+9)
Gunzip-6                   24.4ms ± 1%    24.2ms ± 1%   -0.71%  (p=0.004 n=9+10)
HTTPClientServer-6         54.7µs ± 0%    52.4µs ± 1%   -4.20%  (p=0.000 n=8+9)
JSONEncode-6               6.19ms ± 1%    6.31ms ± 1%   +1.92%  (p=0.000 n=9+10)
JSONDecode-6               27.9ms ± 1%    27.9ms ± 0%     ~     (p=0.190 n=9+9)
Mandelbrot200-6            3.30ms ± 0%    3.27ms ± 1%   -0.87%  (p=0.000 n=9+10)
GoParse-6                  2.27ms ± 1%    2.28ms ± 1%   +0.46%  (p=0.017 n=9+10)
RegexpMatchEasy0_32-6      46.1ns ± 2%    44.6ns ± 1%   -3.16%  (p=0.000 n=9+9)
RegexpMatchEasy0_1K-6       126ns ± 1%     126ns ± 1%     ~     (p=0.192 n=10+9)
RegexpMatchEasy1_32-6      42.4ns ± 9%    41.3ns ± 1%     ~     (p=0.666 n=10+10)
RegexpMatchEasy1_1K-6       210ns ± 1%     208ns ± 1%   -0.92%  (p=0.000 n=10+8)
RegexpMatchMedium_32-6     3.84ns ± 0%    3.73ns ± 1%   -2.92%  (p=0.000 n=9+10)
RegexpMatchMedium_1K-6     21.9µs ± 6%    22.0µs ± 1%     ~     (p=0.156 n=10+9)
RegexpMatchHard_32-6       1.04µs ± 1%    1.03µs ± 1%     ~     (p=0.127 n=10+9)
RegexpMatchHard_1K-6       31.0µs ± 1%    31.2µs ± 1%   +0.65%  (p=0.005 n=10+10)
Revcomp-6                   260ms ± 2%     265ms ± 1%   +1.69%  (p=0.002 n=9+8)
Template-6                 34.9ms ± 1%    35.1ms ± 0%     ~     (p=0.063 n=9+9)
TimeParse-6                 222ns ± 0%     227ns ± 0%   +2.25%  (p=0.000 n=8+10)
TimeFormat-6                218ns ± 0%     216ns ± 0%   -0.92%  (p=0.000 n=9+9)

name                     old speed      new speed      delta
GobDecode-6               205MB/s ± 1%   205MB/s ± 1%     ~     (p=0.633 n=10+8)
GobEncode-6               261MB/s ± 1%   259MB/s ± 2%   -0.75%  (p=0.031 n=9+9)
Gzip-6                    121MB/s ± 0%   121MB/s ± 0%   +0.46%  (p=0.003 n=10+9)
Gunzip-6                  796MB/s ± 1%   801MB/s ± 1%   +0.71%  (p=0.004 n=9+10)
JSONEncode-6              313MB/s ± 1%   308MB/s ± 1%   -1.89%  (p=0.000 n=9+10)
JSONDecode-6             69.6MB/s ± 1%  69.6MB/s ± 0%     ~     (p=0.216 n=9+9)
GoParse-6                25.6MB/s ± 1%  25.4MB/s ± 1%   -0.46%  (p=0.017 n=9+10)
RegexpMatchEasy0_32-6     695MB/s ± 2%   717MB/s ± 1%   +3.22%  (p=0.000 n=9+9)
RegexpMatchEasy0_1K-6    8.12GB/s ± 1%  8.15GB/s ± 1%     ~     (p=0.243 n=10+9)
RegexpMatchEasy1_32-6     756MB/s ± 9%   775MB/s ± 1%     ~     (p=0.684 n=10+10)
RegexpMatchEasy1_1K-6    4.87GB/s ± 0%  4.90GB/s ± 1%   +0.64%  (p=0.010 n=9+10)
RegexpMatchMedium_32-6    260MB/s ± 0%   268MB/s ± 1%   +3.06%  (p=0.000 n=9+10)
RegexpMatchMedium_1K-6   46.8MB/s ± 6%  46.6MB/s ± 1%     ~     (p=0.156 n=10+9)
RegexpMatchHard_32-6     30.7MB/s ± 1%  30.9MB/s ± 1%     ~     (p=0.108 n=10+9)
RegexpMatchHard_1K-6     33.1MB/s ± 1%  32.9MB/s ± 1%   -0.64%  (p=0.005 n=10+10)
Revcomp-6                 976MB/s ± 2%   960MB/s ± 1%   -1.67%  (p=0.002 n=9+8)
Template-6               55.5MB/s ± 1%  55.4MB/s ± 0%     ~     (p=0.066 n=9+9)

which doesn't show significant noise in fast benchmarks (with the CL) on i5-8500B. Instead, the execution time of the benchmark is significantly improved from 438.864s to 215.693s.

@jayconrod
Copy link
Contributor

@jayconrod jayconrod commented Sep 30, 2020

These are slower benchmarks than I was thinking of. I'd expect to see more noise in something that runs in 10 ns or less, like acquiring an uncontested lock.

The execution time improvement is nice, but precision and accuracy are the highest priority. Since I don't think the current methodology is actually wrong, any proposed change should at least hold those steady. There are some pretty sizable differences in times above (BinaryTree17 is different by 14.78%).

@changkun
Copy link
Contributor Author

@changkun changkun commented Sep 30, 2020

There are some pretty sizable differences in times above (BinaryTree17 is different by 14.78%).

The diff that shows bigger diff is clearly not caused by the CL, as you can tell from the others. Since we are talking about microbenchmarks here, thus I don't fully convince by this argument. Could you maybe suggest more counter-examples?
Of course, a more strict benchmark test in a strictly controlled environment can be conducted for the CL, but what's showed here is a quick test for the first impression.

As you agreed, we are talking about code that runs in 10ns or less. It is pretty easy to verify with atomics:

func BenchmarkAtomic(b *testing.B) {
	var v int32
	for i := 0; i < b.N; i++ {
		atomic.AddInt32(&v, 1)
	}
}

tip vs tip+CL:

name       old time/op  new time/op  delta
Atomic-12  4.58ns ± 0%  4.58ns ± 0%   ~     (p=1.094 n=8+9)
@bcmills
Copy link
Member

@bcmills bcmills commented Sep 30, 2020

What is the concrete problem you are trying to address here?

(*testing.B.N) requests the number of iterations for the user-supplied function to run, it does not report the number of iterations actually run. (Compare #38677.)

@bcmills
Copy link
Member

@bcmills bcmills commented Sep 30, 2020

For improving the overall execution time of fast benchmarks, see #10930.

@bcmills
Copy link
Member

@bcmills bcmills commented Sep 30, 2020

The implementation could run the benchmark incrementally other than trying multiple times to predict the iteration. In this way, the testing facility could offer a more consistent result.

Running the benchmark incrementally would potentially be faster, but not necessarily more consistent. Many benchmarks have a non-trivial transient at the start of the benchmark (for example, due to CPU cache misses). Summing the results of multiple incremental runs would also sum the transient effects, whereas taking only the last run (with the final computed N) includes only a single transient, and should therefore have less noise for certain benchmarks.

@changkun
Copy link
Contributor Author

@changkun changkun commented Sep 30, 2020

Many benchmarks have a non-trivial transient at the start of the benchmark (for example, due to CPU cache misses) [...] should therefore have less noise for certain benchmarks.

Yes, I know. But the argument of suggesting running it incrementally is that the noise can be omitted when later verify with significance testing.

What is the concrete problem you are trying to address here?

The original purpose is to address "I have a shared resource, running the benchmark leads to more manipulation on the shared resource, which isn't what was expected." A subsequent investigation found that it is also one of the causes of #27217, as described in CL 257647 message.

@bcmills
Copy link
Member

@bcmills bcmills commented Sep 30, 2020

the argument of suggesting running it incrementally is that the noise can be omitted when later verify with significance testing.

Note that the testing package does not currently perform any significance testing when computing b.N.

@changkun
Copy link
Contributor Author

@changkun changkun commented Sep 30, 2020

Understood. Close because no changes are suggested to proceed.

@changkun changkun closed this Sep 30, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
6 participants
You can’t perform that action at this time.