New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

x/perf/benchstat: Return status code 1 when benchmarks change significantly #20728

Open
nicholasjackson opened this Issue Jun 19, 2017 · 3 comments

Comments

Projects
None yet
5 participants
@nicholasjackson
Copy link

nicholasjackson commented Jun 19, 2017

Running benchstat on a CI server to detect anomalies relies on the user to parse the output from the command in order to pick up any deltas. To make this process simpler I propose benchstat would return a status code 1 when any of the benchmarks have significant change.

In the event that backwards compatibility is required, a new flag could be added to activate this behaviour.

Example:

// ...
var (
	flagDeltaTest    = flag.String("delta-test", "utest", "significance `test` to apply to delta: utest, ttest, or none")
	flagAlpha        = flag.Float64("alpha", 0.05, "consider change significant if p < `α`")
	flagGeomean      = flag.Bool("geomean", false, "print the geometric mean of each file")
	flagHTML         = flag.Bool("html", false, "print results as an HTML table")
	flagSplit.       = flag.String("split", "pkg,goos,goarch", "split benchmarks by `labels`")
        flagFailOnChange = flag.Bool("failonchange", false, "returns exit code 1 when benchmarks have significant change")
)

func main() {

// ...

    for _,row := Range c.Rows {
        if row.Changed != 0 {
            fmt.Fprintln(os.Stderr, "One or more benchmarks have changed significantly")
            os.Exit(1)
        }
    }

}

@gopherbot gopherbot added this to the Unreleased milestone Jun 19, 2017

@mvdan

This comment has been minimized.

Copy link
Member

mvdan commented Jun 19, 2017

If you run the benchmarks enough times, you could get a change as small as 0.50%, which could well be caused by factors one can't control like code alignment. For example, see this false positive I encountered a while ago, which was as high as 4%: #17250

You could change the flag to be a treshold, but I'm not sure if that would be a good solution.

Do you have an idea of how we could deal with these? I feel like the flag would be fairly useless with the high likelihood of false positives.

@mvdan

This comment has been minimized.

Copy link
Member

mvdan commented Jun 19, 2017

Oh, forget my point on the treshold - I didn't know about -alpha.

@rochaporto

This comment has been minimized.

Copy link

rochaporto commented Aug 10, 2017

+1 on this. Currently have a non voting job on CI for exactly this, and parsing the result as described in the initial description which is not ideal.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment