cmd/go: add support for dealing with flaky tests

## Background

First off, flaky tests (a test that usually passes but sometimes fails) are the worst and you should not write them and fix them immediately.

That said, the larger teams & projects & tests get, the more likely they get introduced. And sometimes it's technically and/or politically hard to get them fixed. (e.g. the person who wrote it originally left the company or works on a new team) You're then left deciding whether to skip/delete the entire test (which might be otherwise super useful), or teach your team to become immune to and start ignoring test failures, which is the worst, when teams start submitting when CI's red, becoming blind to real test failures.

Google doesn't have this problem internally because Google's build system supports detecting & annotating flaky tests: https://bazel.build/reference/be/common-definitions

Tailscale has its [own test wrapper](https://github.com/tailscale/tailscale/blob/main/cmd/testwrapper/testwrapper.go) that retries tests [annotated with `flakytest.Mark`](https://github.com/tailscale/tailscale/blob/main/cmd/testwrapper/flakytest/flakytest.go) as flaky. We can't use `go test -exec=...` unfortunately, as that prevents caching (https://github.com/golang/go/issues/27207). So instead we need a separate tool wraps `cmd/go` (kinda awkwardly, as it turns out).

Go internally also uses these flaky test markers 79 times:

* https://pkg.go.dev/internal/testenv#SkipFlaky
* https://pkg.go.dev/internal/testenv#SkipFlakyNet

So, clearly flaky tests are a problem. And `cmd/go` isn't helping.

## Proposal

I propose that Go:

* let users annotate known flaky tests (ala `tb.MarkFlaky(urlOrMessage string)`)
* let `cmd/go` support re-running marked flaky tests up to N times (default 3 like Bazel?). Maybe add `-retrycount` or `-maxcount` aside the existing `-count` flag?
* output machine readable output from `cmd/go` (that `test2json` recognizes) to say that a test was flaky and failed but eventually passed, that users can plug into their dashboards/analytics, to find tests that are no longer flaky
* optional: support re-running even tests not annotated as flaky after they fail, to discover that they're flaky. (perhaps it can still fail the test if not annotated, but it can fail with a different failure type, so the users learn it's flaky and can go fix or annotate)

## FAQ

Won't this encourage writing flaky tests? No. Everybody hates flaky tests.

Won't this discourage fixing flaky tests? No. We're already retrying flaking tests. We're just fighting cmd/go to do so.

Why do you have flaky tests? Unit tests for pure functions aren't flaking. The tests usually flaking are big and involve timeouts and the network and subprocesses, even hitting localhost servers started by the tests. Many CI systems are super slow & oversubscribed. Even super generous network timeouts can fail.

Wasn't there already a proposal about this? Yes, in 2018: #27181. The summary was to not write flaky tests. I'm trying again.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

cmd/go: add support for dealing with flaky tests #62244

Background

Proposal

FAQ

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

cmd/go: add support for dealing with flaky tests #62244

Description

Background

Proposal

FAQ

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions