-
Notifications
You must be signed in to change notification settings - Fork 18.4k
Description
Background
First off, flaky tests (a test that usually passes but sometimes fails) are the worst and you should not write them and fix them immediately.
That said, the larger teams & projects & tests get, the more likely they get introduced. And sometimes it's technically and/or politically hard to get them fixed. (e.g. the person who wrote it originally left the company or works on a new team) You're then left deciding whether to skip/delete the entire test (which might be otherwise super useful), or teach your team to become immune to and start ignoring test failures, which is the worst, when teams start submitting when CI's red, becoming blind to real test failures.
Google doesn't have this problem internally because Google's build system supports detecting & annotating flaky tests: https://bazel.build/reference/be/common-definitions
Tailscale has its own test wrapper that retries tests annotated with flakytest.Mark
as flaky. We can't use go test -exec=...
unfortunately, as that prevents caching (#27207). So instead we need a separate tool wraps cmd/go
(kinda awkwardly, as it turns out).
Go internally also uses these flaky test markers 79 times:
So, clearly flaky tests are a problem. And cmd/go
isn't helping.
Proposal
I propose that Go:
- let users annotate known flaky tests (ala
tb.MarkFlaky(urlOrMessage string)
) - let
cmd/go
support re-running marked flaky tests up to N times (default 3 like Bazel?). Maybe add-retrycount
or-maxcount
aside the existing-count
flag? - output machine readable output from
cmd/go
(thattest2json
recognizes) to say that a test was flaky and failed but eventually passed, that users can plug into their dashboards/analytics, to find tests that are no longer flaky - optional: support re-running even tests not annotated as flaky after they fail, to discover that they're flaky. (perhaps it can still fail the test if not annotated, but it can fail with a different failure type, so the users learn it's flaky and can go fix or annotate)
FAQ
Won't this encourage writing flaky tests? No. Everybody hates flaky tests.
Won't this discourage fixing flaky tests? No. We're already retrying flaking tests. We're just fighting cmd/go to do so.
Why do you have flaky tests? Unit tests for pure functions aren't flaking. The tests usually flaking are big and involve timeouts and the network and subprocesses, even hitting localhost servers started by the tests. Many CI systems are super slow & oversubscribed. Even super generous network timeouts can fail.
Wasn't there already a proposal about this? Yes, in 2018: #27181. The summary was to not write flaky tests. I'm trying again.
Metadata
Metadata
Assignees
Labels
Type
Projects
Status