Measure flakiness of new tests #3541

GeoffreyBooth · 2023-10-25T16:37:42Z

As discussed in nodejs/TSC#1457, could we somehow have a way for CI to measure the flakiness of new tests before they land? Something like:

For every PR, identify tests that are added by the PR (probably tests that run in the PR’s branch that didn’t run for main).
Run measure-flakiness on them.
Fail CI unless the new tests pass the flakiness cutoff, on all platforms.

This obviously won’t help for existing flaky tests, but I would think that it should prevent most new flaky tests from landing on main; and it would highly motivate contributors to improve their tests, because their PRs would be blocked from landing until they did so.

It also wouldn’t help if a test becomes flaky because of changes to the API that it tests after the test has landed. But still, I think this is better than the status quo.

Related: #3056 cc @nodejs/tsc

The text was updated successfully, but these errors were encountered:

RafaelGSS · 2023-10-25T21:40:38Z

I think there are two kinds of flakiness.

When the test relies on the environment, for instance, writing to disk. In this case, the flakiness comes when the machine doesn't satisfy the requirements of the test - in this case, the disk is full.
When the test relies on timers or things that can suffer TOCTOU issues.

While I believe we can measure flakiness on the second option, the first one should be very hard to reproduce. AFAIK most of our flaky tests are related to the first type of flakiness.

GeoffreyBooth · 2023-10-26T14:26:29Z

AFAIK most of our flaky tests are related to the first type of flakiness.

Really? So then there are lots of tests marked as flaky where there’s nothing wrong with the test itself, it just happened to get marked flaky during a rough patch in the life of the machine running our test suite?

If so, wouldn’t the solution there be to improve the environment itself? Like rather than long-running machines that need rebooting and so on, run our tests within Docker containers or EC2 instances that are created just for each run and then discarded. Or at least have some kind of automatic maintenance on the machines, like automatically restart them every few hours or automatically clear their disk space after each run, something along those lines.

RafaelGSS · 2023-10-26T22:01:42Z

Well, I haven't looked at the tests marked as flaky (some of them are quite old), I'm speaking based on my experience handling some PRs, and that might not be an assertive statement.

If so, wouldn’t the solution there be to improve the environment itself?

Possibly. As a non-build team member, I may lack some context, but I assume it will require upgrading machines and increasing nodes - both come with a cost and need someone to champion.

I will wait for someone from the build team to jump in and correct me if I'm wrong.

GeoffreyBooth · 2023-10-26T22:06:40Z

I will wait for someone from the build team to jump in and correct me if I’m wrong.

I would love if you’re right: it’s much easier to just increase machines’ capacity than it is to refactor tests.

GeoffreyBooth added enhancement help wanted question labels Oct 25, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Measure flakiness of new tests #3541

Measure flakiness of new tests #3541

GeoffreyBooth commented Oct 25, 2023

RafaelGSS commented Oct 25, 2023

GeoffreyBooth commented Oct 26, 2023

RafaelGSS commented Oct 26, 2023

GeoffreyBooth commented Oct 26, 2023

Measure flakiness of new tests #3541

Measure flakiness of new tests #3541

Comments

GeoffreyBooth commented Oct 25, 2023

RafaelGSS commented Oct 25, 2023

GeoffreyBooth commented Oct 26, 2023

RafaelGSS commented Oct 26, 2023

GeoffreyBooth commented Oct 26, 2023