Skip to content

CI randomly fails due to mock system bringup timeout #1390

@BenjaminPelletier

Description

@BenjaminPelletier

Describe the bug
A random one of the CI tests often fails on a PR check because one of the mock system containers didn't become healthy in time. Simply rerunning failed tests without changing anything virtually always resolves this problem, so it definitely seems spurious. #1353 attempted to address this problem, but it seems that approach wasn't effective/correct since I'm seeing failures when the attempt to bring up the system hasn't been going for 90 seconds yet. For instance, this run only attempted to bring up the mock system for 63 seconds (much of that dedicated to downloading and extracting the CRDB image) before failing:

Image

To reproduce
Observe checks on PRs. Many PRs have a single check that failed in this way.

Difference from expected behavior
The CI should almost never succeed when retrying after a failure unless there is something completely beyond its control like a network outage (the CI should be a good indicator of actual code faults).

Possible solution
It seems like maybe the time to download and extract supporting (CRDB) images may be included in the container startup timeout -- ideally, we would exclude this time from the bringup timeout and have a separate timeout for image acquisition (so network failure and image bringup failure are both still detected, but independent from one another). But regardless, the timeouts should clearly and correctly relate to what they are measuring, and it seems like the 90 seconds from #1353 is not actually a controlling timeout since the failure above occurred after only 63 seconds of attempt.

Metadata

Metadata

Assignees

Labels

P2Normal prioritybugSomething isn't workingtooling

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions