-
Notifications
You must be signed in to change notification settings - Fork 30
Description
Describe the bug
A random one of the CI tests often fails on a PR check because one of the mock system containers didn't become healthy in time. Simply rerunning failed tests without changing anything virtually always resolves this problem, so it definitely seems spurious. #1353 attempted to address this problem, but it seems that approach wasn't effective/correct since I'm seeing failures when the attempt to bring up the system hasn't been going for 90 seconds yet. For instance, this run only attempted to bring up the mock system for 63 seconds (much of that dedicated to downloading and extracting the CRDB image) before failing:
To reproduce
Observe checks on PRs. Many PRs have a single check that failed in this way.
Difference from expected behavior
The CI should almost never succeed when retrying after a failure unless there is something completely beyond its control like a network outage (the CI should be a good indicator of actual code faults).
Possible solution
It seems like maybe the time to download and extract supporting (CRDB) images may be included in the container startup timeout -- ideally, we would exclude this time from the bringup timeout and have a separate timeout for image acquisition (so network failure and image bringup failure are both still detected, but independent from one another). But regardless, the timeouts should clearly and correctly relate to what they are measuring, and it seems like the 90 seconds from #1353 is not actually a controlling timeout since the failure above occurred after only 63 seconds of attempt.