Skip to content

prevent dendrite check from hogging a CI machine for 90 minutes#9767

Merged
iliana merged 1 commit intomainfrom
iliana/timeouts-timeouts-always-wrong
Feb 2, 2026
Merged

prevent dendrite check from hogging a CI machine for 90 minutes#9767
iliana merged 1 commit intomainfrom
iliana/timeouts-timeouts-always-wrong

Conversation

@iliana
Copy link
Contributor

@iliana iliana commented Jan 31, 2026

@jclulow reported this job: https://buildomat.eng.oxide.computer/wg/0/details/01KG5JNVNBG9BBVWG8S874737E/Ar9EJfdSbD1qqfZ2jmuXxeF7mjn15W2b1GpI5ljIsFKgOJNn/01KG5JPR9Q4AHGSK2A6CYK3FPM#S367

Our check that the switch zone is up was, for whatever reason, hanging for approximately 195 seconds on each iteration. This resulted in what should have been a 30-second timeout becoming about 90 minutes.

This does not fix whatever the root cause is, but it does keep it from hogging one of our two available machines that can run this CI job.

@FelixMcFelix
Copy link
Contributor

FelixMcFelix commented Feb 2, 2026

I think this is fine, in the successful CI run here it takes around 13 attempts (~23s) to reach the service in the switch zone. It looks like in recent successful runs on main we have a few quick RSTs, one of the queries takes ~20s before failing, and then we have a successful query. So we're waiting around the same ballpark, and being more honest about it. I don't know what the upper bound/variance on omicron1 zone bringup is, but I assume we can always adjust the retry count if it turns out we're bumping into 30 retries regularly (other than cases where the switch zone just isn't coming up).

@iliana iliana merged commit 44e65c3 into main Feb 2, 2026
16 checks passed
@iliana iliana deleted the iliana/timeouts-timeouts-always-wrong branch February 2, 2026 20:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants