-
Notifications
You must be signed in to change notification settings - Fork 297
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tests/installed: bump reboot timeout to 180s #1545
Conversation
I think the problem is more that the reboot playbook is still racy. It's really easy to reproduce just by lowering the timeouts locally. I am struggling to figure out how to to wrestle Ansible into doing this in a race-free way. Our existing playbooks mostly get by with having long delays, which is exactly the thing we were trying to avoid by doing VM-in-container. The core problem with the existing code is being able to retry from ssh-level failures if we happen to get back in before the reboot. What I have so far is:
Followed by the Basically it feels like we're really fighting the system; we'd need to move to a model where we generate an inventory and then break out of a playbook each time we do a reboot and resume as a distinct playbook or something. |
Ansible does have a meta: reset_connection but...it fails if the connection was actually already torn down. Sigh. |
Hmm, that's tricky. What if instead of |
I think the problem with that is twofold:
That saidd I've been banging on this most of the morning and am happy to context switch away 😄 |
Here's what I've been playing with:
|
I failed to give up on this and ended up with: #1548 |
Nice! I took a few minutes to take an inventory (pun intended) of the last few recent flakes we've seen. Of the 9 examined, actually only 2 of them were due to rebooting, while 3 of them were due to |
Ah sure, actually might as well bump it even higher I guess. |
It seems like 240 retries is just not long enough for all the non-destructive tests running in parallel to finish. Let's crank that up to 500 retries.
The description said 500, content only did 300 though. Since we really want both I slurped this into #1548 |
We've been seeing a lot of CI test failures due to ansible timing out
waiting for the host to come back up after a reboot.