-
Notifications
You must be signed in to change notification settings - Fork 38.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reboot e2e test made more robust by using nohup. #9117
Conversation
EXPERIMENTAL JENKINS PR BUILDER: e2e build succeeded. |
What are the 'sleep 10's for? |
To prevent race: one process is waiting for SSH connection to finish, the other is doing something nasty with the machine. So, I wanted to give some time to the first process to finish the session. |
@quinton-hoole PTAL |
You should rather explicitly wait on the condition that you require to be true (e.g. the previous SSH session closing, or whatever). And fail with an explicit error if that does not occur within the timeout (e.g. 10 sec). Otherwise you're inviting both flakiness and long delays in e2e tests, because we'll never know quite how much time to sleep, and it will inevitably be extended every time we see flakiness. |
In fact looking a bit deeper, rebootNode() already does the checks for you. So I don't believe that you need to sleep at all. https://github.com/jszczepkowski/kubernetes/blob/e2e-nodes/test/e2e/reboot.go#L217 |
The check in rebootNode is checking if pods are running. My sleep is not related to pods, but to ssh session. The bash command issued by nohup may be started immediately, before ssh session is closed. But we don't want reboot/network partition to start before the ssh session is closed, so that the session will be cleanly closed (we have a check for this in rebootNode). So, the sleep makes this race much less likely. Making this condition explicit will be really complicated and will obscure the test (we will need some additional synchronization abstractions) . I think the sleep is good enough. |
Fair enough, but please put the above explanation in as a comment in the code for the benefit of the next person who has to some along and debug. Thanks. Then LGTM. |
Reboot e2e test made more robust by using nohup in ssh commands. Fixes kubernetes#9062. Follow-up of kubernetes#8784.
EXPERIMENTAL JENKINS PR BUILDER: e2e build succeeded. |
I've added the comments. The PR should be ready now. |
Reboot e2e test made more robust by using nohup.
Reboot e2e test made more robust by using nohup in ssh commands. Fixes #9062. Follow-up of #8784.