Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reboot e2e test made more robust by using nohup. #9117

Merged
merged 1 commit into from
Jun 11, 2015

Conversation

jszczepkowski
Copy link
Contributor

Reboot e2e test made more robust by using nohup in ssh commands. Fixes #9062. Follow-up of #8784.

@k8s-bot
Copy link

k8s-bot commented Jun 2, 2015

EXPERIMENTAL JENKINS PR BUILDER: e2e build succeeded.

@ghost
Copy link

ghost commented Jun 2, 2015

What are the 'sleep 10's for?

@jszczepkowski
Copy link
Contributor Author

To prevent race: one process is waiting for SSH connection to finish, the other is doing something nasty with the machine. So, I wanted to give some time to the first process to finish the session.

@jszczepkowski
Copy link
Contributor Author

@quinton-hoole PTAL

@ghost
Copy link

ghost commented Jun 8, 2015

You should rather explicitly wait on the condition that you require to be true (e.g. the previous SSH session closing, or whatever). And fail with an explicit error if that does not occur within the timeout (e.g. 10 sec). Otherwise you're inviting both flakiness and long delays in e2e tests, because we'll never know quite how much time to sleep, and it will inevitably be extended every time we see flakiness.

@ghost
Copy link

ghost commented Jun 8, 2015

In fact looking a bit deeper, rebootNode() already does the checks for you. So I don't believe that you need to sleep at all.

https://github.com/jszczepkowski/kubernetes/blob/e2e-nodes/test/e2e/reboot.go#L217

@jszczepkowski
Copy link
Contributor Author

The check in rebootNode is checking if pods are running. My sleep is not related to pods, but to ssh session. The bash command issued by nohup may be started immediately, before ssh session is closed. But we don't want reboot/network partition to start before the ssh session is closed, so that the session will be cleanly closed (we have a check for this in rebootNode). So, the sleep makes this race much less likely.

Making this condition explicit will be really complicated and will obscure the test (we will need some additional synchronization abstractions) . I think the sleep is good enough.

@ghost
Copy link

ghost commented Jun 10, 2015

Fair enough, but please put the above explanation in as a comment in the code for the benefit of the next person who has to some along and debug. Thanks. Then LGTM.

Reboot e2e test made more robust by using nohup in ssh commands. Fixes kubernetes#9062. Follow-up of kubernetes#8784.
@k8s-bot
Copy link

k8s-bot commented Jun 10, 2015

EXPERIMENTAL JENKINS PR BUILDER: e2e build succeeded.

@jszczepkowski
Copy link
Contributor Author

I've added the comments. The PR should be ready now.

@ghost ghost added lgtm "Looks good to me", indicates that a PR is ready to be merged. and removed cla: yes area/test-infra labels Jun 10, 2015
ArtfulCoder added a commit that referenced this pull request Jun 11, 2015
Reboot e2e test made more robust by using nohup.
@ArtfulCoder ArtfulCoder merged commit 237b968 into kubernetes:master Jun 11, 2015
@jszczepkowski jszczepkowski unassigned ghost Aug 12, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/test-infra lgtm "Looks good to me", indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Reboot e2e test is flaky
4 participants