Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

increase docker-healthcheck respose timeout #5644

Merged
merged 1 commit into from
Nov 18, 2018

Conversation

tatobi
Copy link
Contributor

@tatobi tatobi commented Aug 16, 2018

In case of increased I/O load, the 10sec timeout is not enough on small / heavily loaded systems thus I propose the 60sec. The kubelet timeout is 2m (120sec) by default to detect health problems. Secondly, the docker restart can load heavily the host OS even huge systems because of many pods initialization at the same time. Continuous dockerd restart loop - a deadlock of node - is observed. Thirdly, because of the forcibly closed sockets and the kernel TCP TIME_WAIT value, the TCP sockets are not usable immediately with a "restart", wait for FIN_TIMEOUT is necessary before start services.
Workaround #1 for: #5434

In case of increased I/O load, the 10sec timeout is not enough on small / heavily loaded systems thus I propose the 60sec. The kubelet timeout is 2m (120sec) by default to detect health problems. Secondly, the docker restart can load heavily the host OS even huge systems because of many pods initialization at the same time. Continuous dockerd restart loop - a deadlock of node - is observed. Thirdly, because of the forcibly closed sockets and the kernel TCP TIME_WAIT value, the TCP sockets are not usable immediately with a "restart", wait for FIN_TIMEOUT is necessary before start services.
Workaround kubernetes#1 for: kubernetes#5434
@k8s-ci-robot
Copy link
Contributor

Thanks for your pull request. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please follow instructions at https://git.k8s.io/community/CLA.md#the-contributor-license-agreement to sign the CLA.

It may take a couple minutes for the CLA signature to be fully registered; after that, please reply here with a new comment and we'll verify. Thanks.


Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@k8s-ci-robot k8s-ci-robot added cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Aug 16, 2018
@ihoegen
Copy link
Contributor

ihoegen commented Aug 17, 2018

Also @tatobi be sure to sign the CLA.

/ok-to-test

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Aug 17, 2018
@ihoegen
Copy link
Contributor

ihoegen commented Aug 17, 2018

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 17, 2018
Copy link
Contributor

@chrisz100 chrisz100 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make this timeout dynamic instead? I can see people who don’t want to wait 60 seconds here and prefer getting a failed early.

@tatobi
Copy link
Contributor Author

tatobi commented Sep 1, 2018

We could, but we should find precisely the metrics which affect the dynamism. The current operation is a bit rough of course, we could check the running container's statuses instead - maybe in next version. Do you think monitoring all pod's status after a stop / start operation would be more appropriate?

@chrisz100
Copy link
Contributor

I wouldn't say we need to get down to that level as that'd be quite compute costly. But making the timeout dynamic could allow to make this specific to certain environments and their specifics.

@justinsb
Copy link
Member

/retest

I think we should get this in, I think the behaviour here is much better - fewer false positives and likely much more visible because of the /proc/sys/net/ipv4/tcp_fin_timeout sleep.

/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ihoegen, justinsb, tatobi

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 17, 2018
@justinsb
Copy link
Member

The errors travis is picking up are the old mis-spellings, because travis doesn't test the code after rebasing.

Force-merging.

@justinsb justinsb merged commit 902a3e4 into kubernetes:master Nov 18, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants