increase docker-healthcheck respose timeout #5644

tatobi · 2018-08-16T12:43:15Z

In case of increased I/O load, the 10sec timeout is not enough on small / heavily loaded systems thus I propose the 60sec. The kubelet timeout is 2m (120sec) by default to detect health problems. Secondly, the docker restart can load heavily the host OS even huge systems because of many pods initialization at the same time. Continuous dockerd restart loop - a deadlock of node - is observed. Thirdly, because of the forcibly closed sockets and the kernel TCP TIME_WAIT value, the TCP sockets are not usable immediately with a "restart", wait for FIN_TIMEOUT is necessary before start services.
Workaround #1 for: #5434

In case of increased I/O load, the 10sec timeout is not enough on small / heavily loaded systems thus I propose the 60sec. The kubelet timeout is 2m (120sec) by default to detect health problems. Secondly, the docker restart can load heavily the host OS even huge systems because of many pods initialization at the same time. Continuous dockerd restart loop - a deadlock of node - is observed. Thirdly, because of the forcibly closed sockets and the kernel TCP TIME_WAIT value, the TCP sockets are not usable immediately with a "restart", wait for FIN_TIMEOUT is necessary before start services. Workaround kubernetes#1 for: kubernetes#5434

k8s-ci-robot · 2018-08-16T12:43:17Z

Thanks for your pull request. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please follow instructions at https://git.k8s.io/community/CLA.md#the-contributor-license-agreement to sign the CLA.

It may take a couple minutes for the CLA signature to be fully registered; after that, please reply here with a new comment and we'll verify. Thanks.

If you've already signed a CLA, it's possible we don't have your GitHub username or you're using a different email address. Check your existing CLA data and verify that your email is set on your git commits.
If you signed the CLA as a corporation, please sign in with your organization's credentials at https://identity.linuxfoundation.org/projects/cncf to be authorized.
If you have done the above and are still having issues with the CLA being reported as unsigned, please email the CNCF helpdesk: helpdesk@rt.linuxfoundation.org

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

...models/nodeup/docker/_systemd/_debian_family/files/opt/kubernetes/helpers/docker-healthcheck

ihoegen · 2018-08-17T18:19:07Z

Also @tatobi be sure to sign the CLA.

/ok-to-test

ihoegen · 2018-08-17T18:24:20Z

/lgtm

chrisz100

Can we make this timeout dynamic instead? I can see people who don’t want to wait 60 seconds here and prefer getting a failed early.

tatobi · 2018-09-01T15:43:57Z

We could, but we should find precisely the metrics which affect the dynamism. The current operation is a bit rough of course, we could check the running container's statuses instead - maybe in next version. Do you think monitoring all pod's status after a stop / start operation would be more appropriate?

chrisz100 · 2018-09-01T16:18:52Z

I wouldn't say we need to get down to that level as that'd be quite compute costly. But making the timeout dynamic could allow to make this specific to certain environments and their specifics.

justinsb · 2018-11-17T14:34:43Z

/retest

I think we should get this in, I think the behaviour here is much better - fewer false positives and likely much more visible because of the /proc/sys/net/ipv4/tcp_fin_timeout sleep.

/approve

k8s-ci-robot · 2018-11-17T14:34:48Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ihoegen, justinsb, tatobi

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [justinsb]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

justinsb · 2018-11-18T19:11:10Z

The errors travis is picking up are the old mis-spellings, because travis doesn't test the code after rebasing.

Force-merging.

k8s-ci-robot added cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Aug 16, 2018

ihoegen reviewed Aug 17, 2018

View reviewed changes

...models/nodeup/docker/_systemd/_debian_family/files/opt/kubernetes/helpers/docker-healthcheck Show resolved Hide resolved

...models/nodeup/docker/_systemd/_debian_family/files/opt/kubernetes/helpers/docker-healthcheck Show resolved Hide resolved

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Aug 17, 2018

k8s-ci-robot assigned ihoegen Aug 17, 2018

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 17, 2018

chrisz100 reviewed Sep 1, 2018

View reviewed changes

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 17, 2018

justinsb merged commit 902a3e4 into kubernetes:master Nov 18, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

increase docker-healthcheck respose timeout #5644

increase docker-healthcheck respose timeout #5644

tatobi commented Aug 16, 2018

k8s-ci-robot commented Aug 16, 2018

ihoegen commented Aug 17, 2018

ihoegen commented Aug 17, 2018

chrisz100 left a comment •

edited

Loading

tatobi commented Sep 1, 2018 •

edited

Loading

chrisz100 commented Sep 1, 2018

justinsb commented Nov 17, 2018

k8s-ci-robot commented Nov 17, 2018

justinsb commented Nov 18, 2018

increase docker-healthcheck respose timeout #5644

increase docker-healthcheck respose timeout #5644

Conversation

tatobi commented Aug 16, 2018

k8s-ci-robot commented Aug 16, 2018

ihoegen commented Aug 17, 2018

ihoegen commented Aug 17, 2018

chrisz100 left a comment • edited Loading

Choose a reason for hiding this comment

tatobi commented Sep 1, 2018 • edited Loading

chrisz100 commented Sep 1, 2018

justinsb commented Nov 17, 2018

k8s-ci-robot commented Nov 17, 2018

justinsb commented Nov 18, 2018

chrisz100 left a comment •

edited

Loading

tatobi commented Sep 1, 2018 •

edited

Loading