Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix docker-healthcheck to work around Docker bug. #6448

Merged
merged 1 commit into from
Mar 16, 2019

Conversation

tsuna
Copy link
Contributor

@tsuna tsuna commented Feb 5, 2019

This is a workaround to better detect moby/moby#38642 when Docker starts
up and remains stuck. In this case docker ps will return nothing and
exit 0, but no container can actually start. A better (but more expensive
and more intrusive test) would be to docker run --rm some cheap test
container to confirm we can actually start a container.

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Feb 5, 2019
@k8s-ci-robot
Copy link
Contributor

Hi @tsuna. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Feb 5, 2019
@tsuna
Copy link
Contributor Author

tsuna commented Feb 5, 2019

/assign @justinsb

@zetaab
Copy link
Member

zetaab commented Feb 5, 2019

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Feb 5, 2019
@tsuna
Copy link
Contributor Author

tsuna commented Feb 5, 2019

e2e test due to what appears to be an infrastructure issue — this hit another one of my PRs as well:

Unable to list AWS regions: AuthFailure: AWS was not able to validate the provided access credentials

@chrisz100
Copy link
Contributor

/retest
Hope dies last

Copy link
Contributor

@chrisz100 chrisz100 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 7, 2019
@tsuna
Copy link
Contributor Author

tsuna commented Feb 7, 2019

Thanks for the LGTM!
Same failure again on the E2E test. Did AWS just decide to hate me? :(

@chrisz100
Copy link
Contributor

It’s not aws but the test infra that is having some issues lately. Just a matter of time till the system is restored.

@tsuna
Copy link
Contributor Author

tsuna commented Feb 14, 2019

Can we /retest plz

@chrisz100
Copy link
Contributor

/retest

@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 22, 2019
@tsuna
Copy link
Contributor Author

tsuna commented Feb 22, 2019

JFYI: I just rebased the change on top of the latest master.

@tsuna
Copy link
Contributor Author

tsuna commented Feb 22, 2019

Hi @chrisz100, could you please re-instate the LGTM so this can be merged? I didn't change anything in the code other than rebasing. Thanks in advance. :)

@chrisz100
Copy link
Contributor

Looks good!

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 22, 2019
This is a workaround to better detect moby/moby#38642 when Docker starts
up and remains stuck.  In this case `docker ps` will return nothing and
exit 0, but no container can actually start.  A better (but more expensive
and more intrusive test) would be to `docker run --rm` some cheap test
container to confirm we can actually start a container.
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 27, 2019
@tsuna
Copy link
Contributor Author

tsuna commented Feb 27, 2019

Rebased...

@tsuna
Copy link
Contributor Author

tsuna commented Mar 4, 2019

@chrisz100 could you reinstate the LGTM please? 😄

@tsuna
Copy link
Contributor Author

tsuna commented Mar 8, 2019

🙏

@justinsb justinsb added this to the 1.12 milestone Mar 14, 2019
@chrisz100
Copy link
Contributor

/approve
/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 16, 2019
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: chrisz100, tsuna

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 16, 2019
@k8s-ci-robot k8s-ci-robot merged commit d2db91b into kubernetes:master Mar 16, 2019
@loshz
Copy link

loshz commented Mar 21, 2019

@tsuna how did you go about updating the healthcheck to this new version? we are experiencing this bug 4/5 times a day in several of our clusters and it's currently causing outages, so would love to get this deployed!

@tsuna
Copy link
Contributor Author

tsuna commented Mar 21, 2019

You need to build and deploy and use kops from master.

(you need to have Bazel installed prior to running this)

go get -d -u k8s.io/kops
cd $GOPATH/src/k8s.io/kops
export S3_BUCKET_NAME=<a public S3 bucket you created>
export KOPS_BASE_URL=https://${S3_BUCKET_NAME}.s3.amazonaws.com/kops/1.12.0-alpha.1/
make kops kops-install dev-upload UPLOAD_DEST=s3://${S3_BUCKET_NAME}

and then use .build/local/kops instead.

Do you believe you're running into moby/moby#38642 as well? Or is it a different underlying Docker bug?

@loshz
Copy link

loshz commented Mar 21, 2019

@tsuna thanks - I'll give it a go.

I'm not 100% sure what our exact issue is, but it does sound similar to moby/moby#38642

We're finding ourselves having to restart Docker (18.06.1) on multiple machines daily, due to RPC/sandbox/network errors.

@tsuna
Copy link
Contributor Author

tsuna commented Mar 21, 2019

In the case of the bug I was running into you could tell for sure by looking at the output of docker network ls. If that came back empty then it's almost certainly that bug. If you see anything in there then it's a different issue.

But whatever bug you're running into, is it also the case that docker ps keeps working even when Docker is unhappy?

@loshz
Copy link

loshz commented Mar 22, 2019

After further inspection, I'm not 100% certain if we are experiencing the same problem. It's definitely network related, but I have a feeling it's a bug in the CNI amongst other things.

@loshz
Copy link

loshz commented Mar 27, 2019

Ok, I take the above comment back. I think we are experiencing the same issue.

We see Docker just stop on a regular basis. After SSHing onto a node I see no networks:

$ sudo docker network ls
NETWORK ID          NAME                DRIVER              SCOPE

@tsuna
Copy link
Contributor Author

tsuna commented Mar 27, 2019

Yeah that's a hallmark of the Docker bug I filed. Do you happen to know if the Docker daemon restarted shortly before the problem started? Maybe due to the healthcheck failing, due to the host being overloaded? Or due to a kernel OOM? If you can add any supporting evidence to moby/moby#38642 to help move the bug forward, that would be nice, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants