Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"kubectl exec" sometimes incorrectly returns empty string causing tests to flake #34256

Closed
saad-ali opened this issue Oct 6, 2016 · 28 comments
Closed
Assignees
Labels
area/docker kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. sig/node Categorizes an issue or PR as relevant to SIG Node.

Comments

@saad-ali
Copy link
Member

saad-ali commented Oct 6, 2016

@jingxu97 noticed (#28081 (comment)) that a lot of the Pod disks should... PD E2E tests are very flaky on GCI because kubectl exec is incorrectly returning an empty string.

Jing's report:

Since last week we noticed that all 6 PD tests become flaky for ONLY GCI image. The failures are all similar, create a PD, write a string to a file (echo $string > $file) and read (cat $file) with "kubectl exec" command. Sometimes cat command returns empty string. Through investigation, we found out that the data is written to the file successfully, mounts are all fine. The only problem is that "kubectl exec" with cat command sometimes returns empty string in stdout. The following three tests prove this point.

  1. Use the existing pd test, but instead of reading once, read the file in a loop for 100 times. Sometime, in the middle of these 100 reading, one read returns empty string and all the rest are fine.

  2. Manually create a pod, use "kubectl" command with "echo > " to write some string to a file. And then run a bash script to create a loop to issue "kubectl" command with "cat " many times. The cat returns the string. But after about 100 runs, cat returns empty string. And then run the loop again, it returns normally for about 300 times, then fails to return the string again. In the mean time, the data and mounts are untouched.

  3. As Saad suggest, to make sure this is not caused by "cat" commdn, we create a pod. In pod spec's container section, add a command inside of a container to write data to a file and then repeatedly read the file with 'cat' for 1000 times. Use "kubectl log " to check to output from the container and they are all fine.

Now the conclusion is that the issue is not related to storage side. "kubectl" command with "cat" command sometimes could not return the output correctly. Since this command execution involves several steps, passing the input command to container, executing the command, and returning the execution output from the container to the terminal, need to investigate more to find the root cause.

Please let me know if you have any question.

CC @pwittrock

Step two is a good way to repro this.

@ncdc
Copy link
Member

ncdc commented Oct 7, 2016

I have previously investigated and fixed multiple issues where kubectl exec wasn't receiving all the data it was expecting. I have attempted to reproduce this recently, as I too have noticed the increasing number of test flakes of this nature. I updated my local checkout, made sure my KUBERNETES_PROVIDER was set to GCE, and then I ran cluster/kube-up.sh, which I believe runs the GCI image by default. I ran kubectl exec nginx echo running in container (one of the frequently flaking tests) in an endless loop for close to 2 hours and could not get it to flake.

If anyone can provide steps for how to reproduce this easily, or an actual cluster where this is flaking, I will be extremely happy and willing to debug.

@mengqiy
Copy link
Member

mengqiy commented Oct 7, 2016

@ncdc I paste the steps that @jingxu97 told me how reproduce it. It works for me.

ssh to the master and then cat /etc/lsb-release to make sure the cluster is using the GCI image.

Step to reproduce

  1. Use gcloud command to create a disk https://cloud.google.com/sdk/gcloud/reference/compute/disks/create
  2. Create a pod which reference this pd
    https://gist.github.com/jingxu97/b9afc582fa52eb83042ac9d4a5b337ea
  3. Use kubectl to write a string to a file in the pod's container
    kubectl exec -c= -- echo '123' > filepath
  4. Use a loop to read the file through kubectl
    https://gist.github.com/jingxu97/57f5ccd83cb608a418664f9802294025

@ncdc
Copy link
Member

ncdc commented Oct 7, 2016

Can you reproduce without the PD?

On Friday, October 7, 2016, Mengqi Yu notifications@github.com wrote:

@ncdc https://github.com/ncdc I paste the steps that @jingxu97
https://github.com/jingxu97 told me how reproduce it. It works for me.

ssh to the master and then cat /etc/lsb-release to make sure the cluster
is using the GCI image.

Step to reproduce

  1. Use gcloud command to create a disk https://cloud.google.com/sdk/
    gcloud/reference/compute/disks/create
  2. Create a pod which reference this pd
    https://gist.github.com/jingxu97/b9afc582fa52eb83042ac9d4a5b337ea
  3. Use kubectl to write a string to a file in the pod's container
    kubectl exec -c= -- echo '123' > filepath
  4. Use a loop to read the file through kubectl
    https://gist.github.com/jingxu97/57f5ccd83cb608a418664f9802294025


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#34256 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAABYqmN8ms0Evk3nlf8VBAisCY6ofOcks5qxnungaJpZM4KQQt-
.

@ncdc
Copy link
Member

ncdc commented Oct 7, 2016

I was finally able to get this to reproduce. It took close to 5000 kubectl exec calls before it did :-(. I'm going to up my log level and add some debug logging and see what I get.

@jingxu97
Copy link
Contributor

jingxu97 commented Oct 7, 2016

Actually we haven't try that yet. Mengqi, could you also try create a pod
without a PD attached and the rest should be the same. Thanks!

Jing

On Fri, Oct 7, 2016 at 10:16 AM, Andy Goldstein notifications@github.com
wrote:

Can you reproduce without the PD?

On Friday, October 7, 2016, Mengqi Yu notifications@github.com wrote:

@ncdc https://github.com/ncdc I paste the steps that @jingxu97
https://github.com/jingxu97 told me how reproduce it. It works for me.

ssh to the master and then cat /etc/lsb-release to make sure the cluster
is using the GCI image.

Step to reproduce

  1. Use gcloud command to create a disk https://cloud.google.com/sdk/
    gcloud/reference/compute/disks/create
  2. Create a pod which reference this pd
    https://gist.github.com/jingxu97/b9afc582fa52eb83042ac9d4a5b337ea
  3. Use kubectl to write a string to a file in the pod's container
    kubectl exec -c= -- echo '123' > filepath
  4. Use a loop to read the file through kubectl
    https://gist.github.com/jingxu97/57f5ccd83cb608a418664f9802294025


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<https://github.com/kubernetes/kubernetes/issues/
34256#issuecomment-252306759>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/
AAABYqmN8ms0Evk3nlf8VBAisCY6ofOcks5qxnungaJpZM4KQQt->

.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#34256 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/ASSNxfvZ1QquB7UDwnkPFaLrIL0QzCZFks5qxn6KgaJpZM4KQQt-
.

  • Jing

@bowei
Copy link
Member

bowei commented Oct 7, 2016

This flake (#27023) that checks for the contents of /etc/hosts seems to also be caused by an exec "cat <file>" returning empty string.

In this case, the test POSTs an exec API request directly to the server rather than going through kubectl. Note that this flake only occurs for GCI.

@ncdc
Copy link
Member

ncdc commented Oct 7, 2016

I am running in an endless loop with spdystream debugging enabled, but so far it hasn't failed. I'm amazed this seems to happen as frequently as it does in Jenkins tests but I'm not able to make it fail quickly.

@ncdc
Copy link
Member

ncdc commented Oct 7, 2016

Finally got it to flake with spdystream debugging on. Here's normal behavior in the kubelet:

(0xc82132b4a0) Add stream frame: 1
(0xc82132b4a0) (0xc820e37860) Stream added, broadcasting: 1
(0xc82132b4a0) Add stream frame: 3
(0xc82132b4a0) (0xc820e360a0) Stream added, broadcasting: 3
(0xc82132b4a0) Add stream frame: 5
(0xc82132b4a0) (0xc820e361e0) Stream added, broadcasting: 5
(0xc820e360a0) (3) Writing data frame
(0xc820e37860) (1) Writing data frame
(0xc82132b4a0) (0xc820e37860) Stream removed, broadcasting: 1
(0xc82132b4a0) (0xc820e360a0) Stream removed, broadcasting: 3
(0xc82132b4a0) (0xc820e361e0) Stream removed, broadcasting: 5

And here's the flake:

(0xc820c40370) Add stream frame: 1
(0xc820c40370) (0xc820f2a0a0) Stream added, broadcasting: 1
(0xc820c40370) Add stream frame: 3
(0xc820c40370) (0xc820b930e0) Stream added, broadcasting: 3
(0xc820c40370) Add stream frame: 5
(0xc820c40370) (0xc820f2a280) Stream added, broadcasting: 5
(0xc820f2a0a0) (1) Writing data frame
(0xc820c40370) (0xc820f2a0a0) Stream removed, broadcasting: 1
(0xc820c40370) (0xc820b930e0) Stream removed, broadcasting: 3
(0xc820c40370) (0xc820f2a280) Stream removed, broadcasting: 5

Stream 1 is an "error" stream where the kubelet can report internal errors.

Stream 3 is stdout from the container.

Stream 5 is stderr from the container.

In the normal log, we see (0xc820e360a0) (3) Writing data frame, which in my case is the output from cating a file. In the flake log, we don't see any data frames for stream (3).

Now to figure out why stdout from the container is getting lost. Could be a bug in docker. Could be a bug in how we invoke docker exec. Could be a race in how we interact with docker exec and our spdy streams...

@bprashanth
Copy link
Contributor

hmm, exec has a lot of moving parts making it hard to debug, can you avoid the flake in the test? If you're execing something, do it in a retry loop. If we need better guarantees, we should replace kubectl exec with something like ssh+nsenter (random idea, ssh will also flake but fail cleanly with an error).

I don't see how we can provide stability guarantees when it's built on docker exec which the docker people recommend only using for debug.

@ncdc
Copy link
Member

ncdc commented Oct 7, 2016

@bprashanth while you could do that for the PD tests, we have tests specifically for kubectl exec that are also flaking.

The big news which I just established is this appears to be a bug in docker in the gci image. I ran the following and it flaked on me:

while true; do output=$(docker exec nginx cat /tmp/myfile); echo ${output}; if [ "${output}" != "this is my file" ]; then break; fi; done

@ncdc ncdc removed their assignment Oct 7, 2016
@ncdc
Copy link
Member

ncdc commented Oct 7, 2016

@runcom are you aware of anything in upstream docker 1.11.2 where docker exec sometimes doesn't send stdout back to the client?

@bprashanth
Copy link
Contributor

@kubernetes/sig-node for clarity on the suspect docker issue mentioend in #34256 (comment)

@runcom
Copy link
Contributor

runcom commented Oct 7, 2016

@ncdc I'm not sure, I remember something but can't check right now (afk). I will later today.

@yujuhong yujuhong added area/os/gci sig/node Categorizes an issue or PR as relevant to SIG Node. labels Oct 7, 2016
@yujuhong
Copy link
Contributor

yujuhong commented Oct 7, 2016

/cc @vishh @mtaufen @dchen1107

@mtaufen
Copy link
Contributor

mtaufen commented Oct 11, 2016

sweet

@calebamiles
Copy link
Contributor

@saad-ali, I commented upstream, is it correct that we'll need a backport to the 1.11.x and 1.12.x series unless we're planning on bumping to docker 1.13 which seems somewhat risky but we should probably do something about all these CI flakes that are possibly related to this issue.

@saad-ali
Copy link
Member Author

Good question. @kubernetes/sig-node @vishh @dchen1107 what's the plan for this? Is it a 1.5 blocker?

@saad-ali saad-ali added this to the v1.5 milestone Nov 10, 2016
@ncdc
Copy link
Member

ncdc commented Nov 10, 2016

Docker 1.13 isn't out yet. Unfortunately the best fix for this particular issue at this time is probably to ensure that all the PD tests that use kubectl exec retry it a few times, such as in #34357.

@saad-ali
Copy link
Member Author

Ack. @jingxu97 started adding retries for PD tests in #35560

@dims
Copy link
Member

dims commented Nov 17, 2016

Guess we need to punt this until docker is ready with the fixes we need

@dims dims modified the milestones: v1.6, v1.5 Nov 17, 2016
@mengqiy
Copy link
Member

mengqiy commented Mar 9, 2017

@ncdc It seems that the upstream issue moby/moby#27289 has been fixed. Please confirm. Can we close this issue?

@ncdc
Copy link
Member

ncdc commented Mar 9, 2017

@ymqytw no, we can't close this issue until everyone moves to docker 1.13+.

@mengqiy
Copy link
Member

mengqiy commented Mar 9, 2017

I see. Then we should remove this issue from 1.6 milestone.

@ethernetdan ethernetdan removed this from the v1.6 milestone Mar 10, 2017
k8s-github-robot pushed a commit that referenced this issue Mar 15, 2017
Automatic merge from submit-queue

Retry calls to ReadFileViaContainer in PD tests

**What this PR does / why we need it**:
kubectl exec occasionally fails to return a valid output string.  It seems to be an issue with docker #34256.  This PR retries the 'kubectl exec' call to workaround the issue.  This should fix the flaky PD test issues.

**Which issue this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close that issue when PR gets merged)*: fixes #28283

**Release note**:

NONE
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

Prevent issues from auto-closing with an /lifecycle frozen comment.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or @fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 22, 2017
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or @fejta.
/lifecycle rotten
/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jan 21, 2018
@fejta-bot
Copy link

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/docker kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. sig/node Categorizes an issue or PR as relevant to SIG Node.
Projects
None yet
Development

No branches or pull requests