"kubectl exec" sometimes incorrectly returns empty string causing tests to flake #34256

saad-ali · 2016-10-06T18:24:15Z

@jingxu97 noticed (#28081 (comment)) that a lot of the Pod disks should... PD E2E tests are very flaky on GCI because kubectl exec is incorrectly returning an empty string.

Jing's report:

Since last week we noticed that all 6 PD tests become flaky for ONLY GCI image. The failures are all similar, create a PD, write a string to a file (echo $string > $file) and read (cat $file) with "kubectl exec" command. Sometimes cat command returns empty string. Through investigation, we found out that the data is written to the file successfully, mounts are all fine. The only problem is that "kubectl exec" with cat command sometimes returns empty string in stdout. The following three tests prove this point.

Use the existing pd test, but instead of reading once, read the file in a loop for 100 times. Sometime, in the middle of these 100 reading, one read returns empty string and all the rest are fine.

Manually create a pod, use "kubectl" command with "echo > " to write some string to a file. And then run a bash script to create a loop to issue "kubectl" command with "cat " many times. The cat returns the string. But after about 100 runs, cat returns empty string. And then run the loop again, it returns normally for about 300 times, then fails to return the string again. In the mean time, the data and mounts are untouched.

As Saad suggest, to make sure this is not caused by "cat" commdn, we create a pod. In pod spec's container section, add a command inside of a container to write data to a file and then repeatedly read the file with 'cat' for 1000 times. Use "kubectl log " to check to output from the container and they are all fine.

Now the conclusion is that the issue is not related to storage side. "kubectl" command with "cat" command sometimes could not return the output correctly. Since this command execution involves several steps, passing the input command to container, executing the command, and returning the execution output from the container to the terminal, need to investigate more to find the root cause.

Please let me know if you have any question.

CC @pwittrock

Step two is a good way to repro this.

The text was updated successfully, but these errors were encountered:

ncdc · 2016-10-07T16:10:39Z

I have previously investigated and fixed multiple issues where kubectl exec wasn't receiving all the data it was expecting. I have attempted to reproduce this recently, as I too have noticed the increasing number of test flakes of this nature. I updated my local checkout, made sure my KUBERNETES_PROVIDER was set to GCE, and then I ran cluster/kube-up.sh, which I believe runs the GCI image by default. I ran kubectl exec nginx echo running in container (one of the frequently flaking tests) in an endless loop for close to 2 hours and could not get it to flake.

If anyone can provide steps for how to reproduce this easily, or an actual cluster where this is flaking, I will be extremely happy and willing to debug.

mengqiy · 2016-10-07T17:03:49Z

@ncdc I paste the steps that @jingxu97 told me how reproduce it. It works for me.

ssh to the master and then cat /etc/lsb-release to make sure the cluster is using the GCI image.

Step to reproduce

Use gcloud command to create a disk https://cloud.google.com/sdk/gcloud/reference/compute/disks/create

Create a pod which reference this pd
https://gist.github.com/jingxu97/b9afc582fa52eb83042ac9d4a5b337ea

Use kubectl to write a string to a file in the pod's container
kubectl exec -c= -- echo '123' > filepath

Use a loop to read the file through kubectl
https://gist.github.com/jingxu97/57f5ccd83cb608a418664f9802294025

ncdc · 2016-10-07T17:16:08Z

Can you reproduce without the PD?

On Friday, October 7, 2016, Mengqi Yu notifications@github.com wrote:

@ncdc https://github.com/ncdc I paste the steps that @jingxu97
https://github.com/jingxu97 told me how reproduce it. It works for me.

ssh to the master and then cat /etc/lsb-release to make sure the cluster
is using the GCI image.

Step to reproduce

Use gcloud command to create a disk https://cloud.google.com/sdk/
gcloud/reference/compute/disks/create

Create a pod which reference this pd
https://gist.github.com/jingxu97/b9afc582fa52eb83042ac9d4a5b337ea

Use kubectl to write a string to a file in the pod's container
kubectl exec -c= -- echo '123' > filepath

Use a loop to read the file through kubectl
https://gist.github.com/jingxu97/57f5ccd83cb608a418664f9802294025

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#34256 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAABYqmN8ms0Evk3nlf8VBAisCY6ofOcks5qxnungaJpZM4KQQt-
.

ncdc · 2016-10-07T17:32:33Z

I was finally able to get this to reproduce. It took close to 5000 kubectl exec calls before it did :-(. I'm going to up my log level and add some debug logging and see what I get.

jingxu97 · 2016-10-07T17:32:50Z

Actually we haven't try that yet. Mengqi, could you also try create a pod
without a PD attached and the rest should be the same. Thanks!

Jing

On Fri, Oct 7, 2016 at 10:16 AM, Andy Goldstein notifications@github.com
wrote:

Can you reproduce without the PD?

On Friday, October 7, 2016, Mengqi Yu notifications@github.com wrote:

@ncdc https://github.com/ncdc I paste the steps that @jingxu97
https://github.com/jingxu97 told me how reproduce it. It works for me.

ssh to the master and then cat /etc/lsb-release to make sure the cluster
is using the GCI image.

Step to reproduce

Use gcloud command to create a disk https://cloud.google.com/sdk/
gcloud/reference/compute/disks/create

Create a pod which reference this pd
https://gist.github.com/jingxu97/b9afc582fa52eb83042ac9d4a5b337ea

Use kubectl to write a string to a file in the pod's container
kubectl exec -c= -- echo '123' > filepath

Use a loop to read the file through kubectl
https://gist.github.com/jingxu97/57f5ccd83cb608a418664f9802294025

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<https://github.com/kubernetes/kubernetes/issues/
34256#issuecomment-252306759>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/
AAABYqmN8ms0Evk3nlf8VBAisCY6ofOcks5qxnungaJpZM4KQQt->

.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#34256 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/ASSNxfvZ1QquB7UDwnkPFaLrIL0QzCZFks5qxn6KgaJpZM4KQQt-
.

Jing

bowei · 2016-10-07T18:23:45Z

This flake (#27023) that checks for the contents of /etc/hosts seems to also be caused by an exec "cat <file>" returning empty string.

In this case, the test POSTs an exec API request directly to the server rather than going through kubectl. Note that this flake only occurs for GCI.

ncdc · 2016-10-07T18:33:06Z

I am running in an endless loop with spdystream debugging enabled, but so far it hasn't failed. I'm amazed this seems to happen as frequently as it does in Jenkins tests but I'm not able to make it fail quickly.

ncdc · 2016-10-07T18:54:52Z

Finally got it to flake with spdystream debugging on. Here's normal behavior in the kubelet:

(0xc82132b4a0) Add stream frame: 1
(0xc82132b4a0) (0xc820e37860) Stream added, broadcasting: 1
(0xc82132b4a0) Add stream frame: 3
(0xc82132b4a0) (0xc820e360a0) Stream added, broadcasting: 3
(0xc82132b4a0) Add stream frame: 5
(0xc82132b4a0) (0xc820e361e0) Stream added, broadcasting: 5
(0xc820e360a0) (3) Writing data frame
(0xc820e37860) (1) Writing data frame
(0xc82132b4a0) (0xc820e37860) Stream removed, broadcasting: 1
(0xc82132b4a0) (0xc820e360a0) Stream removed, broadcasting: 3
(0xc82132b4a0) (0xc820e361e0) Stream removed, broadcasting: 5

And here's the flake:

(0xc820c40370) Add stream frame: 1
(0xc820c40370) (0xc820f2a0a0) Stream added, broadcasting: 1
(0xc820c40370) Add stream frame: 3
(0xc820c40370) (0xc820b930e0) Stream added, broadcasting: 3
(0xc820c40370) Add stream frame: 5
(0xc820c40370) (0xc820f2a280) Stream added, broadcasting: 5
(0xc820f2a0a0) (1) Writing data frame
(0xc820c40370) (0xc820f2a0a0) Stream removed, broadcasting: 1
(0xc820c40370) (0xc820b930e0) Stream removed, broadcasting: 3
(0xc820c40370) (0xc820f2a280) Stream removed, broadcasting: 5

Stream 1 is an "error" stream where the kubelet can report internal errors.

Stream 3 is stdout from the container.

Stream 5 is stderr from the container.

In the normal log, we see (0xc820e360a0) (3) Writing data frame, which in my case is the output from cating a file. In the flake log, we don't see any data frames for stream (3).

Now to figure out why stdout from the container is getting lost. Could be a bug in docker. Could be a bug in how we invoke docker exec. Could be a race in how we interact with docker exec and our spdy streams...

bprashanth · 2016-10-07T19:10:33Z

hmm, exec has a lot of moving parts making it hard to debug, can you avoid the flake in the test? If you're execing something, do it in a retry loop. If we need better guarantees, we should replace kubectl exec with something like ssh+nsenter (random idea, ssh will also flake but fail cleanly with an error).

I don't see how we can provide stability guarantees when it's built on docker exec which the docker people recommend only using for debug.

ncdc · 2016-10-07T20:04:54Z

@bprashanth while you could do that for the PD tests, we have tests specifically for kubectl exec that are also flaking.

The big news which I just established is this appears to be a bug in docker in the gci image. I ran the following and it flaked on me:

while true; do output=$(docker exec nginx cat /tmp/myfile); echo ${output}; if [ "${output}" != "this is my file" ]; then break; fi; done

…sue kubernetes#34256 See kubernetes#34256

ncdc · 2016-10-07T20:21:31Z

@runcom are you aware of anything in upstream docker 1.11.2 where docker exec sometimes doesn't send stdout back to the client?

bprashanth · 2016-10-07T20:22:54Z

@kubernetes/sig-node for clarity on the suspect docker issue mentioend in #34256 (comment)

runcom · 2016-10-07T20:25:43Z

@ncdc I'm not sure, I remember something but can't check right now (afk). I will later today.

yujuhong · 2016-10-07T20:30:08Z

/cc @vishh @mtaufen @dchen1107

mtaufen · 2016-10-11T16:01:11Z

sweet

calebamiles · 2016-11-03T16:54:16Z

@saad-ali, I commented upstream, is it correct that we'll need a backport to the 1.11.x and 1.12.x series unless we're planning on bumping to docker 1.13 which seems somewhat risky but we should probably do something about all these CI flakes that are possibly related to this issue.

saad-ali · 2016-11-10T05:26:31Z

Good question. @kubernetes/sig-node @vishh @dchen1107 what's the plan for this? Is it a 1.5 blocker?

ncdc · 2016-11-10T14:20:13Z

Docker 1.13 isn't out yet. Unfortunately the best fix for this particular issue at this time is probably to ensure that all the PD tests that use kubectl exec retry it a few times, such as in #34357.

saad-ali · 2016-11-10T22:27:24Z

Ack. @jingxu97 started adding retries for PD tests in #35560

dims · 2016-11-17T11:47:31Z

Guess we need to punt this until docker is ready with the fixes we need

mengqiy · 2017-03-09T20:18:08Z

@ncdc It seems that the upstream issue moby/moby#27289 has been fixed. Please confirm. Can we close this issue?

ncdc · 2017-03-09T20:19:30Z

@ymqytw no, we can't close this issue until everyone moves to docker 1.13+.

mengqiy · 2017-03-09T20:24:17Z

I see. Then we should remove this issue from 1.6 milestone.

Automatic merge from submit-queue Retry calls to ReadFileViaContainer in PD tests **What this PR does / why we need it**: kubectl exec occasionally fails to return a valid output string. It seems to be an issue with docker #34256. This PR retries the 'kubectl exec' call to workaround the issue. This should fix the flaky PD test issues. **Which issue this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close that issue when PR gets merged)*: fixes #28283 **Release note**: NONE

fejta-bot · 2017-12-22T01:28:39Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

Prevent issues from auto-closing with an /lifecycle frozen comment.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or @fejta.
/lifecycle stale

fejta-bot · 2018-01-21T02:16:24Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or @fejta.
/lifecycle rotten
/remove-lifecycle stale

fejta-bot · 2018-02-20T02:22:59Z

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

saad-ali added kind/bug Categorizes issue or PR as related to a bug. area/kubectl team/ux labels Oct 6, 2016

saad-ali assigned mengqiy Oct 6, 2016

mtaufen mentioned this issue Oct 6, 2016

[k8s.io] Pod Disks should schedule a pod w/two RW PDs both mounted to one container, write to PD, verify contents, delete pod, recreate pod, verify contents, and repeat in rapid succession [Slow] {Kubernetes e2e suite} #28081

Closed

This was referenced Oct 7, 2016

[k8s.io] Kubectl client [k8s.io] Simple pod should support exec {Kubernetes e2e suite} #33756

Closed

[k8s.io] Kubectl client [k8s.io] Simple pod should support exec through an HTTP proxy {Kubernetes e2e suite} #33649

Closed

ncdc self-assigned this Oct 7, 2016

bowei mentioned this issue Oct 7, 2016

[k8s.io] KubeletManagedEtcHosts should test kubelet managed /etc/hosts file {Kubernetes e2e suite} #27023

Closed

bowei added a commit to bowei/kubernetes that referenced this issue Oct 7, 2016

Add retry loop around check for /etc/hosts contents to work around is…

c3f591a

…sue kubernetes#34256 See kubernetes#34256

ncdc removed their assignment Oct 7, 2016

bowei mentioned this issue Oct 7, 2016

Add retry loop around check for /etc/hosts contents to work around issue #34256 #34357

Merged

yujuhong added area/os/gci sig/node Categorizes an issue or PR as relevant to SIG Node. labels Oct 7, 2016

ncdc added the area/docker label Oct 11, 2016

saad-ali mentioned this issue Oct 13, 2016

kubernetes-e2e-gci-gce-slow-release-1.4: broken test run #34701

Closed

pwittrock removed area/kubectl team/ux labels Oct 13, 2016

saad-ali mentioned this issue Oct 20, 2016

[k8s.io] Pod Disks should schedule a pod w/ a RW PD, ungracefully remove it, then schedule it on another host [Slow] {Kubernetes e2e suite} #33827

Closed

saad-ali added this to the v1.5 milestone Nov 10, 2016

dims modified the milestones: v1.6, v1.5 Nov 17, 2016

msau42 mentioned this issue Mar 9, 2017

Retry calls to ReadFileViaContainer in PD tests #42846

Merged

ethernetdan removed this from the v1.6 milestone Mar 10, 2017

piascikj mentioned this issue Nov 3, 2017

workaround for https://github.com/kubernetes/kubernetes/issues/34256 imdone/kubernetes#2235

Open

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 22, 2017

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jan 21, 2018

k8s-ci-robot closed this as completed Feb 20, 2018

tanjunchen mentioned this issue Jan 1, 2020

/test: resolve pending TODOs #86756

Closed

jvm-pangea mentioned this issue Apr 26, 2024

kubectl exec truncates stdout without reporting error #124571

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"kubectl exec" sometimes incorrectly returns empty string causing tests to flake #34256

"kubectl exec" sometimes incorrectly returns empty string causing tests to flake #34256

saad-ali commented Oct 6, 2016 •

edited

ncdc commented Oct 7, 2016

mengqiy commented Oct 7, 2016

ncdc commented Oct 7, 2016

ncdc commented Oct 7, 2016

jingxu97 commented Oct 7, 2016

bowei commented Oct 7, 2016 •

edited

ncdc commented Oct 7, 2016

ncdc commented Oct 7, 2016

bprashanth commented Oct 7, 2016

ncdc commented Oct 7, 2016

ncdc commented Oct 7, 2016

bprashanth commented Oct 7, 2016

runcom commented Oct 7, 2016

yujuhong commented Oct 7, 2016

mtaufen commented Oct 11, 2016

calebamiles commented Nov 3, 2016

saad-ali commented Nov 10, 2016

ncdc commented Nov 10, 2016

saad-ali commented Nov 10, 2016

dims commented Nov 17, 2016

mengqiy commented Mar 9, 2017

ncdc commented Mar 9, 2017

mengqiy commented Mar 9, 2017

fejta-bot commented Dec 22, 2017

fejta-bot commented Jan 21, 2018

fejta-bot commented Feb 20, 2018

"kubectl exec" sometimes incorrectly returns empty string causing tests to flake #34256

"kubectl exec" sometimes incorrectly returns empty string causing tests to flake #34256

Comments

saad-ali commented Oct 6, 2016 • edited

ncdc commented Oct 7, 2016

mengqiy commented Oct 7, 2016

ncdc commented Oct 7, 2016

ncdc commented Oct 7, 2016

jingxu97 commented Oct 7, 2016

bowei commented Oct 7, 2016 • edited

ncdc commented Oct 7, 2016

ncdc commented Oct 7, 2016

bprashanth commented Oct 7, 2016

ncdc commented Oct 7, 2016

ncdc commented Oct 7, 2016

bprashanth commented Oct 7, 2016

runcom commented Oct 7, 2016

yujuhong commented Oct 7, 2016

mtaufen commented Oct 11, 2016

calebamiles commented Nov 3, 2016

saad-ali commented Nov 10, 2016

ncdc commented Nov 10, 2016

saad-ali commented Nov 10, 2016

dims commented Nov 17, 2016

mengqiy commented Mar 9, 2017

ncdc commented Mar 9, 2017

mengqiy commented Mar 9, 2017

fejta-bot commented Dec 22, 2017

fejta-bot commented Jan 21, 2018

fejta-bot commented Feb 20, 2018

saad-ali commented Oct 6, 2016 •

edited

bowei commented Oct 7, 2016 •

edited