Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

apiserver: command exec in pod randomly drops stdin #112834

Open
pohly opened this issue Oct 3, 2022 · 23 comments
Open

apiserver: command exec in pod randomly drops stdin #112834

pohly opened this issue Oct 3, 2022 · 23 comments
Labels
kind/bug Categorizes issue or PR as related to a bug. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/testing Categorizes an issue or PR as relevant to SIG Testing. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@pohly
Copy link
Contributor

pohly commented Oct 3, 2022

What happened?

I was using

// ExecWithOptions executes a command in the specified container,
// returning stdout, stderr and error. `options` allowed for
// additional parameters to be passed.
func (f *Framework) ExecWithOptions(options ExecOptions) (string, string, error) {
if !options.Quiet {
Logf("ExecWithOptions %+v", options)
}
config, err := LoadConfig()
ExpectNoError(err, "failed to load restclient config")
const tty = false
Logf("ExecWithOptions: Clientset creation")
req := f.ClientSet.CoreV1().RESTClient().Post().
Resource("pods").
Name(options.PodName).
Namespace(options.Namespace).
SubResource("exec").
Param("container", options.ContainerName)
req.VersionedParams(&v1.PodExecOptions{
Container: options.ContainerName,
Command: options.Command,
Stdin: options.Stdin != nil,
Stdout: options.CaptureStdout,
Stderr: options.CaptureStderr,
TTY: tty,
}, scheme.ParameterCodec)
var stdout, stderr bytes.Buffer
Logf("ExecWithOptions: execute(POST %s)", req.URL())
err = execute("POST", req.URL(), config, options.Stdin, &stdout, &stderr, tty)
if options.PreserveWhitespace {
return stdout.String(), stderr.String(), err
}
return strings.TrimSpace(stdout.String()), strings.TrimSpace(stderr.String()), err
}
(similar to kubectl exec -i) with dd as command and some data as stdin to create a file inside a container. This worked fine most of the time during local development. When used inside a Prow job, I ran into test flakes that could be traced down to dd returning after writing zero bytes although stdin was non-empty. No error was returned.

What did you expect to happen?

The data should be transferred reliably.

How can we reproduce it (as minimally and precisely as possible)?

Sorry, no test for this right now.

Anything else we need to know?

No response

Kubernetes version

462f412 (roughly current master)

Cloud provider

kind

OS version

No response

Install tools

kind

Container runtime (CRI) and version (if applicable)

containerd (from master)

Related plugins (CNI, CSI, ...) and versions (if applicable)

No response

@pohly pohly added the kind/bug Categorizes issue or PR as related to a bug. label Oct 3, 2022
@k8s-ci-robot k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Oct 3, 2022
pohly added a commit to pohly/kubernetes that referenced this issue Oct 5, 2022
It turned out to be unreliable (see
kubernetes#112834).  Encoding the data
inside the command as input for base64 is a workaround that is fine for small
amounts of data. It becomes less efficient and/or unusable for large amounts.
@neolit123
Copy link
Member

/sig testing

@k8s-ci-robot k8s-ci-robot added sig/testing Categorizes an issue or PR as relevant to SIG Testing. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Oct 8, 2022
@aojea
Copy link
Member

aojea commented Oct 10, 2022

maybe something breaking the spdy encapsulation?
it can be between e2e -- apiserver --- kubelet -- containerd

@pohly
Copy link
Contributor Author

pohly commented Oct 11, 2022

Yes, could be anywhere. Probably a good first step would be to write a stress test for this functionality.

@aojea
Copy link
Member

aojea commented Oct 11, 2022

yeah, it suffered of memory leaks in the past, I don't think that this feature has been stress tested

pohly added a commit to pohly/kubernetes that referenced this issue Oct 11, 2022
It turned out to be unreliable (see
kubernetes#112834).  Encoding the data
inside the command as input for base64 is a workaround that is fine for small
amounts of data. It becomes less efficient and/or unusable for large amounts.
jaehnri pushed a commit to jaehnri/kubernetes that referenced this issue Jan 3, 2023
It turned out to be unreliable (see
kubernetes#112834).  Encoding the data
inside the command as input for base64 is a workaround that is fine for small
amounts of data. It becomes less efficient and/or unusable for large amounts.
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 9, 2023
@pohly
Copy link
Contributor Author

pohly commented Jan 9, 2023

/sig apimachinery

This is used by test/e2e/framework, but ultimately this is caused somewhere else.

/remove-lifecycle stale

@k8s-ci-robot
Copy link
Contributor

@pohly: The label(s) sig/apimachinery cannot be applied, because the repository doesn't have them.

In response to this:

/sig apimachinery

This is used by test/e2e/framework, but ultimately this is caused somewhere else.

/remove-lifecycle stale

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 9, 2023
@pohly
Copy link
Contributor Author

pohly commented Jan 9, 2023

/sig api-machinery

@k8s-ci-robot k8s-ci-robot added the sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. label Jan 9, 2023
@cici37
Copy link
Contributor

cici37 commented Jan 12, 2023

We are at api machinery bugscrub meeting.
Thanks for the tagging! Looks like the root cause hasn't been identified and test/e2e/storage/drivers/proxy/io.go could be used to reproduce?
@jpbetz Would you mind to take a look or help with redirecting to the right person? Thanks

@pohly
Copy link
Contributor Author

pohly commented Jan 13, 2023

I have asked a colleague to write a stress test for this. He won't be able to solve it, but it should be a start towards identifying the root cause.

/cc @hj-johannes-lee

@aojea
Copy link
Member

aojea commented Jan 13, 2023

the tests in pkg/kubelet/cri/streaming/server_test.go may serve as base

@cici37
Copy link
Contributor

cici37 commented Jan 24, 2023

/cc @hj-johannes-lee @pohly
Would you mind sharing if the stress test is planned or if there is any other plan on this? Thank you!

@hj-johannes-lee
Copy link
Contributor

@cici37 Hi, I am working on it, but could not reproduce the exact problem.! We were supposed to have a meeting and discuss this matter, but we couldn't.
What I did until now is as follows:

  1. Deploying a busybox pod.
  2. Using ExecWithOptions with stdin option (here, data is read from a file that has a long enough contents) for creating a file inside a container.
  3. Comparing the value of stdout read from the container to the value of data read from the file.
  4. Looping the previous process so it will continue until the case is caught.

Let me discuss with @pohly asap and come with the "working" stress test.!

@hj-johannes-lee
Copy link
Contributor

@cici37 @pohly
I just submitted a pr, but I know it should be improved in many ways.!

@cici37
Copy link
Contributor

cici37 commented Feb 14, 2023

/triage accepted

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Feb 14, 2023
@shawkins
Copy link

shawkins commented Mar 6, 2023

We in the fabric8 java client are seeing similar problems fabric8io/kubernetes-client#4910 with exec stdin across a variety of our supported http clients. We're also seeing a problem the other direction on stdOut. Occasionally we'll see the web socket close being sent to before all of the stdOut data messages have been received. We are considering a workaround of waiting for the exit code message, but it is definitely not expected behavior to see additional messages after close.

@aojea
Copy link
Member

aojea commented Mar 6, 2023

websocket has a known issue #89899 (comment) and is reliable than the spdy implementation

@shawkins
Copy link

shawkins commented Mar 7, 2023

@aojea that known issue is fairly easy to workaround, at least for our internal usage scenarios. When you are done with stdIn, you send the web socket close. As noted in that issue it's not a great solution because you won't reliably see command completion - but the api server should process everything that was sent. Isn't SPDY deprecated?

The problem identified here is that the apiserver websocket seems to be the handling close out of order - when sending data it will process the incoming close prior to fully processing data messages it has already received, and on the send side it will send the close ahead of pending data messages.

@aojea
Copy link
Member

aojea commented Mar 7, 2023

Isn't SPDY deprecated?

#7452

but is the one used by kubectl exec/portforward 😄

There was one person trying to move on to websockets by default but it seems that is hard to keep the compatibility between versions #60140 (comment) and the ROI doesn't seem to be enough to try such complicated change

The problem identified here is that the apiserver websocket seems to be the handling close out of order - when sending data it will process the incoming close prior to fully processing data messages it has already received, and on the send side it will send the close ahead of pending data messages.

the close before the data is sent is similar to this #60140 (comment), but I'm just doing some quick correlations here , you should verify and double check those links

@shawkins
Copy link

shawkins commented Mar 9, 2023

the close before the data is sent is similar to this #60140 (comment), but I'm just doing some quick correlations here , you should verify and double check those links

My apologies. After reviewing these issues and the relevant net, apiserver, and kubernetes code, the issues that we are hitting are definitely within the fabric8 kubernetes client code. I'll get those addressed and see how often, if at all, we're actually hitting this or #60140

@k8s-triage-robot
Copy link

This issue has not been updated in over 1 year, and should be re-triaged.

You can:

  • Confirm that this issue is still relevant with /triage accepted (org members only)
  • Close this issue with /close

For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/

/remove-triage accepted

@k8s-ci-robot k8s-ci-robot added needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. and removed triage/accepted Indicates an issue or PR is ready to be actively worked on. labels Mar 8, 2024
@cici37
Copy link
Contributor

cici37 commented Apr 11, 2024

/triage accepted

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Apr 11, 2024
@XciD
Copy link

XciD commented Apr 11, 2024

Hello,

Not sure it's the same issue. I don't succeed to use dd with stdin with kube exec.

echo 'test' > file

cat file | kubectl exec -i pod -- dd of=/home/user/file
0+0 records in
0+0 records out
0 bytes copied, 0.000927544 s, 0.0 kB/s

kubectl exec pod -- ls -lh /home/user/file
-rw-r--r-- 1 user user 0 Apr 11 22:19 /home/user/file

But running:

kubectl exec -it pod -- dd of=/home/user/file

And typing something with \d works

Also this works:

cat file | k exec -i pod -- bash -c 'cat -- | dd of=/tmp/file'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/testing Categorizes an issue or PR as relevant to SIG Testing. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

No branches or pull requests

10 participants